Forem: Liran Baba

Why every Ghostty tab said 'Claude Code' (and the 30-line zsh hook that fixed it)

Liran Baba — Mon, 11 May 2026 17:32:04 +0000

I had four Ghostty tabs open the other morning. Three projects, four tabs, each tab split into two or three panes. Every single tab in the tab bar said Claude Code. I cmd-tabbed three times trying to find the one running tests for claudoscope. By the time I scrolled the panes inside each tab to read a path or a prompt, I'd forgotten what I was looking for in the first place.

That's a regression. Tab titles used to mean something. Then everyone started running an AI coding agent in every other pane, and the tab bar quietly turned into visual noise.

Here's the Thing
Two layers fight for the tab title every time you open a Ghostty tab. Ghostty's shell integration sets it to the running process name (zsh, claude, node), and Claude Code emits its own OSC 2 escape sequence to overwrite the title with Claude Code. Turn both off, then drive the title from a 30-line zsh hook that shows the git repo name by default and adds a 👾 marker the moment you launch claude. Repo name when idle, agent when working, no manual reset.

Why every tab says 'Claude Code'

Two layers of misbehavior, both on by default.

Layer 1 is Ghostty. Its shell integration sets the title to whatever process is currently in the foreground. At a prompt, the tab says zsh. Run claude, the tab says claude. Run node, it says node. Fine for a single pane. With four panes per tab, every tab settles on whatever its focused pane is running, which is almost never what you're looking for.

Layer 2 is Claude Code. The agent emits its own OSC 2 escape sequence on startup and during the session, overwriting the title with the literal string Claude Code. Even if you'd disabled Ghostty's integration, Claude would still clobber whatever title you had. Several open Claude Code GitHub issues asked for a way to turn this off, and an env var eventually shipped to do it. It just isn't advertised anywhere prominent.

Stack the two and you get what's in your tab bar right now. Ghostty steps on your title, then Claude steps on Ghostty's. By the time anything reaches the tab bar, all signal is gone.

Layer	Sets title to	Wins when
Ghostty shell integration	Foreground process (`zsh`, `claude`, `node`)	Always, unless disabled
Claude Code OSC 2 emitter	`Claude Code`	While `claude` is running
Your zsh hooks	Whatever you want	When the other two are off

What I wanted

Five things, written as a checklist so I'd know when I was done:

Tab shows the git repo name by default. If I'm not in a repo, the directory basename.
Tab updates when I cd into a different repo. No manual command.
When I launch claude in a pane, the tab gets a visible marker so I can see at a glance which panes have a session running.
When claude exits, the tab reverts to the repo name automatically.
Works for my clauded alias too, the one that adds --dangerously-skip-permissions because I trust myself in a worktree (don't @ me).

The fix, in three parts

Part 1: Tell Ghostty to back off

~/.config/ghostty/config:

shell-integration-features = no-title

One macOS gotcha worth flagging. Ghostty might have generated a config at ~/Library/Application Support/com.mitchellh.ghostty/config instead of the XDG path. The XDG path wins if both exist, but cmd+, in the GUI opens whichever Ghostty thinks is canonical. Easiest fix: keep the file in ~/.config/ghostty/config so it lives in your dotfiles repo, and symlink the Application Support path to it.

Part 2: Tell Claude Code to back off

~/.claude/settings.json:

{
  "env": {
    "CLAUDE_CODE_DISABLE_TERMINAL_TITLE": "1"
  }
}

That env var stops Claude from emitting OSC 2 sequences for the rest of the session. It shipped in response to one of those open issues. Nothing in Claude's startup banner mentions it exists, which is most of why nobody uses it.

Part 3: Drive the title from zsh

~/.zshrc:

# --- Ghostty tab title ---
autoload -U add-zsh-hook

function _ghostty_context() {
  local ctx
  ctx=$(git rev-parse --show-toplevel 2>/dev/null) && echo "${ctx:t}" || echo "${PWD:t}"
}

function _ghostty_set_title() {
  printf '\033]2;%s\007' "$(_ghostty_context)"
}

function _ghostty_set_title_for_cmd() {
  local cmd="$1"
  local ctx="$(_ghostty_context)"
  case "$cmd" in
    claude|claude\ *|clauded|clauded\ *)
      printf '\033]2;👾 %s\007' "$ctx"
      ;;
  esac
}

add-zsh-hook chpwd   _ghostty_set_title
add-zsh-hook precmd  _ghostty_set_title
add-zsh-hook preexec _ghostty_set_title_for_cmd
# --- end ---

_ghostty_context returns the git repo name if you're in one, falling back to the directory basename. _ghostty_set_title emits an OSC 2 escape with that name. _ghostty_set_title_for_cmd checks the command line you just hit enter on, and if it matches a claude invocation, prefixes the repo name with a 👾. Three hooks, one per zsh lifecycle event.

How it works

The hooks each fire at a different moment in zsh's prompt loop.

Hook	Fires when	What this gives you
`chpwd`	Every directory change	Tab updates the moment you `cd` into a different repo, no command needed
`precmd`	Before every prompt redraw	Safety net: when `claude` exits and a new prompt is about to draw, the title resets to the repo name
`preexec`	After enter, before the command runs	Intercepts the moment you launch `claude` and writes the marker title before zsh hands off to the child process

The reason it feels automatic is that the hooks line up with the three things you actually do. cd into a repo and chpwd paints the repo name. Launch claude and preexec swaps in 👾 reponame. When the session ends, precmd runs before the next prompt draws and the repo name comes back. You never type anything to manage any of it.

One thing worth knowing: zsh's preexec sees the command line after alias expansion. If clauded is a plain alias for claude --dangerously-skip-permissions, the existing claude * pattern in the case statement already catches it. If you've defined clauded as a function or a script, alias expansion doesn't apply and you need the explicit clauded* pattern, which is why I included both. type clauded tells you which case you're in.

Edge cases

A few honest limitations.

Surfaces, not tabs. Ghostty's OSC 2 sets the surface (split) title, and the tab label follows whichever split has focus. So if a tab has three splits and only one is running claude, the 👾 only shows when that split is focused. Worth flagging because no zsh hook can fix it. Ghostty discussion #7581 tracks the request for persistent tab-level titles.

Other AI tools. The case statement is one line away from covering everything else:

case "$cmd" in
  claude|claude\ *|clauded|clauded\ *) printf '\033]2;👾 %s\007' "$ctx" ;;
  gemini|gemini\ *)                    printf '\033]2;✨ %s\007' "$ctx" ;;
  cursor-agent|cursor-agent\ *)        printf '\033]2;🖱️ %s\007' "$ctx" ;;
esac

Suspend and resume. Ctrl-Z triggers precmd, so the title goes back to the repo name even though the agent is still alive in the background. fg doesn't match the claude pattern, so the title stays as the repo name through the resumed session. Acceptable to me, but worth knowing.

What I'd build next

Two things that would make this better, both blocked on Ghostty.

The first is tab color via OSC sequences. iTerm2 has supported OSC 6 ; 1 ; bg ; ... for years. Ghostty doesn't yet (open issue #12235). The day it lands, the same case statement could drive tab color too: red for claude --dangerously-skip-permissions, blue for read-only mode. That's a useful safety signal you can read at a glance, faster than reading text.

The second is the persistent tab-level title mentioned above. With that, a tab could stand for a project even when its splits are running mixed-purpose commands. Right now you can pick "repo name" or "agent marker," but only one at a time per surface. That's the limitation a lot of multi-pane workflows are bumping into.

Closing

Three configs. About 30 lines of zsh. Maybe 20 minutes to write. The friction it removes shows up dozens of times a day, and it's the kind of saved-time you don't notice until you sit down at a machine that doesn't have it. Broken signals in a tool you stare at for eight hours a day add up. They just don't show up on any dashboard.

If you've got a different cut of this (better hook patterns, support for more AI agents, tab color tricks), I'd take a look. The snippet's in a gist if you want to copy it without retyping.

If you're interested in the broader Claude Code surface area I keep finding things in, the hooks post is the obvious next read. The visibility problem with Claude Code sessions is what got me building Claudoscope in the first place. Broken tab titles are a tiny version of the same theme.

Originally published at liranbaba.dev

AI made your team code faster. Everything after is still broken.

Liran Baba — Mon, 04 May 2026 16:01:37 +0000

I sat with one of our teams a while back - multiple agents running in parallel, each on a different feature, output that used to take a sprint landing before lunch. I asked mid-session: did that bug from last week get fixed? The one a customer had flagged. Everyone paused. The agent had been asked to fix it. Someone thought it shipped. But had it reached prod? Was it still in staging? It might have gone out in one of the four releases that week and nobody tracked which one carried it. When that much is moving at once, the fix existed somewhere - they just had no way to know where.

They opened four tabs. Checked a CI dashboard. Skimmed Slack looking for someone's deployment message. Twenty minutes gone.

Coding agents move fast - until binaries go live

Coding agents shifted where the bottleneck is. Writing code was the hard part. Now it's everything after: releasing, knowing what actually shipped and understanding what's running in which environment.

When you ship faster, the operational surface grows - more releases, more artifacts, more environments to track. The tooling most teams use was built for a slower world, one where humans managed each step, version numbers were meaningful, and an artifact repo was something your DevOps person configured once and left running.

Here's where that breaks.

Your coding agent has no idea what's actually deployed. "Is the new auth flow live on staging?" "Did the security patch reach prod?" It can't tell you - no access to runtime environments. Instead of you hunting down answers, the agent tries to do the heavy lifting - scraping past commits, reaching into prod if you have the right access, and digging through issues. But because of its limited memory, it only ever sees a fraction of the bigger picture.

Version numbers have also stopped meaning much. "Deploy v2.4.1" is nearly useless as a description of intent. What you actually want is "deploy the latest build that fixed the checkout bug" or "what's different between the version in staging and prod" - but connecting code changes to what shipped to what's currently running is still manual, and it gets harder the faster you move.

There's also the setup cost that nobody names out loud. Distribution management, package manager configs, CI/CD integration - before you've shipped anything, you've burned real time on infrastructure. For teams with no dedicated DevOps person, it's often what quietly kills momentum.

When all of this is still on you - tracking what built, what shipped, what's running where - it doesn't matter that the agent writes code fast. The last mile is still yours.

Fly: Giving Binaries Agentic Wings

We built Fly because we kept running into exactly this. Teams moving fast with AI, then stalling at the release and visibility layer where nothing is agentic and nothing talks to anything else.

The artifact registry needs to be part of the workflow, not a separate system your agent has never heard of. That's where the ten minutes go.

In Fly, every push creates a traceable release with its PR, commits, and change summary attached automatically. That context is available through MCP, so Cursor, Claude Code, Copilot - whatever you use - actually knows what's been built and what's deployed.

You can ask "find the release that fixed the checkout bug" or "what changes are queued between prod and staging?" or "deploy John's latest changes to production" and get something useful back instead of a blank stare or a three-minute token flood that may or may not give the right answer. You can ask what’s running in any environment without leaving your coding session, and get answers from up-to-date semantic tracking across every runtime, fully agentless.

Releases are identified by changes they contain, not just by a version number someone incremented in a pipeline. Setup is a few minutes: connect GitHub, run the Fly bash command, push once, and it picks up context from there.
You can even ask Fly to slack you back "when Rachel pushed the new UI design to staging".

We've been running this with teams for several months. The thing I hear most often is: "I didn’t realize how much time this was costing us until we had a better way." That's the right outcome - when the tooling becomes invisible in your workflow.

If your team is shipping fast, but figuring out where finished code is actually running still costs you valuable minutes and multiple DMs, it’s time for a better way.
Try this: jfrog.com/fly

Disclosure: I’m AI Lead at JFrog. I use Fly day-to-day.

Claude Code hooks: the half of Claude Code nobody uses

Liran Baba — Thu, 23 Apr 2026 17:32:45 +0000

I was halfway through writing this post when I decided to fact-check myself. Opened ~/.claude/settings.json, expecting three or four hooks I'd forgotten about. There was one. A Stop hook that plays a ding and says "your turn" when Claude finishes thinking. My hooks-to-skills ratio: 1 to 42.

I'm not picking on myself. This is the median. I checked seven of my own project configs after that: zero hooks each. Skills got the awesome-lists. Hooks got a footnote. And the silence is costing people money in tokens, missed incidents in security, and a control surface that ships in the box and never gets wired up.

TL;DR

Claude Code has 25+ hook event types. The average user has configured zero or one. I checked seven of my own project configs: zero hooks. My user-level config: one Stop hook.

Skills feel like adding capability, which is fun. Hooks feel like writing policy, which sounds like work. The work is the part that pays.

One PreToolUse hook that swaps Grep for LSP cuts navigation tokens by 73-91% in the kit's own benchmarks (nesaminua/claude-code-lsp-enforcement-kit, MIT).

The enterprise hook playbook (SOC2 audit, SIEM integration, supply-chain scanning, org-wide token budgets) does not exist publicly yet. If your platform team writes one now, you're early.

Why skills get all the attention and hooks get none

Skills fit a familiar mental model: drop a markdown file, write a description, the agent picks it up. They feel like adding capability. Hooks are different. You write a script that fires at an execution boundary and returns an exit code or JSON to block, modify, or augment what happens next. They feel like writing policy. I saw a reddit thread that nailed this for me: skills change what the model can do, hooks change when it can do it.

That asymmetry shows up in adoption. Skills make Claude more capable; hooks make it more predictable. If you're a solo dev optimizing for capability, you reach for skills. If you're a team lead trying to keep five engineers from doing five different unsafe things, you reach for hooks. Most early adopters were solo, so the awesome-lists filled up with skills first.

There's also a distribution problem. A skill is a markdown file. You can paste it in a Slack message. A hook is a settings.json entry (or a bundled file in a plugin or skill) pointing to a shell script that touches your filesystem. Sharing it requires trust, a setup ritual, and someone willing to chmod +x a stranger's bash. That's a real barrier.

Look at the curated lists on GitHub. Hook-only repos trail the mixed lists (skills, slash commands, MCP, hooks) by a wide margin, and broader awesome-claude-code lists treat hooks as a footnote. Even the post on how the creator of Claude Code uses it mentions hooks only in passing, and zero of the 26 comments noticed.

My audit: 42 user-level skills installed. 1 hook (the Stop notification mentioned above). Across 7 project configs: 0 hooks. The one that stings is my reddit-mcp repo, which gives Claude posting and deletion permissions on my Reddit account. Zero hooks there too. If I'm typical, the median ratio is something like 40 skills to 1 hook.

My deep dive on Claude Code's leaked source showed the harness has a hooks/ directory with 104 files. That's a lot of internal scaffolding for something most users typically ignore, at least in our org.

The hook event reference (Anthropic's official taxonomy)

There are at least 25 hook events documented in Anthropic's hooks documentation. Most third-party tutorials cover four. Here's the actual catalog, grouped by cadence, so you can see the full surface.

Cadence	Events	Effect
Once per session	`SessionStart`, `SessionEnd`	`SessionStart` can inject context
Once per turn	`UserPromptSubmit`, `Stop`, `StopFailure`	`UserPromptSubmit` and `Stop` can block or modify
Per tool call	`PreToolUse`, `PostToolUse`, `PostToolUseFailure`, `PermissionRequest`, `PermissionDenied`	`PreToolUse` can block or modify
Async / lifecycle	`WorktreeCreate`, `WorktreeRemove`, `Notification`, `ConfigChange`, `InstructionsLoaded`, `CwdChanged`, `FileChanged`	Observation only
Agent team	`SubagentStart`, `SubagentStop`, `TeammateIdle`, `TaskCreated`, `TaskCompleted`	Observation only
Compaction	`PreCompact`, `PostCompact`	`PreCompact` can inject context
MCP	`Elicitation`, `ElicitationResult`	`Elicitation` can respond

Configuration sits at four levels: user-wide (~/.claude/settings.json), project-shared (.claude/settings.json, commit it), project-private (.claude/settings.local.json, gitignored), and managed policy (org admin only, individual devs can't disable). Most public configs use the first two. Managed policy is where org-wide control lives, and almost nobody ships templates for it.

Handler types: command, http, prompt, agent. Most public examples use command. The http handler is the door to centralized policy, where one webhook enforces the same rule across every developer in the org. I haven't seen it used in any public repo yet.

Exit codes are easy to get wrong. 0 means success. 2 is a blocking error, but behavior depends on the event: PreToolUse blocks the tool call, Stop prevents session end, other events treat it as non-fatal. Any other code is non-blocking. The model can't see why a hook fired unless the hook writes to stderr; this is the most common reason hooks feel mysterious in practice.

The hook surface is wide. You can block, modify, observe, or augment almost any event in the agent's lifecycle. The shape of what's possible is "almost anything." The shape of what's actually configured on most machines is "nothing."

10 hooks people are actually running

These are the hooks I'd put in front of my team. None are "block writes to .env" or "format on save." Each one is published somewhere (a repo or a Reddit thread) and does something non-obvious. Ranked roughly from most surprising to most foundational.

1. LSP-over-Grep enforcement. PreToolUse on Grep, Glob, Bash, Read. nesaminua/claude-code-lsp-enforcement-kit (MIT). Blocks Grep calls containing code symbols and forces the agent to use LSP find_definition and find_references instead. Documented per-call savings: definition lookup drops from ~6,500 to ~580 tokens. Real workweek aggregate: 320k to 85k navigation tokens, a 73% reduction. Works with cclsp/Serena, supports TypeScript and 13+ other languages. It punishes a default behavior. The agent can still grep, but the cost is paid in friction, not silently in tokens.

2. Knowledge-graph compile of the project. Skill + hook combo. safishamsi/graphify. Karpathy-style: instead of re-reading raw files every session, compile the project into a structured wiki once, then query the wiki via skill. Hook installs the skill and registers /graphify as the entry point. Claimed 71.5x token reduction per query on a mixed corpus. Treats context as a build artifact, not a runtime cost.

3. The 4-hook workflow enforcement stack. SessionStart + PreToolUse on Edit + Stop + PostToolUse on git commit. From tacit7 in the hooks vs skills thread. SessionStart tells the agent to read the workflow skill. PreToolUse on Edit refuses if no task is registered. Stop refuses if the task isn't annotated. PostToolUse on git commit logs the commit to an external app. Four hooks turn a probabilistic agent into a procedurally-compliant teammate, end to end.

4. "Fail twice, stop and ask." PostToolUseFailure + Notification. Another one from the same thread. If the same tool fails twice with similar errors, the hook halts the session and pings the human. The most common Claude Code failure mode is the agent looping on a bad call. This rule catches it in maybe 20 lines of code.

5. Per-file typecheck after every edit. PostToolUse on Write/Edit. From DevMoses in the "5 levels of Claude Code" post. Runs tsc --noEmit (or equivalent) on the single file Claude just edited, instead of flooding the agent with 200+ project-wide errors. Inverts the default. The agent gets a tight feedback loop on its own work without drowning in unrelated noise.

6. Dynamic permission control via hook-managed policy. PreToolUse + PermissionRequest. Also from the same thread. Hooks flip permissions on and off at runtime based on context (project, session source, current task). Claude Code's permission model is mostly static. This hook makes it conditional, which matters for orgs with role-based access.

7. Session-conclusion guard. Stop + SessionEnd. connerohnesorge/conclaude. Refuses to end a session if there's uncommitted state, in-progress work, or unmerged checkpoints. Stops the "I closed the terminal and lost work" failure mode at the harness level.

8. Filesystem offload of large tool outputs. PostToolUse on Read/Bash/WebFetch. sheeki03/Few-Word. When a tool returns more than N tokens, write the result to disk and return a short summary plus a path. Treats the filesystem as an extension of context. The agent can re-read the slice it actually needs instead of choking on a 50KB file.

9. Cache-fix patches injected via hook. SessionStart. Rangizingo/cc-cache-fix. Patches a documented Claude Code bug where the db8 filter strips deferred_tools_delta records, breaking the prompt cache on resumed sessions. The author's analysis claims it wasted ~250,000 API calls per day globally before being noticed. The hook applies the patch at session start. Hooks as a deployment mechanism for community fixes. No need to wait for Anthropic to ship a release. (I'd qualify the 250k figure as analysis-based, not independently confirmed by Anthropic.)

10. Lessons-learned hooks ("encode the mistake"). Pattern, not a single repo. From Aggressive-Sweet828 in the hooks vs skills thread. Every time the agent makes a mistake you don't want repeated, turn it into a hook. Over time, your hooks become your team's quality bar, written in code instead of whispered in code review. Reframes hooks as institutional memory, not just guardrails.

Of the ten, six are now sitting in my own settings.json: the LSP enforcement kit (#1), the graphify knowledge-graph compile (#2), the fail-twice loop guard (#4), the filesystem offload for large outputs (#8), the cache-fix patch (#9), and the lessons-learned pattern (#10). The LSP kit and the fail-twice guard are the two I have the most to say about so far. The other four are too new for me to have a real story yet, and I'll update this section as they earn one.

LSP enforcement kit (#1): install was a git clone and a single bash install.sh. The installer is idempotent and merges into ~/.claude/settings.json without touching what's already there. First session into the portfolio repo, the hook fired on the second tool call and refused a Grep for BlogPostLayout. The agent reached for find_definition instead and landed on the right file. The thing I didn't expect was how often I write prompts that assume Grep ("find where we use X"). The agent now has to translate those into LSP, which takes a beat. I'm keeping it on this repo and waiting to see what the weekly token total actually does.

Fail-twice loop guard (#4): hand-rolled in about 20 lines of bash because the pattern is small enough I didn't want a dependency. It hasn't fired yet, probably because I haven't kicked off anything ambitious enough since installing it. The version I wrote compares the last two PostToolUseFailure events for the same tool name and a similar error substring, and pings me via the same Stop-hook ding I already had. If it ever fires, I'll update this section with what it caught.

The LSP entry is the most honest data point here. It's widely cited, and the kit ships its own reproducible benchmarks. Here's what the per-call savings look like on the operations the agent does dozens of times a day.

The headline 80% savings number on the original Reddit post is anecdotal. The 91% per-call and 73% workweek-aggregate numbers come from the kit's own benchmarks, which are reproducible. I'd treat the kit numbers as the reliable ones and the Reddit headline as directionally right.

Skills vs hooks: a decision table

Skills describe what to try. Hooks define what must happen. Pair them: a skill describes the workflow, a hook enforces the precondition.

Dimension	Skills	Hooks
Mental model	Add capability (request)	Define policy (enforcement)
Distribution	Markdown file, frontmatter	`settings.json` entry pointing to script
Determinism	Probabilistic (model decides if/when)	Deterministic (fires every event match)
Token cost	Loaded on demand, ~free when inactive	Often saves tokens (LSP swap, output offload)
Observability	Model writes about using it in transcripts	Side effects + exit code; model often can't see why a hook fired
Best for	Reusable workflows, domain expertise	Guardrails, audit, cost control, integration
Failure mode	Model forgets to use it	Hook breaks the session if poorly written
Sharing friction	Low (a `.md` file)	Higher (script + permission + JSON)

A concrete pair: a "deploy" skill describes the deployment runbook (build, test, push, monitor). A PreToolUse hook on the deploy command verifies the test suite passed in the last 5 minutes and you're on a release branch, and refuses otherwise. The skill teaches. The hook insists.

The insight that did the most for my own thinking: guardrails belong in hooks because blocks need to be deterministic, not described. A skill that says "don't push to main without tests" is a polite request the model can ignore. A PreToolUse hook that returns exit code 2 with "decision": "block" cannot be ignored.

My comparison of ForgeCode and Claude Code called out hooks as one of Claude Code's real differentiators. Re-reading it now, I underweighted them. ForgeCode being faster doesn't matter if your team needs deterministic policy.

The enterprise patterns nobody is writing about

Anthropic shipped the primitives this past quarter: allowManagedHooksOnly, allowedHttpHookUrls, httpHookAllowedEnvVars, plus a drop-in managed-settings.d/ directory for stacking policy from multiple teams. What's missing is the layer above: published end-to-end SIEM, SOC2, and audit playbooks built on those primitives. I can't find a single public repo shipping an org-wide audit template I'd actually deploy.

The vendors closest to filling that gap are the ones with skin in the AI-supply-chain game. (Disclosure: JFrog is my employer; not paid to link this.) Supply Chain Attackers Are Coming for Your Agents walks the Shai-Hulud npm worm, the postmark-mcp exfiltration, and the LiteLLM compromise as cases where a PreToolUse hook on the install boundary would have caught the payload. JFrog AI Catalog Evolves to Detect Shadow AI and Govern MCPs covers the upstream gateway angle that pairs with client-side hooks. Neither is a hook playbook, but they're closer than anything else I've found.

Security and policy enforcement. A PreToolUse hook on Write blocks anything outside approved paths (scope to Write|Edit|MultiEdit). A UserPromptSubmit hook scrubs known credentials (AWS access key prefix, GitHub PAT, JFrog tokens) before the prompt leaves the machine, returning JSON decision: "block" on a match. The session file where I found my database password (the one that started Claudoscope) was created because Claude Code read a .env and echoed the contents back. A PreToolUse hook on Read with a .env matcher would have refused that read. Claudoscope catches the credential after it lands in the JSONL; the hook prevents it from landing. The full story is in how I found a database password in a session file.

Compliance and audit. A PostToolUse hook with handler type http sends a structured event to a SIEM (Splunk, Datadog, Elastic) on every tool call: session ID, user, tool name, sanitized input, timestamp, project. SessionStart and SessionEnd book-end the audit log. Combine with managed policy and allowManagedHooksOnly: true so individual devs can't disable the audit hook locally. This is the org-wide control surface the docs describe but no one's shipping templates for.

Cost control. Per-developer token budget: a PreToolUse hook logs token estimates per tool call, totals them per day in a small SQLite file, and denies expensive calls once the budget is hit. Same pattern for model routing: rewrite the model selection or refuse a dispatch if the request is trivial enough that Opus is overkill.

DevSecOps and supply chain. This is where my JFrog day job comes in. PreToolUse on Bash intercepts npm install, pip install, etc., and runs the package through a vulnerability scanner before allowing the install. PreToolUse on Write runs the staged change through JFrog Advanced Security and refuses the write if SAST finds an issue.

The conversation that keeps coming up on our side: whether agent-initiated package installs count as a developer action or a CI action under existing supply-chain policy. We don't have a clean answer yet. A PreToolUse hook that routes installs through Xray would collapse the distinction; the same policy applies wherever the install happens.

Swap in your scanner of choice. Supply-chain controls don't have to live in CI anymore. They can live at the agent's tool-call boundary, closer to where the risk is introduced.

If you're a platform team standing up Claude Code at scale, you're filling that gap yourself, which is fine but probably not what you signed up for.

The footguns

Hooks are powerful because they run as the user. That's also the danger. Four real failure modes I've watched people hit.

Infinite loop. A hook on PostToolUse triggers another tool call that triggers the same hook. Fix: add a sentinel (env var or marker file) and short-circuit on repeat invocation. This bites the first time you write a PostToolUse that does anything substantive.

The hook breaks the session and Claude has no idea why. From the same thread: "if a hook fails the model has no idea why and can't self-correct." Mitigation: return useful text on stderr with exit code 2 so the model gets context, even when blocking. The default behavior of a silent block is the worst possible UX.

Permission deadlock. Also from the same thread: "I've already seen it deadlock a session when the hook permissions were set too tight." Always test with --debug first. Always.

Shell startup pollution. Anthropic's docs explicitly warn: shell profiles printing text on startup interfere with JSON parsing. A single echo line in your .bashrc will silently break every JSON-output hook on your machine. This one is hilarious until it happens to you.

Frequently asked questions

What is a Claude Code hook?

A user-defined command, HTTP endpoint, prompt, or agent invocation that fires at a specific lifecycle point: a tool call, session start, prompt submission, and 20+ other events. Defined in settings.json or bundled with a plugin or skill. Returns control via exit codes and structured JSON. Hooks can block tool calls, modify their input, inject context, or just observe. (Anthropic docs, 2026)

What's the difference between Claude Code hooks and skills?

Skills are markdown workflows the model decides whether to use. Hooks are deterministic execution-boundary callbacks: they fire every time the matched event happens, regardless of what the model decides. Use skills to teach a workflow; use hooks when the precondition is non-negotiable. They pair well together, since the skill describes the procedure and the hook makes sure the agent actually followed it.

Can Claude Code hooks reduce token usage?

Yes, significantly. The most-cited example: a PreToolUse hook that swaps Grep for LSP-based code navigation reduces per-call tokens from ~6,500 to ~580 (a 91% drop) for definition lookups, with a documented 73% real-world weekly aggregate reduction (nesaminua/claude-code-lsp-enforcement-kit, MIT). Other patterns (sandboxed tool output, knowledge-graph compilation) report 71.5x to 98% reductions in their respective scopes.

Where I landed

I came into this thinking I'd find a few clever hooks worth sharing. I came out wanting to spend a weekend writing the org-wide policy template that doesn't exist.

If you've been shipping skills and ignoring hooks, you've taken the easier half. Skills are the part Claude can ignore. Hooks are the part it can't. For a solo dev that distinction is small. For a team it's most of the value.

Practical first move, if you're still reading: write the UserPromptSubmit hook that scrubs your most likely credentials before they leave the machine. Maybe 30 lines of Python or bash. It will catch a real incident inside a month. I'd bet on it.

After that, audit your own .claude/settings.json. If the count is zero or one, you've got a lot of room. (And if you also want to see what your sessions are actually doing while you sort out which hooks to write, that's what I built Claudoscope for.)

Originally published at liranbaba.dev

ForgeCode vs Claude Code: which AI coding agent actually wins?

Liran Baba — Thu, 09 Apr 2026 11:24:56 +0000

I've been using Claude Code for months. I like it. I genuinely don't get the Twitter hate. But there's one thing that's been driving me crazy: speed. I'll ask it to rename a variable across three files and it sits there thinking for 40 seconds. A simple test fix on a small repo, and I'm watching a spinner for two minutes. It's not a deal-breaker, but it's the kind of friction that builds up over a day.

We recently rolled out Claude Code across our entire engineering org. We're not ditching Cursor, just giving devs the option to pick whatever tool works for them. And the feedback I kept hearing from people, unprompted: it's slow. Not everyone, not every task. But enough devs brought it up that it clearly wasn't just me being impatient.

So I started looking at alternatives. OpenAI has Codex CLI but I haven't tried the harness yet, just the models. The TermBench 2.0 leaderboard is what caught my eye. ForgeCode at #1 with 81.8%. Claude Code at 58%, ranked #39. I installed ForgeCode that same day.

TL;DR

ForgeCode with Opus 4.6 was noticeably faster than Claude Code on the same tasks. Not marginal, real.

ForgeCode topped TermBench 2.0 at 81.8%, but that's its own benchmark. On the independent SWE-bench, the gap shrinks to 2.4 points.

GPT 5.4 through ForgeCode was unstable for me. A research task on a small repo took 15 minutes.

I'm double-dipping now. Claude Code is still primary, but the latency gains on ForgeCode are too real to ignore.

What is ForgeCode (and why the benchmark confusion exists)?

ForgeCode is not an AI model. It's a model-agnostic agent harness, open source under Apache 2.0, written in Rust, that wraps any LLM through OpenRouter or direct API keys. It launched in late January 2025 and hit v2.8.0 on GitHub by April 2026 with over 6,000 stars.

ForgeCode ships three built-in agents. forge writes and edits code. sage does read-only research and can't modify files. muse generates plans and writes them to a plans/ directory. It's Zsh-native, using a : prefix so you never leave your shell.

Here's the thing that matters for evaluating the benchmark: TermBench 2.0 is ForgeCode's own benchmark, hosted at tbench.ai. The organization submitting entries is ForgeCode itself. That doesn't make the results wrong. But it's not a neutral third party.

Does the benchmark actually hold up?

On SWE-bench Verified, an independent benchmark from Princeton and UChicago, ForgeCode + Claude 4 scored 72.7% compared to Claude 3.7 Sonnet's 70.3%. A 2.4-point gap, not the 24-point gap TermBench implies. That context changes the whole picture.

The TermBench 2.0 numbers, self-reported by ForgeCode on tbench.ai:

ForgeCode + GPT 5.4: 81.8%
ForgeCode + Claude Opus 4.6: 81.8%
Claude Code + Claude Opus 4.6: 58.0% (rank #39)

The SWE-bench Verified numbers, independent:

ForgeCode + Claude 4: 72.7%
Claude 3.7 Sonnet (extended thinking): 70.3%
Claude 4.5 Opus: 76.8%

So how did ForgeCode reach 81.8%? Their blog documents four specific harness changes. They reordered JSON schema fields, putting required before properties to reduce GPT 5.4 tool-call errors. They flattened nested schemas. They added explicit truncation reminders when files are partially read. And they added a mandatory verification pass where a reviewer skill checks task completion before the agent can stop.

These are real engineering improvements. They're also benchmark-specific optimizations. The r/ClaudeCode community called it "benchmaxxed," which is both funny and kind of fair.

I've been eyeing this leaderboard for a while. The numbers are what pushed me to actually try ForgeCode. With Opus 4.6, it was noticeably faster than Claude Code. That part wasn't hype.

SWE-bench scores went from 1.96% in late 2023 to 76.8% by early 2026. Everything's getting better fast. The question is whether a 2-point edge on an independent benchmark justifies switching your entire workflow.

What it's actually like to use ForgeCode

Install is a one-liner: curl -fsSL https://forgecode.dev/cli | sh. Then forge provider login to set up your API keys and you're in. About the same friction as Claude Code. The Zsh plugin is a nice touch, you type : followed by your prompt and it runs inline without switching contexts.

First thing I tried: pointed it at my portfolio repo (Astro 6, maybe 30 files) with Opus 4.6 as the model. I asked it to add a post counter to the blog index page and wire it into the nav component. Claude Code takes about 90 seconds on that kind of task on this repo. ForgeCode did it in under 30. Correct output, clean diff, no hallucinated imports. The speed difference was immediately obvious.

I ran the same kind of test a few more times. A multi-file rename, adding an external link tooltip component, restructuring a layout. ForgeCode with Opus 4.6 was consistently faster. Not by a little. I could feel it in my workflow.

Plan mode was the other thing that stood out. ForgeCode's muse agent writes plans to a plans/ directory, and the output felt more detailed and verbose than Claude Code's plan mode. Whether that's good or bad depends on what you want. I kind of liked having the longer breakdown.

Then I tried GPT 5.4 through ForgeCode, and it fell apart. I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it. So "ForgeCode is fast" needs a qualifier: ForgeCode with Opus 4.6 is fast. ForgeCode with GPT 5.4 was borderline unusable for me.

But I'll give them this: the ForgeCode team explicitly says they've hired zero paid influencers. The low social media presence is intentional. Kind of respect that. In an industry where half the "honest reviews" have affiliate links in the description, that's almost suspiciously refreshing.

Why ForgeCode is actually faster

Part of it is just the Rust binary (Claude Code is TypeScript, so startup and memory are heavier). But that's not the whole story.

ForgeCode has a context engine that indexes function signatures and module boundaries instead of dumping raw files into the context window. The agent pulls only what it needs. Some estimates say this cuts context size by about 90%, which means faster responses and cheaper models that don't lose the plot halfway through a task. That's the real reason the same model (Opus 4.6) responds faster through ForgeCode than through Claude Code.

There's also a --sandbox flag that creates an isolated git worktree and branch, so you can try something risky without touching your main tree and only merge back what works.

What Claude Code has built around the core loop, parallel agent execution, hooks, scheduled cloud tasks, auto-memory, none of that exists in ForgeCode yet. The harness is fast. Everything around it is thin. ForgeCode is a Lambo with no cup holder. Fast as hell, but you're holding your coffee between your knees.

What I missed when I wasn't using Claude Code

I didn't appreciate this until I spent a few days away from Claude Code: the stuff around the agent matters more than the agent itself.

With Claude Code, I have a CLAUDE.md in every project. My team shares the same project instructions. I have hooks that fire on file changes, so I can run secret scanning, linting, whatever I want on every edit. Auto-memory means I don't re-explain my codebase every session. And checkpoints mean every file edit gets snapshotted, so if the agent breaks something three steps back, I hit /rewind and roll back without touching git.

ForgeCode has AGENTS.md (similar idea to CLAUDE.md) and MCP support, so the basics are covered. But no hooks, no checkpoints, no auto-memory, no IDE extensions, no JetBrains plugin. The model-agnostic part is great. The ecosystem is still thin.

For reference, here's the head-to-head:

Feature	ForgeCode	Claude Code
Model choice	Any (300+)	Claude only
Open source	Yes (Apache 2.0)	No
Language	Rust	TypeScript
Project config	AGENTS.md	CLAUDE.md (hierarchical)
MCP support	Yes	Yes (extensive)
Hooks	No	Yes (6 types)
Scheduled tasks	No	Yes (cloud + local)
Sub-agents	Yes (forge/sage/muse)	Yes (parallel)
Plan mode	Yes	Yes (Shift+Tab)
VS Code	No extension	Yes
JetBrains	No	Yes
Auto memory	No	Yes
Checkpoints / rewind	No	Yes

Where I landed

I'm double-dipping. Claude Code is still my primary tool, but I keep ForgeCode open for tasks where the latency kills me. Sometimes I'll drop into Cursor for something visual. Three tools is kind of ridiculous, but the latency gains on ForgeCode are real enough that I can't just ignore them.

Claude Code is where my project config lives, where my hooks fire, where my MCP connections run. That's my home base and it's not changing. But when I need something fast and self-contained, a quick refactor, a file rename across a module, something where I don't need the full ecosystem, I'll run it through ForgeCode with Opus 4.6 and it's done before Claude Code would've finished reading the context.

As of April 2026, ForgeCode is faster than Claude Code when running the same model (Opus 4.6), but Claude Code has the deeper ecosystem with hooks, MCP, auto-memory, and IDE integrations. Neither wins across the board. Pick the one that matches how you work and be ready to use both.

Frequently asked questions

Is ForgeCode's TermBench #1 score legitimate?

TermBench is ForgeCode's own benchmark. On SWE-bench Verified, an independent benchmark from Princeton, ForgeCode + Claude 4 scored 72.7% compared to Claude 3.7 Sonnet's 70.3%. Solid, but not the 24-point gap TermBench suggests.

Can ForgeCode use my existing Claude or ChatGPT subscription?

No. You need API keys, not a subscription login. Separate billing from whatever you pay for Claude Pro or ChatGPT Plus.

Does ForgeCode burn more tokens than Claude Code?

Nobody's published hard numbers. ForgeCode's multi-agent setup (forge/sage/muse spawning sub-agents) almost certainly burns more tokens per session. I noticed it anecdotally but didn't measure. Track your own spend if you try it.

Is ForgeCode safe for proprietary code?

The harness is open source, but default telemetry collects git user emails, scans SSH directories, and sends conversation data externally. GitHub issue #1318 raised data transparency concerns. The team addressed it in March 2025: set FORGE_TRACKER=false to disable all tracking.

Is ForgeCode free?

The code is free and open source (Apache 2.0). The hosted service was originally unlimited, but switched to a tiered model in mid-2025 with daily request caps on the free tier.

ForgeCode's benchmark lead exists on a test it runs itself. On independent benchmarks, it's comparable. The speed with Opus 4.6 is real. The GPT 5.4 experience was rough.

I didn't expect to end up running two coding agents. But here I am. If ForgeCode ships hooks and the ecosystem catches up, that could change. For now, I'm using both, and it's working.

Sources:

ForgeCode GitHub Repository - GitHub, April 2026
TermBench 2.0 Leaderboard - tbench.ai, 2026
SWE-bench Verified Leaderboard - Princeton/UChicago, 2026
Claude Code Documentation - Anthropic, 2026
Anthropic Claude 3.7 Sonnet Announcement - Anthropic, February 2025

Originally published at liranbaba.dev

Cursor 3 shipped parallel agents, but is any of it new?

Liran Baba — Sun, 05 Apr 2026 15:06:27 +0000

Cursor 3 shipped on April 2. The demos look great: eight AI agents running in parallel, each in its own Git worktree, building different parts of your project at the same time. The Hacker News thread lit up. Product Hunt gave it the #3 spot for the day.

Then I read the comments. One user reported spending $2,000 in two days on cloud agents. Another switched from $1,800/month on Cursor to roughly $200/month on Claude Code and Codex. A third said they had "zero interest" in forced agent swarms and were moving to VS Code with Claude Code instead.

The coverage so far has been mostly feature recaps reprinting the press release. Nobody's asking the obvious questions: is parallel agent execution actually new? What does it really cost? And what happens when your agents need to share context?

Here's the Thing
Cursor 2 already supported parallel execution via worktree.json configuration. What Cursor 3 actually shipped is a UI layer (Agents Window sidebar, drag-drop tabs) on top of the same Git worktree primitives. The cost model is the real concern: early testers reported $2,000 bills in two days, and Cursor's pricing page doesn't explain why. The unsolved technical problem is context sharing between local and cloud agents, which the docs hand-wave as "summarized and reduced."

What Cursor 3 actually shipped

Cursor 3 lets you run up to 8 AI agents in parallel across isolated Git worktrees (Cursor, 2026). Agents run locally via Composer 2 or in cloud isolation VMs. You can watch them all from a new sidebar called the Agents Window.

That's the pitch, anyway.

Cursor 2 already supported parallel agent execution through worktree.json configuration. The /worktree command isn't new functionality. It's new UI. The Agents Window gives you visibility into what your agents are doing, and that part is genuinely useful. But calling this an architectural pivot is a stretch.

The other additions: /best-of-n runs the same prompt across multiple models side by side (Composer 2 vs. Claude vs. GPT). Design Mode lets you annotate UI elements and describe changes in plain English. The MCP Marketplace adds plugin support for hundreds of tools.

Under the hood, /worktree runs git worktree add to create an isolated working directory on a new branch, then spawns an agent process scoped to that directory. Each agent gets its own filesystem view, so file edits don't collide mid-run. When the agent finishes, you review the diff and merge. This is the same thing you'd do manually with git worktree add and a second terminal. Cursor 3 wraps it in a sidebar.

The cost problem nobody is talking about

Early adopters reported spending $2,000+ in two days running Cursor 3's cloud agents (Hacker News, 2026). That's not a typo. Two thousand dollars. Two days.

Cursor's pricing page lists four tiers: Free, Pro at $20, Pro+ at $60, and Ultra at $200 per month (cursor.com/pricing, 2026). Those numbers look reasonable until you start running cloud agents. The pricing page doesn't mention per-minute VM charges or explain how cloud agent costs are metered. The resource costs for cloud agents? Absent from the page entirely.

HN user dirtbag__dad reported spending "$2k a week with premium models" before switching to Claude Code Max at "1/10th the price." Another commenter, verelo, switched from $1,800/month on Cursor to roughly $200/month on Claude and Codex, calling it "WAY better value for money."

Same story every time. Listed price and actual spend have almost nothing in common. When your pricing page says $200/month but users regularly spend ten times that, the issue isn't pricing. It's that nobody can predict what anything costs before the bill shows up.

Claude Code isn't immune either

I should be fair here. Anthropic's flat-rate plans sound predictable, but they have their own version of this.

In late March 2026, Claude Code Max plan users reported exhausting their quotas in under an hour. The same quota that previously lasted eight hours (The Register, 2026). The story pulled 324 points on Hacker News. BBC covered it a day later.

Anthropic acknowledged the problem on Reddit: "people are hitting usage limits in Claude Code way faster than expected." A March promotion that doubled limits ended on March 28. There were reports of prompt cache bugs inflating token usage by 10-20x. And Anthropic doesn't publicly specify exact usage caps for any plan.

So people started building tools just to figure out their own limits. API proxy interceptors. One developer tried to reverse-engineer the utilization headers that Anthropic sends on every API response, because Claude Code doesn't surface them to you.

I built Claudoscope partly for this reason. If the tool won't tell you what it costs, build something that will.

Both tools have cost transparency problems. They're just structured differently. Cursor's is per-token opacity: you don't know what cloud agents will cost until the bill arrives. Anthropic's is undisclosed caps on plans marketed as generous. Neither side has figured this out yet, which is kind of remarkable given how much both charge.

The context sharing problem

This is the technical gap that nobody's writing about, and it's the one that actually matters for how well parallel agents work in practice.

Each worktree agent runs in its own isolated branch. That's the point: isolation prevents file conflicts. But it also means Agent A doesn't know what Agent B is doing. If you're building an API endpoint in one worktree and the frontend that calls it in another, those agents are working from the same base commit. Neither sees the other's in-progress changes.

Cursor's docs say local and cloud agent contexts are "summarized and reduced" before sharing. That's doing a lot of work as a sentence. How much of a 100k-line codebase survives summarization? What's the token budget for the summary? Is it a full AST-aware summary or just file path lists? The docs don't say.

There's also the committed-vs-dirty question. Are cloud agents working from the latest committed state on the branch, or from your local uncommitted edits? If committed: you have to commit before spawning cloud agents, which means half-finished code landing in your Git history. If uncommitted: they need filesystem sync between local and cloud, which introduces latency and consistency issues. The docs are silent on this too.

I've hit a version of this problem with Claude Code's worktree parallelism. Two agents building against the same API contract will sometimes diverge on field names or response shapes because neither agent sees the other's work until merge time. The fix is manual: define the contract first, commit it, then parallelize. That works, but it means true parallelism requires upfront planning that eats into the time savings.

The Claude Code source leak exposed how their agent orchestration handles this internally: spawning sub-agents, tool call cascading through orchestration layers, sessions that retry failed operations in loops. Context sharing between agents is an unsolved problem across the entire category, not just Cursor.

What parallel agents actually solve (and when they don't)

Parallel agents deliver real speedups for the right kind of work. Building a full-stack feature with decoupled components? Four agents in parallel (UI, API, database, tests) can cut wall-clock time from eight hours to two (Cursor docs, 2026). That's a genuine 4x on paper.

I use Claude Code's worktree-based parallelism for similar workflows. Spin up multiple agents, each in an isolated branch, merge when they're done. The UX is rougher: no Agents Window, no drag-drop tabs, no visual status at a glance. But the core capability is the same, and the cost is flat.

Here's where it falls apart. When Agent B depends on Agent A's output, you can't parallelize. That's most real work. For tasks under 30 minutes, the orchestration overhead eats the speedup. Solo devs on small projects get almost nothing from running eight agents simultaneously. And the context sharing gap I described above means agents working on related components will diverge unless you've done the upfront contract work.

Cursor 3 is a polished UI layer on existing capabilities, positioned as an architectural breakthrough. The parallel agents are real but not new. The cost model is real but not transparent.

If you're already in Claude Code, I don't see a reason to switch. If you're evaluating for the first time, try both. Run each for a week on real work, not demos. Track what you actually spend. Then decide.

Or skip both and try ForgeCode. It's open source, terminal-based, and topped TermBench 2.0 at 81.8%. You bring your own API keys and pick your model. I haven't used it yet, but I'm giving it a weekend. Their blog post about hitting #1 is titled "benchmarks don't matter," which I kind of respect.

That's really all I've got. Track your costs. The rest will sort itself out.

Frequently asked questions

How much does Cursor 3 actually cost per month?

Plans start at $20/month but real-world spend with cloud agents ranges from $200 to $1,800+ per month based on Hacker News community reports (HN, 2026). Cloud agent resource costs aren't disclosed on the pricing page. Track your actual spend for a full week before committing to a plan.

Can you run Cursor 3 agents locally without cloud costs?

Yes, local agents run Composer 2 on-device with no per-use charges. Cloud agents are where the parallel execution actually matters, though, and those costs aren't disclosed anywhere.

Is Cursor 3 better than Claude Code for parallel tasks?

Claude Code supports parallel execution via worktrees at a flat $100-$200/month rate. Cursor 3 offers better visual orchestration through the Agents Window but with unpredictable costs. Pick based on what matters more to you: UI visibility or cost predictability.

Sources:

Cursor 3 Announcement - Cursor, April 2, 2026
Cursor Pricing - cursor.com, April 2026
Cursor Parallel Agents Docs - Cursor docs
HN: Cursor 3 Discussion - Hacker News, April 2026
Claude Code users hitting usage limits - The Register, March 31, 2026
Reverse Engineering Claude Code Limits - Claude Code Camp, April 1, 2026

Originally published at liranbaba.dev

Undercover mode, decoy tools, and a 3,167-line function: inside Claude Code's leaked source

Liran Baba — Thu, 02 Apr 2026 20:34:48 +0000

On March 31, a single .map file shipped inside an npm package and exposed the complete internals of Claude Code. The Hacker News thread hit 2,060 points. Anthropic filed DMCA takedowns against 8,100+ GitHub repos. And I spent most of the afternoon reading TypeScript I wasn't supposed to see.

I use Claude Code every day. I built Claudoscope because I wanted to understand what it was actually doing in my terminal. So when the source dropped, I went through it. Some of it confirmed things I'd suspected. Some of it genuinely surprised me.

Key Takeaways

A JavaScript source map in Claude Code v2.1.88 exposed ~1,700 TypeScript source files (alex000kim, 2026)

Unreleased features include KAIROS autonomous mode, anti-distillation decoy tools, and "undercover mode" that hides AI authorship

Anthropic's DMCA takedown hit 8,100+ repos, many containing no leaked code

A clean-room rewrite called Claw Code gained 146,000 GitHub stars in under 48 hours

What happened

Security researcher Chaofan Shou disclosed on X that Anthropic had shipped a JavaScript source map file inside Claude Code version 2.1.88 on npm. Source maps are debugging artifacts. They contain the original, readable TypeScript source before minification. They're not supposed to ship to production. This one did.

Early speculation blamed a known Bun bug (oven-sh/bun#28001) where bun serve sometimes exposes source maps in production. But that bug affects web apps hosted by Bun, not packages bundled with Bun and run locally. Claude Code uses Bun as a bundler and local runtime, not as a web server. Jared Sumner, Bun's creator and now an Anthropic employee, confirmed Claude Code doesn't use bun serve, ruling this out. His comment was, as far as anyone can tell, the only public response from an Anthropic employee about the leak. The actual cause of the source map shipping in the npm package remains unexplained.

About 1,700 source files were exposed, spread across utils (564 files), components (389), commands (189), tools (184), services (130), hooks (104), ink (96), and bridge (31) directories. The .map file sat on the npm CDN for anyone to download. When Anthropic responded, they deprecated the package version rather than unpublishing it, so the file stayed somewhat accessible even after the response.

The HN thread generated 1,013 comments. Two follow-up analysis posts scored 1,354 and 1,078 points. People were interested.

What was inside the code?

35+ tools across six categories, 73+ slash commands, and over 200 server-side feature gates (ccunpacked.dev, 2026). The community built a visual guide mapping out an 11-step agent loop from keypress to response.

The main print.ts file is 5,594 lines long. Inside it, a single function spans 3,167 lines at 12 levels of nesting (alex000kim, 2026). Not great.

There's an operational bug affecting 1,279 sessions that hit 50+ consecutive failures, wasting roughly 250,000 API calls per day globally. HN commenters said it was fixable with three lines.

The tool taxonomy is more interesting than the code quality issues. File operations, bash execution, web browsing, agent orchestration, task management, cron jobs, worktree isolation. What looks like a coding assistant in the terminal is actually a full agent framework. Daemon mode. Unix domain socket communication between sessions. Remote control via mobile and browser.

I've been watching Claude Code's behavior through Claudoscope session logs for months. The leaked architecture confirms patterns I'd noticed in the wild: tool calls cascading through orchestration layers, sessions spawning sub-agents, loops where it burns through tokens retrying failed operations over and over. Reading the source was like finally seeing the schematic for a machine I'd only heard running.

The features nobody was supposed to see

The most discussed findings weren't about code quality. They were about where Anthropic is heading.

KAIROS is a persistent autonomous agent mode. It runs on periodic <tick> prompts, maintains daily append-only logs, subscribes to GitHub webhooks, and spawns background daemon workers. The source states it "becomes more autonomous when terminal unfocused." It includes a /dream skill and five-minute cron refreshes. Claude Code that doesn't wait for you to type. That's what this is.

Undercover mode drew the sharpest reaction. The file undercover.ts suppresses all signs of AI authorship when contributing to public or open-source repos. The instructions are blunt: "NEVER include the phrase 'Claude Code' or any mention that you are an AI" and remove "Co-Authored-By lines or any other attribution." It only runs for Anthropic employees (USER_TYPE === 'ant'). The code says: "There is NO force-OFF."

I keep coming back to this one. A company that's built its identity on AI safety and transparency had a mode specifically designed to hide AI involvement in open-source contributions. The file also prevents mention of internal model codenames like "Capybara" and "Tengu," which suggests unreleased models Anthropic hasn't publicly acknowledged.

Anti-distillation sends decoy tool definitions to poison training data if competitors scrape API traffic. A secondary mechanism uses server-side text summarization with cryptographic signatures between tool calls to obscure reasoning chains. As multiple HN commenters pointed out, the strategic value of this system "evaporated the moment the .map file hit the CDN."

Other exposed systems: native client attestation (DRM-like cryptographic verification of legitimate Claude Code binaries), frustration detection via regex (pattern-matching profanity like "wtf" and "dumbass" instead of using the LLM itself, which is kind of funny), and Buddy, a virtual terminal pet that turned out to be the 2026 April Fools' feature.

The DMCA overreaction

Anthropic's response to the leak may end up being the bigger story. On March 31 they filed DMCA takedown notices targeting an entire fork network of 8,100+ repositories on GitHub. The notice said: "The entire repository is infringing."

Many of those repos had nothing to do with the leak. One developer noted on HN that their fork "had not been modified since May" and "did not contain a copy of the leaked code." Others called it "misguided" and "ridiculous." I mean, yeah.

The legal questions get weird fast. If Claude Code was partly written by Claude itself (Anthropic says they use their own tools internally), does the AI-generated portion qualify for copyright protection? One commenter raised a sharper point: undercover.ts explicitly hides AI authorship, which could undermine Anthropic's own copyright claims. False DMCA claims constitute perjury.

Anthropic executives later said the mass takedowns were accidental and retracted most of the notices (TechCrunch, 2026). But by then the Streisand effect had done its work. Every takedown drew more attention to the code they were trying to hide.

What are the actual security risks?

No user data was exposed. But the leak did expose systems Anthropic relies on to protect its product.

System exposed	Risk	Severity
Anti-distillation decoy tools	Anyone scraping API traffic can now filter for fakes	High
Native client attestation	Cryptographic hash mechanism publicly documented	High
Security header feature flags	Remote disabling of security headers revealed	High
Unreleased product roadmap	KAIROS, UltraPlan, Coordinator Mode visible to competitors	Medium-High
Internal model codenames	"Capybara," "Tengu" disclosed	Medium
Operational bugs	250K wasted API calls/day, trivially fixable	Medium

The anti-distillation system is the clearest loss. Its entire value depended on competitors not knowing it existed.

This connects to something I've written about before. When I found my database password sitting in a Claude Code session file, the issue wasn't that Claude Code was doing something malicious. The issue was that it operates with deep filesystem access and stores everything in unencrypted JSONL files that nobody checks. The source leak confirms what I suspected: there's limited internal safeguarding around what gets stored and transmitted.

Claw Code: 146K stars in 48 hours

Within hours of the leak, a developer ported Claude Code's core architecture to Python and Rust from scratch. Claw Code hit 146,000 GitHub stars and 101,000 forks in under 48 hours.

It's a clean-room rewrite, not a fork of the leaked code. The repo disclaims any affiliation with Anthropic and says the exposed snapshot "is no longer part of the tracked repository state." The developer was later featured in a Wall Street Journal article as a power user who consumed "25 billion tokens" of AI coding tools per year.

The project includes an interactive CLI, plugin system, MCP orchestration, streaming API support, and LSP integration. Rust (92.9%), Python (7.1%).

We've seen this before. When Meta's LLaMA model weights leaked in 2023, they chased takedowns for a while, then gave up and went open. The community built derivatives no matter what legal said. 146K stars on Claw Code tells you what developers actually want. Whether Anthropic decides to offer an open alternative is almost beside the point now.

The bigger picture

This didn't happen in isolation. It capped a rough month for Anthropic:

Feb 16: Pentagon threatened Anthropic with punitive action
Mar 5: Pentagon formally labeled Anthropic a "supply chain risk" (WSJ, 2026)
Mar 9: Anthropic sued the Pentagon (Axios, 2026)
Mar 26: Federal judge blocked the Pentagon's effort (CNN, 2026)
Mar 31: Source code leaked via npm. DMCA takedowns hit 8,100+ repos
Apr 1: TechCrunch runs "Anthropic is having a month"

Anthropic built its brand on responsible development and safety-first engineering. Then a source map shipped in an npm package and nobody caught it. The DMCA response hit thousands of uninvolved developers. And undercover.ts was hiding AI authorship while the company publicly advocated for transparency.

I still use Claude Code. I don't think it's a bad product. But the gap between the safety messaging and the operational reality is now documented in 1,700 TypeScript files. Anyone can read them.

What to do now

If you use Claude Code, there's nothing you need to patch or update. The leak was Anthropic's source code, not your data.

What's worth paying attention to is how Anthropic responds. As of this writing, there's been no official statement on their newsroom, blog, or developer channels. The only Anthropic employee who commented publicly was Jared Sumner, and only to clarify the Bun bug wasn't the cause. Whether they address undercover mode, the DMCA overreach, or the anti-distillation system will say a lot about how they handle things going forward.

And if you're eyeing Claw Code as an alternative, know what you're getting into. It's a clean-room rewrite with different internals, not a fork.

Or maybe this is the push to try something else entirely. ForgeCode currently tops TermBench 2.0 and has been getting a lot of attention. I haven't switched yet, but I'd be lying if I said I wasn't curious.

Frequently asked questions

What exactly was leaked in the Claude Code source code?

The full TypeScript source, exposed via a JavaScript source map in npm package v2.1.88. It included 35+ tools, 73+ slash commands, 200+ feature gates, and unreleased features like KAIROS autonomous mode and undercover mode (ccunpacked.dev, 2026).

Why did Anthropic take down 8,100 GitHub repositories?

They filed DMCA takedown notices targeting the entire fork network of the repo hosting the leaked code. Many repos contained no leaked material. Anthropic later called the mass takedown accidental and retracted most notices (TechCrunch, 2026).

Is my data at risk from the Claude Code leak?

No. This was source code, not user data. That said, the source did reveal how session data is handled and that feature flags exist to disable security headers remotely.

What is Claw Code?

Someone ported Claude Code's core architecture to Python and Rust from scratch within hours of the leak. It's a clean-room rewrite, not a fork. 146,000 stars and 101,000 forks in under 48 hours. Not affiliated with Anthropic (GitHub).

Sources:

Claude Code Source Leak Analysis - alex000kim, March 31, 2026
Claude Code Unpacked Visual Guide - ccunpacked.dev, April 1, 2026
Anthropic DMCA Notice - GitHub DMCA Archive, March 31, 2026
HN Thread: Source Leak Disclosure - Hacker News, March 31, 2026
Anthropic took down thousands of GitHub repos - TechCrunch, April 1, 2026
Anthropic is having a month - TechCrunch, March 31, 2026
Claw Code Repository - GitHub

I found my database password in a Claude Code session file

Liran Baba — Tue, 31 Mar 2026 13:00:00 +0000

I use Claude Code for most of my programming work, and I have very little idea what it's actually doing under the hood.

A few months ago I was poking around ~/.claude/projects/ and opened a session JSONL file. Buried in the conversation, Claude Code had read a .env file and echoed its contents back as a tool result. My database password, sitting in plaintext, in a file I never look at.

That was the afternoon I stopped what I was working on and started building Claudoscope.

The problem isn't Claude Code. It's visibility.

Claude Code doesn't have a cost breakdown per session. The Enterprise API doesn't surface spend data at all; only the admin dashboard does, and it's not granular enough. When we rolled it out across the org, nobody could answer basic questions: which sessions are expensive? Is the agent stuck in a loop somewhere? Is our CLAUDE.md actually doing anything useful or just eating context window?

And the security angle was worse. Session files contain the full conversation, including anything the agent reads from disk. If it touches a file with credentials, those credentials now live in an unencrypted JSONL file indefinitely. Nobody was checking for that.

So I built a flashlight.

Claudoscope is a native macOS menu bar app. It watches your Claude Code session files locally, parses them, and gives you a dashboard. Nothing leaves your machine.

The menu bar widget gives you a glance: today's sessions, tokens, cost, and any sessions that are currently running with a live cost number next to them. Click through to the full dashboard when you want the details.

"Why did Tuesday cost $47?"

That was the question I kept asking and couldn't answer. The analytics view breaks it down: cost by project, cost by model, daily trends. The cache tab shows whether your prompt cache is stable or busting on every request (cache busting is expensive and invisible without tracking). There's a what-if calculator that shows what your bill would look like if you moved Opus sessions to Sonnet.

"Is my CLAUDE.md any good?"

I didn't plan on building a config linter. It started as a quick check for obvious problems in my own setup. Then I ran it on a colleague's CLAUDE.md and found it was over 4,000 tokens, roughly 10% of the context window eaten by instructions before the agent even started working. So I made it a rule.

The linter now has 19 rules. It checks CLAUDE.md structure, skill metadata, deprecated commands, token budget estimates. It groups findings by rule rather than by file, so you see patterns. One rule (subprocess env scrub) has a one-click auto-fix.

The first time I ran it on our team's configs, it flagged raw XML brackets in a skill's frontmatter that would break the system prompt parser. Nobody had noticed because the failure was silent.

Secret scanning

This is probably the most useful feature and also the hardest one to get people excited about. Did the agent just leak your credentials? You'd never know unless something was watching.

Claudoscope scans session files for leaked credentials: private keys, AWS access keys, auth headers, API tokens, passwords in connection strings. It uses regex matching, Shannon entropy analysis, and allowlists for placeholder values. The entropy check matters because without it you get a wall of false positives from example code and docs.

When it finds something, a panel pops up on screen. Doesn't matter if the dashboard is open. It watches the tail of active session files and alerts you immediately.

What I learned from my own data

Building this meant spending a lot of time inside Claude Code's JSONL format. A few things I didn't expect:

Prompt cache reads are cheap ($0.30/MTok on Sonnet vs $3.00 uncached), so I assumed most of my input was cached. On some projects, 30-40% wasn't. The cache busts when session context shifts after compaction, and before I had a hit rate chart staring me in the face, I had no idea.

I also figured my expensive sessions would be the big multi-hour ones. They weren't. The cost was in dozens of short sessions where Claude Code loaded context, did one thing, and exited. Each one paid full input with no cache. Fifty quick questions cost me more than the three-hour refactor.

Most CLAUDE.md files across our team were 2,000-5,000 tokens. Context window you pay for on every message. A few people trimmed theirs after seeing the linter's token estimate.

And one gotcha for anyone parsing these files themselves: the JSONL contains intermediate records with null stop_reason, in-progress streaming responses. Sum all records naively and you double-count tokens. I shipped this bug and didn't catch it until cost estimates were 1.5-2x the actual Vertex bill. Not documented anywhere, as far as I can tell.

Under the hood

It watches ~/.claude/projects/ with macOS FSEvents (not polling). Session parsing runs on a Swift actor for thread safety. Cost estimation runs per-message, not per-session, because different messages in the same session can use different models. There's an LRU cache (20 sessions) so navigating between recent sessions feels instant.

I built it in SwiftUI, macOS 14+, Apple Silicon only. I wanted it to feel like a Mac app. That means no Linux or Windows, and I'm fine with that tradeoff.

Install

Free, open source, macOS only (Apple Silicon). Homebrew:

brew tap cordwainersmith/claudoscope
brew install --cask claudoscope

Or grab the DMG from GitHub. It auto-updates. The cost estimation is most useful on Enterprise plans where per-session data isn't available, but session analytics and config linting work regardless of your plan.

Go check your session files. You might not like what you find.