Forem: Harrison Guo

Testing Real-World Go Backends Isn't What Many People Think

Harrison Guo — Sat, 18 Apr 2026 00:18:33 +0000

I've reviewed enough Go backend test suites to notice a pattern. The services with the most unit tests are often the ones with the most production incidents. Not because unit tests cause incidents — because the teams writing unit tests and calling it a day weren't testing the things that actually broke.

Production bugs in distributed Go backends don't usually look like "function computed wrong value." They look like:

"The context deadline didn't propagate into the background goroutine, so under load it leaked."
"Two services agreed on the happy path, but the error-shape contract diverged six months ago, and now one returns status.Code(codes.Unavailable) where the other expects codes.ResourceExhausted."
"The retry logic is race-y. With test-scale traffic it works; at 10x production it double-charges."
"The database migration works on SQLite (our test DB) but not Postgres 15's stricter planner."

No unit test catches those. A different set of test shapes does.

tl;dr — Stop framing tests as "unit vs integration." That's a level-of-isolation axis, and it's the least interesting one. The axes that matter for production Go: deterministic behavior (controlled clocks, seeded randomness), concurrency correctness (race detector, stress tests), contract fidelity (shared schemas, real downstreams), and environment fidelity (real DBs, real networks). Design your test suite around those; coverage follows.

The Wrong Taxonomy

"Unit tests test one function. Integration tests test several. E2E tests test the whole system."

That framing is a starting point for junior engineers. It stops being useful the moment you're debugging why your Go service silently dropped a message in production. The level of isolation isn't the interesting axis. What is:

Deterministic vs non-deterministic behavior. Do the same inputs produce the same outputs every time?
Concurrency correctness. Do the race conditions stay caught?
Contract fidelity. Do your assumptions about downstreams match what they actually do?
Environment fidelity. Does your test environment reproduce the production runtime closely enough to catch real bugs?

A test can be "unit" on the isolation axis but score on two or three of these. A test can be "integration" and miss all four.

Deterministic Behavior: The One Thing Every Test Should Have

If you can't run your test a thousand times and get the same result, you have a flaky test, and flaky tests are worse than no tests — they train the team to ignore failures.

The three sources of non-determinism in Go test suites, in order of prevalence:

1. Time

Any test that calls time.Now(), time.After(), time.Sleep(), or depends on wall-clock intervals is a landmine. It works on the developer's laptop and fails in a slow CI runner where GC decided to kick in.

Fix: inject a clock. A minimal clock interface:

type Clock interface {
    Now() time.Time
    Sleep(d time.Duration)
    After(d time.Duration) <-chan time.Time
}

type realClock struct{}
func (realClock) Now() time.Time            { return time.Now() }
func (realClock) Sleep(d time.Duration)     { time.Sleep(d) }
func (realClock) After(d time.Duration) <-chan time.Time { return time.After(d) }

In production, realClock. In tests, a FakeClock that advances manually. Libraries like github.com/benbjohnson/clock give you this for free.

Payoff: a test that verifies "retries happen every 500ms for 3 attempts" becomes deterministic — advance the fake clock 500ms, observe a retry, advance another 500ms, observe again. No sleeping in the test.

2. Randomness

Anything that shuffles, samples, picks a random ID, or generates random test data needs a seeded random source. math/rand.Intn with default source is a machine-global shared state; two tests running in parallel can interfere.

func New(seed int64) *Service {
    return &Service{rng: rand.New(rand.NewSource(seed))}
}

In tests, pass a known seed. In production, rand.NewSource(time.Now().UnixNano()).

3. Concurrency ordering

The nasty one. A test that creates goroutines and checks a result has to either (a) synchronize on a deterministic completion signal (a channel, a WaitGroup) or (b) poll with a timeout — which is back to non-determinism.

The best habit: design for deterministic completion. If you're testing "five goroutines should all complete and total the result," use sync.WaitGroup.Wait() or close a channel. Don't sleep. Don't poll.

Concurrency Correctness: The Race Detector Is Not Optional

Go ships with a race detector. Running go test -race is one flag and it catches an entire category of bugs that will otherwise show up as "works on my machine." In my experience, any production Go service will, on first -race run, surface at least one real data race that had been silently ignored.

The race detector adds ~5-10x runtime overhead, so people skip it on every-save tests. Fine. Run it in CI. Run it on nightly integration tests. Run it on anything touching shared state. Some configurations I've seen work:

Every PR: run unit tests with -race.
Nightly: run full integration suite with -race and a longer timeout.
Pre-release: run stress tests with -race against a production-sized dataset.

The cost of running with -race is engineering discipline. The payoff is not debugging a data race at 2 AM.

Beyond the race detector, stress tests are undervalued. A test that runs your concurrent path 1,000 times with different goroutine interleavings catches bugs that a single-iteration test never will.

func TestConcurrentWorkers_Stress(t *testing.T) {
    if testing.Short() {
        t.Skip("stress test")
    }
    for i := 0; i < 1000; i++ {
        t.Run(fmt.Sprintf("iter%d", i), func(t *testing.T) {
            t.Parallel()
            // ... actual test body ...
        })
    }
}

t.Parallel() + 1,000 iterations + -race finds race conditions that a single deterministic run happily misses.

Contract Fidelity: The Bug Class Everyone Misses

Say your service calls a downstream gRPC service for payments. You write a mock that returns a successful response. Your tests pass. The downstream team changes their error code vocabulary. Your service now misinterprets their new error. Production finds out first.

Contract testing addresses this. Two approaches work in practice:

Shared schema, shared types

If the downstream service publishes a protobuf file (they should), your service imports it directly. Your tests use types generated from the real contract. If the downstream bumps the proto, your next build fails — loudly, at compile time.

This is the simplest and often best answer for Go services with gRPC downstreams. The contract is literally the shared protobuf.

Consumer-driven contract tests

Each consumer writes tests that capture their expectations of the downstream. Those tests run against the real downstream (or a contract-test server like Pact). When the downstream changes, the contract tests catch it before the contract-as-written reality diverges.

This helps for REST APIs where there's no single source of truth schema. It's more ceremony. For most gRPC Go services, shared protobufs cover it.

The "mock everything" antipattern

If your test suite consists of mocks that return whatever your test needs, you're not testing integration. You're testing that your code calls your mocks correctly. That's a tautology. Real integration bugs live in the gap between your mock's behavior and the downstream's actual behavior.

Have at least one test per integration point that hits the real downstream — either in a staging environment or via Testcontainers. Keep the mocks for fast feedback, but don't pretend they're the only tests you need.

Environment Fidelity: Use Real Infra Where It Matters

The sharpest line in my test taxonomy is between "close to production runtime" and "not close."

Things that matter and are worth running on real infrastructure in tests:

Databases. SQLite is not Postgres is not MySQL. Query planner, isolation levels, and error shapes differ. Test with the DB you ship with.
Message brokers. Kafka's ordering and offset semantics cannot be faked well. Use a real Kafka (or Redpanda) in tests that exercise ordering or replay.
Caches. Redis has specific failover and eviction semantics. A fake in-memory map doesn't reproduce them.
Time-sensitive downstream APIs. Anything with rate limits or TTLs.

Things that rarely matter and are fine with fakes:

Object storage. A local file-system backend usually reproduces S3 well enough.
Metrics / tracing exporters. Tests don't need a real Prometheus.
Email / SMS. A mock recording calls is plenty.

The pattern: test with real infra for anything where semantic difference is possible. Testcontainers (github.com/testcontainers/testcontainers-go) makes this painless:

func setupPostgres(t *testing.T) string {
    ctx := context.Background()
    c, err := postgres.RunContainer(ctx,
        testcontainers.WithImage("postgres:15-alpine"),
        postgres.WithDatabase("testdb"),
        postgres.WithUsername("testuser"),
        postgres.WithPassword("testpass"),
    )
    require.NoError(t, err)
    t.Cleanup(func() { c.Terminate(ctx) })

    dsn, err := c.ConnectionString(ctx, "sslmode=disable")
    require.NoError(t, err)
    return dsn
}

Slow? Yes — each container takes a few seconds to start. But you can run them once per test package with a TestMain, and the bugs they catch are the ones most worth catching.

A Real Taxonomy

pure functions · algorithms"]"/>

Here's the taxonomy I actually use when designing a test suite for a Go backend:

Fast tests (seconds for the whole file): pure functions, algorithms, small state machines. Run on every save.
Concurrency tests (seconds to a minute): anything with goroutines. Run with -race. Run in PR.
Deterministic integration tests (single-digit seconds per test): one module + fakes + fake clock. Fast enough to keep in the main test run.
Real-infra integration tests (seconds per test): one module + real DB / Kafka / Redis via Testcontainers. Run in PR, longer timeout.
Contract tests (milliseconds): verify shared schemas with downstreams. Run on every schema change.
Stress tests (minutes): high-iteration, high-concurrency, with -race. Run nightly or on schedule.
End-to-end tests (minutes): real services, real network, against a staging environment. Run pre-release.

What you'll notice: "unit" and "integration" don't appear as categories. That's on purpose. The level of isolation is implementation detail. The purpose of the test is the taxonomy.

Small Habits That Pay Off

Use t.Cleanup over defer. Cleanups run in LIFO order, can be added anywhere in the test, and survive test panics better.
Prefer table-driven tests. Twenty tests as rows in a slice beats twenty nearly-identical test functions.
Fail tests with t.Fatalf, not t.Errorf, for setup failures. A broken setup should abort; a broken assertion might allow the test to continue collecting more failures.
Golden files for complex outputs. If you're verifying a generated SQL query, a serialized event, or a JSON response, a golden file comparison is more readable than a long string literal.
Separate _test.go files for slow tests with a build tag. //go:build integration lets you run them explicitly.

The Shift That Changed My Testing

Coverage numbers lie. The question is not "what percent of lines are executed by tests" — it's "what percent of the risky behaviors are covered by tests that will actually fail when those behaviors break."

A codebase with 95% line coverage and zero race tests, zero real-DB tests, and mock-heavy integration tests is brittle. A codebase with 60% line coverage, go test -race in CI, Testcontainers for the DB, and a stress test for every hot concurrent path is not.

The single biggest shift I recommend: stop thinking about tests in terms of isolation level, and start thinking about them in terms of the production failure modes you're actually afraid of. Map each failure mode to a test shape. If you don't have a test shape for a failure mode, you don't really have that failure mode covered — you just hope it doesn't happen.

Production has opinions about what you hope.

Go's Concurrency Is About Structure, Not Speed — the concurrency patterns that make production-shape Go possible.
Go Context in Distributed Systems: What Actually Works in Production — the single most common test gap in Go services I review.
Why Your "Fail-Fast" Strategy is Killing Your Distributed System — a production failure mode that's hard to test unless you design the test for it.

Scale-Up vs Scale-Out: Why Every Language Wins Somewhere

Harrison Guo — Sat, 18 Apr 2026 00:18:32 +0000

I worked with a team that rewrote a critical service from Go to Rust because "performance." Six months later, the service was 30% faster, the team was miserable, and feature velocity had dropped to a crawl. Meanwhile the competitor team, still on Go, had shipped four new features.

We did the postmortem eventually. The service handled maybe 2,000 requests per second on a 4-core machine. CPU utilization sat around 20%. Rust's extra speed bought us exactly nothing — the bottleneck was downstream database latency. What it cost us was every feature we didn't ship while writing unsafe, fighting the borrow checker, and nursing the team through the learning curve.

That incident taught me the question I wish I'd learned earlier: what are you actually scaling, and does the language buy you the right kind of scale?

tl;dr — Language benchmarks optimize for one axis: per-request performance. Real systems have multiple axes — throughput, latency, concurrency, developer velocity, operational complexity, memory efficiency. Rust, Go, Java, Python aren't competing to be "fastest." They're different answers to different bets about what you're going to scale. Pick by fit, not by leaderboard.

The Two Kinds of Scale

At the top level, two strategies dominate:

Scale-up: make one machine do more. Vertical scaling. Faster CPUs, more RAM, specialized hardware, lower per-operation cost.
Scale-out: add more machines. Horizontal scaling. Cheaper commodity hardware, more concurrency, lots of work running in parallel.

These aren't just infrastructure decisions. They're reflected in the language and ecosystem you pick. A language optimized for scale-up (Rust, C++) has different priorities than one optimized for scale-out (Go, Elixir) or one optimized for neither but for developer leverage (Python, Ruby).

The big confusion comes from mixing axes. "Rust is faster than Go" is true on per-op microbenchmarks and irrelevant if your workload is I/O-bound service-to-service traffic. "Python is slow" is true in a compute-bound loop and irrelevant for a 500-QPS API that spends 95% of its time waiting on PostgreSQL.

Where Each Language Actually Wins

Rough positioning — not a benchmark, a fit map. The language you pick should live near the kind of scaling your system actually demands.

Rust / C++ / Zig — Scale-up champions

These languages dominate when per-machine throughput is the bottleneck and you can afford the engineering cost. That's a narrower set of problems than Twitter would have you believe, but the problems that exist are real:

High-frequency trading engines — microseconds matter, GC pauses are unacceptable, every cache line counts.
Inference engines — llm.cpp, vllm, mistral.rs. Memory layout, SIMD, custom kernels.
Databases and storage engines — ScyllaDB, TiKV, Foundation internals. State machines that live forever and must not leak.
Network data planes — Cloudflare's Pingora, proxies at the edge.
Game engines, audio/video encoding, embedded.

The pattern: one box, pushed hard, for years. Memory safety matters because bugs compound over time. Performance matters because throughput per core is the product.

The cost: every commit is slower. Refactoring is expensive. Onboarding is measured in months, not weeks. The compile times are what they are. You pay this cost every day the service exists.

Go — Scale-out champion

Go hits a specific sweet spot: cheap concurrency, predictable performance, fast-to-ship code, and easy to hire for. It's a scale-out language.

Thousands of goroutines per core, 2KB stacks, user-space context switching. The "cost of one more waiter" is nearly zero.
Standard library is enough for 80% of backend work — HTTP server, JSON, SQL, crypto.
Compilation is fast enough to stay in flow. Iteration loop feels similar to a dynamic language.
Minimalism is aggressive. One person can read the whole language in a weekend. New hires are productive in days.

Where it loses: per-op performance. Go's GC is fine but not invisible. Zero-copy generic code is harder to write than in Rust. The type system doesn't prevent the entire class of bugs Rust's does.

Go's bet: the problem you're most likely to have is "I need to handle 10x the concurrent work with 2x the code." Not "I need this loop to be 5% faster." For most backend services, that bet is right.

Java / Kotlin — Mature scale-out with runtime depth

The JVM is what you want when the workload is scale-out but you need runtime flexibility Go doesn't give you:

A mature JIT that optimizes hot paths beyond what AOT can.
Rich profiling and monitoring (JFR, async-profiler, flight recorder) that makes post-deploy tuning feasible.
A library ecosystem that, after 25 years, has a mature library for basically anything.
Kotlin on top gives you modern syntax and coroutines without leaving the ecosystem.

Where it loses: startup time, memory overhead, operational complexity (GC tuning is a real job), the occasional "it works on my JDK 11 but the prod JDK 17 changed something." Also: hiring is harder than Go now, at least in my corner of the industry.

Java's bet: "you'll still be running this service in ten years, and you want to be able to tune its runtime when that day comes." For large enterprises with deep infrastructure, that bet pays off. For a startup shipping its first three services, the overhead is not worth it.

Python / Ruby — Developer-velocity champions

The forgotten-but-dominant answer: languages that optimize neither scale-up nor scale-out, but scale-the-team.

Fast to write, fast to read, fast to debug.
Massive libraries for data, ML, scripting, DSLs.
Easy to onboard anyone — CS students, data scientists, analysts.
Prototype-to-production path is shorter than anywhere else.

Where they lose: per-core throughput, concurrency (the GIL is real), memory. Python and Ruby are not your language for a 100K QPS service.

But a lot of real companies don't need a 100K QPS service. They need to get a thing working, put it in front of users, and iterate. If your current problem is "we need to ship the next feature this week," Python might be the right answer even if a Rust version would technically run faster.

Python's bet: throughput isn't the constraint yet. Time-to-shipped-feature is. For most companies most of the time, that's correct.

The Axes Nobody Talks About

Beyond scale-up/scale-out, a few axes decide more projects than raw performance.

Developer-velocity per week

"I can ship a feature and have it in production by Friday" beats "this service is 2x faster" most of the time. Measure it. If your current stack requires a two-day ceremony to deploy a one-line change, throughput is not your problem. Velocity is.

Operational complexity

Scale-up is operationally cheaper than scale-out. One machine, one process, one log. Scale-out gives you better redundancy but also distributed-systems problems — consistency, ordering, partial failure, chaos engineering. If your team is three people, the operational complexity of a 20-node scale-out cluster may eat more time than the language choice saves.

Memory efficiency per dollar

At cloud scale, memory is expensive. A Rust service that fits in 2GB where a Java service needs 8GB is a 4x savings on every instance. Multiply by thousands of instances and "per-op performance" stops being the interesting number — per-GB cost starts to matter.

Hiring pool

The language with the deepest talent pool in your market is usually the right answer for a new system, all else equal. A marginal technical improvement isn't worth a six-month hiring pipeline.

Learning curve shape

Some languages have shallow onboarding (Go, Python) and a long tail of depth. Others have steep onboarding (Rust, Haskell) and you're productive only after the ramp. For a senior team on a long-lived system, steep is fine. For a fast-moving team, steep is expensive.

The Pattern I See Repeated

A company starts small, picks Python or Ruby, builds the thing, ships to production. Ten employees. One codebase. Life is fast.

They grow to fifty engineers. The monolith cracks. Some services get rewritten in Go for concurrency and operational simplicity. A few performance-critical ones get written in Rust. Data infra sits on the JVM (Kafka, Spark, Flink). A few internal tools stay in Python because the team knows it and it works.

Five years in, the stack is polyglot. Nobody regrets it. What they regret is the six months they spent trying to make a single-language stack work past its comfort zone — the Python team pushing for "just async more things," or the Rust team fighting the borrow checker on code that could have been Go, or the Java team explaining to a new hire why the stack trace is 400 lines long.

The pattern: pick the language that fits the service, not the service that fits the language.

How I Ask the Question Now

When someone proposes "let's build this new thing in X," I ask:

What's the expected traffic profile, and what's the per-request work shape?
Is this scale-up limited (per-machine throughput) or scale-out limited (concurrent work)?
Who's going to write this, and how fast do we need them productive?
Who's going to operate this, and what's their tooling comfort?
Does this interact with an existing ecosystem (JVM data platform, Rust security infra)?
How long does it have to live?

The answer to those five questions usually lands me on one of three languages for 80% of systems I see: Go, Rust, or (for data-adjacent work) Kotlin on the JVM. Python still shows up for tools and glue. Everything else is contextual.

The benchmarks don't help. Per-op microbenchmarks answer questions nobody is actually asking. The right question is which axes matter for this system, and which language's bet lines up with those axes.

The Argument I've Stopped Having

I still see engineers argue about whether Rust or Go is "better." Both are good languages. Both are bad choices for problems they weren't designed for. The meaningful question is which kind of scale you're paying for — and the honest answer is almost always a mix, evolving over time.

The Rust rewrite I opened with wasn't a bad decision because Rust is a bad language. It was a bad decision because we weren't scale-up limited. We were downstream-database limited. No language could help with that.

Know which scale you're buying, and buy it on purpose.

Why Go Handles Millions of Connections: User-Space Context Switching, Explained — the design decision behind Go's scale-out bet.
Go's Concurrency Is About Structure, Not Speed — what you actually get with Go, and what you don't.
NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — applying the same fit-vs-benchmark thinking to messaging.

From Locks to Actors: The Four Pillars of Modern Concurrency

Harrison Guo — Fri, 17 Apr 2026 05:50:27 +0000

Most working engineers have spent ninety percent of their concurrent-programming life in one model: shared memory protected by locks. Threads that all see the same variables. Mutexes around the critical sections. Hope and care. It's the model every OS textbook teaches, every mainstream language supports, and every senior engineer has a horror story about.

It's also not the only option. Or even the best one, for many of the problems it gets used for. Three other models — CSP, actors, and software transactional memory — have been around for decades, mature enough for production, and each solves a class of problems that lock-based designs handle poorly.

This is a map of all four, from a working backend engineer who uses each of them for different jobs, and a take on when each is the right answer.

tl;dr — Concurrency has four viable pillars: shared memory + locks (threads, mutexes), CSP (channels, Go), actors (mailboxes, Erlang), and STM (transactional memory, Clojure). None is universally better. Each solves a different problem and has a different failure mode. Senior designs often mix three of them in one system. Mutex-for-everything works until it doesn't — usually at exactly the scale you promised you'd never reach.

Pillar 1: Shared Memory + Locks

The default. Threads, mutexes, atomics, condition variables. Every mainstream language has them.

How it works: multiple threads of execution share the same address space. They read and write the same data. Mutexes make sure only one thread touches a critical section at a time. Atomics do the same for single-word operations without a full lock.

Where it shines:

Simple shared counters and caches. atomic.AddInt64, sync.Map, LRU caches. The right tool.
Tight single-process coordination where the code is small enough for one person to hold in their head.
Performance-critical paths where the overhead of channel sends or actor dispatches is too much.

Failure modes:

Deadlocks. Two threads acquire locks in opposite order. Happens.
Priority inversion. Low-priority thread holds the lock, high-priority thread waits, work piles up.
Lock ordering bugs at scale. When N components each take M locks, the reasoning gets exponential.
Memory-model weirdness. What one thread writes, another may not immediately see. You start caring about happens-before, acquire/release semantics, and why volatile in Java is not what you thought.
Invisible races. The worst kind. Tests pass; production fails weirdly twice a month.

Use mutexes for small, localized shared state. Once the shared state has three collaborators or more, or a nontrivial invariant across fields, reach for one of the other models.

Pillar 2: CSP (Communicating Sequential Processes)

Tony Hoare's 1978 paper, popularized by Occam and now Go. The model Rob Pike and Ken Thompson picked for Go's concurrency.

How it works: processes don't share memory; they send messages on named channels. Senders and receivers rendezvous on the channel. Ownership of data moves with the message. "Do not communicate by sharing memory; share memory by communicating."

Where it shines:

Pipelines. Data flows through stages, each a goroutine, connected by channels. Clean to read.
Fan-out / fan-in. One producer, many workers, one aggregator. The channel topology is the architecture.
Backpressure. A bounded channel blocks the producer when full. No extra flow control needed.
Cancellation coordination. select with <-ctx.Done() is a clean primitive.
Lifecycle control. Closing a channel is a broadcast to every listener.

Failure modes:

Deadlocks remain possible. Two goroutines each waiting on the other's channel. Cycles in the channel graph are lethal.
Memory leaks via unclosed channels. A goroutine blocked on a send that will never be received lives forever.
Awkward request/reply. You end up passing a reply channel with each request, which works but feels verbose.
Order isn't free. Channel ordering is only per-channel. If you fan out and fan in, the aggregation is unordered unless you sort.

Use CSP for coordination-heavy designs. When the structure of "who's alive, who sends to whom, when do things stop" is the architecture, channels make that visible in the code.

Go is the obvious exemplar, but CSP-style is also available in Rust (crossbeam-channel, tokio::sync::mpsc), Kotlin (coroutines with channels), Python (asyncio.Queue), and C# (System.Threading.Channels).

Pillar 3: Actors

Carl Hewitt's 1973 paper. Made practical by Erlang (1986) and later Akka (Scala/Java). The model behind WhatsApp, a decade of telecom, and most fault-tolerant messaging infrastructure.

How it works: an actor is a named entity with private state and a mailbox. Other actors send messages to its address. Messages are processed one at a time from the mailbox. No shared memory. Parent actors supervise children; when a child crashes, the parent decides to restart, escalate, or ignore. Crashes are normal.

Where it shines:

Fault isolation at scale. One actor crashing is expected; it doesn't take down the system. Supervision hierarchies make "let it crash" a sensible engineering strategy.
Stateful services. Each actor holds its own state. Conceptually clean: no shared global state, no locks around it.
Location transparency. An actor can live in the same process, another process, or another machine. The sender doesn't know. This is where actors shine in distributed systems — the model scales across the network boundary natively.
Massive concurrency with stateful semantics. Erlang routinely runs millions of actors per node. Each is cheap.

Failure modes:

Mailbox unboundedness. If a producer sends faster than the actor can process, the mailbox grows without bound. Bounded mailboxes exist; use them.
Message-ordering assumptions break across the network. Within one node, delivery order is preserved per sender. Across nodes, all bets are off without explicit sequencing.
Testing is harder. Actors make their own state opaque; you test behavior through message exchange. Good frameworks help, but the habits needed are different from testing normal code.
Conceptual mismatch in CRUD-style backends. If your business logic is "select some rows, transform them, insert result," actors are overkill. They shine on long-lived stateful entities (a game character, a connected device, a user session), not on stateless request handlers.

Erlang and Elixir are the canonical runtimes. Akka brings actors to the JVM. Pony is a rare actor-first typed language. In Go, you can simulate actors with a goroutine + channel-as-mailbox pattern, but you lose Erlang's supervision and "let it crash" semantics unless you build them yourself.

Use actors when you have long-lived stateful entities with fault requirements. Telecom, messaging, multiplayer game servers, IoT device shadows, any system where "this particular entity has its own state machine, and we really care when it crashes" is the shape.

Pillar 4: Software Transactional Memory (STM)

Imagine database transactions, but for in-memory data. That's STM.

How it works: critical sections are wrapped in transactions. The runtime tracks reads and writes optimistically. On commit, if any data touched was modified by another transaction, the current one rolls back and retries. No explicit locks. Composability — two transactions can be combined into a larger one without redesigning the locking order.

Where it shines:

Composable concurrent code. Combining operations that were individually correct usually stays correct under STM. Lock-based code famously does not.
Read-mostly workloads. STM with multi-version concurrency control scales reads without blocking.
Avoiding the lock-ordering bug class. No locks, no deadlocks. The failure mode is retry storms, which are easier to reason about.

Failure modes:

I/O inside transactions is awful. Transactions may retry. If you did I/O, you may have done it multiple times. Either separate I/O from transactional state, or the runtime has to forbid I/O inside transactions (Haskell's STM monad does this at the type level).
Retry storms under contention. Heavy write contention on the same data means constant retries. In the worst case, throughput can be worse than locks.
Limited language support. Clojure (built-in), Haskell (STM), Scala (scala-stm), Rust (experimental stm crates). Not a mainstream feature of Go/Java/C#.

Clojure is the canonical "STM as a first-class citizen" language — its refs and transactions are idiomatic. Haskell's STM monad is arguably the cleanest realization. In other ecosystems, STM exists as libraries but hasn't displaced mutexes.

Use STM when the concurrent state is small-to-medium, the access pattern is read-heavy with occasional writes, and you want the composability. For the rare problems that fit, STM is strictly simpler to reason about than locks. For problems that don't fit (I/O-heavy, write-contention-heavy), STM is worse.

How Real Systems Mix Them

The surprise for engineers who've only used one model: mature systems mix three of them in one codebase.

A typical backend service I'd build today:

Mutexes / atomics for the inner loops — counters, caches, rate-limiter state, anything performance-critical with one clear owner.
Channels (CSP) for coordination — worker pools, pipelines, cancellation, shutdown signaling, bounded queues.
Actors (in a sense) for long-lived stateful entities — each connected client session, each in-flight request, each background job. In Go I'd model this as "one goroutine per entity, communicating via channels," which isn't formal actors but inherits the useful semantics: isolated state, message-passing, crash-isolation.

And I wouldn't use STM in that stack. Not because it's bad, but because the language runtime doesn't make it first-class. If I were writing Clojure, STM would be a natural fit for the in-memory state machines that would otherwise be locked maps.

The old "pick one concurrency model" debate was always a false choice. The real decision is per-problem: what shape is the concurrent work, what's the state-sharing pattern, and what failure semantics do I want.

Decision Guide

Quick map:

I have a counter that multiple goroutines read and update. → atomic or mutex.
I have a pipeline of work that flows through stages. → channels (CSP).
I have a fleet of long-lived sessions, each with its own state and lifetime. → actor pattern (goroutine + mailbox channel, or real actor framework).
I have a fleet of connected devices each with a state machine that must survive crashes. → actor framework with supervision (Erlang, Akka, or Go with explicit crash/restart logic).
I have complex shared state with nontrivial invariants across fields, and updates are occasional but important to compose. → STM if your language supports it; otherwise, lots of careful mutex discipline.
I have a request/response flow with fan-out to downstreams. → CSP with errgroup.WithContext.
I have no idea what I have. → Start with mutexes, switch when it hurts. Don't over-engineer the first version.

The Real Lesson

Most people who get bitten by concurrency bugs got bitten because they used the wrong model, not because they used it wrong. A mutex-heavy design for a workload that's really a pipeline is fragile. A channels-for-everything design when there's a shared counter underneath ends up with awkward rendezvous. An actors-everywhere design when the business is CRUD requests reads like over-engineering.

The four pillars aren't competing theories of concurrency. They're four tools, each good at specific jobs. Senior engineers know all four and reach for the right one. Junior engineers reach for the only one they know and force-fit it.

If your career so far has been mostly mutexes, spend a weekend reading the other three. Write a toy pipeline in Go channels. Read Erlang's supervision documentation. Play with Clojure refs. The investment pays back every time you sit in a design review and someone proposes locking their way out of a structural problem.

Go's Concurrency Is About Structure, Not Speed — CSP applied concretely in Go.
Why Go Handles Millions of Connections — the runtime characteristics that make CSP cheap in Go.
Scale-Up vs Scale-Out: Why Every Language Wins Somewhere — the language-level view of the same question.

RPC vs NATS: It's Not About Sync vs Async — It's About Who Owns Completion

Harrison Guo — Fri, 17 Apr 2026 05:50:26 +0000

A team I worked with once migrated an order-placement path from gRPC to NATS because "it's decoupled and faster." The old flow was simple: the web service called PlaceOrder via gRPC, got back an order ID, rendered success to the user. The new flow: web service publishes order.place to NATS, an order-service consumes it and processes asynchronously.

Within three weeks they had three kinds of incidents on rotation:

Duplicate orders — retry on the publisher side meant the same order was placed twice when the first publish actually succeeded but the ack was slow.
Lost orders — consumer crashed mid-process; no ack meant NATS redelivered, but the consumer had already partially committed state, so redelivery was rejected by a dedup check. The order just... disappeared from the user's perspective.
Dark-failure support tickets — users reported "I clicked buy and nothing happened." From the publisher side, everything looked fine. From the consumer side, processing time had drifted from 50 ms to 45 seconds because a downstream DB had a slow query, and the web team had no telemetry on the consumer side.

The retro landed on a single sentence: we thought we were changing the transport; we actually changed who owned the completion of the work.

tl;dr — RPC and pub/sub messaging look like two points on a sync-vs-async spectrum. They aren't. They're two fundamentally different ownership contracts. In RPC, the caller owns knowing the work finished. In messaging, the receiver owns it. Swapping one for the other without inverting retry, idempotency, ack, and observability is how you turn a clean migration into a three-month incident parade.

The Sync-vs-Async Trap

The most common framing I see is this: RPC is synchronous, messaging is asynchronous, pick based on whether you need the answer immediately. That framing is almost useless in practice. It conflates two separate axes.

Axis 1: Does the caller wait? Sync vs async. This is a latency question.

Axis 2: Who is responsible for knowing the work completed? Caller or receiver. This is a contract question.

You can have synchronous messaging (request-reply over NATS with a reply subject — caller waits, but transport is pub/sub). You can have asynchronous RPC (fire-and-forget gRPC — stream.Send with no ack). What matters isn't how long the caller waits. It's who's on the hook if the work doesn't happen.

Two Clean Ownership Models

Two models. Opposite error semantics. Opposite retry semantics. Opposite observability alignment. Swapping one for the other changes every downstream engineering assumption.

RPC: caller owns completion

When a client makes an RPC, they hold the socket open until a response comes back. That response is a statement by the server: I did the thing, here's the result. If the call times out, the client assumes failure (possibly partial) and has to decide what to do about it.

What this means operationally:

Retry is a caller decision. The caller knows whether the work is idempotent, how important it is, and how much budget is left.
Errors propagate naturally. A gRPC status code goes right back up the call chain.
Observability aligns. The caller's span includes the work's duration. If it's slow, the caller sees it.
Backpressure is immediate. Callers block on slow servers, limiting their own rate.

This is why RPC feels "simple" — the ownership contract is tight. The downside: the caller's fate is coupled to the callee's fate. A slow server propagates slowness back to every caller.

Messaging: receiver owns completion

When a publisher sends a message, the bus accepts it. The publisher's job is done. Whether the work happens — when, in what order, how many times, whether at all — is now somebody else's problem. Usually the consumer's.

What this means operationally:

Retry is a consumer decision. The bus may redeliver on no-ack; the consumer has to decide how to handle that (idempotency key, dedup table, upsert).
Errors are silent on the publisher side. A failed consumer doesn't tell the publisher. A dead-letter queue or out-of-band alerting has to be built.
Observability splits in two. Publisher metrics say "I sent it." Consumer metrics say "I processed it." The gap between those — lag — is its own story.
Backpressure is decoupled. Publishers can happily overwhelm consumers, which means you need consumer-side rate limits or bounded queues.

This is why pub/sub feels "flexible" — producers and consumers are independent. The downside: nothing is automatic. Every property that RPC gave you for free (retry policy, error propagation, aligned observability, flow control) is now a thing you have to design and build.

The Real Decision

Once you see it as an ownership question, the decision becomes clearer:

Does the caller need the answer to decide what happens next? → RPC. Auth check. Balance read. Inventory reservation. Any synchronous business flow.
Is the work a notification that something already happened? → Messaging. "Order was placed." "User signed up." Downstream consumers that don't gate the primary flow.
Can the work tolerate delay and be retried independently? → Messaging. Email send. Indexing. Analytics.
Is the work idempotent by construction, or can it be made so cheaply? → Messaging works. If not, RPC's caller-owned retry is simpler to reason about.

You can mix them. Most mature microservice stacks do. The mistake is picking messaging because "decoupled is better" without doing the consumer-side engineering that decoupling requires.

What Actually Has to Change in the Migration

Here's the minimum checklist for every RPC → messaging migration. If any of these aren't in place, the old code was better.

1. Idempotency keys, enforced at the consumer

Every message carries an operation ID. The consumer dedup-checks before committing. This is not optional. Without it, any redelivery (and there will be redeliveries) creates duplicate state.

func (c *Consumer) Handle(msg Message) error {
    if alreadyProcessed(msg.OpID) {
        return nil // idempotent: we already did this
    }

    tx, err := db.Begin()
    if err != nil { return err }
    defer tx.Rollback()

    if err := doTheWork(tx, msg); err != nil {
        return err // message will be redelivered
    }
    if err := markProcessed(tx, msg.OpID); err != nil {
        return err
    }
    return tx.Commit()
}

The markProcessed call has to be in the same transaction as the actual work, or you have a race where the work commits but the dedup record doesn't. Then the next redelivery re-does it.

2. Explicit ack semantics

Know whether your bus is at-most-once (send and forget, messages can be lost), at-least-once (redelivery on no-ack, duplicates possible), or effectively-once (at-least-once plus receiver-side dedup). Most production systems run on at-least-once with dedup. NATS core is at-most-once by default; NATS JetStream is at-least-once. Kafka is at-least-once with offset-based replay. RabbitMQ is configurable — check both sides agree.

3. Dead-letter path

Messages that fail repeatedly have to go somewhere other than "redelivered forever." A dead-letter queue (or topic, or subject) plus an alert when non-trivial traffic hits it. Without this, a poison message takes a consumer out of service.

4. Consumer-side observability

At minimum: consumer lag (messages in flight), processing time per message, error rate, redelivery rate. The publisher's metrics tell you about the bus, not about the work. If you can't see "how fast is the consumer chewing through the queue right now," you're flying blind during the next incident.

5. Replay and reprocessing

What happens when the consumer has a bug that corrupts data for a day, you fix the bug, and now you need to reprocess yesterday's messages? In RPC, you'd re-run the caller. In messaging, you need the ability to replay from an offset or from a backup. If the bus doesn't give you that (NATS core doesn't, JetStream does), you need a separate event log.

A Specific Pattern I Like: The Request-Reply on a Bus

One thing that confuses the discussion: you can do synchronous-looking work on a message bus. NATS has request-reply built in (nc.Request(subject, payload, timeout)), where the publisher gets a correlated reply on a temporary subject. This gives you the RPC ergonomics while using the messaging infrastructure.

When is this useful?

When you want the operational simplicity of RPC (caller waits, caller decides) but your service mesh is the message bus and adding a gRPC stack is overhead.
When you want transparent failover — multiple consumers can listen, any can reply, and the bus handles the routing.
When you want unified observability — both "notify" and "ask" flows go through the same substrate.

Request-reply over NATS gives you back caller-owned completion semantics on messaging infrastructure. It's the "pick ownership model separately from transport" option. Many good designs use it.

The one that doesn't work: request-reply where the reply is supposed to happen later, via a different message. At that point the caller has moved on, the completion is truly transferred, and you're back in consumer-owned territory. Don't pretend otherwise.

The Framing I Use in Design Reviews

When someone says "let's use NATS/Kafka/RabbitMQ for this," I ask exactly one question: who is responsible if the work doesn't happen?

If the answer is "the caller will notice and retry," they want RPC. If the answer is "the receiver will eventually catch up," they want messaging. If the answer is "I don't know," the design isn't ready.

Everything else — transport, framing, protocol — is implementation. The ownership contract is the architecture.

NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — once you've decided messaging, how to pick among the three.
Why Your "Fail-Fast" Strategy is Killing Your Distributed System — retry and resilience on the RPC side of the boundary.
Go Context in Distributed Systems: What Actually Works in Production — cancellation propagation in caller-owned flows.

Go Context in Distributed Systems: What Actually Works in Production

Harrison Guo — Thu, 16 Apr 2026 15:36:46 +0000

The bug was alive for three weeks. On a normal day it cost nothing. On the day it activated, it nearly took the service down.

The pattern was simple. An HTTP handler had to fetch data from three downstream gRPC services and merge the results. The team had done the disciplined thing: set a 5-second deadline on the request context, propagate it all the way through to the handler, use errgroup for parallelism. Except — and you've probably seen this one — the fan-out looked like this:

func handleRequest(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context() // has a 5-second deadline

    var a, b, c Result
    go func() { a, _ = callA(context.Background(), req) }() // ← here
    go func() { b, _ = callB(context.Background(), req) }() // ← here
    go func() { c, _ = callC(context.Background(), req) }() // ← here

    // ... some sync wait ...
    respond(w, merge(a, b, c))
}

Every day for three weeks, the downstreams responded in 20 ms and everything worked. Then one of them — the slow path — got a planned capacity change that degraded it from 20 ms to 20 seconds. Not a crash. Just slow. And the HTTP handler's 5-second deadline did exactly what it promised: returned a timeout to the client.

But the three goroutines kept running. They didn't get the memo.

Within ninety seconds, goroutines climbed from 2,000 to 80,000, connection pools drained, the GC started to choke on the churn, and the entire service had to be restarted twice before someone figured out that context.Background() inside a handler-scoped goroutine isn't a stylistic choice — it's a goroutine leak with extra steps.

tl;dr — context.Context is not documentation. It is the runtime boundary between "this work still matters" and "this work should stop." Every time you launch a goroutine from inside a request-scoped context and fail to propagate the parent ctx, you are creating work that outlives its reason to exist. Under load, that's what brings a service down — not CPU, not memory, not the downstream. Goroutines that won't die.

What Context Actually Is

The single biggest mistake I see engineers make is treating context.Context like an argument convention — "the standard library says I should pass one, so I pass one." That's the wrong mental model.

context.Context is four things, in order of importance:

A cancellation signal. When the context is done (cancelled, deadline exceeded), every goroutine holding it is being asked to stop.
A deadline. How much wall-clock budget this work has before it's considered failed.
An error cause. Why the context ended (context.Canceled, context.DeadlineExceeded, or a custom reason via context.Cause).
A narrow channel for request-scoped metadata. Trace ID, deadline, auth principal. That's about it.

Notice what's not on the list: data transport, DI container, settings object, session store, cache. If you're using context to pass any of those, you've already lost.

Context is control flow, not data.

ctx · 5s deadline"] --> G1"/>

When the parent ctx is cancelled, the signal propagates to every goroutine that inherited it. Every spawned call drops the work it was doing and returns. That's the whole value of context — and the reason context.Background() inside a spawned goroutine breaks everything: it severs this tree.

Every correct use of context follows from this. The moment you treat it as something else — a way to pass a config value, a way to smuggle a feature flag, a way to avoid changing a function signature — you start breaking the cancellation semantics that make it useful at all.

The Five Patterns That Work

After enough production debugging, a small set of patterns covers 95% of cases.

1. Always propagate, never replace

The outer context defines the lifetime of the work. Any goroutine spawned to do part of that work must inherit it.

// ✗ Wrong: spawned work is unkillable
go func() { doWork(context.Background()) }()

// ✓ Right: spawned work dies with the parent
go func() { doWork(ctx) }()

If your linter isn't flagging context.Background() or context.TODO() inside functions that already have a ctx in scope, fix your linter. contextcheck in golangci-lint catches most of these.

2. Fan out with errgroup.WithContext

Raw goroutines + sync.WaitGroup is the wrong primitive for fan-out calls to downstreams. Use golang.org/x/sync/errgroup:

func fanOut(ctx context.Context, req Request) (A, B, C, error) {
    g, gctx := errgroup.WithContext(ctx)

    var a A; var b B; var c C

    g.Go(func() error {
        var err error
        a, err = callA(gctx, req)
        return err
    })
    g.Go(func() error {
        var err error
        b, err = callB(gctx, req)
        return err
    })
    g.Go(func() error {
        var err error
        c, err = callC(gctx, req)
        return err
    })

    if err := g.Wait(); err != nil {
        return A{}, B{}, C{}, err
    }
    return a, b, c, nil
}

Two properties that matter:

gctx inherits the parent's deadline and cancellation. The spawned calls die when the caller gives up.
The first error cancels the sibling calls. If callA fails fast, the in-flight callB and callC stop wasting work.

Both are invisible in the code. That's the point. You get the right behavior without having to think about it per-callsite.

3. Cap the subtree with WithTimeout

The parent gives you the outer boundary. Sometimes you want a tighter one for a specific piece of work:

func callSlowly(ctx context.Context, req Request) (Result, error) {
    ctx, cancel := context.WithTimeout(ctx, 800*time.Millisecond)
    defer cancel() // ← don't leak the timer

    return client.Call(ctx, req)
}

Three things people get wrong here:

Forgetting defer cancel() leaks the timer goroutine. It's small, but it adds up under load.
Using WithTimeout where WithDeadline makes more sense — if your budget is "finish by a fixed wall-clock time," use WithDeadline. Timers and deadlines aren't the same.
Stacking timeouts that exceed the parent. A WithTimeout(ctx, 30*time.Second) on a context that already has a 5-second deadline has a 5-second effective timeout. If you're setting 30 seconds, you probably meant to replace the parent, not extend it — which is almost never what you want. Check your assumptions.

4. Make cancellation observable

In a handler loop or polling loop, cancellation must be checked at every iteration:

for {
    select {
    case <-ctx.Done():
        return ctx.Err()
    case work := <-queue:
        if err := process(ctx, work); err != nil {
            return err
        }
    }
}

I've debugged a service that looked like it was "stuck" but was actually processing a queue in a tight loop that never checked ctx.Done(). The cancellation had fired long ago; the code just didn't care.

5. Return ctx.Err() at the right boundary

When a context ends, the standard library returns context.Canceled or context.DeadlineExceeded. Your code needs to either:

Pass it up, because the caller asked for cancellation and you're honoring it, or
Translate it, because your API surface speaks a different error vocabulary (gRPC codes, HTTP status codes, domain errors).

result, err := downstream.Call(ctx, req)
if err != nil {
    // Was this our fault, or theirs?
    if errors.Is(err, context.DeadlineExceeded) {
        return Result{}, status.Error(codes.DeadlineExceeded, "upstream deadline")
    }
    if errors.Is(err, context.Canceled) {
        return Result{}, status.Error(codes.Canceled, "caller cancelled")
    }
    return Result{}, err
}

If you don't do this, the errors that reach your caller will be a mix of "the downstream is broken" and "you asked me to stop, remember?", and your on-call will waste hours separating the two.

The Anti-Patterns

There are a handful of things that look fine and aren't. These are the ones I see most.

`context.Background()` inside a spawned goroutine

The bug that opens this post. You already have a context in scope. Use it. Spawning with context.Background() breaks the cancellation chain and creates work that outlives the caller. It's the single most common goroutine leak I've seen in production Go.

Passing the context by field instead of by argument

// ✗ Wrong
type Worker struct {
    ctx context.Context
}
func (w *Worker) Do() error { return callA(w.ctx) } // stale ctx

// ✓ Right
type Worker struct{}
func (w *Worker) Do(ctx context.Context) error { return callA(ctx) }

Context is per-call, not per-object. The moment you stash it in a struct, you've made it stale — the context from construction time is not the context from the current call. golangci-lint with the contextcheck linter enabled catches most of these. If your CI doesn't run it, add it today.

Storing business data in context

// ✗ Wrong
ctx = context.WithValue(ctx, "currentUser", user)

// ✓ Right
ProcessOrder(ctx, user, order)

The rule is: if the function needs it to work, it goes in the signature. If it's optional metadata that cross-cuts every call (trace ID, request ID, auth principal for logging), context is fine — but keep the key typed (not a raw string) and keep the set small.

Blanket rethrow without translating

Returning ctx.Err() from a library function when the caller doesn't know about context produces baffling errors two layers up. If you're writing something reusable, translate context errors to your own error type at the boundary.

A Small Debugging Tool

When you suspect a context-propagation problem, the fastest way to find it is usually a goroutine dump under load. Something like this keeps one around:

// /debug/goroutines — read-only, auth-gated in prod
mux.HandleFunc("/debug/goroutines", func(w http.ResponseWriter, r *http.Request) {
    p := pprof.Lookup("goroutine")
    p.WriteTo(w, 1) // 1 = text format with stacks
})

Ship it behind auth, point a cron or load test at the thing you're trying to exercise, and diff two snapshots 10 seconds apart. Goroutines that persist across snapshots and aren't in netpoll or runtime.park_m are your suspects. Nine times out of ten, when I follow the stack traces, the leaked goroutines were spawned from a handler that's already returned — because someone wrote context.Background() inside a go func().

Where This Leaves You

The moment you treat context.Context as decoration — as a parameter you pass because the lint rule told you to — you've already lost the benefit. The entire reason context exists is to be the one shared signal that ties the lifetime of spawned work to the lifetime of its cause. Ignore that and you get goroutine leaks. Honor it and you get a service that drains cleanly under partial failure.

In a monolith, you can get away with sloppy cancellation because the damage stays local. In a distributed system, where one slow downstream can cascade through three layers of fan-out into a goroutine explosion, you cannot. The cost of sloppy context handling scales with the number of network hops, and modern architectures have many.

The fix is boring. Use errgroup.WithContext for fan-out. Never context.Background() inside a handler-scoped goroutine. Translate context errors at API boundaries. Check <-ctx.Done() in loops. Add a /debug/goroutines endpoint and actually look at it.

There are no clever moves here. There's only the habit of passing context correctly, every time, for years — and the services that outlast the ones that didn't.

Why Go Handles Millions of Connections: User-Space Context Switching, Explained — the runtime-level counterpart: what makes spawning goroutines cheap in the first place.
Why Your "Fail-Fast" Strategy is Killing Your Distributed System — a different angle on the same class of problem: behaviour during partial failure.

Go's Concurrency Is About Structure, Not Speed: chan and context as Lifecycle Primitives

Harrison Guo — Wed, 15 Apr 2026 19:24:53 +0000

For a while, I thought channels were Go's way of doing message passing. Something like Erlang processes or actors, except with a simpler syntax. That understanding is fine if you're writing tutorials. It is not fine when you've just OOM-killed a pod for the third time in an hour because your worker pool wasn't really a pool.

The moment it clicked for me was during a production incident. A Kafka consumer service had been humming along for months at about 1,000 messages per second. Then an upstream team replayed twelve hours of events into the topic at once — roughly 1.2 million messages in two minutes.

The consumer code looked like this, more or less:

for msg := range kafkaMessages {
    go process(msg) // one goroutine per message, fire and forget
}

Here's what the runtime tried to do: spawn 1.2 million goroutines as fast as it could. It did. Heap climbed from 200 MB to 12 GB in about forty seconds. GC pauses went from 2 ms to 800 ms. The pod got OOM-killed. Kubernetes restarted it. On restart, it re-read the uncommitted offsets. Repeat. It took forty minutes and manual producer-side rate limiting upstream before the system would stay up.

The bug wasn't Kafka. It wasn't Go. It was the mental model — treating goroutines as "free" and treating channels as "a way to move data between them." Goroutines are not free under load. And channels are not pipes.

tl;dr — chan and context aren't just concurrency utilities. They're the two primitives Go gives you for drawing the boundaries of aliveness in your program. chan bounds how many things are alive at once (backpressure, ownership). context bounds when they stop being alive (cancellation, deadline). Use them as the skeleton of your design, not as implementation details bolted on at the end.

The Bounded Pool Fix

The fix for the Kafka disaster is the classic bounded worker pool. The shape looks like this:

reads one at a time"]"/>

The bounded channel is the concurrency clamp. The context is the kill switch. Neither alone is enough; together they give you a pipeline that drains cleanly under shutdown and refuses to explode under load.

Here it is in full, because the code matters:

func run(ctx context.Context, consumer *kafka.Consumer) error {
    const (
        workers     = 50              // fixed pool
        bufferSize  = 100             // bounded queue
    )

    jobs := make(chan Message, bufferSize)

    // Spawn fixed workers
    var wg sync.WaitGroup
    for i := 0; i < workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            worker(ctx, jobs)
        }()
    }

    // Producer: push into bounded jobs, blocks when full
    go func() {
        defer close(jobs) // tell workers we're done
        for {
            msg, err := consumer.ReadMessage(ctx)
            if err != nil {
                return
            }
            select {
            case jobs <- msg:
                // enqueued; producer moves on
            case <-ctx.Done():
                return
            }
        }
    }()

    <-ctx.Done() // wait for shutdown
    wg.Wait()
    return ctx.Err()
}

func worker(ctx context.Context, jobs <-chan Message) {
    for {
        select {
        case msg, ok := <-jobs:
            if !ok {
                return // producer closed the channel
            }
            if err := process(ctx, msg); err != nil {
                log.Error(err)
            }
        case <-ctx.Done():
            return
        }
    }
}

Look at what this code does that the broken version didn't:

Fixed number of workers. 50, period. Never more, regardless of input rate.
Bounded queue. At most 100 in-flight messages between producer and workers. When the queue is full, the producer stops reading from Kafka.
Backpressure is implicit. The blocking send on jobs <- msg is the backpressure mechanism. No complicated flow control needed. The channel is the mechanism.
Cancellation is wired everywhere. ctx.Done() in producer, worker, and consumer.ReadMessage. Any of them dies with the parent.

That's it. No semaphores. No rate limiters. No backoff. The channel semantics do the whole job. The channel isn't a pipe; it's a clamp.

Channels Are Lifecycle Primitives

This is the insight I wish I'd had earlier: a channel isn't really about data transfer. It's about ownership and aliveness.

When you send on a channel, you're transferring ownership of a value from the sender to the receiver. The sender no longer owns it; the receiver does. That's useful, and it's the "share memory by communicating" idea you've probably read a dozen times.

But the deeper use is backpressure. A channel with capacity N means "at most N things can be in flight between these two points in the program." When it's full, the producer has to stop. That stop is the entire backpressure signal — no separate rate limiter, no token bucket, no hand-rolled semaphore. The buffer size is the concurrency bound.

Once you see this, you stop thinking of channels as "fancy queues" and start thinking of them as structural declarations: this is how many things can be happening in this zone of my program. That's a very different design tool.

Four patterns that fall out

Bounded pool. The Kafka example above. Fixed workers consume from a bounded channel. The channel is the clamp on in-flight work.

Fan-out, fan-in. One producer, N workers, one aggregator. Each stage is a channel. The sizes of those channels are the concurrency limits between stages.

  producer  →  [chan N] →  worker pool (M)  →  [chan N']  →  aggregator

Rate-limited writer. Want to batch writes to a slow downstream? One channel in, one goroutine that flushes every 100 items or every 500ms, whichever comes first. The channel is the queue; the goroutine is the flush policy.

Graceful shutdown signal. A chan struct{} closed on shutdown is a broadcast to every goroutine listening. Every place that checks case <-done: gets the signal at the same time, for free.

None of these need mutexes. Mutexes show up when you have shared mutable state that multiple goroutines read and modify together — a cache, a counter, a shared map. That's different from "multiple goroutines coordinating their lifecycles," which is what channels are for.

The rule of thumb I use:

Coordinating goroutines? Channels.
Sharing a counter or cache? sync.Mutex, sync.RWMutex, or sync/atomic.
Both? Channels for the outer shape (who's alive, when to stop), mutex for the inner state (protected data).

Context Is the Other Half

If chan defines "how many alive," context defines "when to die." You already know the story if you've written any Go: context.Context carries a cancellation signal, an optional deadline, and a shallow bag of request-scoped metadata. It propagates down through function calls, and when it fires, every goroutine holding it is asked to stop.

What I want to emphasize is the pairing.

In the bounded-pool example above, look at where ctx appears:

In the worker's select loop — so a worker can stop mid-wait on <-jobs.
In the producer's select — so the producer can stop mid-wait on jobs <- msg.
In the call to consumer.ReadMessage(ctx) — so the Kafka read unblocks immediately on shutdown.

Remove any one of those and the shutdown path has a hole. With all three, cancel() on the parent context makes the entire pipeline drain and stop cleanly in under a second. The channel decides the structure; the context decides the termination.

Neither primitive alone is enough. A bounded channel without cancellation will keep processing until its queue drains — which can be minutes for a deep queue. A cancelled context without a bounded channel still lets you create unbounded goroutines between now and the moment everyone notices. You need both.

chan draws the boundaries in space. context draws the boundary in time. Together they describe the shape and the lifetime of concurrent work.

What "Structure, Not Speed" Actually Means

Go's concurrency model is often sold as fast. Sometimes it is. Per-request throughput in well-written Go is solidly middle of the pack — beaten by Rust and C++, comparable to Java and C#. You do not pick Go because it's fast.

You pick Go because the design of a concurrent program becomes tractable. A senior engineer reading a goroutine-and-channels design understands what's alive and what bounds it. A junior engineer reading the same code doesn't have to know about monitors, condition variables, or lock ordering. The shape of the program is visible in the channel declarations.

That's the "structure" pitch. And it works because the primitives compose:

Bounded channels compose into pipelines with known concurrency at each stage.
Contexts compose into a lifetime tree where cancelling any subtree stops everything below it.
Select statements compose cancellation, timeouts, and channel operations into a single readable switch.

The failure mode — the one that gave me the Kafka outage — is treating these primitives as optional utilities you reach for when a standard pattern doesn't fit. They aren't. They're the first-class design vocabulary of concurrent Go. The moment you're writing concurrent code without thinking in channels and contexts, you've left the paved road.

Small Things That Matter

A few tactical points I've learned the expensive way:

Always document channel ownership. Who closes it? Who sends? Who receives? A closed channel panics on send. A nil channel blocks forever in select. These are easy to reason about if ownership is clear, and confusing if it isn't. I use comments right at the declaration site: // jobs: producer sends and closes; workers receive.
Close from the sender, not the receiver. There's exactly one sender and it owns the lifecycle. Multiple senders? Use a separate done channel or a sync.Once.
select with default is not backpressure, it's drop. select { case ch <- x: default: } drops the message if the channel is full. Sometimes that's what you want (metrics sampling). Often it's a bug disguised as a performance optimization.
Unbuffered channels are rendezvous, not pipes. An unbuffered send completes the instant a receiver is ready, not before. This is sometimes exactly the synchronization you want (handoff semantics) and sometimes a deadlock waiting to happen.
Test under load, not just logic. The Kafka incident would have been caught by any realistic load test. Unit tests happily ran the go process(msg) version and passed. Load is what reveals structural bugs.

The Real Lesson

Three years ago I'd have written "use goroutines for parallelism, channels for communication, context for cancellation" and considered that advice. I don't think it's wrong, but it misses the point.

The better framing is: chan and context are the two primitives for drawing boundaries around concurrent work. One draws the boundary of "how many alive." The other draws "when to die." Everything else — the pools, the pipelines, the cancellation trees — is built by composing these two.

A design that doesn't specify those boundaries isn't really a design. It's just code that happens to spawn goroutines. Sometimes it works. Sometimes it eats a pod's memory in forty seconds.

The Kafka incident fixed itself the day we stopped writing go process(msg) and started writing jobs <- msg. The second version is longer. It's also the version that doesn't page us at 3 AM.

Why Go Handles Millions of Connections: User-Space Context Switching, Explained — why spawning goroutines is cheap in the first place. The foundation that lets you do any of this.
Go Context in Distributed Systems: What Actually Works in Production — the sibling post on context propagation patterns and the context.Background() trap.
Why Your "Fail-Fast" Strategy is Killing Your Distributed System — a different lens on the same underlying question: what should your program do when the world gets slow?

Why Go Handles Millions of Connections: User-Space Context Switching, Explained

Harrison Guo — Tue, 14 Apr 2026 21:43:03 +0000

Somewhere around 40,000 concurrent connections, your Java service falls over. Not from CPU, not from network — from memory, because every connection is a thread and every thread wants its own megabyte of stack. By the time you've finished Googling whether this is a -Xss problem or a ulimit problem, Ops has already bumped the box to 64 GB and you've pushed the wall back another 20,000 connections. Linear in RAM. It never ends.

A Go service on half that box can hold 200,000 connections without noticing. People assume it's because Go is faster. It isn't. Per-request, Go and Java are roughly the same — sometimes Java wins. What Go does differently is more fundamental: it stops asking the kernel to help.

tl;dr — High-concurrency isn't about raw CPU. It's about how cheaply you can hold an idle connection open. Go's 2KB goroutine stacks and user-space M:N scheduler push the marginal cost of a connection close to zero. The kernel only gets involved when there's real I/O to do. This is the same principle HFT engines chase with DPDK and io_uring — Go just hands it to you for free.

The Wrong Mental Model

Most engineers I talk to think "threads are expensive because threading is hard." That's not wrong, but it misses the more mechanical reason.

Every time a traditional language (Java pre-Loom, C# pre-async everywhere, classic Python) parks a thread waiting for I/O, it pays two concrete costs:

Stack memory: Default JVM thread stack is 1 MB. 40,000 threads = 40 GB of stack, most of which is unused.
Context-switch cost: When the OS swaps the thread, it traps into the kernel, saves the full register set, swaps page tables if there's an address-space change, flushes TLB entries, and walks the scheduler's runqueue. Measured on modern x86, that's 1–5 microseconds per switch, plus the less visible cost of instruction-cache pollution.

Multiply that by tens of thousands of waiters and you're paying the kernel a rent that has nothing to do with your actual workload.

What Go Does Instead

stack ≈ 1 MB"]"/>

Go's concurrency is built on an M:N scheduler. You have many goroutines (N) multiplexed onto a small number of OS threads (M, typically GOMAXPROCS).

Here's the part that matters:

A goroutine starts with a 2 KB stack, not a megabyte. Growth is copy-and-resize in user space, triggered by the function prologue when it detects a near-overflow.
Switching between goroutines happens entirely in the Go runtime. No syscall. No TLB flush. No register-set save-and-restore at OS cost. Roughly a couple hundred nanoseconds in microbenchmarks — an order of magnitude cheaper than an OS-level context switch. The exact number moves around with workload, scheduler contention, and Go version; what's stable is the order of magnitude.
When a goroutine blocks on network I/O, the runtime parks it and flips the underlying OS thread to run a different goroutine. The goroutine's state lives in Go's own scheduler, not in a kernel wait queue.

This is the actual answer to "why Go scales to millions of connections": the runtime refuses to hand idle work back to the kernel. The kernel still does the real I/O — Go uses epoll on Linux, kqueue on BSD, IOCP on Windows — but it only involves the kernel when there's actual work, not when a goroutine is just sitting around.

A Small Benchmark That Tells the Whole Story

Here's a stripped-down Go program that spins up N goroutines, each one holds a channel read, and prints the total RSS when they're all parked:

package main

import (
    "fmt"
    "os"
    "runtime"
    "sync"
    "syscall"
)

func main() {
    n := 100_000
    if len(os.Args) > 1 {
        fmt.Sscanf(os.Args[1], "%d", &n)
    }

    var wg sync.WaitGroup
    ch := make(chan struct{})
    wg.Add(n)

    for i := 0; i < n; i++ {
        go func() {
            defer wg.Done()
            <-ch // park forever
        }()
    }

    // Let the runtime settle
    runtime.GC()

    var r syscall.Rusage
    syscall.Getrusage(syscall.RUSAGE_SELF, &r)
    fmt.Printf("goroutines=%d  rss=%d KB  (%.1f KB/goroutine)\n",
        n, r.Maxrss, float64(r.Maxrss)/float64(n))

    close(ch)
    wg.Wait()
}

On my laptop (M1, Go 1.22, macOS):

goroutines=10000    rss=28672 KB   (2.9 KB/goroutine)
goroutines=100000   rss=263168 KB  (2.6 KB/goroutine)
goroutines=1000000  rss=2600960 KB (2.6 KB/goroutine)

2.6 KB per parked goroutine, flat, all the way to a million. That's the story. Not 1 MB. Not 256 KB. Two and a half KB.

Try the equivalent program with new Thread(() -> ...).start() in Java and you will run out of memory well before 100,000. The comparison isn't even close, and it isn't about execution speed — it's about what an idle waiter costs.

The Parallel in Finance: Same Problem, Opposite Extreme

The part that made this click for me is noticing where else this principle shows up. High-frequency trading engines and exchange colocation boxes have the same bottleneck — kernel context switches are expensive — and they solve it the other way: skip the kernel entirely.

DPDK gives userspace direct access to the NIC. Packets bypass the kernel network stack.
Kernel-bypass sockets (Solarflare Onload, AWS Nitro enhanced networking) push the TCP/IP stack into userspace.
io_uring on modern Linux brings the same idea to general-purpose code — a shared memory ring buffer between app and kernel, batched, with minimal syscalls.
RDMA lets network cards write directly into another machine's memory. No kernel on either end.

Different tools, same target: syscalls and context switches are expensive; keep them off the hot path.

Go arrives at the same destination with a completely different route. Instead of bypassing the kernel, it hides the kernel behind a user-space scheduler and only calls in when absolutely necessary. HFT says "the kernel is slow, route around it." Go says "the kernel is slow, so we'll handle most of the state ourselves and only ring the kernel's doorbell when we have real work." The principle is identical.

Once you see this pattern, you start seeing it everywhere. V8 Isolates. Erlang processes. Rust async runtimes. The details differ but the bet is the same: keep concurrency cheap by keeping it out of the kernel.

Where Go Actually Breaks Under Load

None of this means Go scales forever. When I've seen Go services crack at scale, it's usually not the runtime:

File descriptors: Default ulimit -n is 1024 on most systems. You'll hit this before you stress the scheduler. Push it to 1M if you're actually building a long-poll service.
Ephemeral ports: If your service fans out to a downstream with lots of short-lived outbound connections, the 28K-ish default ephemeral port range bites before anything else.
Conntrack tables: Linux's nf_conntrack_max default is laughably small for a real service. Tune it or turn it off on high-throughput paths.
GC pressure from allocation-heavy handlers: The scheduler is cheap; the garbage collector is not. Sync pools, stack-allocated buffers, and careful escape analysis still matter.
The load balancer: Your L4/L7 LB probably caps out before Go does.

I've watched a Go service sit happily at 400K connections on a single pod while the upstream Envoy bled under its own CPU budget. The Go process was the calm one.

Concurrency Isn't a Speed Contest

It's a cost-of-idleness contest.

If you're building anything with long-lived connections — streaming APIs, WebSocket fan-out, server-sent events, message brokers, pub/sub gateways, anything with more connections than cores — the question isn't "is my language fast?" It's "how much does one idle waiter cost me?"

Go's answer is 2.6 KB and 200 nanoseconds. That's why it scales.

If you come from a world where "high concurrency" means "we bought a bigger box," Go can feel like cheating. It isn't. It's just a careful, decade-old design decision that says: the kernel is a system call you should make as rarely as possible, and when you must, do it in bulk.

Claude Code Deep Dive Part 4: Why It Uses Markdown Files Instead of Vector DBs

Harrison Guo — Wed, 08 Apr 2026 05:19:40 +0000

This is Part 4 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features | Part 2: The 1,421-Line While Loop | Part 3: Context Engineering — 5-Level Compression Pipeline

This article replaces and deepens our earlier analysis, Claude Code's Memory Is Simpler Than You Think. The original focused on limitations. This one focuses on **why* — the first-principles tradeoffs behind every design choice.*

The Core Principle: Only Record What Cannot Be Derived

This single constraint governs every decision in Claude Code's memory system:

Don't save code patterns — read the current code. Don't save git history — run git log. Don't save file paths — glob the project. Don't save past bug fixes — they're in commits.

This isn't about saving storage. It's about preventing memory drift.

If a memory says "auth module lives in src/auth/", one refactor makes that memory a lie. But the model doesn't know it's a lie — it trusts specific references by default. A stale memory is worse than no memory at all, because the model acts on it with confidence.

Code is self-describing. The source of truth is always the current state of the project, not a snapshot from three weeks ago. Memory should store meta-information — who the user is, what they prefer, what decisions were made and why — not facts that the codebase already expresses.

Four Types, Closed Taxonomy

Claude Code enforces exactly four memory types. Not tags. Not categories. Four types with hard boundaries:

Type	What to Store	Example
user	Identity, preferences, expertise	"Data scientist, focused on observability"
feedback	Behavioral corrections AND confirmations	"Don't summarize after code changes — user reads diffs"
project	Decisions, deadlines, stakeholder context	"Merge freeze after 2026-03-05 for mobile release"
reference	Pointers to external systems	"Pipeline bugs tracked in Linear INGEST project"

Why closed taxonomy beats open tagging: Free-form tags cause label explosion. A model tagging memories freely might produce "coding-style", "code-style", "style-preference", "formatting" — four labels for the same concept. Closed taxonomy forces an explicit semantic choice. Each type has different storage structure (feedback requires Why + How to apply fields) and different retrieval behavior. The constraint buys clarity.

Why Positive Feedback Matters More Than Corrections

The feedback type stores both failures AND successes. The source code explains why:

"If you only save corrections, you will avoid past mistakes but drift away from approaches the user has already validated, and may grow overly cautious."

Imagine the user says "this code style is great, keep doing this." If you don't save that, next session the model might "improve" the style — moving away from what the user explicitly liked. Positive feedback anchors the model to known-good patterns. Without anchors, corrections alone push the model toward progressively safer (blander) output.

Project Type: Relative Dates Kill You

When a user says "merge freeze after Thursday", the memory must store "merge freeze after 2026-03-05." A memory read three weeks later has no idea what "Thursday" meant. This seems obvious, but it's an explicit rule in the source code because models default to storing user language verbatim.

Why Sonnet Side-Query Instead of Vector Embeddings

This is the design choice that draws the most criticism. Claude Code uses a live LLM call (Sonnet) to pick relevant memories instead of vector similarity search. Here's the actual tradeoff:

How it works:

Sonnet reads descriptions (not full content), evaluates semantic relevance, and returns up to 5 filenames. The call costs ~250ms and 256 output tokens.

Why this over vector embeddings:

Dimension	Sonnet Side-Query	Vector Embeddings
Semantic depth	Full language understanding — "deployment" matches "CI/CD"	Cosine similarity — good but shallow
Infrastructure	Zero — one API call	Requires embedding model + vector store
Transparency	Can inspect WHY a memory was selected	Opaque similarity scores
Cost per query	~250ms + 256 tokens (shared prompt cache)	Embedding call + search latency
Scaling	Degrades past ~200 files	Scales to millions

The tradeoff is deliberate: for a session-based CLI tool where users typically have 20-100 memories, Sonnet's semantic understanding beats vector search's scale. The 250ms latency is hidden entirely through async prefetch — the search runs in parallel while the model generates its response. For the user, memory recall is "free."

The 5-File Cap: Constraint as Design

Why limit to 5 memories when a user might have 200?

This is not a technical limitation. It's a behavioral nudge. If the system scaled to inject 50 memories, users would never clean up stale ones. The 5-file cap pushes users to write better descriptions (so the right memories get selected) and consolidate outdated entries (so slots aren't wasted on stale info).

Design principle: constraints that change user behavior beat constraints that scale infrastructure.

Background Extraction: The Invisible Agent

Claude Code doesn't just save memories when you say /remember. After every conversation turn where the main agent stops (no more tool calls), a forked background agent runs to extract memorable information.

Key design details:

Mutual exclusion: If the main agent already wrote a memory in this turn, the extractor skips. No duplicate memories from the same conversation.
Trailing runs: If extraction is still running when the next turn ends, the new request queues as pendingContext. When the current extraction finishes, it picks up the pending work. No concurrent writes to the memory directory.
5-turn hard deadline: The extractor gets at most 5 tool-call turns. Efficiency over completeness.
Minimal permissions: Read/Grep/Glob unlimited. Write only to the memory directory. Cannot modify project files, execute code, or call external services.
Shared prompt cache: The forked agent reuses the parent's cached system prompt — near-zero additional token overhead.

The execution strategy is prescribed in the prompt: "Turn 1: parallel reads of all existing memories. Turn 2: parallel writes of new memories." Two turns for the common case. The 5-turn budget handles edge cases.

Trust but Verify: The Eval That Proved It

The most impactful section in Claude Code's memory prompt is TRUSTING_RECALL_SECTION:

"A memory that names a specific function, file, or flag is a claim that it existed when the memory was written. It may have been renamed, removed, or never merged."

The rule: before acting on a memory that references a file path, verify the file exists (Glob). Before trusting a function name, confirm it's still there (Grep).

This section's value was proven empirically: without it, eval pass rate was 0/2. With it, 3/3. Models default to trusting specific references in memory. They'll confidently say "as stored in memory, the auth module is at src/auth/" — even when that path was renamed weeks ago. The verification requirement breaks this default behavior.

Three Architectures, Three Tradeoffs

This is not a ranking. I'm using OpenClaw and Hermes as contrasts because they represent the two obvious alternative bets: scale and autonomy. Claude Code, OpenClaw, and Hermes Agent made different choices for different deployment models.

Dimension	Claude Code	OpenClaw	Hermes Agent
Storage	Markdown files (flat)	MD + SQLite (FTS + vector)	SQLite + FTS + MEMORY.md
Recall	Sonnet side-query (semantic)	Embedding cosine + FTS fusion	Full-text search + structured queries
Infrastructure	Zero (filesystem only)	SQLite + embedding model	SQLite
Transparency	Full (plain text, human-readable)	Partial (vector scores opaque)	Partial
Learning loop	None (static after write)	None	Self-evolving (auto-generates skills)
Session model	Session-based, stateless between sessions	Persistent, cross-session	Persistent, self-improving
Scale ceiling	~200 files by design	Scales with SQLite	Scales with SQLite

Claude Code's Bet

Optimize for zero infrastructure and full transparency. Accept a scale ceiling.

For a CLI tool that runs on a developer's laptop, requiring SQLite or an embedding service is friction. Plain Markdown files are human-readable, git-trackable, and editable with any text editor. The 200-file ceiling is intentional — if you need more, you should be consolidating, not scaling.

When this breaks: Teams with hundreds of shared memories. Long-running projects where memory accumulation outpaces cleanup. Multi-user scenarios where memory needs to be queried across team members.

OpenClaw's Bet

Accept infrastructure overhead for persistent cross-session scale.

OpenClaw stores memories in SQLite with both full-text search and vector embeddings. This enables fuzzy semantic matching across thousands of memories, weighted fusion of multiple retrieval signals, and persistent state that survives across sessions indefinitely.

When this breaks: Setup complexity. Users must configure embedding models. Vector similarity scores are opaque — when the wrong memory is recalled, debugging why is harder than inspecting a Sonnet side-query.

Hermes Agent's Bet

Accept complexity for a self-evolving learning loop.

Hermes doesn't just store memories — it generates skills from completed tasks. After a complex task (5+ tool calls), the agent distills the entire process into a structured skill document. Next time it encounters a similar task, it loads the skill instead of solving from scratch. Skills self-iterate: if the agent finds a better approach during execution, it updates the skill automatically.

When this breaks: Skill quality is unverified. A bad skill propagated through the learning loop compounds errors. The self-evolving mechanism needs guardrails that don't exist yet — there's no eval framework for auto-generated skills.

The Right Choice Depends on Your Deployment Model

Session-based, single user, zero setup → Claude Code's approach
Persistent, multi-user, cross-session  → OpenClaw's approach  
Autonomous, self-improving, research    → Hermes's approach

There is no universal "best." The first-principles question is: what are you optimizing for — simplicity, scale, or autonomy?

What This Teaches About Agent Design

Three principles that transfer beyond memory systems:

Constraints that change user behavior > constraints that scale infrastructure. The 5-file cap is more effective than unlimited vector search, because it forces better memory hygiene. Don't build capacity for a mess — design incentives for cleanliness.
Eval data beats intuition for prompt engineering. The trust-verification section wasn't added because someone thought it was a good idea. It was added because evals went from 0/2 to 3/3. If you can't measure it, you're guessing.
Use the model's own reasoning for retrieval when latency allows. Sonnet understanding "deployment" relates to "CI/CD" is something no keyword match or embedding similarity can reliably do. When your retrieval budget allows a model call, the quality ceiling is higher than any static index.

Previous: Part 3: Context Engineering — 5-Level Compression Pipeline | Part 2: The 1,421-Line While Loop | Part 1: 5 Hidden Features

See also: Claude Code + Codex: Two Brains for how dual-AI workflows complement the memory system.

Claude Code Deep Dive Part 3: The 5-Level Compression Pipeline Behind 1M Tokens

Harrison Guo — Wed, 08 Apr 2026 05:19:22 +0000

This is Part 3 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features | Part 2: The 1,421-Line While Loop | Part 4: Memory Tradeoffs

Why Context Engineering Is the Real Moat

Every AI agent has the same fundamental constraint: a fixed-size context window. Claude's is now up to 1M tokens. That sounds massive — until you realize a real coding session can easily generate multiples of that. Dozens of file reads, hundreds of tool calls, thousands of lines of output.

The model's decision quality depends entirely on what it sees. Get the tradeoff wrong, and it forgets which files it just edited, re-reads content it already saw, or contradicts its own earlier decisions.

Think of the context window as an office desk. Limited surface area. You need the most important documents within arm's reach, everything else filed in drawers — retrievable, but not cluttering your workspace.

Claude Code's context engineering is that filing system. And it's far more sophisticated than most people expect. In Part 2, we covered the 4-stage compression overview as part of the loop's survival mechanism. Here, we zoom into the internal engineering — revealing a 5th level most sessions never trigger, a dual-path algorithm that adapts to cache state, and a security blind spot in the summarizer.

The compression pipeline alone lives in src/services/compact/ — over 3,960 lines of TypeScript across 5 files.

The 5-Level Compression Pipeline

The design philosophy is progressive compression: cheapest first, heaviest last. Each level is more expensive than the previous one — consuming more compute or discarding more context detail.

Most conversations never reach Level 5. That's the point.

Level 1 — Tool Result Budget (Zero Cost)

Problem: A single FileReadTool call on a 10,000-line file dumps the entire thing into context. A BashTool running find returns thousands of paths.

Solution: When a tool result exceeds 50,000 characters (DEFAULT_MAX_RESULT_SIZE_CHARS), Claude Code doesn't truncate it — it persists the full output to disk and keeps only a 2KB preview in context:

<persisted-output>
Output too large (2.3 MB). Full output saved to:
/tmp/.claude/session-xxx/tool-results/toolu_abc123.txt

Preview (first 2.0 KB):
[first 2000 bytes of content]
...
</persisted-output>

Why persist instead of truncate? Truncation means permanent loss. If the model later needs line 500 of that output — maybe that's where the bug is — it can use the Read tool to access the full file from disk. The 2KB preview gives enough context to decide whether that's necessary.

Level 2 — History Snip

Think of History Snip as garbage collection for stale conversation scaffolding. If the session contains repetitive assistant wrappers, redundant bookkeeping, or older spans that no longer affect the next decision, this layer can cut them before heavier compression starts.

Its real importance is accounting correctness. It feeds snipTokensFreed into the autocompact threshold calculation. Without that correction, the last assistant message's usage data still reflects the pre-snip context size, so autocompact can fire even after tokens were already freed.

Level 3 — Microcompact (The Dual-Path Design)

This is where it gets clever. Microcompact cleans up old tool results that are no longer useful — that file you read 30 minutes ago is probably irrelevant now, but it's still eating thousands of tokens.

The twist: Microcompact has two completely different code paths, selected based on cache state.

Path A — Cache Cold (Time-Based)

When the user was away long enough for the prompt cache to expire (default 5-minute TTL), the cache is already dead. Rebuilding is inevitable. So Microcompact goes ahead and directly modifies message content:

// microCompact.ts — cold path
return { ...block, content: '[Old tool result content cleared]' }

Simple, brutal, effective. Keep only the N most recent compactable tool results, replace everything else with a placeholder.

Path B — Cache Hot (Cache-Editing)

When the user is actively chatting and the prompt cache is warm — holding 100K+ tokens of cached prefix — directly modifying messages would invalidate the entire cache. That's a massive cost hit.

Instead, the hot path uses an API-level mechanism called cache_edits:

Tag tool result blocks with cache_reference: tool_use_id
Construct cache_edits blocks telling the server to delete those references in-place
Server-side deletion preserves cache warmth — no client re-upload needed

The messages themselves are returned unchanged. The edit happens at the API layer, invisible to the local conversation state.

	Time-Based (Cold)	Cache-Edit (Hot)
Trigger	Time gap exceeds threshold	Tool count exceeds threshold
Operation	Direct message modification	`cache_edits` API blocks
Cache Impact	Cache rebuilds anyway	Preserves 100K+ cached prefix
API Calls	Zero	Zero (edits piggyback on next request)

The two paths are mutually exclusive. Time-based takes priority — if the cache is already cold, using cache_edits is pointless.

Level 4 — Context Collapse (Non-Destructive)

Think of this as a database View — the underlying table (message array) stays unchanged, but queries (API requests) see a filtered, summarized projection.

Context Collapse triggers at ~90% utilization. Unlike autocompact, it's reversible — original messages are never deleted, and the collapse can be rolled back if needed. The summaries live in a separate collapse store, and projectView() overlays them onto the original messages at query time.

Critical interaction: when Context Collapse is active, Autocompact is suppressed. Both compete for the same token space — autocompact at ~87%, collapse at ~90% — and autocompact would destroy the fine-grained context that collapse is trying to preserve.

Level 5 — Autocompact (The Last Resort)

When everything else fails to keep tokens under control, the system forks a child agent to summarize the entire conversation. This is expensive and irreversible.

The compression prompt uses a two-phase Chain-of-Thought Scratchpad technique:

<analysis> block — the model walks through every message chronologically: user intent, approaches taken, key decisions, filenames, code snippets, errors, fixes
<summary> block — a structured summary with 9 standardized sections (Primary Request, Key Technical Concepts, Files and Code, Errors and Fixes, Problem Solving, All User Messages, Pending Tasks, Current Work, Optional Next Step)

The critical design: formatCompactSummary() strips the <analysis> block and keeps only the <summary>. Chain-of-thought reasoning improves summary quality dramatically, but the reasoning itself would waste tokens if kept in context. Discard the work, keep the conclusion.

Post-Compression Recovery

Autocompact's biggest risk: the model "forgets" files it just edited. The system automatically runs runPostCompactCleanup():

Restore last 5 recently-read files (≤5K tokens each)
Restore all activated skills (≤25K tokens total)
Re-announce deferred tools, agent lists, MCP directives
Reset Context Collapse state
Restore Plan mode state if active

Without this recovery step, the model would start re-reading files it just edited — or worse, make contradictory changes.

The Circuit Breaker Story

On March 10, 2026, Anthropic's telemetry showed 1,279 sessions with 50+ consecutive autocompact failures. The worst session hit 3,272 consecutive failures. Globally, this wasted approximately 250,000 API calls per day.

In Part 2, we mentioned the circuit breaker as a single boolean (hasAttemptedReactiveCompact). Here's the production story behind it.

The fix was three lines:

const MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3

After 3 consecutive failures, stop trying. The context is irrecoverably over-limit — burning more API calls won't help. This is a textbook circuit breaker: detect a failure loop, break it early, fail gracefully.

Three adjacent systems make this pipeline viable in production: accurate token estimation, prompt-cache boundaries, and the summarizer's security assumptions.

Token Estimation Without API Calls

Most agents estimate context size by counting tokens on the client. This typically has 30%+ error — enough to trigger compression too early or too late.

Claude Code uses a smarter approach. Think of it as a morning weigh-in: you step on the scale at 75kg, then eat lunch. You don't need the scale again — estimating 75.5kg is good enough.

The "scale" is the usage data returned by every API response — server-side precise token counts. The "lunch" is the few messages added since then.

function tokenCountWithEstimation(messages): number {
  // Find the most recent message with server-reported usage
  // Use that as the anchor point
  // Estimate only the delta (new messages since anchor)
  // Result: <5% error vs 30%+ from pure client estimation
}

This eliminates the need for tokenizer API calls while maintaining accuracy that's good enough for compression timing decisions.

The Prompt Cache Architecture

Claude Code's system prompt can be 50-100K tokens. Without caching, every API call would re-process this from scratch.

The key innovation: SYSTEM_PROMPT_DYNAMIC_BOUNDARY — a sentinel string that splits the system prompt into static and dynamic halves.

Before the boundary: core instructions, tool descriptions, security rules — identical for ALL users globally → cached with scope: 'global'
After the boundary: MCP tool instructions, output preferences, language settings — varies per user → not cached globally

This means millions of Claude Code users share the same cached system prompt prefix. One cache hit saves compute for everyone. But change one byte before the boundary, and the global cache breaks for all users.

To protect this, Claude Code implements sticky-on latching for beta headers: once a header is sent in a session, it persists for all subsequent requests — even if the feature flag is turned off mid-session. Flexibility sacrificed for cache stability.

The Security Blind Spot

Here's something the compression pipeline gets wrong: it treats all content equally.

The autocompact summarizer processes user instructions and tool results through the same pipeline. If an attacker plants malicious instructions inside a project file — and the model reads that file — those instructions survive compression. They become part of the summary, indistinguishable from legitimate context.

The <analysis> scratchpad that makes summaries so good also faithfully preserves injected instructions. There's no classification step that distinguishes "user said this" from "this was in a file the model read."

Additionally, truncateHeadForPTLRetry() reveals another edge: when the conversation is so long that the compression request itself triggers a Prompt-Too-Long error, the system recursively drops the oldest turns to make the compression fit. An attacker could craft inputs that survive this truncation — instructions placed strategically in the middle of conversations, not at the edges.

Three Designs Worth Stealing

If you're building your own agent, these patterns transfer directly:

Progressive compression (cheapest first) — Don't jump to expensive summarization. Try zero-cost approaches first. Most sessions will never need the heavy option.
Cache-aware dual paths — Let infrastructure state drive algorithm selection. When cache is cold, optimize for simplicity. When cache is hot, optimize for preservation. Same goal, different strategies.
Circuit breakers on automated recovery — Never let a fix become a new failure mode. If compression fails 3 times, it will fail a 4th time. Stop. The 250K wasted API calls/day before this fix was added is a cautionary tale for any self-healing system.

Next: Part 4: Memory — First-Principles Tradeoffs in Agent Persistence — why Anthropic chose Markdown files over vector databases, and when that's the wrong call.

Previous: Part 2: The 1,421-Line While Loop | Part 1: 5 Hidden Features

Claude Code + Codex Plugin: Two AI Brains, One Terminal

Harrison Guo — Tue, 07 Apr 2026 14:47:24 +0000

You're debugging a gnarly race condition. Claude Code has been going at it for 10 minutes — reading files, forming theories, running tests. Then it hits a wall. Same hypothesis, same failed fix, third attempt.

What if you could call in a second brain — a completely different model with fresh eyes — without leaving your terminal?

That's what the Codex plugin for Claude Code does. It puts OpenAI's Codex (powered by GPT-5.4) inside your Claude Code session as a callable rescue agent. Two models. Two reasoning styles. One shared codebase.

What Is It, Exactly?

The Codex plugin is a Claude Code plugin — not a standalone tool. It lives inside your Claude Code session and gives you slash commands to dispatch tasks to OpenAI's Codex CLI.

Think of it as a second engineer sitting next to you. Claude (Opus) is your primary — it has the full conversation context, knows your project, runs your tools. Codex is your specialist — you hand it a focused task, it works in a sandboxed environment, and returns results.

The key insight: they don't compete. They complement.

Claude sees the big picture. It orchestrates, reads files, runs tools, manages state.
Codex gets a sharp, scoped task. It reasons deeply on that one problem and comes back with an answer.

Setup: 3 Minutes

1. Install the Codex CLI

npm install -g @openai/codex

2. Authenticate

Inside Claude Code, type:

!codex login

This opens a browser for OpenAI authentication. Once done, your token is stored locally.

3. Verify

/codex:setup

Claude Code will check that the Codex CLI is installed, authenticated, and ready.

The Commands

The plugin adds 7 slash commands to Claude Code:

Command	What It Does
`/codex:setup`	Check installation and auth status
`/codex:rescue`	Hand a task to Codex (the main one you'll use)
`/codex:review`	Run a Codex code review on your local git changes
`/codex:adversarial-review`	Same, but Codex actively challenges your design choices
`/codex:status`	Check running/recent Codex jobs
`/codex:result`	Get the output of a finished background job
`/codex:cancel`	Kill an active background Codex job

The Rescue Workflow: When Claude Gets Stuck

This is where the plugin shines. Claude Code will proactively spawn the Codex rescue agent when it detects it's stuck — same hypothesis loop, repeated failures, or a task that needs a second implementation pass.

You can also trigger it manually:

/codex:rescue fix the race condition in src/worker.ts — tests pass locally but fail in CI under parallel execution

What happens behind the scenes:

Claude takes your request and shapes it into a structured prompt optimized for GPT-5.4
The plugin invokes codex-companion.mjs task with that prompt
Codex works in the shared repository — reading files, reasoning, writing code
Results come back into your Claude Code session

Foreground vs Background

Small, focused rescues run in the foreground — you wait and get the result immediately.

Big, multi-step investigations can run in the background:

/codex:rescue --background investigate why the build is 3x slower since the last merge

Check on it later with /codex:status and grab results with /codex:result.

Code Review: A Second Opinion That Actually Pushes Back

/codex:review

This sends your local git diff to Codex for review. It checks against your working tree or branch changes.

But the real power is the adversarial review:

/codex:adversarial-review

This isn't "looks good to me." Codex will actively challenge your implementation approach, question design decisions, and flag things a polite reviewer wouldn't mention. It's the code review you need, not the one you want.

When to Use Which Brain

After a month of daily use, here's my mental model:

Let Claude (Opus) Handle:

Orchestration — multi-file changes, refactors across the codebase
Context-heavy tasks — "fix this bug" when you've been discussing it for 20 messages
Tool-heavy workflows — file reads, grep, test runs, build commands
Conversation continuity — anything that builds on prior context

Call in Codex (GPT-5.4) For:

Fresh eyes — when Claude is circling the same hypothesis
Deep single-problem reasoning — "why does this specific test fail under these exact conditions"
Adversarial review — challenge assumptions Claude might share with you
Parallel investigation — background a research task while Claude keeps working

The Pattern That Works Best

Claude does the initial investigation — reads files, forms a theory
If the theory doesn't pan out in 2-3 attempts, rescue to Codex with the full context of what was tried
Codex returns a diagnosis or fix
Claude applies it in context, runs tests, iterates

Two models. Two reasoning paths. Converging on the same answer faster than either alone.

Advanced: Prompt Shaping

The plugin includes a gpt-5-4-prompting skill that automatically structures your rescue requests into Codex-optimized prompts using XML tags:

<task> — the concrete job
<verification_loop> — how to confirm the fix works
<grounding_rules> — stay anchored to evidence, not guesses
<action_safety> — don't refactor unrelated code

You don't need to write these yourself. Claude does it automatically when it hands off to Codex. But knowing they exist explains why Codex rescue results are usually sharper than raw Codex CLI usage.

Advanced: The Review Gate

/codex:setup --enable-review-gate

When enabled, every git commit in the repo triggers an automatic Codex review before the commit completes. It's a pre-commit hook powered by a second AI brain.

This is aggressive — I only enable it on critical branches or before releases. But when you want zero-trust code quality, it's unmatched.

The Bottom Line

The Codex plugin doesn't replace Claude Code. It makes Claude Code anti-fragile.

Every AI agent has blind spots — reasoning loops it can't escape, patterns it over-fits to, assumptions it shares with its user. A second model with a different training distribution breaks those loops.

The dual-brain setup isn't about which model is "better." It's about coverage. Two independent reasoning paths catch more bugs than one brilliant path run twice.

If you're using Claude Code daily, install the Codex plugin. It's 3 minutes of setup and it will save you hours of "why is Claude stuck on this?"

Part of the Claude Code Architecture Deep Dive series. Previous: The 1,421-Line While Loop That Runs Everything.

Claude Code Deep Dive Part 2: The 1,421-Line While Loop That Runs Everything

Harrison Guo — Fri, 03 Apr 2026 17:24:18 +0000

This is Part 2 of our Claude Code Architecture Deep Dive series. Part 1: 5 Hidden Features covered the surface-level discoveries. Now we go deeper.

The Heart of Claude Code

Every AI coding agent — Claude Code, Cursor, Copilot — runs some version of the same loop: send context to an LLM, get back text and tool calls, execute tools, feed results back, repeat. We called this LLM talks, program walks.

But Claude Code's implementation of this loop is anything but simple. It lives in query.ts, a 1,729-line async generator. The while(true) starts at line 307 and ends at line 1728 — a single loop body spanning 1,421 lines of production code.

This is not a toy. This is the engine that processes every keystroke, every tool call, every error recovery, every context compression decision for millions of users.

// query.ts — line 307
// eslint-disable-next-line no-constant-condition
while (true) {
    let { toolUseContext } = state
    const { ... } = state
    // ... 1,421 lines of state machine logic ...
    state = next
} // while (true)  — line 1728

Why a State Machine, Not Recursion

Early versions of Claude Code used recursion — the query function called itself. But recursion has a fatal flaw: in long conversations with hundreds of tool calls, the call stack grows until it explodes.

The current design uses while(true) with a state object that carries context between iterations:

// query.ts — lines 207-215 (State type, partial)
autoCompactTracking: AutoCompactTrackingState | undefined
maxOutputTokensRecoveryCount: number
hasAttemptedReactiveCompact: boolean       // circuit breaker for 413 recovery
stopHookActive: boolean | undefined
turnCount: number
transition: { reason: string } | undefined // why we continued

Each continue statement is a state transition. There are 9 distinct continue points in the code (lines 950, 1115, 1165, 1220, 1251, 1305, 1316, 1340), each representing a different reason to run another turn:

Next tool call needed
Reactive compact triggered after 413
Max output tokens recovery
Stop hook interrupted
Token budget continuation
And more

The Loop at a Glance

10 Steps Per Iteration

Each time the loop runs, it does these 10 things in order. Every step has real source code behind it.

Step 1: Context Compression (4 stages)

Before calling the API, the system tries to fit everything into the context window. Four compression mechanisms fire in priority order (imports at lines 12-16, 115-116):

Snip Compact — trims overly long individual messages in history
Micro Compact — finer-grained editing based on tool_use_id, cache-friendly (line 370: "microcompact operates purely by tool_use_id")
Context Collapse — folds inactive context regions into summaries
Auto Compact — when total tokens approach the threshold, triggers full compression

These are not mutually exclusive — they run in priority order:

The system tries lightweight options first. If snip + micro bring tokens under the limit, the heavy compressors never run.

Step 2: Token Budget Check

If a token budget is active (feature('TOKEN_BUDGET'), line 280), the system checks whether to continue. Users can specify targets like "+500k", and the system tracks cumulative output tokens per turn, injecting nudge messages near the goal to keep the model working.

Step 3: Call Model API

Line 659 — the actual API call:

for await (const message of deps.callModel({

This is a streaming call. The response arrives token by token, and the system processes it incrementally.

Step 4: Streaming Tool Execution

This is a critical optimization. Traditional agents wait for the model to finish generating all output, then execute tools. Claude Code uses StreamingToolExecutor (imported at line 96):

When the model is still generating its second tool call, the first one is already running:

Traditional Agent (sequential):
┌─────────────────────────┐┌───┐┌───┐┌───┐┌───┐┌───┐
│  LLM generates 5 calls  ││ T1││ T2││ T3││ T4││ T5│  ← 30s total
└─────────────────────────┘└───┘└───┘└───┘└───┘└───┘

Claude Code (streaming):
┌─────────────────────────┐
│  LLM generates 5 calls  │
├──┬──┬──┬──┬─────────────┘
│T1│T2│T3│T4│T5│                                       ← 18s total
└──┴──┴──┴──┴──┘
↑ tools start while LLM is still generating

In a turn with 5 tool calls, traditional waits 30 seconds. Streaming finishes in 18 — a 40% speedup from architecture alone, not model improvements.

Line 554-555 reveals an interesting detail: stop_reason === 'tool_use' is unreliable — "it's not always set correctly." The system detects tool calls by watching for tool_use blocks during streaming instead.

Step 5: Error Recovery

If the prompt is too long? Try context collapse drain. If that fails, try reactive compact (line 15-16). If the API returns 413 (prompt too long), trigger emergency compression and retry.

But there's a circuit breaker: hasAttemptedReactiveCompact (line 209, initialized false at line 275) ensures each turn only attempts reactive compact once. Without this, a genuinely oversized conversation would loop forever.

The system also handles model degradation — if the primary model fails, it can fall back to a different model.

Step 6: Stop Hooks

After the model stops outputting, the system runs registered stop hooks. These can inspect the output and decide whether to let the model continue. This is where external governance plugs in.

Step 7: Token Budget Check (Again)

Yes, checked twice — once before calling the model (should we even start?) and once after (did we exceed the budget?). The second check decides whether to inject a "keep going" nudge or stop.

Step 8: Tool Execution

If the response contains tool_use blocks, execute them. Two paths:

runTools() (from toolOrchestration.ts, line 98) — batch execution
StreamingToolExecutor (line 96) — streaming execution, gated by config.gates.streamingToolExecution (line 561)

Each tool call goes through the 14-step execution pipeline in toolExecution.ts (1,745 lines) — validation, permission checks, hooks, actual execution, analytics. That's a story for Part 3.

Step 9: Attachment Injection

After tools finish, the system injects additional context before the next turn:

Memory attachments — relevant memories from the memdir/ system
Skill discovery — matching skills based on the current task
Queued commands — any commands that were waiting

This happens after tool execution but before the next API call, ensuring the model has fresh context.

Step 10: Assemble and Loop

Build the new message list from all the pieces — original conversation, tool results, attachments, system reminders — and go back to step 1.

Why This Architecture Matters

Most open-source AI agents implement the loop as 50 lines of pseudocode: call model, parse tool calls, execute, repeat. Claude Code's 1,421-line version exists because production reality is messy:

Context doesn't fit. A real coding session easily hits 200K tokens. Without the 4-stage compression pipeline, the agent dies on every long conversation. Most agents just truncate and lose context. Claude Code compresses intelligently — lightweight first, heavy only when needed.

Models fail. APIs return 413, connections drop, rate limits hit. The 9 continue points aren't over-engineering — they're the minimum number of recovery paths needed for reliable operation. The hasAttemptedReactiveCompact circuit breaker is the kind of detail that separates a demo from a product.

Speed matters more than correctness of execution order. Streaming tool execution — starting the first tool while the model is still generating the third — is a user experience decision backed by architecture. Traditional agents feel slow because they are: they serialize everything. Claude Code parallelizes at the loop level.

Tokens cost money. The SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker in prompts.ts (914 lines) splits the system prompt into static (cacheable) and dynamic sections. If two requests share the same static prefix byte-for-byte, the API caches it. Source comment: "don't modify content before the boundary, or you'll destroy the cache." This is prompt cache economics — saving Anthropic real compute costs at scale.

The Behavioral Constitution

Buried inside the prompt assembly, getSimpleDoingTasksSection() may be the most valuable function in the entire codebase. It encodes hard-won rules about what the model should NOT do:

Don't add features the user didn't ask for
Don't over-abstract — three duplicate lines beat a premature abstraction
Don't add comments to code you didn't change
Don't add unnecessary error handling
Read code before modifying it
If a method fails, diagnose before retrying
Report honestly — don't say you ran something you didn't

Anyone who has used Claude Code recognizes these rules. I've personally watched the system refuse to add "helpful" abstractions and stick to minimal changes. That's not the model being disciplined — it's the prompt constraining the model. The takeaway: don't trust model self-discipline. Codify the behavior.

How Other Agents Compare

Aspect	Claude Code	Cursor	Typical OSS Agent
Loop complexity	1,421 lines, 9 continue points	Unknown (closed source)	~50-200 lines
Compression	4-stage pipeline + reactive 413 recovery	Tab-level context pruning	Truncate or fail
Tool execution	Streaming (parallel with generation)	Sequential	Sequential
Error recovery	Circuit breakers, model fallback, emergency compact	Basic retry	Crash
Prompt caching	Static/dynamic boundary, section registry	Unknown	None

The gap between Claude Code and most open-source agents is not model quality — it's the program layer. The model is the same Opus or Sonnet for everyone. What makes Claude Code feel different is 1,421 lines of careful engineering around it.

The Bottom Line

The query loop is where "LLM talks, program walks" becomes concrete:

The LLM outputs text and tool call JSON. That's it.
The program handles compression, budget tracking, error recovery, streaming, permissions, memory injection, and 14-step tool validation.
The 1,421 lines are not the model being smart. They're the program being careful.

If you're building an AI agent and your main loop is under 100 lines, you're not handling the cases that matter. Production is not about the happy path. It's about what happens when context overflows, the API returns 413, the user's conversation hits 500 turns, and three tools need to run while the model is still thinking.

Next: Part 3 — The 14-Step Tool Execution Pipeline (coming soon) — what happens between "model says call this tool" and the tool actually running.

Previous: Part 1 — 5 Hidden Features Found in 510K Lines

Video: The AI Stack Explained — LLM Talks, Program Walks

Claude Code Source Leaked: 5 Hidden Features Found in 510K Lines of Code

Harrison Guo — Tue, 31 Mar 2026 22:02:07 +0000

What Happened

Anthropic shipped Claude Code v2.1.88 to npm with a 60MB source map still attached. That single file contained 1,906 source files and 510,000 lines of fully readable TypeScript. No minification. No obfuscation. Just the raw codebase, sitting in a public registry for anyone to download.

Within hours, backup repositories appeared on GitHub. One of them — instructkr/claude-code — racked up 20,000+ stars almost instantly. Anthropic pulled the package, but the code was already mirrored everywhere. The cat was out of the bag, and it had opinions about AI safety.

5 Hidden Features Found in the Source

1. Buddy Pet System

Deep in buddy/types.ts, there is a complete virtual pet system. Eighteen species, five rarity tiers, shiny variants, hats, custom eyes, and stat blocks. This was clearly planned as an April Fools easter egg.

The species list:

const SPECIES = [
  'duck', 'goose', 'blob', 'cat', 'dragon', 'octopus',
  'owl', 'penguin', 'turtle', 'snail', 'ghost', 'axolotl',
  'capybara', 'cactus', 'robot', 'rabbit', 'mushroom', 'chonk'
];

Rarity weights:

const RARITY_WEIGHTS = {
  common:    60,  // 60%
  uncommon:  25,  // 25%
  rare:      10,  // 10%
  epic:       4,  //  4%
  legendary:  1   //  1%
};

Each buddy gets a hat, eyes, and stats:

type Hat = 'none' | 'crown' | 'tophat' | 'propeller' | 'halo' | 'wizard' | 'beanie' | 'tinyduck';
type Eye = '·' | '✦' | '×' | '◉' | '@' | '°';
type Stat = 'DEBUGGING' | 'PATIENCE' | 'CHAOS' | 'WISDOM' | 'SNARK';

Your buddy is generated deterministically from hash(userId). Every account gets a unique pet. There is also a shiny boolean variant — presumably the rare version you brag about in team Slack.

This was 100% an April 1st drop. The leak killed the surprise.

2. Undercover Mode

This one raised eyebrows. In utils/undercover.ts, there is a mode that makes Claude pretend to be a human developer.

When active, undercover mode:

Removes all Anthropic traces from commits and pull requests
Strips Co-Authored-By headers
Removes any mention of "Claude Code" or model names
Writes commit messages "as a human developer would"

From the source:

// UNDERCOVER MODE — CRITICAL
// You are operating UNDERCOVER in a PUBLIC/OPEN-SOURCE repository.
// Your commit messages, PR titles, and PR bodies MUST NOT contain
// ANY Anthropic-internal information. Do not blow your cover.
//
// NEVER include in commit messages or PR descriptions:
// - Internal model codenames (animal names like Capybara, Tengu, etc.)
// - Unreleased model version numbers (e.g., opus-4-7, sonnet-4-8)
// - The phrase "Claude Code" or any mention that you are an AI
// - Co-Authored-By lines or any other attribution
//
// Write commit messages as a human developer would.

It auto-activates for public and open-source repositories. You can force it on:

CLAUDE_CODE_UNDERCOVER=1

There is no force-OFF override. The safety default is undercover-on for public repos.

The kicker: this feature is gated to internal Anthropic employees only (USER_TYPE === 'ant'). Anthropic's own developers use Claude Code on open-source projects, and this mode ensures their contributions look human-written. Draw your own conclusions about what that means for AI attribution norms.

3. Kairos — Permanent Memory

Behind the feature flag KAIROS in main.tsx and the memdir/ directory, there is a persistent memory system that survives across sessions.

This is not the .claude/ project memory you already know. Kairos is a four-stage memory consolidation pipeline:

Orient — scan context, identify what matters
Collect — gather facts, decisions, patterns from the session
Consolidate — merge new memories with existing long-term store
Prune — discard stale or low-value memories

The system runs automatically when you are not actively using Claude Code. It tracks memory age, performs periodic scans, and supports team memory paths — meaning shared memory across a team's Claude Code instances.

This turns Claude Code from a stateless tool into a persistent assistant that learns your codebase, your patterns, and your preferences over time. It is the most architecturally significant hidden feature in the leak.

4. Ultraplan — Deep Task Planning

The feature flag ULTRAPLAN in commands.ts enables a deep planning mode that can run for up to 30 minutes on a single task. It uses remote agent execution — meaning the heavy thinking happens server-side, not in your terminal.

Ultraplan is listed under INTERNAL_ONLY_COMMANDS. Anthropic's engineers apparently have access to a planning mode that goes far beyond what ships to paying customers. This is the kind of feature that separates "AI autocomplete" from "AI architect."

5. Multi-Agent, Voice, and Daemon Modes

The source reveals several execution modes that are not publicly documented:

Coordinator mode — orchestrates multiple Claude instances running in parallel, each working on a subtask
Voice mode (VOICE_MODE flag) — voice input/output for Claude Code
Bridge mode (BRIDGE_MODE) — remote control of a Claude Code instance from another process
Daemon mode (DAEMON) — runs Claude Code as a background process
UDS inbox (UDS_INBOX) — Unix domain socket for inter-process communication between Claude instances

Together, these paint a picture of Claude Code evolving from a single-user CLI into a multi-agent orchestration platform. The daemon + UDS architecture means Claude Code instances can message each other, coordinate work, and run without a terminal attached.

The Core Architecture

The entire Claude Code engine lives in queryLoop() at query.ts line 241. At line 307, there is a while(true) loop that drives everything:

callModel() sends the conversation to the LLM
The LLM returns text and tool_use JSON blocks
The program parses each tool_use, checks permissions, executes the tool
Results feed back into the conversation
Loop continues until the LLM stops requesting tools

This is the "LLM talks, program walks" pattern I wrote about previously. The LLM decides what to do. The program decides whether to allow it, then does it. Seeing it confirmed in 510K lines of production code is satisfying.

Security Architecture

Claude Code's permission system is the most carefully engineered part of the codebase. Every tool call passes through six layers, implemented in useCanUseTool.tsx:

Config allowlist — checks project and user configuration
Auto-mode classifier — determines if the tool is safe for autonomous execution
Coordinator gate — validates against the orchestration layer
Swarm worker gate — checks permissions for sub-agent execution
Bash classifier — analyzes shell commands for safety
Interactive user prompt — final human confirmation

External commands run in a sandbox. This is defense-in-depth done right. The irony is that the company that built this careful permission model forgot to strip a source map from their npm package.

What This Means

The moat for AI coding tools is not the CLI. It is the model. Anyone can read this source code and understand the architecture, but nobody can replicate Sonnet or Opus. The queryLoop() pattern is elegant but simple — the magic is in what callModel() returns. That said, the product roadmap is now public. Competitors know about Kairos, Ultraplan, multi-agent coordination, and voice mode. That is real strategic damage.

For a company that positions itself as the responsible AI lab — the one that takes safety seriously — shipping a fully readable source map to a public registry is a notable operational security failure. The six-layer permission system in the code is impressive. The process that let a 60MB source map slip through CI/CD is not.

Watch the Deep Dive

I broke down the full AI agent architecture — the same query loop that Claude Code uses — in a 15-minute video: Watch on YouTube

For background on the "LLM talks, program walks" pattern: Read: The AI Stack Explained — LLM Talks, Program Walks

Coming next: a deep dive into Claude Code's 6-layer permission system and the Kairos memory architecture — with full code walkthroughs. Subscribe to catch it.

Forem: Harrison Guo

Testing Real-World Go Backends Isn't What Many People Think

The Wrong Taxonomy

Deterministic Behavior: The One Thing Every Test Should Have

1. Time

2. Randomness

3. Concurrency ordering

Concurrency Correctness: The Race Detector Is Not Optional

Contract Fidelity: The Bug Class Everyone Misses

Shared schema, shared types

Consumer-driven contract tests

The "mock everything" antipattern

Environment Fidelity: Use Real Infra Where It Matters

A Real Taxonomy

Small Habits That Pay Off

The Shift That Changed My Testing

Related

Scale-Up vs Scale-Out: Why Every Language Wins Somewhere

The Two Kinds of Scale

Where Each Language Actually Wins

Rust / C++ / Zig — Scale-up champions

Go — Scale-out champion

Java / Kotlin — Mature scale-out with runtime depth

Python / Ruby — Developer-velocity champions

The Axes Nobody Talks About

Developer-velocity per week

Operational complexity

Memory efficiency per dollar

Hiring pool

Learning curve shape

The Pattern I See Repeated

How I Ask the Question Now

The Argument I've Stopped Having

Related

From Locks to Actors: The Four Pillars of Modern Concurrency

Pillar 1: Shared Memory + Locks

Pillar 2: CSP (Communicating Sequential Processes)

Pillar 3: Actors

Pillar 4: Software Transactional Memory (STM)

How Real Systems Mix Them

Decision Guide

The Real Lesson

Related

RPC vs NATS: It's Not About Sync vs Async — It's About Who Owns Completion

The Sync-vs-Async Trap

Two Clean Ownership Models

RPC: caller owns completion

Messaging: receiver owns completion

The Real Decision

What Actually Has to Change in the Migration

1. Idempotency keys, enforced at the consumer

2. Explicit ack semantics

3. Dead-letter path

4. Consumer-side observability

5. Replay and reprocessing

A Specific Pattern I Like: The Request-Reply on a Bus

The Framing I Use in Design Reviews

Related

Go Context in Distributed Systems: What Actually Works in Production

What Context Actually Is

The Five Patterns That Work

1. Always propagate, never replace

2. Fan out with errgroup.WithContext

3. Cap the subtree with WithTimeout

4. Make cancellation observable

5. Return ctx.Err() at the right boundary

The Anti-Patterns

context.Background() inside a spawned goroutine

Passing the context by field instead of by argument

Storing business data in context

Blanket rethrow without translating

A Small Debugging Tool

Where This Leaves You

Related

Go's Concurrency Is About Structure, Not Speed: chan and context as Lifecycle Primitives

The Bounded Pool Fix

Channels Are Lifecycle Primitives

Four patterns that fall out

Context Is the Other Half

What "Structure, Not Speed" Actually Means

`context.Background()` inside a spawned goroutine