Forem: Harrison Guo

Channels Aren't Message Passing — How Parked Goroutines OOM-Killed a Pod

Harrison Guo — Thu, 14 May 2026 05:26:27 +0000

It's 3am. The Kafka consumer pod that's been running cleanly for six weeks gets OOM-killed. Kubernetes restarts it. Five minutes later: OOM-killed again. Restart. OOM-killed a third time. By the fourth restart I've shelved the dashboard and started reading runtime/chan.go.

The code that died fit on one line:

events := make(chan Event)

I want to tell you that line is the bug. It isn't. An unbuffered channel will happily backpressure a single producer — every send rendezvous with a receiver, the producer cannot run ahead. The channel did exactly what it was designed to do.

What I had built around it didn't. The Kafka consumer loop wrapped events <- parseEvent(msg) inside a go func(msg) { ... }(msg), spawning a fresh goroutine per inbound message. Every one of those goroutines blocked on send, parked on the channel's sendq list, and kept its stack and the parsed event alive in memory. The channel was the gravestone. The unbounded go func fan-out was what filled it.

This is the story of what a Go channel actually is at the runtime level, why "channels are message passing" is one of the most expensive lies in the Go ecosystem, and why the most common channel bug isn't in the channel — it's in the code that calls into it.

tl;dr — A Go channel is not a queue and not a message bus. It's a heap-allocated hchan struct containing a mutex, a ring buffer, and two parked-goroutine lists. The send operation is a memcpy under a lock, not a transmission. Channels only deliver backpressure if the producer side is bounded. The OOM that started this story came not from make(chan Event) — that was working as designed — but from an unbounded go func(msg) fan-out parking thousands of goroutines on sendq, each retaining a 10KB payload. The fix isn't a buffer size. It's making backpressure part of the producer contract: a single long-lived producer with select-based backoff, plus a bounded queue as a safety net. The same architectural mistake shows up at every layer where engineers reach for an "in-process queue" — including the inbound queue of your AI agent.

The Mental Model That Killed The Pod

Here is what I thought a channel did, and I suspect most Go engineers carry some version of this picture:

"A channel is like a Kafka topic in-process. Producers push messages onto it. Consumers pull messages off it. The runtime handles ordering and delivery. It's CSP — Communicating Sequential Processes — Hoare's thing, basically a typed pipe."

Every word of that sentence is wrong in a way that matters. There is no topic. Nothing is being pushed anywhere. The runtime is not a broker. The word passing — borrowed from message-passing concurrency, where independent processes communicate across an isolation boundary — is the most misleading part. In a Go channel, there is no isolation boundary. There is one struct on the heap, and both goroutines reach in and mutate it.

I held the message-passing model long enough that when the Kafka consumer started ingesting a 12-hour upstream replay at full throttle, I had no instinct that the messages were going somewhere bounded. They weren't. They were sitting in a ring buffer that I had failed to give a size to.

What A Channel Actually Is

Crack open runtime/chan.go in the Go source tree and you'll find this (layout stable since Go 1.7, confirmed against Go 1.21–1.25):

type hchan struct {
    qcount   uint           // total data in the queue
    dataqsiz uint           // size of the circular queue
    buf      unsafe.Pointer // points to an array of dataqsiz elements
    elemsize uint16
    closed   uint32
    elemtype *_type
    sendx    uint           // send index
    recvx    uint           // receive index
    recvq    waitq          // list of recv waiters
    sendq    waitq          // list of send waiters
    lock     mutex
}

That's it. That's the channel. A struct with a mutex, a pointer to a circular byte array, two indices to track read/write positions in the ring, and two intrusive linked lists holding parked goroutines that are waiting to send or receive.

When you write ch <- value, the runtime calls chansend, which does roughly this:

Take the lock (lock(&c.lock)).
Check recvq — is there a goroutine already parked waiting to receive? If yes, copy value directly from the sender's stack into the receiver's stack via sendDirect, mark the receiver runnable with goready, release the lock, return. No buffer involved — when a receiver is already waiting, send can hand off directly without ever touching the ring buffer. (In normal operation a buffered channel can't simultaneously have queued data AND parked receivers; if recvq has a waiter, the buffer is empty.)
Otherwise, check buffer space — if qcount < dataqsiz, copy value into buf[sendx], advance sendx, increment qcount, release the lock, return.
Otherwise, park the sender — append the sender's goroutine to sendq, release the lock, and call gopark to suspend execution until a receiver wakes it up.

Receive is the mirror image, calling chanrecv with sendq and recvq swapped.

Here is the shape of it:

Three things are worth burning into memory:

One — there is no transport. The "message" never leaves the heap. Sender writes bytes; receiver reads bytes; the lock arbitrates. This is shared-memory synchronization with the appearance of message passing.

Two — the buffer is just a ring of typed slots. dataqsiz is set exactly once, at make(chan T, N) time. If you write make(chan T), dataqsiz is zero and there is no buffer at all — every send must rendezvous with a receiver or park.

Three — sendq is unbounded. This is the part nobody talks about. The ring buffer has a fixed size. The list of parked senders waiting to write into the ring buffer does not. If a thousand goroutines all hit a full channel, the runtime parks all thousand of them on sendq and each one keeps its stack and any data it was about to send alive in memory.

That third point is what made the OOM I had a different shape from the one I was about to describe.

The Incident, Mechanism By Mechanism

The pod that died had a goroutine topology that looked like this — and the bug is not the make(chan Event) line. Watch the outer loop:

events := make(chan Event)

// Consumer — slow.
go func() {
    for ev := range events {
        process(ev) // ~3ms per event
    }
}()

// THE ACTUAL BUG: outer loop spawns a fresh goroutine per inbound message.
for msg := range kafkaConsumer.Messages() {
    go func(msg kafka.Message) {
        events <- parseEvent(msg) // every blocked send parks on sendq
    }(msg)
}

If you replace the inner go func(msg) { ... }(msg) with a direct events <- parseEvent(msg), the outer loop itself becomes the producer, and the unbuffered channel correctly backpressures it — the loop simply doesn't advance until the consumer is ready. No OOM.

But because each message is dispatched to a fresh helper goroutine, the outer loop never blocks. It keeps spawning. Each helper goroutine reaches the send, finds no waiting receiver, and parks on sendq. Now sendq is the unbounded thing. Here is what actually happened, in order:

1. Sustained baseline: rendezvous works

At 1K msg/sec inbound and ~3ms per process call (~333/sec consumer throughput), the consumer is already behind by 3x at steady state. For weeks this didn't OOM because the Kafka client's own internal buffer absorbed the gap, and lag built up on the broker side — visible in Grafana, ignored by me.

2. Replay: the producer detaches from the consumer's pace

When upstream re-emitted 12 hours of events, the Kafka client's internal pre-fetch buffer filled to capacity (default fetch.message.max.bytes × partition count = several hundred MB) and started backing up Kafka-side without applying backpressure to the consumer goroutine, because the client library was configured with a large internal queue.

3. The actual heap growth: parked sender goroutines

Each call to events <- parseEvent(msg) on the unbuffered channel would either rendezvous (rare during replay) or park. When it parked, the sender goroutine held:

Its own stack (~8KB minimum, grew under load)
The Event value it was about to send (~10KB per event with strings, headers, payload)
A reference into the Kafka message it was parsing (another ~10KB)

Multiply by the number of in-flight parsing goroutines — which kept being spawned by an outer loop that didn't apply backpressure to itself — and you arrive at the 12GB heap. The channel's sendq was the proximate memory sink, not the buffer (which was zero-sized).

The goroutine lifecycle for each parsing goroutine looked like this:

Every goroutine sitting in Parked_on_sendq is reachable (it's on the runtime's wait queue, which is rooted in the hchan struct, which is rooted by both the producer and consumer goroutines). Reachable means non-collectible. The longer the consumer falls behind, the longer the queue grows.

4. GC can't help

Go's GC can only reclaim unreachable memory. Every parked goroutine on sendq is reachable (it's on the runtime's scheduler queue). Every Event it's holding is reachable. The GC ran, found nothing to free, and the heap continued growing until the kernel OOM-killer fired.

5. The cgroup hammer drops

cgroup memory limit was 4GB. Heap crossed 4GB. OOM kill. Kubernetes restarted the pod. The replay was still in progress on the broker side, so the same sequence ran again. And again.

What this looks like in pprof

You don't have to take my word for the mechanism — it reproduces in under a minute. I built a minimal demo at harrison001/channels-oom-demo (cmd/bug) that runs the same workload shape on a laptop. The output of the bug version over 22 seconds, captured with runtime.NumGoroutine() and runtime.MemStats.HeapAlloc:

t=   1s  goroutines=   497  heap_alloc=     5 MB
t=   5s  goroutines=  2462  heap_alloc=    28 MB
t=  10s  goroutines=  4915  heap_alloc=    61 MB
t=  15s  goroutines=  7369  heap_alloc=    89 MB
t=  20s  goroutines=  9828  heap_alloc=   109 MB
t=  22s  goroutines= 10813  heap_alloc=   125 MB

Goroutine count grows at almost exactly 1 per millisecond (the spawn rate). Heap grows at ~5MB/sec, dominated by the 10KB Event payload each parked goroutine is holding. Extrapolate to a 12-hour replay at production volume and you arrive at the original 12GB OOM.

For comparison, the fix version (cmd/fix) on the same workload:

t=   1s  goroutines=     3  heap_alloc=     3 MB  chan_len= 256
t=  10s  goroutines=     3  heap_alloc=     4 MB  chan_len= 256
t=  20s  goroutines=     3  heap_alloc=     5 MB  chan_len= 256

Three goroutines (producer, consumer, pprof listener). Heap flat at 4-5 MB. Channel pinned at its 256-slot bound, meaning the producer is constantly blocked on send and applying backpressure upstream — exactly what we want.

The Fix, And Why It Works

The visible code change was one parameter. The real fix was making backpressure part of the producer contract — two changes, working together:

events := make(chan Event, 256) // (1) bounded queue as safety net

// (2) single long-lived producer goroutine with select-based backoff —
// NO outer loop spawning fresh goroutines per message.
go func() {
    for msg := range kafkaConsumer.Messages() {
        select {
        case events <- parseEvent(msg):
            // sent — loop continues at consumer speed when channel fills
        case <-ctx.Done():
            return
        }
    }
}()

The key word in change (2) is single. There is exactly one goroutine reading from Kafka and writing to the channel. When the channel fills, that goroutine blocks on send; the for msg := range loop stops calling Poll(); the Kafka client's internal pre-fetch queue stops draining; consumer lag accumulates broker-side; the broker simply retains messages until we come back. No go func(msg) helpers. Nothing piling up on sendq. Memory stays bounded because the producer is bounded — the buffer is only a safety net to absorb short bursts.

What this changes, mechanically:

Before (unbounded `go func` fan-out + `make(chan Event)`)	After (single producer + `make(chan Event, 256)`)
One goroutine per inbound message	One long-lived producer goroutine
`sendq` grows unboundedly with parked helpers	`sendq` empty by construction; producer is sole sender
No signal to upstream — outer loop never blocks	Producer blocks on send; outer loop runs at consumer speed
Kafka client keeps pre-fetching, lag invisible	Kafka client's internal queue fills, consumer stops polling, broker-side lag accumulates
OOM	Bounded heap, bounded latency, Kafka rebalances cleanly when behind

A bounded channel buffer alone does not prevent this OOM. If you applied change (1) without change (2), you'd merely increase the OOM-killing rate — the outer go func(msg) fan-out would keep spawning, the buffer would fill in milliseconds, helpers would pile up on sendq exactly as before. Backpressure is not a property of any one component — it is a property of the entire chain having no unbounded buffer (and no unbounded fan-out) anywhere in it.

Every link in this chain is bounded — the database has connection pool limits, the consumer is rate-limited by process() latency, the channel buffer is 256, the Kafka client's internal queue has a configured max, and the broker simply retains messages on disk when its consumer falls behind. When ANY downstream link slows, the pressure propagates back up by the consumer ceasing to pull; the broker doesn't need to be told anything. The whole system runs at the rate of its slowest component.

If any link in that chain has an unbounded buffer, the chain has no backpressure. That link will absorb the load until it OOMs.

Bounded Buffers Are Not About Channels

The lesson is not "use buffered channels." The lesson is:

Any in-process queue without a capacity bound is a latent OOM.

This applies identically across runtimes:

Runtime	The footgun	The fix
Go	Unbounded goroutine fan-out parked on sends (`go func(msg) { ch <- ... }(msg)`); oversized buffered channels	Single long-lived producer + `select` + bounded buffer as safety net
Rust (Tokio)	`mpsc::unbounded_channel()`	`mpsc::channel(N)`
Python (asyncio)	`asyncio.Queue()` with no `maxsize`	`asyncio.Queue(maxsize=N)`
Node.js	Unbounded array of in-flight Promises	`p-limit`, `Sema`, or explicit pool
Erlang/Elixir	Process mailbox grows unboundedly when selective receive can't keep up	Demand-driven flow control: `GenStage` / `Flow` for pipelines, or explicit ack-based protocols in `gen_statem`

Every one of these reaches for the same shape — an in-process queue — and every one of them OOMs the same way when the shape is unbounded.

When Channels Are The Right Tool

I want to be careful not to overcorrect. Channels are not a mistake. They are an excellent primitive used incorrectly. Cases where reaching for a channel is the right call:

Cancellation signaling — context.Done() is a <-chan struct{}. This is canonical.
Fan-out work distribution with a worker pool — a bounded channel feeding N worker goroutines is a clean semaphore. Buffer size = pool size or small multiple of it.
Producer-consumer with a known throughput ratio — yes, with a bounded buffer sized to the latency budget.
Error aggregation from concurrent goroutines — small buffered channel, drain on goroutine completion.
Handoff between pipeline stages — bounded, with explicit close semantics on the upstream stage.

Cases where reaching for a channel is the wrong call:

Cross-process messaging — use a real broker (NATS, Kafka, Redis Streams). Channels do not survive a pod restart.
Persistence — channels are stack-local-ish. If your pod dies, the in-flight data is gone. If you need "at least once" across restarts, you need a real queue.
Bursty load with unknown shape — if you cannot put a meaningful upper bound on the buffer, you have not understood the load. Adding a channel does not give you understanding; it postpones the OOM.
Anything that wants to be a message bus — that's not a channel. That's a message bus. They are different categories of system.

The Same Bug, Different Layer: AI Agent Inbound Queues

The reason this post lives in the SecurityLab track and not just "Go tips" is that the exact same mistake is now happening, at scale, in LLM agent infrastructure. I've seen the pattern repeatedly in recent AI backends — same architectural shape, different runtime.

The pattern: an agent backend exposes an HTTP endpoint. Each inbound request is dispatched to a worker pool via an in-process queue.

# The bug, in a different language
request_queue = asyncio.Queue()  # unbounded

async def http_handler(req):
    await request_queue.put(req)  # never blocks
    return {"status": "queued"}

async def worker():
    while True:
        req = await request_queue.get()
        await llm_call(req)  # 8 seconds, sometimes 30

Steady state is fine: requests arrive faster than they're processed, queue grows slowly, latency creeps up, nobody notices because the HTTP layer keeps returning 200.

Then a launch happens. Or a viral tweet. Or a marketing email goes out. Inbound rate spikes 50x for 20 minutes. The queue accepts everything (it's unbounded). The worker pool can't keep up — LLM calls are inelastic, you can't parallelize past your token-per-minute quota. The queue grows to 200K items. Each item holds a request payload (~50KB with conversation history) and a future. 10GB of heap. OOM. Pod restart. All 200K requests lost. Users see 500s instead of the explicit "rate-limited, try again in 30s" they would have seen with proper backpressure.

The fix is identical to the Go fix:

request_queue = asyncio.Queue(maxsize=100)

async def http_handler(req):
    try:
        request_queue.put_nowait(req)
    except asyncio.QueueFull:
        return Response(status=503, headers={"Retry-After": "30"})
    return {"status": "queued"}

503 is a feature. It is the system telling the client we're at capacity, retry in 30 seconds. It is honest. It is bounded. It is the difference between a system that degrades gracefully and one that dies silently.

Reproducing This Yourself

The numbers in this post come from a minimal Go program that fits in under 100 lines per command. The repo lives at:

github.com/harrison001/channels-oom-demo

git clone https://github.com/harrison001/channels-oom-demo.git
cd channels-oom-demo

# Watch goroutine count + heap climb every second
go run ./cmd/bug

# Switch to the fix — flat at 3 goroutines, 5 MB heap
go run ./cmd/fix

Each program exposes pprof on localhost:6060. While the bug version is running:

# Confirm 10K+ goroutines parked on chansend → runtime_chanrecv1
curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=1' | head -20

# Confirm the heap is dominated by Event payloads, not the channel itself
go tool pprof -text http://localhost:6060/debug/pprof/heap

The bug demo has a hard cap at 20,000 goroutines so it won't actually OOM your laptop. Remove the cap if you want to see the kernel finish the job.

What I Wish I'd Known

If I could send one note back to myself eighteen months before the OOM:

When you reach for an in-process queue, you are choosing a backpressure boundary. The buffer size is not a performance tuning knob. It is a contract: under sustained load greater than my consumer's throughput, this is how much memory I am willing to lose before I tell the producer to stop. If you don't pick a number, the runtime picks one for you, and the number is whatever fits in RAM right before the kernel kills the process.

Channels in Go look like message-passing because the syntax was deliberately borrowed from CSP, a model where independent processes communicate by passing values across an isolation boundary. In Go there is no isolation boundary. The channel is a struct in shared memory, the goroutines are coroutines on the same scheduler, and the entire setup is synchronization plumbing in CSP clothing.

Once you see the hchan struct, you can't un-see it. Every channel decision after that is a synchronization decision, not a transport decision. And synchronization decisions always have a capacity bound — you just have to choose whether to pick it explicitly or have the OOM-killer pick it for you.

Keep going

Code: harrison001/channels-oom-demo — reproduce both versions, capture your own pprof
Next piece: Goroutines Are Cheap — Until Backpressure Is Missing — coming next. The producer side of the same mistake: why "just spawn a goroutine" is the second half of the bug.
Subscribe: I write one of these monthly on runtime mechanics, distributed systems postmortems, and the security implications of getting them wrong. Newsletter · SecurityLab track

If you've hit this bug — or its cousin in a different runtime — I'd genuinely like to hear about it. The Erlang and Node.js shapes especially: I have hunches but not enough scars. Reply to the newsletter or open an issue on the demo repo.

How I Improved an AI Agent from 40% to 60% — With A/B Test Data

Harrison Guo — Tue, 12 May 2026 15:49:19 +0000

The Setup

I was optimizing an AI agent for a production system — a creator agent that handles user requests like "make this character fiercer" or "rename this entity." The agent runs a 5-layer pipeline: Perceive → Cognate → Decide → Act → Express, with real LLM calls at each step.

Quality was bad. Not "it doesn't work" bad — "it works 40% of the time" bad. The remaining 60% were wrong entity targeting, infinite reasoning loops, and silent failures.

I ran 5 standardized test cases, each repeated 5 times (LLMs are non-deterministic), measuring pass rate:

Test	What It Does	Baseline
QL-001	Create 4 entities + 1 relationship in one message	0%
QL-002	Classify user intent correctly	80%
QL-003	Update the right entity in a world with 6 characters + 4 locations	40%
QL-004	Maintain context across long conversation	100%
QL-005	Simple rename ("Ember" → "Infernia")	20%

Overall: 40% pass rate. The model (equivalent to GPT-4 class) was plenty capable. Something else was wrong.

The Diagnosis: Context Was the Problem

QL-003: Why the Agent Confused Entities (40% → 80%)

The user says: "Make Ember more fierce and give her fire breath."

The world has 10 entities: 6 characters (Ember, Luna, Grak, Roland, Mira, Pip) and 4 locations. The agent's BuildChatCompletionMessages function dumped ALL entity data into the prompt — every character's backstory, every location's description.

The LLM had to find Ember in a wall of irrelevant text. Sometimes it picked Luna. Sometimes it referenced the wrong character's traits. Not because the model was stupid — because the context was noisy.

QL-005: Why Simple Rename Failed (20% → 80%)

"Rename Ember to Infernia." One entity, one operation. Should be trivial.

Two problems:

No round limit — the agent sometimes looped 15+ times on a rename, reasoning tools firing endlessly
When a tool failed, the LLM got: {"error": true, "message": "This tool is temporarily unavailable."} — no context on what to do next

The model gave up or produced responses that didn't contain "Infernia."

QL-001: Why Multi-Step Creation Was Impossible (0% → 0%)

"Create a dragon named Ember who lives in Crystal Caves. Ember has a rivalry with Sir Roland who guards the village gate."

This requires creating 4 entities + 1 relationship. The 5-layer pipeline processes entities sequentially, each in isolation. The relationship creation doesn't know the knight was just created — there's no shared state between action steps.

Both baseline and improved scored 0%. This is an architectural problem, not a context problem.

The Fixes: 8 Changes, 7 Pure Code

Fix 1: PlanExecution (the only LLM call)

One API call before the main loop. The LLM generates a plan:

Goal: Update Ember's properties
Steps: 1. Identify Ember entity  2. Apply personality changes
Tools needed: updateCharacter

This plan gets injected into the cognition layer's context. The intent classifier now sees a roadmap, not just raw entity data.

Cost: ~$0.003 per request, 3-5s latency. The only fix that uses an LLM call.

Fix 2: PrioritizeContext (pure code)

Sort context items by salience score. Higher-relevance items go first. Low-relevance items dropped when the token budget is exceeded.

When the user says "Make Ember fiercer," Ember's data gets priority. Luna's backstory gets dropped. The LLM sees signal, not noise.

sort.Slice(items, func(i, j int) bool {
    return items[i].Salience > items[j].Salience
})
items = items[:tokenBudget]

Cost: Zero. Pure sort + filter.

Fix 3: CompressContext (pure code)

Old conversation rounds get summarized extractively — find tool names, find CONCLUSION markers, truncate the rest. No LLM needed for this level of compression.

Cost: Zero. String operations.

Fix 4: Preserve Conclusions (pure code)

When reasoning text is truncated at 4,000 characters, the truncation used to cut wherever it landed. If the LLM decided "I need to rename Ember to Infernia" in round 1 but that conclusion was at character 4,100, round 2 forgot the decision.

Fix: truncateReasoningPreservingConclusions() finds CONCLUSION/DECISION markers and keeps them even when truncating.

Cost: Zero. String search.

Fix 5: Max Rounds Cap (pure code)

const DefaultMaxRounds = 10
if roundCount > DefaultMaxRounds { break }

Previously unlimited. The agent sometimes looped 15+ rounds on a trivial task. Now it stops at 10 and produces its best result.

Cost: Zero. One if-statement.

Fix 6: Structured Tool Errors (pure code)

Before:

{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable."}

After:

{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable.",
 "error_type": "timeout", "retryable": true}

With retryable: true, the LLM knows to try again instead of giving up. With error_type: "timeout", it knows the issue is transient.

Cost: Zero. String classification.

Fix 7: Circuit Breaker (pure code)

Count failures per LLM provider. After 3 consecutive failures, skip that provider and try the fallback. Prevents the agent from burning through 120 seconds of timeout on a dead provider.

Cost: Zero. Counter + threshold.

Fix 8: HTTP Client Reuse (pure code)

Store *http.Client on the provider struct, reuse across calls. Previously each call created a new client, a new TCP connection, a new TLS handshake.

Cost: Zero. Struct field.

The Results

Test	Baseline	After Fix	Delta	What Fixed It
QL-001	0%	0%	=	Needs pipeline architecture change
QL-002	80%	80%	=	Already working
QL-003	40%	80%	+40%	PrioritizeContext + PlanExecution
QL-004	100%	100%	=	Already working
QL-005	20%	80%	+60%	Max rounds + structured errors + conclusion preservation

Overall: 40% → 60%. Same model. Better input.

Latency went from 26s to 43s due to the PlanExecution LLM call (~3-5s per test). The HTTP reuse and circuit breaker savings show up under concurrent load, not in a 5-test sequential run.

What Didn't Improve — And Why

QL-001 (multi-step creation) stayed at 0%. This isn't a context problem — it's a pipeline architecture problem. Each entity is created in isolation, and the IDs returned by each step are discarded before the next step runs:

Fixing this requires collapsing the 5-layer pipeline into a unified agent with cross-step state — a larger architectural change, not a context fix.

The lesson: Context optimization has a ceiling. Past that ceiling, you need architecture changes. But the ceiling is higher than most people think — we still had 20% improvement available before hitting it.

What's Still Missing

Three pieces of infrastructure were built but not wired:

Component	Status	Gap
VerifyOutput	Logs quality issues	Doesn't retry on failure
ScoreMemoryUsage	Computes relevance scores	Scores never applied to future retrieval
PlanExecution	Generates plan before loop	Plan not tracked during execution

All three are open loops. The infrastructure detects problems but doesn't act on them. Closing these loops is the next 20% — getting from 60% to 80%+.

The Takeaway

Better input → better output. The LLM is the same.

If your agent is underperforming, check the context before blaming the model. In our case:

7 out of 8 fixes were pure code
Zero additional LLM cost (except one planning call at $0.003)
20% quality improvement without changing the model
The model was always capable — the context was holding it back

The highest-ROI investment in any agent system is context management. It's not glamorous. It's sort, filter, compress, truncate, prioritize. But it's the difference between 40% and 60% — and the foundation for everything else.

Part of the AI Agent Architecture series. See also: The 90% Problem for the broader framework, and Claude Code Deep Dive Part 3 for how Anthropic solves context at scale.

Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works

Harrison Guo — Wed, 06 May 2026 15:50:16 +0000

There's an impulse, when someone first learns about consistency models in distributed systems, to want to classify the taxonomy into neat drawers. Strong here. Eventual there. Linearizable above it. Read-your-writes below. Study the diagram, pass the interview.

That taxonomy is real, but it's not useful the way people think. Production systems don't pick a consistency model and run with it. They pick a different model per feature, often per type of operation within a feature, and spend most of their engineering effort on the gaps between what the model provides and what users actually expect. The taxonomy is the menu. The interesting question is which dish each scenario needs.

This is a working engineer's walk through ten real consistency scenarios — from the obvious ones (money transfers need strong) to the less obvious (collaborative editing, notification feeds, analytic dashboards) — with the specific engineering that makes each one work.

tl;dr — Consistency is not a global system property; it's a per-operation property. A well-designed distributed system picks different consistency levels for different operations based on what users actually notice, what the business actually requires, and what latency budget each operation has. The CAP-theorem framing ("pick 2 of 3") is a caricature; real systems use PACELC (which adds the latency trade-off during normal operation) and pick per-feature.

The Frames That Matter

Before scenarios, three frames you actually use in practice.

CAP (Consistency, Availability, Partition tolerance, pick 2). Useful as a first-week mental model. Misleading if taken literally, because (a) you can't give up partition tolerance in a real network, and (b) the choice isn't binary — you can tune per operation.

PACELC: if there's a Partition, pick A (availability) or C (consistency). Else, pick L (latency) or C (consistency). Adds the latency trade-off you pay during normal operation, which is where 99% of design decisions actually live. A system that's "consistent when no partition" but pays 50ms of cross-region round-trip for every write has made a latency-vs-consistency call, not a CAP call.

Consistency models, from strongest to weakest:

Linearizable: operations appear to happen instantaneously, in a total order consistent with real time. The strongest practical model. Expensive.
Sequential: operations appear in a total order, but not necessarily aligned with real time. Slightly weaker, slightly cheaper.
Causal: if event A causally precedes event B, every observer sees A before B. Preserves the "this reply should appear after the comment it replied to" property.
Read-your-writes: you see the effects of your own operations, even if other users don't yet.
Monotonic read: once you see a value, you won't see an older value later.
Eventual: if writes stop, replicas eventually converge. No ordering guarantees during the transient.

You don't need to memorize these. You need to recognize which one each feature actually needs.

Moving left to right: cheaper, faster, less coordinated — and more work you do in application code to close the gap between what the model gives you and what users expect.

Ten Scenarios

Quick reference — each row is expanded into its own section below.

#	Scenario	Consistency Model	Key Technique
1	Money transfer between accounts	Linearizable	Synchronous quorum + idempotency keys
2	Inventory decrement, hot key	Strong w/ sharding	Reserved-inventory buckets
3	User profile update	Read-your-writes	Session timestamp + sticky read
4	Social media feed	Causal	Version vectors / Lamport timestamps
5	Collaborative document editing	Eventual + CRDT/OT	Conflict-free operations
6	Ad click counter	Eventual	Local shard + async aggregation
7	Multi-region primary/secondary	Eventual + RYW on demand	Primary routing per write
8	Distributed lock / leader election	Linearizable	Raft/Paxos consensus
9	Analytics dashboard	Append-only / none	Stream → warehouse ETL
10	Cross-service orchestration	Saga	Local txns + compensations

1. Money Transfer Between Accounts

Needs: strict linearizability. No double-spend. No lost updates.

Approach: transactional database with serializable isolation, or a strongly-consistent coordination layer (Paxos/Raft quorum). Typical implementation: single-region primary Postgres with synchronous replication, or a distributed SQL (Spanner, CockroachDB, YugabyteDB) with linearizable reads.

What you give up: latency (especially cross-region), availability during partitions. This is the right trade — a bank doesn't tolerate double-spend to save 30ms.

Key engineering: idempotency keys on every request, deduplication at the persistence layer, well-audited transaction boundaries. Strong consistency at the DB isn't enough if your retry logic double-writes.

2. Inventory Decrement with High Contention

Needs: "no overselling" without blocking every request.

Approach: the classic "hot key" problem. Options in ascending sophistication:

Pessimistic locking — SELECT ... FOR UPDATE on the inventory row. Works; serializes hot items. Under peak traffic on Black Friday, this queues up and tail latencies explode.
Optimistic concurrency — read version, decrement, compare-and-swap. Retries on conflict. Better tail latency at moderate contention, worse at very high contention (retry storms).
Reserved-inventory buckets — maintain N "shards" of available inventory, route requests to a random shard, only one shard hits contention at a time. Sacrifices a small amount of overselling risk (if shard A has 5 left but shard B has 0, a user might get told "out of stock" while 5 remain total) for huge throughput wins.
Best-effort with async reconciliation — accept orders optimistically, reconcile at a background worker, cancel overbooks with apology emails. Used by event-ticketing sites for popular drops.

The right choice depends on business rules. If overselling by 1% is unacceptable, pessimistic. If overselling by 0.1% is tolerable and user-experience matters, shard or async reconcile.

3. User Profile Update

Needs: read-your-writes. After I save my display name, I see it on next page load.

Approach: sticky reads. Either the session pins to the write replica for a short window, or the application tracks a "last write timestamp" per user and refuses to serve reads from a replica that hasn't caught up.

The naive alternative — "eventually consistent, just retry" — breaks user expectations immediately. "I updated my name and it didn't save" is one of the most expensive support tickets on a per-incident basis, because the user has no way to distinguish "didn't save" from "saved but replication is lagging."

The engineering is not glamorous. A session cookie that carries last_write_ts, a read path that asserts replica.latest_ts >= last_write_ts, and a fallback to the primary if the assertion fails. Most frameworks don't give you this for free; you build it.

4. Social Media Feed

Needs: causal consistency for comments and replies. Eventual consistency everywhere else.

Approach: two-tier. Posts and likes are written to a local region with async replication. Replies are linked to their parent post with an explicit cause-precedes relationship — the reply's store won't surface the reply until the parent has propagated.

The CRDT-adjacent pattern (version vectors, Lamport timestamps) sits underneath, but you don't usually expose it to the application. What you expose is "here's the list of replies, in causally-consistent order." What the user sees: "I replied to a comment, and my reply appears under it" — which is exactly the mental model they expect.

What you save by not using strong consistency everywhere: low write latency (local region only), high availability during partitions, and the ability to handle massive fan-out (a celebrity's post propagating to 40M followers doesn't need to wait on a single coordinator).

5. Collaborative Document Editing

Needs: offline-first, multi-user concurrent edits, always-eventually-converge, no lost updates.

Approach: CRDT (conflict-free replicated data type) or OT (operational transformation). This is one of the few spots where CRDTs genuinely shine. The underlying math guarantees that any two replicas will converge to the same state, regardless of the order operations arrive in, as long as all operations eventually reach all replicas.

Google Docs uses a version of OT. Figma uses multivalue registers and CRDT-adjacent primitives. Notion uses a mix. The common property: any user can edit while offline, sync when reconnected, and the final document reflects all edits.

What you give up: simplicity. CRDT implementations are subtle, and naive "last-write-wins" semantics are almost never what the user wants (their previous sentence vanished, not merged).

What you gain: offline support without an ugly "you've been offline, your changes may conflict" modal.

6. Ad Click Counter

Needs: eventual consistency, very high write throughput, lossy-okay for a tiny fraction.

Approach: local counter per shard, periodic aggregation to central store. Writes are fire-and-forget to a stream (Kafka, Kinesis). Reads come from a precomputed aggregate that's a few seconds stale.

Why this works: no advertiser is going to detect the difference between "47,312 clicks" and "47,318 clicks" in their dashboard. Counting-with-precision across a global distributed system is ten times harder than counting approximately. Do the latter.

What's non-obvious: the system should be designed for approximate counts, with explicit tolerance in the SLA ("counts are accurate to within 0.01% and updated every 30 seconds"). If you don't say that upfront, someone will eventually ask "why don't our counts match the backend logs exactly" and you'll be in a two-week project to eliminate errors that never mattered.

7. Multi-Region Primary / Secondary

Needs: fast reads in every region, writes can live in one region.

Approach: primary-in-region-A, async replication to regions B/C/D. Reads in B/C/D may lag the primary by milliseconds to seconds. Writes always route to A.

Consistency model you're serving: eventual, with read-your-writes available on demand (see scenario 3). Reads from the primary are strongly consistent; reads from secondaries are lagged but fast.

Key engineering: the client SDK should know which operations need primary reads (after a recent write, for "show me the thing I just wrote" operations) and which can hit secondaries (dashboards, history views, anything time-insensitive).

This is where most backend systems actually live. The bulk of reads go to secondaries — cheap, fast. A small percentage route to primary for freshness. Latency and availability both win.

8. Distributed Lock / Leader Election

Needs: exactly one leader, no split-brain, sometimes-unavailable-is-okay.

Approach: a consensus system (Zookeeper, etcd, Consul — all Raft or Paxos variants). Acquire the lock or lease, renew it, do the work, release. If you lose the network partition, the other side knows you lost it because it couldn't renew.

The classic failure: leader election on top of Redis. Redis is not a consensus system. RedLock has well-documented failure modes — it is not safe for correctness-critical locking. Use etcd. Use Zookeeper. Use a real consensus system. The tempting shortcut will, eventually, bite.

What consensus buys you: guaranteed linearizability for operations on the lock/lease. What it costs: every operation is a quorum round-trip. That's fine for leader election (infrequent). It's not fine for a hot write path (use a different mechanism).

9. Analytics Dashboard

Needs: all the data, eventually, in a queryable form. No urgency on freshness.

Approach: stream writes to a durable log (Kafka), have an ETL job populate a columnar warehouse (BigQuery, ClickHouse, Snowflake) on a schedule. Dashboards query the warehouse. Data is minutes to hours stale.

Consistency model: none, in the traditional sense. You have an append-only log and a materialized view. The view is eventually consistent with the log, and that's the whole contract.

This is simple but worth calling out because people sometimes try to do analytics against the operational database directly ("we'll run these queries on the primary, it'll be fine"). It will not be fine. Analytic queries are different workload shapes — they want columnar storage, aggressive parallelism, no transactional overhead. Put them in a warehouse.

10. Cross-Service Orchestration (Saga)

Needs: multi-step business flow across services — create order, reserve inventory, charge payment, schedule shipment. Each step might fail. The system should end up in a consistent state either way.

Approach: Saga. Each step is a local transaction in its own service. For each step, you also define a compensating step that undoes it. If a step fails partway through, you run compensations for the earlier steps in reverse:

Not all compensations are symmetric — you can't un-send an email, you can't un-refund a payment. But for most business flows, you can design compensations that leave the system in a consistent-enough state.

The alternative — 2PC (two-phase commit) across all services — is real but rarely used. 2PC requires every participant to support the protocol, holds locks while waiting, and blocks the whole transaction if any participant is slow or down. For services owned by different teams on different storage engines, 2PC doesn't scale.

Saga engineering concerns: saga orchestrators (a coordinator service that runs the state machine) vs saga choreography (each service emits events that trigger the next). Orchestrators are simpler to reason about. Choreography scales further but can produce spaghetti.

The Meta-Rule

Walking through those ten: the choice isn't really "which consistency model is best for my system." It's "which consistency model does this specific operation need, given what users expect to see."

Most production systems use all of the following, in different places:

Strong/linearizable consistency for anything money-related.
Read-your-writes for user-visible writes that users need to see immediately.
Causal consistency for feed-like data where ordering matters.
Eventual consistency for counters, analytics, and anything where approximate-and-fast beats exact-and-slow.
CRDTs (narrowly) for collaborative editing and specific offline-first features.
Saga for cross-service business flows.
Consensus (Zookeeper/etcd) for the very few things that actually need leader election or distributed locks.

The engineering decision is not "pick a consistency level for the whole system." It's "for this specific feature, what consistency level does the user need to experience, what trade-offs does the stronger version cost, and can we engineer the weaker version to feel as good?"

That last clause matters. A read-your-writes layer on top of eventual consistency often feels strongly consistent to users while actually being cheap to operate. Users don't experience consistency models; they experience whether their updates show up, whether their comments appear in order, whether their refund matches what they expected. Engineering consistency is about closing the gap between the model you can afford and the experience the user requires.

Common Anti-Patterns

A few shapes that show up repeatedly in code reviews:

"Retry until consistent." Seen in code that does a write, then reads from a secondary and loops until it sees the write. Works on the happy path, deadlocks on partition, creates unbounded retry storms under load. Use read-your-writes through a session token instead.

"We'll use eventual consistency for speed." Used as a justification for skipping engineering. Yes, eventual is faster. The engineering to make it feel correct (causal ordering, conflict resolution, read-your-writes fallback) is what you're skipping — and users will notice.

"Just use Redis for leader election." Already mentioned. Redlock is not safe. If you're doing anything correctness-critical with leader election, use a real consensus system.

"Saga with no compensations." "What happens if step 3 fails?" "Oh, we'll fix it manually." That's a saga you haven't designed. It's a half-finished state machine waiting to corrupt data. Design the compensations before you ship.

"Strong consistency everywhere, for safety." Default-safe sounds responsible. It also means your read latency is 50ms minimum, you can't serve a region during a partition, and the cost per query is high. Users rarely need strong consistency everywhere. They need it in a few specific places.

The Senior Move

Consistency is a user-experience feature, not a system property. The right question at design time isn't "what consistency model does our database provide" — it's "what does the user need to see, in what order, with what freshness, with what tolerance for partial failure."

Most of the work in a well-designed distributed system is engineering around the consistency model the storage layer provides: sticky reads, session tokens, version vectors, compensating actions, explicit ordering, user-visible "your change is saved" confirmations. The model is the floor; the engineering lifts the experience to what users actually expect.

The difference between senior and junior distributed-systems work often shows up here. Junior picks a model and fights everything else to conform. Senior picks the model per-feature, builds the engineering scaffolding that closes the gap, and ships something that feels right to users — even though underneath, ten different operations run on five different consistency levels.

Why Your "Fail-Fast" Strategy is Killing Your Distributed System — another "the system property is not the user experience" essay.
RPC vs NATS: It's Not About Sync vs Async — It's About Who Owns Completion — who owns completion is one of the things consistency models don't address.
From Locks to Actors: The Four Pillars of Modern Concurrency — the concurrency side of the same general question.
NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — the durability choice that enables some of the consistency patterns here.

Don't Pick One AI. Run Three Against Each Other.

Harrison Guo — Sun, 03 May 2026 19:18:57 +0000

The Problem Nobody Talks About

AI can write code, generate content, analyze data, design systems, and manage projects. It's getting better every month. The natural question: what's left for humans?

The wrong answer: "AI will replace us."
The other wrong answer: "AI is just a tool, nothing changes."

The right answer is uncomfortable: stop picking the best AI. Run multiple AIs in competition, and become the judge.

The Tournament Model

Three rules, learned the hard way:

Multiple advisors, competing opinions. Don't bind to one AI — its bias becomes yours. Three models running the same task surface blind spots no single model catches.
You decide. After the AIs argue, you make the call. Not the smartest model — you. The one with context they don't have.
Results judge everyone. Did the call work? Keep it. Did it fail? Learn and move on. Never blame the AI — you chose to follow that advice.

That's the operating system for the AI age.

In Practice: Three AIs in One Window

Theory is cheap. The reason most people don't run multiple AIs is friction — opening three terminals, signing in to three CLIs, and pasting the same prompt three times kills the loop after one day.

So I wrote two small scripts and one tmux config to remove the friction. They live in this repo: harrison001/ai-tournament.

script	what it does
`prj`	one command opens a tmux window with codex, claude, and gemini running side-by-side, plus a shell pane
`tmx`	fzf picker to switch between tournament sessions
`tmux.conf`	binds `<prefix> b` to broadcast — type once, all three AIs receive it

The workflow:

$ prj ai-tournament       # opens tmux: codex | claude | gemini | shell
$ <prefix> b              # broadcast ON — status bar turns red
$ <type the question>     # all three receive the same prompt
$ <prefix> b              # broadcast OFF, read each answer, pick the winner

A real run

To test the loop I asked all three the same dumb question I had asked myself five minutes earlier:

"I want to wash my car. The car wash is 30 meters from my house. Should I drive or walk?"

Here is what came back, unedited:

codex: "Walk. For 30 meters, driving adds unnecessary startup, parking, and maneuvering, and it is barely faster unless you need to move supplies or the car itself into a wash bay."
claude: "Walk. At 30 meters, driving takes longer once you factor in starting the car, and you'd just be sitting in a freshly washed car driving it home through whatever's on the road. Walk over, drive back clean."
gemini: "You should drive. Although 30 meters is a very short distance to walk, your car needs to be physically present at the facility to be cleaned."

Two out of three confidently gave a wrong answer. They optimized the surface question — short distance, walking is fine — and missed the function of a car wash. If I had asked only the popular one, I would have walked over to wait in line for a service that requires a car.

Only gemini caught the obvious thing: the car has to be there.

This is what the tournament model is for. It is not "three AIs are smarter than one." Two of them were less smart than one. The point is the divergence becomes visible, and the human is the one who picks. With a single AI, you never see the disagreement — you just inherit whichever bias that model happened to have.

The car wash is a toy example. Replace it with "should we go gRPC, NATS, or HTTP for service-to-service?" and the same pattern holds — except the cost of picking the confident-but-wrong answer is no longer a wasted afternoon.

The Five Principles

1. Use Multiple AIs — Don't Bind to One

Claude, Gemini, GPT, Codex — they're all advisors. Each has strengths. Each has blind spots. Using only one AI is like having only one advisor: you inherit all their biases.

One AI:     The model's bias becomes your bias
Three AIs:  Biases cancel out, blind spots get covered

I write content using three AI models simultaneously. Same task, three outputs. I don't ask them to divide the work — I ask them to compete. The best output wins. The others get discarded.

This is not "AI-assisted writing." This is a tournament where AI models compete and the human judges.

2. Compete, Don't Divide

Most people who use multiple AIs assign each one a role: "Claude for writing, GPT for coding, Gemini for research." That's division of labor. It's a planned economy.

The tournament model is a market economy: same task to all, let results determine who's best.

Why competition beats division:

Division relies on your judgment of which AI is better at what — and that judgment is constantly wrong as models update
Competition is self-correcting — if GPT suddenly gets better at writing, it starts winning writing tasks. No reconfiguration needed
You don't need to solve the impossible problem of "which AI is best" — let them prove it through results

3. The Human Decides — Judgment Is Not Outsourceable

AI can analyze. AI can generate options. AI can evaluate tradeoffs. What AI cannot do: decide which tradeoff matters in this specific context for this specific person with these specific constraints.

Three capabilities make human judgment irreplaceable:

Insight — Knowing what question to ask. AI can answer any question, but it can't know which question matters right now. Insight comes from understanding the problem deeply enough to ask the question that unlocks everything else.

Critical Thinking — Knowing when AI is wrong. AI gives confident, articulate answers regardless of accuracy. The human must evaluate: does this make sense? Is this consistent with what I know? Is there a blind spot?

Result Evaluation — Knowing if the outcome is good enough. AI can generate a technically correct solution that's wrong for your context. Only the human who understands the full picture — users, business constraints, team dynamics, market timing — can judge whether the output actually serves the goal.

These three form a loop:

Insight → Ask the right question
  ↓
AI gives analysis
  ↓
Critical Thinking → Is this analysis trustworthy?
  ↓
Choose and execute
  ↓
Result Evaluation → Did it work?
  ↓
Insight → Why did it work / not work? → Better questions next time

4. No Blind Faith, No Emotions — Results Are the Only Standard

Two temptations:

AI agrees with you → "See, I was right." (Confirmation bias)
AI disagrees with you → "AI doesn't understand my situation." (Emotional rejection)

The tournament model rejects both:

AI agrees with me    → Good, but does the result confirm it?
AI disagrees with me → Interesting. Let me verify before judging.
Made a choice        → Own the outcome. Right? Improve. Wrong? Learn. Never blame the AI.

Practice as the sole test of truth. Not who said it. Not how confident it sounded. Did it work?

5. Human Drives AI, Not the Other Way Around

AI is an amplifier. The question is: amplifying what?

No insight + good AI tools = efficiently producing mediocrity
Good insight + no AI tools = good ideas, slow execution
Good insight + tournament model = insight amplified 10x

The human provides:

Direction — what to work on (insight)
Quality standard — what "good" looks like (evaluation)
Context — the constraints AI doesn't see (judgment)
Accountability — willingness to own the outcome (leadership)

AI provides:

Speed — generate options fast
Breadth — consider more possibilities than a human can
Consistency — apply the same standard across large volumes
Knowledge — access more information than any person can hold

The human's role isn't to do AI's job slowly. It's to do the job AI can't do at all.

Applied to Real Work

Content Creation

The temptation: let AI generate content and publish automatically. Maximum output, minimum effort.

The result: a flood of mediocre, AI-flavored content. No differentiation. No personal perspective. Platforms and audiences both learn to ignore it.

The tournament approach:

Three AI models generate competing drafts on the same topic
The human evaluates: which captured the insight? Which missed the point?
The winning draft gets refined — the human adds what AI can't: personal experience, controversial opinion, industry context
Publication decision: is this good enough to attach my name to?

The output isn't "AI content." It's human content, produced at AI speed.

Technical Decisions

The temptation: ask one AI "should I use vector databases for agent memory?" and follow its recommendation.

The result: you inherit that model's training bias. Claude might favor simplicity (it was trained by Anthropic, who chose Markdown files). GPT might favor complexity (it's aligned with enterprise patterns).

The tournament approach:

Ask all three: "What are the tradeoffs between Markdown files, SQLite + vectors, and self-evolving skills for agent memory?"
Each gives a different analysis weighted by its own biases
The human evaluates against the actual constraints: deployment model, team size, user count, latency requirements
The decision accounts for context that no AI has — your specific situation

Career Strategy

The temptation: "AI will replace developers, I need to switch careers."

The reality: AI replaces tasks, not roles. The question is which tasks become your competitive advantage.

For employees:  Agent engineering skills (the 90% problem) — because companies 
                have data and scenarios, but need people who can build reliable agents

For founders:   Data + scenario moats — because agent engineering can be hired,
                but proprietary data and deep domain knowledge can't

In both cases, the competitive advantage is insight — understanding what matters in your specific domain well enough to direct AI effectively.

The Anti-Patterns

Anti-Pattern	Problem	Tournament Alternative
Only use one AI	Single advisor's bias = your bias	Multiple AIs competing
Follow AI blindly	Lose judgment over time	AI advises, human decides
Reject AI when it disagrees	Miss good ideas out of ego	No emotions, evaluate by results
Automate everything	No quality control, garbage output	Human at quality gates
Treat AI as just a tool	Waste AI's analytical capability	Treat AIs as competing advisors

The Test

Here's how to know if you're using AI well:

Bad sign: You can't explain why you chose AI's suggestion over the alternatives.
Good sign: You can articulate the tradeoff — what you gained and what you gave up.

Bad sign: You use the same AI for everything.
Good sign: You use different AIs for the same task and pick the best output.

Bad sign: You haven't disagreed with AI in the past week.
Good sign: You regularly override AI when your insight says otherwise — and you're right more than you're wrong.

Bad sign: You can't tell the difference between AI output and human output.
Good sign: You use AI for speed and breadth, then add what only you can: context, judgment, and accountability.

One Sentence

In the AI age, run AIs like a tournament: many compete, you decide, results judge everyone. Your insight is the one thing that scales with AI instead of being replaced by it.

Part of the AI Agent Architecture series. For the technical deep dive behind these ideas: The 90% Problem and Claude Code Deep Dive.

Node Turns Waiting Into Events. Go Moves Context Switching Into User Space.

Harrison Guo — Tue, 28 Apr 2026 18:01:12 +0000

Most discussions of TypeScript/Node vs Go concurrency stop at the surface: Node is async, Go is threaded. That framing isn't wrong — it just isn't deep enough to be useful when you're picking a runtime, debugging a tail-latency problem, or explaining to your team why one of the services keeps falling over under CPU load.

The real difference is not async vs threaded. It's a question about where, in the system, suspended work lives — and what shape it takes when it's resumed.

tl;dr — Both Node and Go refuse to let the CPU sit idle while a request waits on I/O. They disagree on the unit of scheduling. Node's unit is the continuation — the tail of an async function captured as a heap closure. Go's unit is the goroutine — a full call stack the runtime can suspend and resume in user space. That single decision cascades into every other property of each runtime.

The Wrong Question

"Async vs threaded" is the wrong frame because it makes you think the choice is between paradigms. It isn't. Both runtimes have already made the same fundamental decision: do not block an OS thread waiting for slow external work. The interesting choice is how they implement that.

The actually useful question is:

When a request is waiting for I/O — for a database, an HTTP call, a Redis round-trip, a file read — what does the CPU do, and where does the suspended state of that request live?

Once you frame it that way, Node and Go aren't opposites. They're two answers to the same question — and each answer cascades into a different language shape, a different library style, and a different failure mode under load.

The naive blocking model answers the question with "an OS thread waits for the syscall to return." That model collapses around a few thousand concurrent connections — memory per thread, scheduler overhead, kernel context-switch cost. By 40,000 connections you're out of RAM, not CPU. Node and Go both refuse to do this. They diverge on which resource gets freed up and how the suspended work is captured for later resumption.

Node's Answer: Turn Waiting Into an Event

Node's model can be summarized in one line: the JS main thread only executes code that's already ready to run.

Look at this:

const user = await db.getUser(id);
return user;

It reads as if the function is paused, blocking on the database. It isn't. Here's what V8 actually does at the bytecode level when it compiles an async function: it rewrites the body into a state machine, with each await becoming a state transition.

The function above gets transformed into something equivalent to:

function asyncFn() {
  const promise = new Promise((resolve) => {
    let state = 0;
    const closure = {};                  // heap object holding locals

    function step(value) {
      switch (state) {
        case 0:
          state = 1;
          db.getUser(id).then(step);     // await → register continuation
          return;                         // ← function POPS here
        case 1:
          closure.user = value;           // resume: locals live in closure
          resolve(closure.user);
          return;
      }
    }
    step();
  });
  return promise;
}

Three things to notice:

await is not a pause. It's the point at which V8 returns from the function and pops the JS stack frame. The "rest of the function" is captured as a continuation registered on the awaited Promise via .then.
Local variables move to the heap. Because the stack frame is gone, locals (user here) live in a heap closure, accessible only when the state machine resumes.
Each await slices the function into another state. A function with two awaits runs in three event-loop turns, with three independently-pushed JS frames, with all live state stored in heap closures between them.

That third point is the most non-obvious. A single async function is not one unit of execution — it's a sequence of fresh frames separated by event-loop turns:

There is no "paused" function. There are only captured continuations and fresh frames that resume them. The event loop is the dispatcher: it watches for I/O readiness via libuv, for resolved Promises (via V8's microtask queue), for timers — and pulls the corresponding continuation onto the JS thread when it's ready to run. One thread can manage tens of thousands of concurrent connections, because at any moment only a handful of them have work that's actually ready.

This is event-driven concurrency in its precise sense — the runtime turns "waiting" into a registered event, and only resumes the captured continuation when the event fires.

The Visible Side Effect: Function Color

Because the suspension point has to be marked at compile time, async-ness becomes part of the function's type. A function that does I/O returns Promise<T>. Its callers must await it. Once they await, they themselves return Promise<T>. The "color" propagates up the call stack until you hit an async-aware entry point — typically the top of an HTTP handler or the event loop itself.

Bob Nystrom named this the function color problem in 2015. It's not a notation choice — it's a logical consequence of the stackless coroutine model. V8 cannot save and restore arbitrary JS call stacks. The only way to express suspension is "return a Promise and be marked async," and once one function does that, every function on the way up has to do the same.

The Hard Limit

The model fails the moment your code stops waiting. A single CPU-bound operation:

while (true) { /* heavy work */ }

…holds the JS main thread, and every other request on this process is dead until it returns. The event loop has nowhere else to go. Worker threads, child processes, or splitting CPU work into a separate service are real fixes, but they're escape hatches — they exist because the core model has only one main thread executing JS, and there is exactly one of it.

Go's Answer: Move Context Switching Into User Space

Go writes synchronous code:

user := db.GetUser(id)
sendResponse(user)

There is no await. There is no callback. The function looks like it blocks on the database. And yet the program scales to hundreds of thousands of concurrent operations on modest hardware.

The trick is that the scheduling boundary has been moved. Where Node has the programmer mark the suspension point with await and the runtime captures a continuation, Go lets the programmer write straight-line code and has the runtime suspend the entire goroutine when it hits a blocking I/O call.

This is the central insight, and the cleanest one-line statement of Go's concurrency model:

Go's essence is the user-space-ification of context switching.

A goroutine isn't an OS thread. It's a small (initially 2 KB) growable stack and a register snapshot, managed by the Go runtime. The runtime maps a large number of goroutines (G) onto a small number of OS threads (M) using scheduling contexts (P). This is the GMP model:

G — a goroutine. The unit of scheduling. Cheap to create, cheap to suspend.
M — an OS thread. Usually only GOMAXPROCS of them.
P — a scheduling context. Decides which G runs on which M.

many G  →  Go scheduler  →  few M  →  CPU cores

When a goroutine hits a blocking syscall or a channel wait, the Go runtime suspends the goroutine — saves its stack and registers — detaches it from the current M, and schedules another runnable goroutine onto that M. When the original goroutine's wait completes, it's marked runnable again, and some M eventually picks it up and resumes execution from the suspension point. None of this enters the kernel. No clone(2), no kernel-mediated thread switch, no kernel scheduler queue. The bookkeeping is all in user space.

That's the user-space-ification. The CPU still has to switch contexts when work shifts between goroutines, but the cost is roughly a function call plus a stack swap — not a kernel-mediated thread switch.

The key contrast with Node's model is in where the suspended state lives:

In Node, the JS call stack is shared and almost always near-empty — every async function in flight has already popped, with its state sitting in a heap closure. In Go, every goroutine owns its full call chain on its own heap-allocated stack; suspended goroutines look like frozen frames waiting for the runtime to resume them on some OS thread.

This is also why neither language can simply borrow the other's model. Node runs on V8, which was designed in 2008 for browser JS — single call stack, synchronous semantics, no concept of saving stacks across yields. Adding stackful coroutines would mean rewriting the engine, which is roughly what Java's Project Loom did to the JVM at huge cost. Go was designed from scratch with a runtime that owns stacks, can grow them, and can save them. The choice is locked in by runtime architecture, not language taste.

What "User-Space" Actually Buys You

The slogan only matters if user-space context switching is meaningfully cheaper than the kernel-mediated kind. It is — by more than an order of magnitude.

Two goroutines pinned to one OS thread (GOMAXPROCS=1), ping-ponging via runtime.Gosched() and via an unbuffered channel. Two pthreads pinned to one core (taskset -c 0), ping-ponging via pthread_mutex + pthread_cond. (Reproduction code at the end of the post.)

Measured on Intel N100, Ubuntu 24.04 (kernel 6.8.0), Go 1.23.4, gcc 13.3:

Operation	ns / switch
Goroutine yield (`runtime.Gosched`, GOMAXPROCS=1)	~102 ns
Goroutine round-trip via unbuffered channel	~436 ns (≈218 ns per G-switch + channel coordination)
pthread switch (mutex+cond ping-pong, single core)	~2,900 ns (range 2,818–3,611 across 5 runs of 2M iterations)

Ratio: roughly 28× cheaper for the bare scheduler yield, ~13× cheaper for the apples-to-apples synchronized round-trip.

Where the gap comes from:

Mode switch. The user → kernel → user round-trip alone is ~100 ns of entry/exit and ABI-mandated register save/restore. A goroutine switch never crosses that line.
Scheduler work in kernel space. Linux CFS maintains a red-black tree of runnable threads with locked, cross-CPU runqueues. The Go scheduler does the same job in user space with per-P local runqueues and lock-free fast paths — and skips the kernel locks entirely.
Cache and TLB effects. A kernel scheduler may migrate a thread to a different core, costing you cold L1/L2 and an instruction-cache reload. Goroutines normally stay on the same M, so the cache stays warm.

What the model does not buy you: a goroutine that makes a real blocking syscall still pays for a real OS thread switch — the runtime detaches the G from its M and may spin up another M so the rest of the goroutines keep running. Async preemption (Go 1.14+, signal-based) is the runtime's answer to tight loops that never yield, and it has its own cost. Once you saturate GOMAXPROCS, the user-space runqueue itself starts to show up in profiles.

The "user-space-ification" buys you cheap G-to-G switching on a hot M. That's where the order-of-magnitude lives. The syscalls, the M-to-M handoffs, the actual kernel work — those are still as expensive as they always were. The model wins by making the common case — many concurrent goroutines, mostly waiting, occasionally running — almost free.

(N100 is a low-power Alder Lake-N E-core; absolute numbers will be smaller on a server-class Xeon or EPYC, but the ratio is expected to hold.)

The Unit of Scheduling

The cleanest comparison is to ask what each runtime actually schedules:

	Node / TypeScript	Go
Unit of scheduling	callback / Promise continuation	goroutine
What's captured at suspension	tail of an async function as a heap closure	full call stack + registers
How code looks	explicit `async`/`await`	straight-line synchronous
Suspension marked by	the programmer (`await`)	the runtime (any blocking op)
Suspended state lives in	V8 microtask queue + heap closure	goroutine stack on the user-space heap
Kernel involvement	epoll/kqueue/IOCP via libuv	epoll/kqueue/IOCP via netpoller
CPU parallelism	one main JS thread; needs workers/cluster for cores	M:N scheduler runs goroutines across cores natively
Function color	yes (Promise infects up the call stack)	no (any function may block)
What breaks under CPU load	the entire event loop	nothing — scheduler runs another G on another M

The two columns describe deeply different mental models, but they belong to the same family. They are both user-space concurrency runtimes that avoid kernel thread-per-request. They differ in where the suspension is captured (the language vs. the call stack) and how broad the scheduler's mandate is.

Where the Boundaries Diverge: CPU-Bound Work

Node and Go look interchangeable on I/O-bound workloads. They diverge sharply the moment CPU work enters the picture.

Node's event loop has one job: dispatch ready callbacks onto a single JS thread. If a callback runs for 200 ms doing JSON parsing or hashing, the loop is frozen for those 200 ms. Every other suspended continuation has to wait. Throughput collapses.

Go's runtime has a different mandate. It doesn't only manage waiting — it also manages execution. If you spawn:

go task1()
go task2()
go task3()

…the scheduler is happy to put each goroutine on a different M, run them on different cores in true parallel, and preempt long-running goroutines so they don't starve the rest of the runtime. CPU-bound goroutines aren't a special case to work around. They're just goroutines.

That's why Go's concurrency model covers more ground:

Node's model mainly solves non-CPU-bound concurrency — network I/O, database waits, downstream API calls. Go's model solves I/O waiting and CPU parallelism with the same primitive.

This isn't a knock on Node. The event loop is brilliant at what it's designed for: lots of slow waits, light per-request CPU. It's the natural shape of API gateways, BFFs, websocket hubs, real-time aggregation, and most of the JSON-shuffling that makes up modern web backends. But sustained CPU work, mixed CPU + I/O pipelines, long-lived infrastructure services — those are workloads where Go's scheduler-driven model has more headroom built in.

Two Answers to the Same Question

Strip away the implementation details and the two runtimes are answering the same question with different abstractions:

Concurrency at scale is the problem of what to do with the CPU while a request waits on I/O.

Node's answer: turn the wait into an event, capture the rest of the function as a continuation, resume the continuation when the event fires. One thread cycling through ready continuations.

Go's answer: run the request on a goroutine, suspend the goroutine in user space when it blocks, schedule another runnable goroutine onto the OS thread, resume the original when its wait completes.

Two ways of solving the same waste. One state-machines it. The other lowers the cost of context switching far enough that you can afford to keep one execution flow per request.

Two answers to one question: one is events, implemented as a state machine. The other is low-cost user-space context switching.

But there's a deeper layer worth surfacing. The two answers also disagree about whether suspension should be visible in the type system. Node says yes — Promise<T> is part of the signature, async is part of the contract, function color propagates. Go says no — any function may block, and the type doesn't carry that information.

This visibility-vs-uniformity trade-off shows up far beyond Node and Go. It's the same shape as monadic IO vs implicit IO in Haskell, checked vs unchecked exceptions in Java, capability-based security vs ambient authority. Each pair makes the same trade: composable static reasoning vs ergonomic uniform code. Node and Go are picking sides of a much bigger question.

You see the consequence in the libraries. Node libraries publish fs.readFile and fs.readFileSync, two retry helpers (one for sync ops, one for async), p-limit-style bounded-concurrency wrappers around Promise.all. Go libraries publish os.ReadFile (one function), one Retry(op func() error, n int) error, twenty lines of chan + WaitGroup for bounded concurrency. The Go versions aren't simpler because Go developers are smarter — they're simpler because the runtime hides the same complexity that Node's type system insists on exposing.

The Closing Line

If you remember one thing from this:

Node turns waiting into events. Go turns execution flows into schedulable units. Both refuse to let the CPU sit idle while I/O blocks — they just disagree on what the unit of scheduling should be.

Or, if you want the deeper layer:

Node makes "this function might suspend" visible at the type level. Go makes it invisible.

That's the whole story. Everything else — await vs go, libuv vs the netpoller, V8's microtask queue vs GMP, single-thread bottleneck vs CPU-bound resilience, libraries that look complicated vs libraries that look simple — falls out of that one disagreement.

Appendix: Reproduce the Benchmark

goroutine_switch_test.go — GOMAXPROCS=1 go test -bench=. -benchtime=5s -count=5:

package bench

import (
    "runtime"
    "sync"
    "testing"
)

// Channel ping-pong: each iter is a full round-trip = 2 G-switches.
func BenchmarkGoroutineSwitchChannel(b *testing.B) {
    ch := make(chan struct{})
    done := make(chan struct{})
    go func() {
        for {
            select {
            case <-done:
                return
            case <-ch:
                ch <- struct{}{}
            }
        }
    }()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        ch <- struct{}{}
        <-ch
    }
    b.StopTimer()
    close(done)
}

// Bare scheduler yield. Each iter ≈ 1 G-switch.
func BenchmarkGoroutineSwitchGosched(b *testing.B) {
    var wg sync.WaitGroup
    wg.Add(1)
    half := b.N / 2
    go func() {
        for i := 0; i < half; i++ {
            runtime.Gosched()
        }
        wg.Done()
    }()
    b.ResetTimer()
    for i := 0; i < half; i++ {
        runtime.Gosched()
    }
    wg.Wait()
}

pthread_switch.c — gcc -O2 -o pthread_switch pthread_switch.c -lpthread && taskset -c 0 ./pthread_switch 2000000:

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

static pthread_mutex_t mu  = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t  cv  = PTHREAD_COND_INITIALIZER;
static volatile int    turn = 0;
static long            iters;

static void *worker(void *arg) {
    int my_turn = (int)(intptr_t)arg;
    pthread_mutex_lock(&mu);
    for (long i = 0; i < iters; i++) {
        while (turn != my_turn) pthread_cond_wait(&cv, &mu);
        turn = 1 - my_turn;
        pthread_cond_broadcast(&cv);
    }
    pthread_mutex_unlock(&mu);
    return NULL;
}

static double now_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (double)ts.tv_sec * 1e9 + (double)ts.tv_nsec;
}

int main(int argc, char **argv) {
    iters = (argc > 1) ? atol(argv[1]) : 1000000L;
    pthread_t t0, t1;
    double start = now_ns();
    pthread_create(&t0, NULL, worker, (void *)(intptr_t)0);
    pthread_create(&t1, NULL, worker, (void *)(intptr_t)1);
    pthread_join(t0, NULL); pthread_join(t1, NULL);
    double end = now_ns();
    printf("ns / switch: %.1f\n", (end - start) / (2.0 * iters));
    return 0;
}

GOMAXPROCS=1 forces both goroutines onto the same M so we measure pure G-to-G switching, not cross-core migration. taskset -c 0 pins both pthreads to one CPU so they actually have to context-switch (otherwise they run in parallel on two cores and there is nothing to measure). Both benches do the simplest possible synchronized hand-off — no I/O, no real work — so what is left is the cost of the switch itself.

gRPC Interceptors in Production: Design Patterns That Survive Real Load

Harrison Guo — Mon, 20 Apr 2026 17:02:20 +0000

gRPC interceptors are the middleware pattern, specialized for gRPC. If you've written HTTP middleware before, the shape is familiar — a function that wraps a call, can observe or modify the request, pass to the next handler, then observe or modify the response. The difference: gRPC's type system makes the flavors (unary, server-stream, client-stream, bidi) explicit, and chain ordering matters more than most people realize.

Most online examples show a single toy interceptor. Production systems stack five to ten of them per service. Getting the composition right — ordering, concern separation, testability — is half of running a gRPC-based microservice well.

tl;dr — gRPC interceptors are middleware with more explicit types. Chain them outside-in: observability wraps everything, then throttling, then auth, then retry, then the actual service. Keep each interceptor focused on one concern; the moment an interceptor does two things you're writing coupled middleware. Stream interceptors are trickier than unary — don't copy-paste unary logic into stream without thinking. Test the chain composition with bufconn, not just each interceptor in isolation.

The Four Interceptor Types

gRPC has four interceptor signatures, two for client, two for server:

Unary server interceptor: wraps a single request → single response call.
Stream server interceptor: wraps streaming RPCs (server-stream, client-stream, bidi).
Unary client interceptor: wraps the client side of a unary call.
Stream client interceptor: wraps the client side of a streaming call.

Unary interceptors are easy. Stream interceptors are harder because you're wrapping a bidirectional wire, not a single call.

Example unary server interceptor:

func loggingInterceptor(ctx context.Context, req interface{},
    info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    start := time.Now()
    resp, err := handler(ctx, req)
    log.Printf("method=%s duration=%s err=%v", info.FullMethod, time.Since(start), err)
    return resp, err
}

s := grpc.NewServer(grpc.UnaryInterceptor(loggingInterceptor))

Straightforward. Now stack five of them.

Chaining and Order

Real services need multiple interceptors. gRPC's standard library gives you grpc.ChainUnaryInterceptor(...) (since 1.25), or you can use google.golang.org/grpc/interceptor helpers:

s := grpc.NewServer(
    grpc.ChainUnaryInterceptor(
        observabilityInterceptor,  // outermost
        rateLimitInterceptor,
        authInterceptor,
        validationInterceptor,
        businessLogicContextInterceptor, // innermost
    ),
)

Chain order matters enormously. Interceptors execute outside-in on the way to the handler, inside-out on the way back. Put the wrong interceptor outside the wrong one and you get bugs that are hard to debug.

Canonical order I use:

Outside-in on the way to the handler, inside-out on the way back. Observability must wrap everything — so it sees every rejection, every rate-limit hit, every failed auth — otherwise you have operational blind spots. Details:

Observability (tracing + metrics + logging) — outermost. You want to see every request, including the ones that get rejected by later interceptors. If observability is inside auth, unauth'd attempts are invisible — a security-relevant blind spot.
Rate limiting / quota — before auth. Why? Because auth involves token verification (DB lookup, JWT parsing, external identity service), and you don't want unauthenticated requests to cost you CPU. Rate-limit first, authenticate second.
Auth (authentication + authorization) — before business logic. Reject unauthenticated/unauthorized requests early.
Validation (request shape, basic sanity) — before business logic. Catches malformed requests before they hit service code.
Retry / idempotency handling — closer to business. Only retry what actually made it through auth.
Request context enrichment (trace IDs, user metadata) — innermost. Populate context with validated data for the service to use.

Inverted order produces real bugs. I've seen auth outside observability (auth failures weren't logged). Retry outside rate limiter (a retry storm blew through the rate limit). Validation outside observability (validation failures invisible in metrics). Each one a real incident.

Keeping Interceptors Focused

The rule: one concern per interceptor. The moment you have an "auth-and-logging" interceptor, you're coupling concerns that should evolve separately.

Concretely:

Don't: single "observability" interceptor that does tracing, metrics, and logging in one function.
Do: three interceptors (tracingInterceptor, metricsInterceptor, loggingInterceptor), chained.

Cost: three function-call overheads instead of one. Marginal.

Benefit: you can swap tracing backends without touching logging. You can disable metrics in tests without disabling tracing. Each interceptor is testable in isolation.

This is the same argument for Unix pipes over monolithic commands. Composition beats monoliths.

Common Interceptor Recipes

Real interceptors I've written variants of many times:

Tracing (OpenTelemetry)

Use the otelgrpc integration from go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc. Don't write your own — the ecosystem is mature. Current idiomatic setup uses a StatsHandler, which hooks deeper than the interceptor chain and captures stream events correctly:

import "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"

s := grpc.NewServer(
    grpc.StatsHandler(otelgrpc.NewServerHandler()),
    grpc.ChainUnaryInterceptor( /* your app interceptors */ ),
)

Older codebases still use otelgrpc.UnaryServerInterceptor() and otelgrpc.StreamServerInterceptor() — those are deprecated but still work. Migrate when convenient; don't rewrite in a panic.

Metrics

Prometheus histogram of request duration per method:

var (
    reqDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "grpc_server_request_duration_seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "code"},
    )
)

func metricsInterceptor(ctx context.Context, req interface{},
    info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    start := time.Now()
    resp, err := handler(ctx, req)
    code := status.Code(err).String()
    reqDuration.WithLabelValues(info.FullMethod, code).Observe(time.Since(start).Seconds())
    return resp, err
}

Note: cardinality of method is bounded (you know your service's methods). Cardinality of code is bounded (gRPC codes are a fixed enum). Don't add user-id or request-id as labels — that's cardinality-explosion territory.

Auth

Extract bearer token from metadata, verify, inject user context:

func authInterceptor(ctx context.Context, req interface{},
    info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    md, ok := metadata.FromIncomingContext(ctx)
    if !ok {
        return nil, status.Error(codes.Unauthenticated, "no metadata")
    }
    tokens := md.Get("authorization")
    if len(tokens) == 0 {
        return nil, status.Error(codes.Unauthenticated, "no auth token")
    }

    claims, err := verifyToken(tokens[0])
    if err != nil {
        return nil, status.Error(codes.Unauthenticated, "invalid token")
    }

    // Skip certain public methods
    if isPublic(info.FullMethod) {
        return handler(ctx, req)
    }

    ctx = context.WithValue(ctx, userCtxKey{}, claims)
    return handler(ctx, req)
}

Key detail: add the user context here, near the boundary. Service code reads it from context. You don't pass claims as argument through every service method.

Rate limiting

Token bucket per caller or per method:

func rateLimitInterceptor(limiter *rate.Limiter) grpc.UnaryServerInterceptor {
    return func(ctx context.Context, req interface{},
        info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
        if !limiter.Allow() {
            return nil, status.Error(codes.ResourceExhausted, "rate limited")
        }
        return handler(ctx, req)
    }
}

Production rate limiting is fancier — per-tenant, distributed state in Redis, burst capacity — but the shape is the same. Reject with ResourceExhausted before doing work.

Retry (client-side)

Client interceptor that retries on transient errors:

func retryClientInterceptor(attempts int) grpc.UnaryClientInterceptor {
    return func(ctx context.Context, method string, req, reply interface{},
        cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ...grpc.CallOption) error {
        var err error
        for i := 0; i < attempts; i++ {
            err = invoker(ctx, method, req, reply, cc, opts...)
            if err == nil {
                return nil
            }
            if !isRetryable(err) {
                return err
            }
            backoff := time.Duration(1<<uint(i)) * 100 * time.Millisecond
            select {
            case <-time.After(backoff):
            case <-ctx.Done():
                return ctx.Err()
            }
        }
        return err
    }
}

Retry is one of the most dangerous interceptors. Get it wrong (no idempotency keys, retry non-idempotent operations, retry storm during outage) and it causes more production incidents than it prevents. Pair with grpc-middleware/retry if you can; it's battle-tested.

The Stream Interceptor Trap

Stream interceptors are harder. The interceptor signature gives you a grpc.ServerStream, which is a bidirectional channel. Logging becomes:

func loggingStreamInterceptor(srv interface{}, ss grpc.ServerStream,
    info *grpc.StreamServerInfo, handler grpc.StreamHandler) error {
    start := time.Now()
    err := handler(srv, ss)
    log.Printf("stream=%s duration=%s err=%v", info.FullMethod, time.Since(start), err)
    return err
}

This only logs at stream-end, not per message. If you want per-message observability, you need to wrap the ServerStream itself:

type observedStream struct {
    grpc.ServerStream
    sent, recv int64
}

func (s *observedStream) SendMsg(m interface{}) error {
    atomic.AddInt64(&s.sent, 1)
    return s.ServerStream.SendMsg(m)
}

func (s *observedStream) RecvMsg(m interface{}) error {
    err := s.ServerStream.RecvMsg(m)
    if err == nil {
        atomic.AddInt64(&s.recv, 1)
    }
    return err
}

Then pass the wrapper to the handler. This is the pattern for any stream interceptor that needs per-message visibility.

Common mistakes:

Forgetting to propagate context to the wrapper. The wrapped stream's Context() should be the enriched context.
Per-message overhead blows up long streams. A message-level log line is fine at 100 msgs/sec. At 100K msgs/sec, it's your dominant cost.
State in the wrapper not thread-safe. Streams can be concurrent on the Send and Recv sides. Protect counters.

Testing Interceptor Chains

Unit test each interceptor in isolation:

func TestAuthInterceptor_NoToken(t *testing.T) {
    ctx := context.Background() // no metadata
    info := &grpc.UnaryServerInfo{FullMethod: "/my.Service/Method"}
    handler := func(ctx context.Context, req interface{}) (interface{}, error) {
        t.Fatal("handler should not be called")
        return nil, nil
    }

    _, err := authInterceptor(ctx, nil, info, handler)
    require.Equal(t, codes.Unauthenticated, status.Code(err))
}

Integration-test the chain end-to-end using bufconn (in-memory connection):

func TestChain_Ordering(t *testing.T) {
    lis := bufconn.Listen(1024 * 1024)
    defer lis.Close()

    s := grpc.NewServer(grpc.ChainUnaryInterceptor(observability, auth, business))
    pb.RegisterMyServer(s, &realImpl{})
    go s.Serve(lis)
    defer s.Stop()

    conn, _ := grpc.Dial("bufnet",
        grpc.WithContextDialer(func(ctx context.Context, _ string) (net.Conn, error) {
            return lis.DialContext(ctx)
        }),
        grpc.WithTransportCredentials(insecure.NewCredentials()),
    )
    defer conn.Close()

    client := pb.NewMyClient(conn)
    resp, err := client.Method(ctx, req)
    // assert on behavior end-to-end
}

Integration tests catch bugs that unit tests don't: metadata propagation, interceptor ordering, context enrichment visible to the handler. Don't skip them.

Patterns That Save Time

Use grpc-middleware/v2 (github.com/grpc-ecosystem/go-grpc-middleware/v2) for chain helpers, recovery, and batteries-included interceptors. Don't reinvent every wheel.
Keep error semantics consistent. Every interceptor should return status.Error(code, msg) for failures. Don't return raw Go errors — clients can't parse them properly.
Skip-list for public methods. Auth and rate limit often need to skip health check and reflection endpoints. Keep the skip list in one place.
Per-service vs global interceptors. Most interceptors are global (tracing, metrics, auth). A few might be per-service (e.g., a bespoke rate limiter for a specific hot endpoint). Compose accordingly.
Panic recovery at the outermost layer. A panic in a handler shouldn't kill the server. Use the recovery middleware from grpc-middleware or write your own, and put it first in the chain.

The Discipline That Makes This Work

Interceptors are the right tool for cross-cutting concerns — the things every RPC needs but the service code shouldn't have to think about. The discipline is: one concern per interceptor, careful ordering, consistent error semantics, tested end-to-end.

The services I've seen do this well have clean business logic (because the cross-cutting stuff is outside it) and reliable operational behavior (because the interceptor chain is tested as a unit, not just piece-by-piece). The services that do it poorly have auth logic sprinkled through their handlers, tracing that randomly misses requests, and rate limiters that let certain code paths bypass.

Interceptor order is one of those details that looks tactical and turns out to be architectural. Get it right once; the service's behavior improves every release.

Go Context in Distributed Systems: What Actually Works in Production — the context that flows through every interceptor.
RPC vs NATS: It's Not About Sync vs Async — It's About Who Owns Completion — the shape of gRPC calls as one side of the bigger messaging picture.
Observability and Cost Attribution: Why One Pipeline Isn't Enough — why tracing interceptors alone aren't enough for business attribution.

Go Generics, One Year In: Which Promises Held, Which Didn't

Harrison Guo — Mon, 20 Apr 2026 16:30:05 +0000

Go 1.18 shipped generics in March 2022. The two years before that were dominated by hopeful blog posts ("finally, a real type system!") and the two years after by the predictable backlash ("why did we even bother, Go was simpler"). I've written production Go before and after. The honest answer is somewhere in the middle and closer to "useful for a narrower set of problems than we expected."

This is a look back from someone who has shipped generic code in anger and reviewed a lot more of it. What held up. What didn't. What habits to adopt and which to avoid.

tl;dr — Go generics are genuinely valuable for parametric operations on container-shaped types — slices, maps, channels, any-key lookup tables, min/max/sum utilities. Less valuable for "clever abstractions" that dress up control flow as type magic. The clearest gains are in the standard library itself (slices, maps) and in domain-specific utility packages. Most application code didn't need generics before and doesn't need them after. The mistake is not using generics; it's using them for things interfaces already handled fine.

What Generics Actually Are

Go generics are type parameters on functions and types. A function like slices.Contains can be written once, work for any slice element type, and still be type-checked at compile time:

func Contains[S ~[]E, E comparable](s S, v E) bool {
    for _, x := range s {
        if x == v {
            return true
        }
    }
    return false
}

Three features you should know:

Type parameters: the [E any] or [E comparable] in brackets.
Constraints: tell the compiler what operations the type parameter supports. any, comparable, or custom interfaces like constraints.Ordered.
Approximate constraints: ~[]E means "any type whose underlying type is []E" — lets you be flexible about named slice types.

What they aren't: Java-style wildcards, C++ SFINAE, or anything that mimics variance. The design is deliberately narrower than most prior languages. It's more like Rust's generics, minus the trait system's complexity.

Where Generics Clearly Win

Standard-library style container and utility functions

The slices and maps packages in the standard library are the canonical example:

slices.Contains(users, "alice")
slices.Sort(numbers)
maps.Keys(config)
maps.Values(settings)

Before generics, these were either hand-written per-type (tedious, error-prone), done via interface{} (type-unsafe, slow), or done via reflect (slow and error-prone). Generics are strictly better for these.

The same pattern shows up in third-party libraries: samber/lo (JS-style utilities), thoas/go-funk (functional helpers), and many domain-specific ones. If you reach for lodash-style helpers in JavaScript, you'll want similar in Go, and generics made that workable.

Concurrency helpers

Generic worker pools, futures, result types — these all benefit from generics:

type Future[T any] struct {
    done chan struct{}
    val  T
    err  error
}

func (f *Future[T]) Get() (T, error) {
    <-f.done
    return f.val, f.err
}

Before generics, you'd have had an interface{} return and a type assertion at the call site. Now you can express "this future produces a T" in the type. Cleaner at the boundary, safer at the call site.

Typed collections

If your system has a genuinely typed container use case — say, an ordered map keyed by a domain ID — generics let you write it once:

type OrderedMap[K comparable, V any] struct {
    order []K
    data  map[K]V
}

This is a rare case where "custom generic container" is the right tool. The majority of code doesn't need this. But when you do need it, the generics version is much better than the interface{} alternative.

Numerical / algorithmic code

constraints.Ordered (or its post-1.21 replacement cmp.Ordered) is the key constraint for "works for any numeric or ordered type":

func Max[T cmp.Ordered](a, b T) T {
    if a > b { return a }
    return b
}

Math helpers, min/max, sum, average — all cleanly generic. Readable, type-safe, performant.

Where Generics Don't Help, Or Hurt

"Generic services" and similar framework-y code

I've seen codebases where someone wrote a generic "repository" type:

type Repository[T any] struct { /* ... */ }
func (r *Repository[T]) FindByID(id string) (T, error) { /* ... */ }
func (r *Repository[T]) Save(t T) error { /* ... */ }

The instinct — "all repositories do the same thing" — is mostly wrong. Real repositories differ in query shape, error cases, caching rules, transaction boundaries. Forcing them behind a generic interface either (a) produces a lowest-common-denominator API that doesn't fit any actual use, or (b) gets so many type parameters that readability collapses.

The Go idiom is usually better: one non-generic UserRepository, one OrderRepository, etc. Each concrete, each tuned to its domain.

Over-constrained helpers

If your "generic" function has five type parameters with custom constraints each, readability dies:

func Complicated[
    T comparable,
    K Hashable,
    V any,
    F func(T, K) (V, error),
    M map[K]V,
](items []T, f F, cache M) error { /* ... */ }

This is technically legal. Reading it, you realize it's a glorified map-with-cache-and-error. Interfaces or function types would have been clearer. Generics don't make complex APIs simple; they just let you make them complex in a type-checked way.

Behavioral polymorphism

Interfaces are still the right tool when different types have different behavior. A generic Process[T any](x T) error doesn't help if you actually want different logic per type. You want an interface with a method.

// Good use of interface
type Processor interface {
    Process(ctx context.Context) error
}

// Bad use of generics
func ProcessGeneric[T any](x T) error {
    // can't actually differentiate behavior
}

The separation: generics for parametric operations (same logic, any type), interfaces for polymorphic behavior (different logic per type).

Performance: Usually a Wash

The performance story is more nuanced than either "generics are slow" or "generics are free."

Go's current generic implementation uses GCShape stenciling — one compiled version per "GC shape" (roughly, per memory layout). This is between full monomorphization (one version per type, like Rust) and type-erased dispatch (one version total, like Java's reified-erased hybrid).

Practical implications:

Small primitive types (int, int64) often get specialized versions. Competitive with hand-written.
Pointer-sized types (most structs, interfaces) share code. Slightly slower than hand-written but usually faster than interface-based dispatch.
Call overhead is similar to function calls, not interface dispatch. No devirtualization issue.
Compile times increase, especially for libraries with many instantiations. This is the real cost.

Benchmarks I've seen: generic versions are within 5% of hand-written equivalents, and consistently faster than interface{}-based alternatives. Performance is almost never the deciding factor — readability and design fit matter more.

Idioms That Emerged

Over the years since 1.18, a few conventions have stuck:

Prefer `any` to `interface{}`

any is a type alias for interface{} added in 1.18. Shorter, clearer. Use it everywhere.

Single-letter type parameters for simple cases, descriptive for complex

T, K, V for the obvious cases. More descriptive when the role is specific:

func Reduce[In any, Out any](items []In, f func(Out, In) Out, initial Out) Out

Put constraints in a dedicated package

If you have several custom constraints, group them:

package constraints

type Ordered interface { ~int | ~int64 | ~string | ~float64 }
type Numeric interface { ~int | ~int64 | ~float64 }

The standard golang.org/x/exp/constraints (and later cmp.Ordered in 1.21) set the pattern.

Use `~T` approximations for flexibility

~[]E includes named slice types. ~int includes type MyInt int. Almost always the right choice for generic parametric code; refuses arbitrary extension.

Never overload generic helpers to do too much

Each generic function should do one parametric thing. Generic helpers that try to be many things at once collapse under type-parameter weight.

The Standard Library Won

The clearest vindication of Go generics is what happened to the standard library. slices, maps, cmp.Ordered — these additions are uncontroversially better than the pre-1.18 alternatives. A lot of code that used to be hand-rolled or based on sort.Interface has cleaner replacements.

The user-land picture is more mixed. Libraries that benefit from generics genuinely use them well (samber/lo, kelindar/column, many others). Libraries that don't need them mostly haven't been retrofitted with them.

What I Do Now

A few simple rules I apply:

Prefer standard library generic helpers over hand-rolled. slices.Contains, slices.Sort, maps.Keys — use them.
Write a generic helper only when I have at least two concrete use cases for it. One use case is a pattern waiting to be born, not necessarily a generic.
Prefer functions to methods on generic types when possible. Generic methods have more friction (can't overload by type, can't add methods outside the defining package).
Keep constraints simple. any, comparable, cmp.Ordered, and domain-specific single-type-union constraints cover 95% of cases. More complex constraints usually mean the abstraction is wrong.
Never turn interfaces into generics just because you can. If the types have genuinely different behavior, an interface is right.

Where Generics Actually Sit Now

Generics were oversold before they landed ("Go finally becomes a real language!") and oversampled in the aftermath ("generics everywhere!"). The truth is narrower and more boring: they're a useful addition for a specific class of problems, mostly centered on parametric operations over containers and numerics. They improved the standard library. They haven't changed the shape of most Go code.

If you've been writing Go and wondering whether you're missing out by not using generics, the answer is almost certainly no. Code without them is still idiomatic. Code with them, when the use case fits, is cleaner. Neither is dominant. Both are fine.

The one concrete thing I'd say: learn the generic parts of the standard library. slices, maps, cmp.Ordered. Use them reflexively. Stop hand-rolling indexOf and contains. Everything else can wait until you have a real problem that generics solve.

Go Profiling in Anger: pprof, Escape Analysis, and Inlining Without Magic — the performance toolchain that tells you whether your generic code actually matches the hand-written version.
sync.Pool in Go: When It Actually Helps, and When It Quietly Hurts — another feature most commonly misapplied.
Scale-Up vs Scale-Out: Why Every Language Wins Somewhere — the meta-question behind every language-feature debate.

Go Profiling in Anger: pprof, Escape Analysis, and Inlining Without Magic

Harrison Guo — Mon, 20 Apr 2026 16:29:23 +0000

Go's performance culture has a ritual quality. "Use sync.Pool." "Avoid interface boxing." "Preallocate slices." Copy-pasted from blog posts and applied without measurement. Sometimes helpful. Often hollow.

The honest answer is that Go performance work is mostly just profiling. Good profiling tells you what's actually slow. Bad profiling — or no profiling — leaves you guessing. The toolchain that Go ships with is genuinely excellent; more engineers should use it, and fewer should follow checklist optimizations they haven't measured.

This is a practical, end-to-end guide to pprof, escape analysis, and inlining — the three Go-specific tools that answer most performance questions.

tl;dr — Start every Go perf investigation with a CPU pprof of the hot path under realistic load. 80% of issues are obvious in the flame graph. For the remaining 20%, add a heap profile and look for allocation pressure driving GC. Only after you've localized the problem with real data should you reach for micro-optimizations: escape analysis via -gcflags='-m', inlining hints, and targeted benchmark-driven rewrites. Skip the profile step, and you are optimizing the wrong thing.

The Investigation Flow

CPU Profiling: The First Thing, Always

Every Go binary can expose a pprof HTTP endpoint in two lines:

import _ "net/http/pprof"
// later
go http.ListenAndServe("localhost:6060", nil)

Under load, grab a CPU profile:

$ go tool pprof -http=:9999 http://localhost:6060/debug/pprof/profile?seconds=30

This opens a flame graph in your browser. The wide blocks are where CPU time is spent. Usually the answer is immediate — "oh, JSON encoding is 40% of my CPU; let me switch to a faster encoder." Or "regex compilation is in the hot path because someone forgot to pre-compile."

A few things that look surprising on first profile but shouldn't:

runtime.mallocgc taking 10%+ is GC pressure. You're allocating a lot. Look at heap profile next.
runtime.schedule or runtime.findrunnable taking 5%+ means you have too many goroutines churning. Check if you're spawning per-request.
syscall.Syscall high means you're system-call-heavy — usually I/O. Either buffer/batch, or consider epoll-direct if it's in your hot path.
mutex.Lock visible means contention. Either shrink the lock hold time or shard the lock.

Don't guess your way through these. Click into each, read the stack, find the user code that caused it.

Heap Profiling: When CPU Points to GC

If runtime.mallocgc shows up in your CPU profile as a non-trivial chunk, heap profile tells you why:

$ go tool pprof -http=:9999 http://localhost:6060/debug/pprof/heap
$ go tool pprof -http=:9999 http://localhost:6060/debug/pprof/allocs

heap shows current memory usage. allocs shows cumulative allocations since program start — this is usually what you want to optimize.

In the flame graph, look for:

Specific allocation sites taking disproportionate share. A single line of code creating 50% of allocations is an obvious target.
Calls to makeslice, makemap, newobject with known-size inputs. If you know the size, preallocate.
Interface boxing in hot paths. Every time you pass a concrete type through an interface{} argument in a tight loop, the runtime may heap-allocate the boxed value.
String concatenation with +. This is the textbook preventable allocation — use strings.Builder.

The goal isn't "zero allocations" — that's usually not practical. The goal is "allocations per operation in a tight, repeated path are bounded and understood."

Escape Analysis: The Compiler's Story

Go's compiler decides at compile time whether a variable lives on the stack (free, garbage-collected with the function) or the heap (allocated, GC-tracked). This is called escape analysis.

To see the analysis for your code:

$ go build -gcflags='-m' ./...

Output looks like:

./foo.go:12:6: can inline hotFunction
./foo.go:15:10: &Thing{} escapes to heap
./foo.go:18:14: make([]int, 100) does not escape
./foo.go:22:6: parameter "x" escapes to heap

Key things to read for:

escapes to heap — this allocation is heap-allocated. If it's in a hot path, investigate.
does not escape — stack-allocated, free. You want most short-lived locals to do this.
parameter escapes to heap — the caller's passed value escapes because this function keeps a reference to it. Often fixable by taking a copy or not storing a reference.

The most common surprise: passing a value to a function that eventually hands it to interface{} causes the value to escape. A pattern like:

func log(msg string, args ...interface{}) {...}
func handleRequest(req *Request) {
    log("got request", req.ID) // req.ID boxes to interface{} and may escape
}

req.ID escapes because of the ...interface{} argument. In a tight path, this is measurable. Fix: use a typed logger that takes concrete types, or accept the cost because logging on the hot path is usually not the hot path.

Escape analysis is one of those things where reading the output a few times is worth it. You start seeing your code differently.

Inlining: When the Compiler Eliminates the Call

Go's compiler inlines small functions to avoid call overhead. Seeing what got inlined:

$ go build -gcflags='-m' ./... 2>&1 | grep -E 'can inline|cannot inline'

./foo.go:12:6: can inline hotFunction
./foo.go:18:6: cannot inline bigFunction: function too complex: cost 117 exceeds budget 80
./foo.go:22:6: cannot inline interfacingFunction: call to unknown method

Default budget is 80 AST nodes. Hard blockers:

Calls through interfaces. The compiler doesn't know what concrete method gets called. No inlining.
Calls to functions that contain loops with for range over a channel. Historically blocked, though the mid-stack inliner has improved this.
Recursive functions. Obvious.
Functions over the budget. Refactor smaller if the call is hot.

When to care:

Never in normal code. Go inlines what it can; your code runs.
Sometimes in tight hot loops where the call overhead is 10%+ of the total work. Benchmark shows it.
Occasionally when you control an interface boundary and can replace it with a concrete type on a hot path.

Don't structure your code around inlining. Code readability beats hypothetical call-overhead wins in nearly every case.

Benchmarks: The Ground Truth

Every perf claim should be backed by a benchmark. testing.B is the tool:

func BenchmarkEncodeResponse(b *testing.B) {
    resp := newResponse()
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, _ = encode(resp)
    }
}

Run:

$ go test -bench=BenchmarkEncode -benchmem -count=5

-count=5 runs each bench 5 times, so you can compare variance. Don't trust a single run. Hardware, OS scheduling, thermals — all add noise.

For comparing two implementations:

$ go test -bench=BenchmarkEncodeResponse -benchmem -count=10 ./... > old.txt
# (change code)
$ go test -bench=BenchmarkEncodeResponse -benchmem -count=10 ./... > new.txt
$ benchstat old.txt new.txt

benchstat (golang.org/x/perf/cmd/benchstat) gives you statistical significance. If the difference isn't statistically meaningful, you didn't actually improve anything — you just rolled the dice differently.

The 80/20 of Go Performance

After enough of this work, a few patterns dominate the real wins:

Query shape, not language. A slow endpoint is usually doing 10 DB queries when it could do 1. Go is almost never the bottleneck; the data layer is.
Network hop count. Every inter-service call adds latency. Merging two small services or co-locating tight integrations beats any language-level optimization.
Caching at the right layer. A well-placed LRU cache saves more than micro-optimizing the uncached path.
Preallocating known-size slices/maps. make([]int, 0, n) when you know n is almost free. The default make([]int, 0) reallocates as you append.
Avoiding interface boxing in loops. This is the one micro-optimization that regularly shows up in real profiles.

Everything else — sync.Pool, escape analysis hand-tuning, loop unrolling — is a long-tail optimization. Worth it when profiling tells you it is. Premature otherwise.

A Habit I Recommend

Before adding any optimization, do exactly three things:

Take a profile with the optimization off. Save it.
Apply the optimization.
Take a profile with the optimization on. Compare.

If the comparison doesn't show clear improvement on the metric you cared about, revert. Do not add complexity without evidence.

This sounds obvious. Almost nobody does it. Most perf work in Go codebases accumulates dead optimizations that add nothing or actively hurt — but nobody knows which, because nobody benchmarked.

The Habit That Compounds

Go's performance tooling is better than Go's performance culture gives it credit for. pprof, escape analysis, inlining diagnostics, and benchmarks are built in. They're precise. They tell you the truth.

The reason most Go code isn't as fast as it could be isn't that Go is slow (it isn't). It's that engineers copy-paste optimizations they haven't measured, call the work done, and move on. The few engineers who profile first and optimize second write code that's actually fast — and usually simpler than the ritual-heavy version.

Profile first. Everything else follows.

sync.Pool in Go: When It Actually Helps, and When It Quietly Hurts — the one Go optimization most likely to be misapplied.
Why Go Handles Millions of Connections: User-Space Context Switching, Explained — understanding the runtime is the prerequisite to understanding profiles.
Testing Real-World Go Backends Isn't What Many People Think — benchmarking is the last mile of testing.

sync.Pool in Go: When It Actually Helps, and When It Quietly Hurts

Harrison Guo — Mon, 20 Apr 2026 16:28:42 +0000

sync.Pool is one of those Go features that shows up prominently in "how to write fast Go" blog posts and then gets applied to everything. The result is a codebase sprinkled with pools that don't help and sometimes hurt. Most Go code I review does not need sync.Pool. The code that does need it often uses it wrong.

This is a working engineer's take on when pooling actually helps, when it's wasted effort, and the specific traps it creates.

tl;dr — sync.Pool is a GC-pressure reducer for workloads that allocate large-ish, short-lived objects at high frequency. It is not a general-purpose optimization. The cases where it clearly helps: per-request buffers in HTTP handlers, encoder/decoder instances, JSON buffers, protocol frame buffers. The cases where it hurts or is wasted: small objects, infrequent allocations, long-lived state, and any code that forgets to reset pooled items. Benchmark before and after — always.

What sync.Pool Actually Does

sync.Pool is a free-list for objects that the GC can clear. You Get() an object (fresh or recycled). You use it. You Put() it back. The runtime tries to give you a recycled one next time, but reserves the right to drop the whole pool on GC.

Key properties:

GC clears pools on every cycle. This is crucial. Pools are not a long-term cache — they're a hint to the runtime that "if you're going to collect these, wait a moment in case they get reused first."
Per-P (per-scheduler-thread) local storage. Most Get()/Put() calls hit a goroutine-local pool with no contention. Scaling across cores is nearly free.
No guarantees. A Get() might return a fresh object. A Put() might be discarded if the pool is full or the GC just fired.

This design is exactly right for "reusable scratch space." It's wrong for "cached resources I need to stay around" (use a real cache instead).

Should You Pool This?

Most paths in real code exit this flow long before hitting "use". That's correct.

When Pooling Helps: Per-Request Buffers

Canonical case. An HTTP handler serializes a response to a buffer, writes the buffer, moves on. The next request does the same thing. Without pooling, the GC collects the buffer every request. With pooling, the buffer is reused:

var bufferPool = sync.Pool{
    New: func() interface{} {
        return bytes.NewBuffer(make([]byte, 0, 4096))
    },
}

func handler(w http.ResponseWriter, r *http.Request) {
    buf := bufferPool.Get().(*bytes.Buffer)
    defer func() {
        buf.Reset()
        bufferPool.Put(buf)
    }()

    writeResponse(buf, r)
    w.Write(buf.Bytes())
}

Under realistic load (thousands of requests per second), this typically reduces allocation pressure by 20-40% and measurably lowers GC pause times. The exact number depends on your allocation pattern, but the principle holds: large, frequent, short-lived allocations are exactly what pooling is for.

What makes this the canonical case:

Buffers are big enough (4KB initial) that the allocation actually matters.
They're frequent — thousands per second.
Short-lived — used within one request.
Easy to reset — buf.Reset() clears it cleanly.
Same shape every time.

When you see a request-scoped buffer that fits all five, pooling almost always pays.

When Pooling Is Wasted Effort

Small objects. Pooling a 24-byte struct with three fields is almost never worth it. The pool's own overhead (per-P lookup, interface boxing) is larger than the allocation. Benchmark to confirm — you'll see allocs/op go down but ns/op stay the same or go up.

// Not worth it:
type Small struct { a, b, c int }
var smallPool = sync.Pool{New: func() interface{} { return &Small{} }}

// Just use new(Small) or &Small{}

Infrequent allocations. If your code path runs once an hour, pooling saves nothing meaningful. The GC handles a handful of allocations just fine.

Long-lived state. Connection objects, database handles, caches. These shouldn't be in sync.Pool — they should be in a proper cache or connection pool (like *sql.DB, which internally manages connections without sync.Pool).

Anything you can't reliably reset. If an object has state that needs to be "returned to zero," and you can forget to zero it, you're one typo away from data leaking between requests.

The Reset Trap

The single most dangerous mistake with sync.Pool: forgetting to reset the object before putting it back, or reusing it before clearing whatever was in it.

// Wrong:
buf := pool.Get().(*bytes.Buffer)
buf.Write(responseData) // might not start empty
w.Write(buf.Bytes())
pool.Put(buf) // buf still has data; next caller might see it

// Right:
buf := pool.Get().(*bytes.Buffer)
buf.Reset() // ← explicit
buf.Write(responseData)
w.Write(buf.Bytes())
buf.Reset()
pool.Put(buf)

This has caused real production incidents. Pooled buffers across request handlers have leaked bearer tokens, user PII, and password reset codes when a reset was missed. The runtime doesn't help — there's no "enforce reset" mechanism. You have to do it.

Habits that reduce the risk:

Always pair Get with a defer Reset+Put at the top of the function.
Reset at both ends (on Get and on Put) — paranoid but effective.
For byte slices, shrink before return: buf.Reset() on a bytes.Buffer resets length but keeps capacity — that's usually what you want. For a raw []byte, use buf[:0].
Make your New function return a pre-reset object. Don't assume it's always "fresh."

The Alloc Benchmark Methodology

The only honest way to know if pooling is helping is go test -bench -benchmem. Here's what a useful benchmark looks like:

func BenchmarkWithoutPool(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        buf := bytes.NewBuffer(make([]byte, 0, 4096))
        writeResponse(buf, exampleRequest)
        _ = buf.Bytes()
    }
}

func BenchmarkWithPool(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        buf := bufferPool.Get().(*bytes.Buffer)
        buf.Reset()
        writeResponse(buf, exampleRequest)
        _ = buf.Bytes()
        bufferPool.Put(buf)
    }
}

Run:

$ go test -bench=. -benchmem
BenchmarkWithoutPool-10    200000    8431 ns/op    4352 B/op    3 allocs/op
BenchmarkWithPool-10       500000    3214 ns/op     128 B/op    1 allocs/op

Look for two things:

allocs/op drops significantly (here: 3 → 1).
ns/op drops or stays flat (here: 8431 → 3214).

If allocs/op drops but ns/op goes up, pooling is adding overhead without saving enough GC pressure to justify itself. That's the "wasted effort" signal.

The benchmark alone isn't enough, though — you also need production evidence. pprof heap profiles before and after deployment should show reduced allocation. If the prod numbers don't match the benchmark, you're measuring the wrong thing.

A Pattern That Actually Works: Scoped Pools

One pattern I've found useful: scope the pool to the type of work it serves. Don't have one giant pool that everything pulls from.

// JSON response buffer pool
var jsonBufPool = sync.Pool{
    New: func() interface{} { return bytes.NewBuffer(make([]byte, 0, 4096)) },
}

// Protocol frame buffer pool (different typical size)
var frameBufPool = sync.Pool{
    New: func() interface{} { return bytes.NewBuffer(make([]byte, 0, 64*1024)) },
}

Why separate pools matter: if you have one shared pool, you might Get() a 64KB buffer when you needed a 4KB one and waste memory. Or worse, you might Get() a 4KB one for a 64KB job and grow it (defeating pooling's purpose).

Separate pools stay close to their intended sizes. Each pool's items are homogeneous. The New function's initial capacity reflects the typical workload.

The Big Thing `sync.Pool` Isn't

sync.Pool is not a replacement for bounded resource pools (database connections, HTTP clients, goroutine worker pools). Those need explicit lifecycle management, health checks, and non-discardable state. Use a real pool library for them.

sync.Pool is also not a cache. A cache holds items you want to find again. sync.Pool holds items you might reuse if one's convenient, and discards them otherwise. Different primitive for a different problem.

What Actually Matters

Most Go code is fast enough without pooling. Before adding sync.Pool to your hot path, ask:

Have I actually benchmarked this with -benchmem?
Are the objects I'd pool both large and frequent?
Can I reliably reset them?
Is GC pressure in pprof profiles actually a problem?

If any answer is no, skip the pool. The simpler code is almost always the better code.

The cases where pooling pays are real but narrower than internet wisdom suggests. Per-request buffers, protocol frame buffers, encoder/decoder state, crypto scratch space. Beyond that, the pool usually adds more lines of code than it saves nanoseconds — and each of those lines is one more place where a missing Reset() can leak bytes between requests.

Measure. Then decide.

Go's Concurrency Is About Structure, Not Speed — the bigger principle: Go optimizes for correct structure, not raw speed.
Testing Real-World Go Backends Isn't What Many People Think — how to actually benchmark and prove a pool helps.

IronSys: A Production Blueprint for Modern Concurrency

Harrison Guo — Mon, 20 Apr 2026 16:28:01 +0000

In the last post I walked through the four concurrency pillars — shared memory + locks, CSP, actors, STM — and argued that real systems mix them on purpose. Someone reasonably asked: okay, but what does that actually look like? Fair question. Abstract taxonomy is less useful than a worked example.

IronSys is that worked example. It's a composite blueprint — not a real service, but representative of a class of services I've designed, helped design, or debugged in production. Let's say it's a mid-sized backend system: public API, stateful user sessions, streaming data in, aggregation and reporting out. The kind of thing that appears in the middle of any serious platform.

The interesting part isn't the features. It's which concurrency primitive shows up where, and why.

tl;dr — IronSys is a composite production blueprint: a multi-service Go backend with stateful user sessions, streaming ingest, and usage aggregation. It uses CSP channels for pipelines and coordination, a goroutine-per-entity actor pattern for stateful sessions, mutexes and atomics for hot shared counters, and durable queues for cross-service handoff. Each primitive is picked for a specific failure mode. The pattern is not "mix for variety"; it's "match the primitive to the work."

The System Shape

Before deciding on concurrency primitives, sketch the work shapes. IronSys has four:

Public API — request/response, modest concurrency, latency-sensitive. The classic HTTP backend.
Live sessions — stateful, long-lived per-user entities. Think multiplayer game server, collaborative editor, real-time dashboard.
Streaming ingest — high-throughput events arriving over Kafka/NATS, fanned out to workers for processing.
Batch aggregation — periodic rollup jobs that read from storage, compute, write back.

Four shapes, four concurrency patterns. The wrong design would apply the same primitive to all four. The right design picks each separately.

stateless · request/response"]"/>

The API Handlers

Nothing fancy. Stock Go HTTP server. Each request is its own goroutine (Go's runtime does this automatically). Shared state — rate limiters, cache, config — is protected by mutexes or atomics:

type RateLimiter struct {
    mu      sync.Mutex
    buckets map[string]*bucket
}

func (r *RateLimiter) Allow(key string) bool {
    r.mu.Lock()
    defer r.mu.Unlock()
    b, ok := r.buckets[key]
    if !ok {
        b = newBucket()
        r.buckets[key] = b
    }
    return b.allow()
}

Obvious choice. The contention is bounded by request rate, the state is small, a mutex is the simplest possible tool. Over-engineering here — sharded maps, lock-free data structures — buys nothing.

What IronSys does here that many teams miss: every handler is context-aware from request entry:

func (s *Server) HandleFoo(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    result, err := s.service.Foo(ctx, parseReq(r))
    writeResponse(w, result, err)
}

Context flows everywhere downstream. The handler layer is boring; that's the point.

The Live Sessions — Actor Pattern in Go

Each active user session is a long-lived goroutine with an inbox channel. I call this the goroutine-per-entity pattern — it's Erlang actors without the runtime, built from Go primitives.

type Session struct {
    id       SessionID
    mailbox  chan SessionCmd  // the "actor" inbox
    shutdown chan struct{}
    state    sessionState      // private to this goroutine
}

type SessionCmd struct {
    op     string
    args   interface{}
    reply  chan<- SessionReply // optional reply channel
}

func runSession(ctx context.Context, s *Session) {
    defer close(s.mailbox)
    for {
        select {
        case cmd := <-s.mailbox:
            s.handle(cmd)
        case <-s.shutdown:
            s.flush() // persist final state
            return
        case <-ctx.Done():
            return
        }
    }
}

Why this pattern, not "session is a struct with a mutex"?

State is private to one goroutine. No sharing, no locks, no lock-ordering bugs. The session state is accessed by exactly one execution context.
Serial message processing. Commands process one at a time, in FIFO order. Business invariants hold naturally.
Natural location for cross-session coordination. Each session is a message destination. Broadcasting to all sessions, or routing a command to a specific session, is just "send on its inbox."
Clean lifecycle. The goroutine runs until shutdown or ctx.Done. State is flushed once, on exit. No race between "is this session still alive" and "did we finish writing its state."

The manager that creates and routes to sessions looks like:

type SessionManager struct {
    mu       sync.RWMutex
    sessions map[SessionID]*Session
}

func (m *SessionManager) Get(id SessionID) (*Session, bool) {
    m.mu.RLock()
    defer m.mu.RUnlock()
    s, ok := m.sessions[id]
    return s, ok
}

func (m *SessionManager) Start(ctx context.Context, id SessionID) *Session {
    m.mu.Lock()
    defer m.mu.Unlock()

    s, ok := m.sessions[id]
    if ok { return s }

    s = newSession(id)
    m.sessions[id] = s
    go runSession(ctx, s) // supervisor goroutine
    return s
}

Note the mixing: the manager uses a mutex-protected map (shared state with a clear owner), individual sessions use the actor pattern (isolated state, message-passing). Two primitives, picked per-job.

This pattern scales to millions of sessions because goroutines are cheap. I've seen this exact pattern serve 400K concurrent sessions on a single pod.

The Streaming Ingest — Bounded Worker Pool (CSP)

Kafka consumer feeding a worker pool. Canonical CSP territory:

func runConsumer(ctx context.Context, cons *kafka.Consumer) error {
    jobs := make(chan Event, 256)
    var wg sync.WaitGroup

    // Fixed worker pool
    for i := 0; i < workerCount; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for {
                select {
                case job, ok := <-jobs:
                    if !ok { return }
                    if err := process(ctx, job); err != nil {
                        log.Error(err)
                    }
                case <-ctx.Done():
                    return
                }
            }
        }()
    }

    // Producer
    go func() {
        defer close(jobs)
        for {
            msg, err := cons.ReadMessage(ctx)
            if err != nil { return }
            select {
            case jobs <- msg:
            case <-ctx.Done():
                return
            }
        }
    }()

    <-ctx.Done()
    wg.Wait()
    return ctx.Err()
}

The bounded channel is the concurrency clamp. Kafka can push as fast as it wants; the worker pool consumes at its own pace; backpressure propagates back to Kafka's consumer offset naturally.

Why not actors here? Because the work items are stateless — you're processing events, not maintaining per-entity state. The overhead of an actor (mailbox, dispatch, ownership) is unjustified. CSP is the right fit.

Why not mutex + a worker loop? You could, but the channel primitive is exactly the right shape — bounded capacity + safe cross-goroutine handoff + graceful shutdown — without needing to build those three features yourself.

The Batch Aggregation — Pipelines + errgroup

Nightly rollup: read from storage, compute per-account aggregates, write back.

func runRollup(ctx context.Context, input <-chan Event) error {
    g, gctx := errgroup.WithContext(ctx)

    // Stage 1: parse
    parsed := make(chan ParsedEvent, 64)
    g.Go(func() error {
        defer close(parsed)
        return parseStage(gctx, input, parsed)
    })

    // Stage 2: aggregate (keyed by account)
    agged := make(chan Aggregate, 64)
    g.Go(func() error {
        defer close(agged)
        return aggregateStage(gctx, parsed, agged)
    })

    // Stage 3: persist
    g.Go(func() error {
        return persistStage(gctx, agged)
    })

    return g.Wait()
}

Three stages in a pipeline. Each stage is a goroutine, connected by bounded channels. errgroup ties them together: first error cancels the whole pipeline.

The aggregation stage internally uses a map protected by a mutex, because it's a single goroutine reading the map — no contention at all, but still safe if a future change introduces more readers.

This is textbook CSP: the topology of channels is the architecture. Read the code and the shape of the computation is obvious.

The Cross-Service Handoff — Durable Queues

IronSys talks to two other services: a billing service (async, eventually consistent) and an auth service (sync, immediate).

For billing: a dedicated NATS JetStream subject with at-least-once delivery. Usage events go in one end; the billing service reads them. The emission codepath has a local write-ahead log so that if NATS is briefly down, events buffer on disk and replay when the connection recovers.

For auth: gRPC with tight timeouts. Caller owns completion. If auth is slow, the API handler's deadline fires and the request fails fast.

Two different ownership models for two different shapes of work. See: RPC vs NATS: Who Owns Completion.

How the Primitives Map

Summarizing which primitive serves which job in IronSys:

Work shape	Primitive	Why
HTTP request handling	Stock `net/http` + goroutine per request	Language default, right for stateless
Hot shared state (rate limiter, cache)	Mutex / atomic	Simplest primitive that works
Stateful user sessions	Goroutine-per-entity (actor-like)	Isolated state, message-passing, serial processing
Session directory	RWMutex-protected map	Shared lookup, read-heavy
Streaming event processing	Bounded channel + worker pool (CSP)	Backpressure, parallelism, graceful shutdown
Multi-stage data pipeline	CSP pipeline + errgroup	Stage topology = architecture; first-error cancels all
Async cross-service handoff	Durable queue (NATS JetStream / Kafka)	Receiver owns completion, at-least-once delivery
Sync cross-service call	gRPC with ctx timeout	Caller owns completion, fast failure

Notice: all four concurrency pillars show up. Mutexes in the rate limiter. CSP in the event pipeline. Actors (in pattern) in the session runtime. (STM is missing; it would show up if I were doing this in Clojure or Haskell.)

What This Architecture Gets Wrong

Every architecture has weaknesses. IronSys's are real:

The actor pattern isn't real actors. Without Erlang-style supervision, if a session goroutine panics, Go's default behavior is to kill the process. Adding panic recovery per-session is easy but not free. In practice, most teams hit this 6 months in, add a recovery wrapper, and move on.
Bounded channels can mask slow downstream. If a channel fills up and the producer blocks, that's backpressure — great. But if the channel is buffered too large, you can buffer a lot of work into memory before realizing downstream is slow. Tune buffer sizes with measurements, not guesses.
Goroutine-per-entity has a per-session baseline cost. Cheap but not free. A million sessions is ~2.5GB of goroutine stacks. For services where most entities are inactive, a lazy pattern (spin up on activity, suspend to disk on idle) is better.
Mixing paradigms cognitively. New engineers have to learn four patterns instead of one. The productivity hit is real for the first two weeks; the payoff is in the next two years.

What This Blueprint Is Really Selling

A system with four work shapes should have four concurrency patterns, not one stretched to cover everything. The four pillars aren't theoretical; they map to real design decisions, and production Go services that use them deliberately are easier to reason about than those that don't.

What IronSys is really selling is intentional heterogeneity. Every primitive is there for a reason. Every reason is traceable to a specific failure mode you want to prevent. The architecture should be legible — a new engineer reading the code should understand why a channel is there instead of a mutex, why a session has its own goroutine instead of being a struct in a shared map, why billing goes through a durable queue instead of a gRPC call.

If you can't answer "why this primitive here," the code isn't finished. It's just working, for now.

Blueprints are useful precisely because they're generic. The specifics of your system will be different. But the decision framework — what's the work shape, what's the failure mode, what's the right primitive — is the same every time.

From Locks to Actors: The Four Pillars of Modern Concurrency — the taxonomy behind the choices in IronSys.
Go's Concurrency Is About Structure, Not Speed — chan and context as the glue across all of these.
RPC vs NATS: It's Not About Sync vs Async — It's About Who Owns Completion — the cross-service handoff choices.
Testing Real-World Go Backends Isn't What Many People Think — how you verify a system like this actually holds up.

Docker Kubernetes: What They Really Changed (It's Not What You Think)

Harrison Guo — Mon, 20 Apr 2026 07:13:19 +0000

"A Docker container is basically a lightweight VM, right?" No. That sentence alone causes more architectural misunderstandings than any other in modern backend engineering. A VM virtualizes hardware. A container is a set of Linux kernel features — namespaces, cgroups, overlay filesystems — wrapped in a nicer CLI. Same host kernel, same memory space, same attack surface if the kernel has a bug. The marketing that says otherwise has cost teams real money in misconfigured production.

Kubernetes gets the same treatment. "It's a tool for running containers." Also not really. Kubernetes is a distributed scheduler, service mesh, declarative control plane, and reconciliation engine. Containers are one of the things it happens to run. Treating Kubernetes as "container orchestration" produces systems that break in predictable, frustrating ways — because the team never learned that the reconciliation loop, not the container, is the thing that actually matters.

This is a working engineer's re-read of what Docker and Kubernetes actually changed. Not the marketing story. The underneath-the-hood story that tells you when to reach for them and when they're overkill.

tl;dr — Docker didn't invent Linux namespaces, cgroups, or filesystem layering; it packaged them into a developer-friendly workflow. That workflow is what changed. Kubernetes didn't invent distributed scheduling, service discovery, or rolling deployments; it standardized the declarative, reconciliation-loop pattern for all of them. That pattern is what changed. Understanding these primitives (namespaces + cgroups + reconciliation loops) tells you when to reach for the tools and when the tools are overkill.

What Docker Actually Is

Docker is a set of Linux kernel features wrapped in a nice CLI and an image format. The features existed before Docker; they just weren't accessible.

Linux namespaces — process, mount, network, IPC, UTS, user, cgroup. Each namespace gives a process its own view of that resource. When your container thinks it has PID 1, it really thinks so; inside its PID namespace, the host's init is invisible.
cgroups (v1/v2) — resource accounting and limits. How much CPU, memory, I/O bandwidth a group of processes can use. This is why a misconfigured container can eat a host's memory and take everything else down.
Union / overlay filesystems — the thing that lets you stack "base image" + "layer 1" + "layer 2" without copying. OverlayFS on modern kernels.
Image format (OCI) — a standard way to package a root filesystem plus metadata into something reproducible.

Docker's innovation was not inventing any of this. It was making them accessible. docker run -p 8080:80 nginx hides a beautiful horror of namespace creation, iptables rules, virtual ethernet pairs, overlay mounts, and cgroup assignment. Before Docker, you'd have spent a week reading unshare(2) and ip netns add to reproduce this. After Docker, you did it in a workshop afternoon.

What actually changed: deployments became reproducible. The image you built on your laptop contained everything needed to run — OS libraries, Python version, environment. "Works on my machine" stopped being a coping mechanism and started being a legitimate development artifact. That's the Docker revolution. Not containers. Reproducible, portable environments.

The thing that is not true, despite the marketing: Docker containers are not VMs. They share the host kernel. A kernel exploit in one container can reach the host and other containers. Containers are a soft isolation — good enough for most production multi-tenant workloads, not good enough for hostile tenants.

What Kubernetes Actually Is

Kubernetes is a declarative control plane built on the reconciliation loop pattern. This is the single most important idea to internalize.

You write a manifest describing the desired state: "three replicas of this deployment, exposed through this service, attached to this config."
You hand the manifest to the control plane: "make it so."
Kubernetes runs an unending loop: observe the current state, compare to desired, take actions to close the gap.

Everything Kubernetes does follows this pattern:

Deployment controllers watch the pod count, scale up if low, scale down if high.
ReplicaSet controllers ensure N identical pods exist.
Service controllers maintain the iptables / IPVS / eBPF rules that route virtual IPs.
Ingress controllers watch Ingress resources and configure the edge proxy.
The scheduler watches for unscheduled pods and binds them to nodes.
Node controller watches node health and evicts pods from unhealthy nodes.

Your application is just the data in the reconciliation loop. The loops run forever, closing gaps. That's Kubernetes.

Every Kubernetes feature — Deployments, Services, Ingresses, HPAs, CronJobs, StatefulSets — is some controller running this exact pattern. Once you see it, the platform stops being magic.

What actually changed because of this: the operational model became shared across companies. Before Kubernetes, every engineering team had a bespoke orchestration system: a collection of Chef/Puppet/Ansible recipes, some custom scripts, a deploy button, and a few senior engineers who knew which knobs to turn during incidents. Different at every company. Opaque to new hires. Sensitive to key-person risk.

Kubernetes is many things, but the single biggest thing it did was replace a hundred bespoke orchestration glues with one standard. It's not the best tool for every problem — Nomad is simpler, ECS is more managed, Cloud Run hides the thing entirely — but it's the standard, and "it's the standard" has real value: hires know it, vendors build for it, books exist, the job market is liquid.

The Mental Model Most People Miss

Once you see "reconciliation loop," you stop asking questions Kubernetes doesn't answer.

"How do I deploy?" You don't. You update a manifest. A controller observes the change and reconciles.

"How do I roll back?" You don't. You update the manifest back. A controller observes the change and reconciles in the other direction.

"Why did my pod get killed?" Because a controller decided the current state (this pod is here, on this node) didn't match the desired state (node is draining, or pod is over its memory limit, or a replica count decreased). It closed the gap.

"Why can't I SSH in and hand-edit things?" Because the next reconcile loop will undo your edit. The manifest is the source of truth. If you want to change behavior, change the manifest.

This is a shift from imperative ops ("run these commands to deploy") to declarative ops ("the system should look like this; make it so"). Git becomes the history of what your infrastructure should be. Time travel works. Change review works. Disaster recovery becomes "re-apply the manifests to a new cluster." When it clicks, you stop fighting the platform.

Until it clicks, the platform feels maddening. "I just want to run a container" — yes, but the platform doesn't care what you want to do once. It cares about the continuous state. Every action through kubectl apply is a statement of desired state, not an imperative command.

What Changed in Practice

Concretely, what looks different on a team that's moved from "SSH into the box and systemctl restart" to a reconciled-state model:

Deployment became a git push

Before: log into the bastion, pull the latest build, restart the service, watch the log.
After: merge to main, CI pushes image to registry, ArgoCD/Flux observes the manifest change, the Deployment controller updates the ReplicaSet, pods roll gradually.

Benefits: change review, audit trail, rollback by git revert, consistent deploys across teams.
Costs: debugging a broken deploy requires understanding the CD pipeline, the manifest, and the controller that's reconciling. The failure mode surface is wider.

Scaling became a number in a file

Before: write a script that watches metrics, calls the cloud API, hopes for the best.
After: replicas: 10 in a manifest, or an HPA (Horizontal Pod Autoscaler) that watches metrics and adjusts the Deployment.

Benefits: declarative, versioned, reproducible.
Costs: HPA behavior is subtle — wrong thresholds cause thrashing, wrong metrics cause over/underscaling. Many teams never invest in tuning.

Service discovery became DNS

Before: register in Consul, read from Consul, have a catalog. Or hardcode IPs. Or service registry.
After: my-service.my-namespace.svc.cluster.local resolves to a stable virtual IP. Kube-proxy or CNI load-balances to healthy pods.

Benefits: services don't need to know how other services run. Standard DNS.
Costs: the DNS / networking layer is one of the hardest parts of Kubernetes to debug. When service discovery breaks, you're reading iptables or eBPF maps, not a Consul dashboard.

Configuration became a manifest

Before: environment variables, .env files, maybe Consul KV.
After: ConfigMaps and Secrets, mounted as env vars or volumes.

Benefits: versioned, reviewed, separate from code.
Costs: changing a ConfigMap doesn't automatically restart pods. You have to annotate the Deployment or use something like reloader. New users get bitten by this constantly.

When Kubernetes Is Overkill

I'll say it directly: most teams adopting Kubernetes for the first time don't need it.

Rules of thumb:

Two or three services, one team: you don't need Kubernetes. ECS, Nomad, Cloud Run, or even systemd + Ansible will do. The operational overhead of Kubernetes exceeds its benefit at this scale.
Ten to twenty services, small team: Kubernetes starts breaking even if you pick a managed service (EKS, GKE, AKS). Don't run your own control plane.
Fifty+ services, multiple teams, serious release engineering needs: Kubernetes is probably the right call. The cost of complexity is amortized over the benefits of a shared declarative platform.

The dangerous zone is 5-15 services on a small team. At that scale, Kubernetes often wins the resume-driven-development vote and loses the actual-outcomes vote. Pick a simpler tool.

When Kubernetes Is the Right Answer

The jobs where Kubernetes genuinely shines:

Multi-service, multi-team engineering orgs where consistency matters more than per-service optimality.
Scale-out workloads with heterogeneous shapes — web apps, job runners, ML batch jobs, stateful databases, all on one platform.
Teams that want declarative infrastructure — GitOps via ArgoCD/Flux, infra PRs reviewed like code.
Workloads with nontrivial scheduling — affinity rules, taints, GPU allocation, spot instances.
Operators ecosystem — Kubernetes operators (Prometheus operator, cert-manager, etc.) let you extend the same reconciliation model to application-specific concerns.

Notice the pattern: Kubernetes wins when you want the platform's primitives — declarative state, reconciliation, operators — beyond just container scheduling. If you only want "run my container," you're buying a jumbo jet to fly to the next town.

What I'd Tell a Team Starting Fresh

Two concrete takeaways I'd hand to engineers thinking about Docker and Kubernetes.

For Docker: the image isn't the point. Reproducibility is. An image built on your laptop that runs unchanged in CI and production — that's the contract you got. Break it (say, by mutating state inside the running container) and you lose the value. The container is a delivery mechanism for a reproducible environment.

For Kubernetes: the manifest is the source of truth. Every piece of your infrastructure — deployments, services, secrets, ingresses, policies — lives in git. Every change is a git change. Every rollback is a git revert. If you find yourself running kubectl edit on production, something is wrong with your workflow, not with Kubernetes.

Both tools won because they codified patterns that were already emerging in sophisticated shops. They didn't invent the patterns. They made them accessible, portable, and standard. That's the fifteen-year revolution. Not containers. Not YAML. The standardization of patterns that used to require a senior infrastructure team to implement from scratch at every company.

When you work with the grain of the pattern — reproducible environments for Docker, reconciled declarative state for Kubernetes — both tools get out of the way. When you fight the grain, they fight back.

Why Go Handles Millions of Connections — Linux primitives that Docker is built on, seen from the language side.
Observability and Cost Attribution: Why One Pipeline Isn't Enough — what happens to operational complexity when you have dozens of services.
Scale-Up vs Scale-Out: Why Every Language Wins Somewhere — the architectural decision that drives whether you need Kubernetes at all.

Observability and Cost Attribution: Why One Pipeline Isn't Enough

Harrison Guo — Mon, 20 Apr 2026 07:13:17 +0000

A team I worked with tried to build their billing system on top of their tracing pipeline. The idea was clean: every operation already generates a span; spans already have duration and attributes; adding user_id and billable_units to each span lets finance query the trace store to compute invoices. One pipeline, less infrastructure. Beautiful.

Six weeks before the first billing cycle, the wheels came off. The tracing system was sampling at 10% because full-capture was too expensive. The sampler was head-based, meaning whether a trace got kept was decided at request entry, long before the code knew whether the request was billable. Some users got charged for 10% of their actual usage; others got free service. Nobody's invoice agreed with the other team's report.

The workaround — "don't sample billable traces" — sounded reasonable, broke the tracing pipeline's cost model immediately, and created a dozen new edge cases around which requests counted as "billable." Within a month the team was reluctantly building a second pipeline for billing. They still had the first one for traces. Now they had two pipelines that disagreed with each other.

The postmortem landed on a single sentence: observability and cost attribution aren't the same problem, and pretending they are is expensive twice.

tl;dr — Tracing and metrics optimize for signal-to-noise — you want the interesting outliers, sampling is OK, dropping data is tolerable. Billing optimizes for completeness and auditability — every event must be captured and durably recorded, end of story. The two pipelines have opposite trade-offs on sampling, retention, schema evolution, and cost. Building them as one pipeline forces one of the two to lose. Build them as two, share primitives where possible, let each specialize where it must.

Why They Look Alike

Observability pipelines and billing pipelines do look eerily similar from a distance:

Both capture events from production systems.
Both attach metadata to those events.
Both aggregate events over time windows.
Both export to a query layer.

It's tempting — especially to engineers who like clean architecture — to say these are the same problem and build one system. The similarity is surface. The constraints are opposite.

Axis	Observability	Billing
Loss tolerance	High (sampling is fine)	Zero
Latency tolerance	Seconds to minutes	Minutes to hours is fine
Retention	Days to weeks	Years
Schema evolution	Fast, frequent	Slow, with audit trail
Cardinality profile	Low cardinality on hot dims	Arbitrary (per user, per resource)
Consumers	SRE, engineering, on-call	Finance, legal, customer
Failure mode	Blind spot in a dashboard	Wrong invoice, legal exposure

The one that really matters: loss tolerance. Everything else follows from it.

A tracing pipeline that drops 10% of spans is fine. You still see the outliers. You still find the slow paths. The system does its job.

A billing pipeline that drops 10% of events is a disaster. Some users underpay. Some users overpay. Finance reconciliation fails. You end up manually auditing transactions for weeks.

The moment one pipeline has to satisfy zero-loss and the other can tolerate 90% sampling, you have two different systems whether you wanted one or two.

The Dual-Path Architecture

The design I keep reaching back to is straightforward: two pipelines, shared ingest, separate durability and query paths.

Two emission paths from the application. Two pipelines behind them. Each tuned for its job.

The tracing path

Stays conventional. OpenTelemetry SDK emits spans. Collector applies head-based or tail-based sampling. Hot store (Tempo, Jaeger, Grafana Cloud) gets 10-20% of the volume. Retention a few days to a few weeks. Query layer is for engineers debugging incidents.

What I optimize for here:

Cost per span — you're keeping billions; every byte matters.
Query latency — on-call wants answers in seconds.
Auto-instrumentation coverage — the fewer things you have to manually instrument, the better.

What I don't care about:

Full capture. Sampling is fine.
Long retention. You're debugging last Tuesday, not last fiscal year.
Per-user accuracy. If a single user's trace got dropped, nobody cares.

The usage-event path

The dedicated billing pipeline. Every billable operation emits a usage event — a small, structured record with everything finance needs and nothing it doesn't.

{
  "event_id": "ue_01HFNGR...",
  "occurred_at": "2026-02-14T18:22:30.145Z",
  "account_id": "acc_12345",
  "resource_id": "res_6789",
  "operation": "api.request",
  "dimensions": {
    "region": "us-east-1",
    "tier": "standard"
  },
  "units": {
    "requests": 1,
    "cpu_ms": 147,
    "egress_bytes": 8342
  },
  "idempotency_key": "req_abc_20260214182230"
}

The rules on this path:

Unsampled. Every billable operation emits exactly one event. No head sampling. No tail sampling. No "approximate."
Durable writes. Emitter has a local write-ahead log or durable queue. If the downstream is down, events buffer locally until delivery. No dropped events under partial failure.
Idempotency keys. Every event has a unique ID (or composite key) so downstream dedup is trivial. This lets you retry safely.
Schema versioned and immutable. Once an event shape is shipped, it doesn't mutate. New fields add a new version. Old versions keep working until you intentionally deprecate.
Long retention. Years, usually. Auditors ask for 2023's data in 2027.

The downstream infrastructure matches: Kafka or NATS JetStream with high replication factor for ingest, columnar warehouse (BigQuery, Snowflake, ClickHouse) for aggregation and query, separate auth and access control from engineering-facing tools.

What the two paths share

Not nothing. They share:

The trace/request ID. Usage events include the trace ID of the request that generated them. This is the one cross-pipeline link that matters — when finance escalates "this user says they were charged for X requests but they swear they only made Y," you want to be able to find the traces of those Y requests.
OpenTelemetry as the emission library. OTel can emit both spans and custom events. Using it for both keeps the instrumentation codepaths uniform. But the pipelines behind the emitter are different.
The application's definition of an "operation." Both pipelines have opinions about what counts as one operation. Keep that definition single-source.

Why Head-Sampling Kills Billing

Worth dwelling on the specific thing that breaks when you try to unify.

Head-based sampling decides whether to record a trace at entry, based on trace ID. It's O(1), stateless, and fair across traffic shapes — the standard default.

The failure: at entry time, the system has no idea whether this request will be billable. The sampler doesn't know if the user is on a paid plan, if the request will succeed, if it will hit a billable feature. It just picks randomly.

Tail-based sampling fixes part of this — you decide after the fact, based on span attributes. Now you can keep all errors, all slow requests, all requests from paid users. Better, but still subject to buffering limits. Heavy tail-based samplers sit in front of your trace ingest pipeline and drop spans when buffers fill, which still gives you lossy billing during traffic bursts.

The only sampler that's correct for billing is "capture everything." And "capture everything" is what the tracing pipeline tries to avoid, because that's what makes it expensive.

You can do "capture everything for billable operations, sample everything else" in one pipeline. It works. It also ends up being the most complex sampler you've ever written, with an exception branch that duplicates the decision logic from your actual billing code. The dedicated usage-event path is simpler.

Cardinality and the Per-User Problem

A related anti-pattern: attaching user ID as a Prometheus label.

Prometheus (and most metrics systems) store one time series per label combination. Add a user_id label to a metric that ten thousand users hit, and you just created ten thousand time series. Add a request_type label alongside, and that's ten thousand × request-type-count. Cardinality explodes. Your metrics storage bill goes with it.

The instinct is fine — "I want to track per-user throughput" — the mechanism is wrong. Metrics with high-cardinality labels are the square peg. Usage events are the round hole. Emit a usage event with account_id as a dimension, aggregate per-user in the warehouse at query time.

Rule I use: metrics for engineering-facing dashboards, events for business-facing attribution. If the label cardinality could exceed ~1,000 distinct values, it belongs in an event, not a label.

The Boring Operational Details

Where the two pipelines actually differ in day-to-day ops:

Retention. Tracing a few weeks, maybe. Billing store, years. Warehouse partitioning by date and account_id makes multi-year queries practical. Archive older partitions to object storage.

Access control. Traces: engineers. Billing events: accounting + support + an audit-only read path for legal. Not the same principals, not the same ACL model.

Schema governance. Traces: OTel semantic conventions, loose. Billing events: your own schema with a proto or Avro definition, version bumps tracked in a migration log, additive only.

Reconciliation. Billing needs to agree with itself. Daily reconciliation job that asserts "yesterday's event count per user equals the sum of the per-hour counts" catches silent drops early. No equivalent makes sense for tracing.

Replay. When a billing bug is discovered, you need to replay historical events through a fixed pipeline. Kafka's offset model makes this natural; NATS JetStream has it too. The tracing pipeline rarely needs replay — if the last two weeks of traces have a bug, you shrug and fix forward.

When You Can Get Away With One

Small workloads with no audit requirement, usage-based pricing below ~$1/user, and a team of three — one pipeline is fine. Add user attributes to spans, store them all, build a nightly aggregation job, call it billing. It works.

The threshold where it stops working is somewhere around:

Revenue per customer exceeds the cost of a mistake. At $10k/month per customer, a dropped event is a $10k issue.
The first auditor asks "show me exactly what this customer used in March 2024." Unsampled, durable, retrievable, signed — that's the table stakes for audit-grade billing, and sampled traces can't meet any of those.
Engineering starts wanting cheaper traces. When the tracing pipeline outgrows your budget and someone proposes "let's sample more aggressively," you're about to break billing.

When any of those lights up, separate the pipelines. The cheapest time to separate is before you've built tools on top of the unified one.

When to Invest in Splitting the Pipelines

Observability and cost attribution are adjacent problems that optimize for opposite things. A tracing pipeline that compromises on completeness becomes a bad billing pipeline. A billing pipeline that compromises on cardinality and retention becomes a bad tracing pipeline. Building one system that satisfies both usually produces two systems that satisfy neither.

The dual-path design isn't more complex. It's just honest about the constraints. Same emission library, same operation definition, two paths behind the emitter, each tuned for its job.

If you're about to launch usage-based pricing and you're planning to compute invoices from your trace store, rethink it now. The sooner you split, the cheaper the split.

NATS vs Kafka vs MQTT: Same Category, Very Different Jobs — why the durability choice on the billing path matters so much.
RPC vs NATS: It's Not About Sync vs Async — It's About Who Owns Completion — completion ownership applies to the emit path, too.