Forem: Alex Cloudstar

Claude's June 15 Pricing Split: What Indie Devs Actually Need to Do Before the Meter Starts

Alex Cloudstar — Fri, 15 May 2026 18:33:21 +0000

The first time I noticed how much programmatic Claude usage was hiding inside my $20 Pro subscription was when I added a GitHub Action that ran claude -p on every pull request. The action did three things: summarised the diff, drafted a release note, and flagged any new env vars touched. It worked. I shipped it. I forgot about it.

Three weeks later I had pushed forty PRs and the action had run about a hundred and ten times after retries and reruns. I never saw a bill. Not because the work was free, but because Anthropic was quietly eating the cost. I was paying $20 a month and consuming what would have been, by my own back-of-envelope math, somewhere between $80 and $180 of API usage in that window. The subscription was doing the heavy lifting, and the subscription was a great deal.

That deal ends on June 15, 2026.

Anthropic announced on May 13 that starting on that date, Claude subscriptions split into two billing pools. Interactive use (claude.ai, the terminal Claude Code session you sit in front of, Cowork) stays exactly where it is. Programmatic use (Claude Agent SDK, claude -p, Claude Code GitHub Actions, every third-party harness that talks to your subscription) moves to a new monthly Agent SDK credit pool, denominated in dollars, billed at full API rates. The headline subscription prices do not move. Your $20 plan is still $20. But the bucket your scripts have been drinking from quietly stops being all-you-can-eat.

This post is the practical version of that announcement. What the new pools actually look like, the cost math nobody is publishing, who wins, who loses, and the concrete things to change in your CI and your scripts before the meter flips on.

What Actually Changes On June 15

The shape of the change is simple. Today, when you call Claude from a script, a CI job, or any tool that authenticates through your subscription, that call hits the same rate-limited subscription bucket as a chat in claude.ai. Heavy users have rate limits, light users do not, and nobody pays per token. After June 15, the same call hits a separate dollar-denominated pool that resets monthly and bills every token at standard API rates.

Anthropic's pricing page sketches the new pools roughly like this:

Plan	Subscription price	New monthly Agent SDK credit
Free	$0	none
Pro	$20	$20
Max 5x	$100	$100
Max 20x	$200	$200
Team Standard	$25 per seat	about $20 per seat
Team Premium	$125 per seat	about $100 per seat
Enterprise	custom	custom

The credit is per user. It does not pool across a team. It does not roll over. Once it is gone, your programmatic calls either stop, or fall through to pay-as-you-go API billing if you have explicitly opted into "extra usage" on the account.

The list of things that draw from the new pool is wider than most people realise. It covers Claude Agent SDK calls, every claude -p invocation, Claude Code running inside GitHub Actions, and third-party harnesses like OpenClaw, Conductor, Jean, Hermes, and Zed's ACP integration. What stays on the existing subscription rate limits is shorter: the claude.ai web, desktop, and mobile apps, the interactive Claude Code terminal session you launch by typing claude and waiting for the prompt, and Cowork.

The line is "is there a human reading the response in real time." If yes, subscription. If no, meter.

For the math that follows, the API rates are the same ones the Anthropic console shows: Sonnet 4.6 is $3 per million input tokens and $15 per million output tokens. Opus is roughly five times that. Haiku is roughly five times cheaper.

The Cost Math Nobody Is Showing You

The friendly way to read the new pool is "you get $20 of free API a month bundled with Pro." The honest way to read it is "your $20 used to subsidise way more than $20 of API, and now it does not." Both are true. Which one applies to you depends entirely on how much programmatic usage you actually do.

Three concrete examples make the difference visible.

One claude -p invocation. Suppose you run claude -p "summarise this diff" against a medium PR. The diff is around 8k tokens. The system prompt and tool definitions add another 4k. Claude writes a 600-token summary. That is 12k input and 600 output, so 0.012 * $3 + 0.0006 * $15 = $0.045. Roughly four and a half cents. Cheap. You could run that 440 times before exhausting a $20 credit.

That sounds like a lot. It is not. Watch what happens when you put it in CI.

A Claude Code GitHub Action on every PR. Same diff, but the action also calls Claude twice more: once to draft a release note, once to look for risky changes. Three calls per PR, $0.13 each PR on average once you account for tool use and a longer context window the second time around. Ten PRs a month: $1.30, basically free. Fifty PRs a month, which is normal for a small team that ships daily: $6.50, still fine. Two hundred PRs across a small org or any team that uses PR-per-commit conventions: $26. You blew through your $20 credit on PR review alone and you have not run a single claude -p of your own yet.

A background agent loop. This is where it gets ugly. A loop that runs every ten minutes, reads recent logs, and decides if anything is worth paging you about. Each iteration is roughly 6k input and 400 output, so $0.024 a call. Six calls an hour, 144 a day, around $3.40 a day, $100 a month per loop. You hit the $100 Max credit in 29 days and the $200 Max credit in 58. If the loop crashes and restarts more aggressively, or if you have two of them, or if a single iteration accidentally pulls in 60k of log context because someone's stack trace was long, the numbers move fast.

This is also where the surprise charges live. The story making the rounds in May is the developer who racked up $200.98 in API charges because a commit message contained the string "HERMES.md" and got auto-flagged as third-party tool use, billing programmatic instead of interactive. Anthropic eventually reversed it after the screenshot went viral, but the lesson stuck. The classifier is doing more work than people assume, and the line between "I am chatting with Claude" and "Claude is acting on my behalf" is fuzzier than you would hope.

The single most useful thing you can do this week is run the actual numbers for your actual usage. Not what you think you do. What you actually do.

Who Wins, Who Loses

The plain way to read June 15 is that it sorts users into two buckets that used to be the same bucket.

The winners are light scripters. If you run claude -p two or three times a week, you have never come close to your subscription rate limit, and you have never seen the meter because there was no meter. Starting June 15 you get a $20 (or $100, or $200) bucket that you will not exhaust, ever. The pricing change is, for you, a free upgrade. You get explicit limits and a transparent budget you previously did not have. If that is your usage pattern, you can mostly stop reading here.

The losers are anyone running 24/7 automation or shared CI on a single seat. Zed's blog put a hard number on the implicit subsidy that is going away. They wrote that Claude subscriptions "previously subsidised agent usage at roughly 15 to 30 times compared to API pricing." Translate that and you get the real story of June 15: programmatic users were getting a 15x to 30x discount. That discount is gone. The credit pool just makes the new rack rate look a little nicer.

The other group of losers is teams. Credits do not pool across users. If you have three engineers on a Team plan and one of them runs all the CI automation, the other two seats' $20 credits sit unused while seat one runs out on day eleven. The fix is either to refactor automation to use a dedicated API key with its own billing, or to spread the work across seats in a way that makes no engineering sense but makes accounting sense. Both are awkward.

And then there is the truly heavy use case. Ben Hylak, CTO of Raindrop.ai, called the change "either really silly, or shows how bad of a spot Anthropic is in re: GPUs." That second reading is the interesting one. If Anthropic is rationing compute by raising the price of background agent loops while keeping interactive chat unchanged, they are signalling that the long tail of always-on automation has become economically painful to subsidise. That signal matters when you are deciding whether to bet a startup on always-on Claude.

What To Change Before June 15: A Checklist

Most of the work you can do in the next month is small. It is also the difference between waking up on June 16 to a clean monthly bill and waking up to a Slack message from your co-founder asking why the OpenAI invoice has a Claude line on it.

Audit every place you call Claude programmatically. This is the work nobody wants to do. It is also the only one that matters. A rough first pass:

rg -n "anthropic|claude" \
  .github/workflows \
  scripts \
  apps \
  packages

You are looking for every cron, every action, every script, every server route, every background worker. Write the list down on actual paper. For each one, answer two questions: how often does it run, and roughly how many tokens per run. If you do not know, instrument it (see below) and come back next week.

Add token-level logging to every claude -p and Agent SDK call. The Anthropic SDK returns usage in the response. Wrap every call in a tiny logger and dump the result to a file or to your existing observability stack. Concrete pattern:


const SONNET_INPUT_PER_M = 3;
const SONNET_OUTPUT_PER_M = 15;

const client = new Anthropic();

type ClaudeCallContext = {
  caller: string;
  workflow?: string;
};

export async function runClaude(
  args: Anthropic.MessageCreateParams,
  context: ClaudeCallContext,
) {
  const start = Date.now();
  const response = await client.messages.create(args);
  const inputTokens = response.usage.input_tokens;
  const outputTokens = response.usage.output_tokens;
  const cost =
    (inputTokens / 1_000_000) * SONNET_INPUT_PER_M +
    (outputTokens / 1_000_000) * SONNET_OUTPUT_PER_M;

  logger.info({
    event: 'claude_call',
    caller: context.caller,
    workflow: context.workflow,
    inputTokens,
    outputTokens,
    cost,
    durationMs: Date.now() - start,
  });

  return response;
}

A week of this data tells you which workflows are cheap, which are scary, and which are blowing your future credit pool in three days. The number that matters is dollars per day per workflow, not tokens per call.

Set explicit max_tokens everywhere. The single most common way to nuke a credit pool is a runaway response. Set max_tokens to something realistic for the task. A diff summary does not need 4096 tokens. 600 is plenty. A release note generator does not need 8000. 1500 is plenty. The model will respect the cap. Your wallet will thank you.

Cap GitHub Actions runs with concurrency and if guards. The fan-out pattern is the cost-equivalent of leaving a tap running. If your action runs on push, every commit triggers it. If you also run on pull_request, the same commit triggers it twice. Add a concurrency block that cancels in-progress runs for the same PR, and add an if guard that skips drafts:

concurrency:
  group: claude-${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  review:
    if: github.event.pull_request.draft == false
    runs-on: ubuntu-latest
    steps:
      - uses: anthropic/claude-code-action@v1

That single change cut my own action cost in half because I had been leaving runs queued on every push to a draft branch.

Decide per-workflow: subscription credit, direct API, or off. This is the strategic choice. For each programmatic Claude call you found in step one, pick one of three lanes. Subscription credit is for things that are predictable, run a known number of times per month, and fit comfortably inside your plan's allocation. Direct API is for things that need real budget alerts, team-wide pooling, or volume that exceeds the credit. Off means you switch that workflow to a cheaper model (Haiku, or a non-Claude alternative) or you decide it does not earn its place.

Wire spending alerts on the API key your overflow falls through to. If you enable extra usage so jobs do not just hard-stop, set a budget alert at $50 and another at $200. The Anthropic console supports this. Without it, your June bill arrives as a surprise. With it, you get a Slack message at 50 percent of expected and you can decide whether to keep going or pull the cord.

Pre-cache deterministic prompts. If the same system prompt is going into every call, and it is 4k tokens long, you are paying for that prompt every single time. Anthropic's prompt caching cuts the cost of repeated prefix tokens by roughly 90 percent on hit. Configure it for any workflow where the prefix is stable. A linter that always sends the same rule set is the canonical case.

A lot of this is the same instinct that drove feature flags for solo developers: wrap the expensive thing in a switch you can flip when reality disagrees with your assumptions. The flag does not save you money. The fact that you can turn the workflow off without a deploy does.

Should You Just Skip The Credit Pool And Go Direct API?

For some workflows, yes. The credit pool is convenient. It is also limited in ways that matter at any real scale.

The case for staying on subscription credit is genuine. The credit is bundled with a plan you are already paying for, so the first $20 or $200 of API is sunk cost. The auth flow is the one you already have. You do not need a separate billing relationship. For light, sporadic, per-developer use, it is the simplest option and it just works.

The case for moving a workflow to a direct API key starts when any of three things are true. First, the workflow runs unattended and you need predictable budget alerts. The credit pool does not give you per-workflow caps. A dedicated API key with a budget on the Anthropic console does. Second, the workflow is shared across a team. A single API key with billing on the company card scales better than asking every engineer's $20 personal credit to subsidise CI. Third, the workflow is hot enough to blow through the credit halfway through the month. Once you are out of credit, the difference between "fall through to API" and "use a dedicated API key from the start" is mostly bookkeeping, except the dedicated key is on the company card and has alerts.

If you are debating between credit and direct API and you cannot decide, the tiebreaker is: who pays the bill, and do they want to see the line item? If the answer is "the company does, and accounting wants a single line that says Anthropic API," go direct. If the answer is "I do, and it goes on my personal card alongside the subscription," credit is fine.

There is also the option that nobody likes to talk about, which is dropping to Haiku for any workflow where the quality difference does not matter. Sonnet is the right default for a chat. For a script that classifies log lines or extracts JSON from a fixed template, Haiku is five times cheaper and indistinguishable. Audit which workflows actually need Sonnet output and which inherited Sonnet because that was the default.

A Brief Word On Trust

The June 15 change is the third pricing reversal in three months. On April 4 Anthropic cut off OAuth access for third-party harnesses without warning, breaking every OpenClaw and Conductor setup. On April 22, Claude Code briefly disappeared from the Pro plan for a "test affecting about 2 percent of new prosumer signups," then came back inside 24 hours after Simon Willison surfaced it. And now, on May 13, this. Anthropic is not being malicious. They are figuring out the economics of an explicitly subsidised product in public, in real time, and it shows.

The practical takeaway is not "Claude is bad now." Sonnet is still the model I reach for first when I want code that works. The takeaway is that planning a business around the assumption that today's pricing will be next year's pricing is naive. If a single workflow in your product depends entirely on Claude being cheap, you have a single point of failure. Building a thin abstraction over the model call so you can swap providers (Sonnet, Codex, a local model, whatever) without rewriting business logic is no longer a "nice to have." It is part of indie-dev hygiene now, the same way idempotency in your Stripe webhooks is part of any real billing setup.

The thin abstraction is not exotic. A function called runModel(prompt, options) that picks an SDK based on an env var is enough. The point is that the day Anthropic announces the next change, you can flip a flag and route traffic somewhere else for a week while you decide what to do. That option is the actual safety net. The $20 credit is not.

The Real Lesson

Programmatic Claude was subsidised. That is the truth almost nobody is saying out loud. The subscription price was set when Anthropic wanted users on the platform and was willing to absorb compute cost to get them. That is also a normal stage of any platform. It does not last forever and it never did. The surprise is not that the meter is starting. The surprise is that we treated the un-metered version as the long-term default.

The people who will be fine on June 16 are the people who instrumented their spend before they had to, who know which workflows earn their place, who set caps on the runaway ones, and who have a working fallback if Claude prices double again next quarter. The people who will not be fine are the ones who find out about the change from a billing email.

You have about a month. Run the audit. Add the logging. Cap the workflows. Decide which lane each one belongs in. Wire the alerts. Pre-cache the stable prompts. None of it is hard. All of it is the difference between a clean June and a noisy one.

If you only do one thing this week, do the audit. Everything else flows from knowing where Claude actually lives in your stack.

Zero-Downtime Postgres Migrations: The Mistakes That Locked My Production Database

Alex Cloudstar — Fri, 15 May 2026 09:31:54 +0000

The first production database migration I ran that broke things took down an internal tool for forty-two minutes. The migration looked harmless. It added a NOT NULL column to a table with thirty-eight million rows. I ran it on a Wednesday afternoon, watched it sit at "pending" for a few seconds, then watched our entire app stop responding. Postgres was rewriting the table. Every read and write was queued behind an ACCESS EXCLUSIVE lock. I had no idea this would happen because in development the same migration ran in two hundred milliseconds.

That was the day I learned the difference between a migration that works on a small table and a migration that works on a real production database. They are not the same operation. They have different cost models, different failure modes, and different blast radius. The Postgres docs describe the locking behaviour of every command, but you have to know to look. Most ORM migration tutorials do not even mention locks.

This is the post I wish I had read before that Wednesday. It covers the operations that quietly lock your tables, the expand-and-contract pattern that lets you change schema without downtime, and the migrations I now refuse to run during business hours no matter how confident I am.

The Mental Model You Need First

A Postgres database serving live traffic is a high-frequency machine. Reads and writes are happening every millisecond. Every operation you run has to share the same data with everything else. The way Postgres makes that sharing safe is locks.

There are eight lock modes in Postgres, but the only one that matters for migration safety is ACCESS EXCLUSIVE. That lock blocks every other operation on the table, including reads. If your migration takes ACCESS EXCLUSIVE on a large table and holds it for thirty seconds, your app is down for thirty seconds. If it holds it for thirty minutes, your app is down for thirty minutes.

The dangerous operations are the ones that quietly take ACCESS EXCLUSIVE while also requiring a full table scan or table rewrite. These are the ones that work fine on a small table in staging and freeze you in production. The list of operations like this is short but important to memorise:

ALTER TABLE ... ADD COLUMN ... NOT NULL without a constant default
ALTER TABLE ... ADD COLUMN ... DEFAULT with a volatile default (older Postgres versions)
ALTER TABLE ... ALTER COLUMN TYPE for most type changes that require a rewrite
ALTER TABLE ... ADD CONSTRAINT for CHECK, FOREIGN KEY, or UNIQUE without using NOT VALID first
CREATE INDEX without CONCURRENTLY
VACUUM FULL
CLUSTER

Every other migration operation is either cheap (metadata change only) or has a safe variant if you know the magic words. The safe variants are the entire point of this post.

The Five-Step Migration That Made Me Stop Breaking Things

The pattern that solves almost every schema change is called expand-and-contract, or sometimes parallel change. It works because at no point in the migration is the application unable to read or write. The database holds both shapes of the schema at once, and the application code learns to deal with the old shape and the new shape simultaneously during the transition.

The five steps:

Expand. Add the new schema (new column, new table, new index) in a way that does not break the current application.
Dual-write. Update the application so every write that touches the old schema also writes to the new schema.
Backfill. Copy historical data from the old schema to the new schema, in batches, without locking.
Switch reads. Update the application to read from the new schema. The old schema is still being written to, just in case.
Contract. Stop writing to the old schema. Drop it.

Each step is a separate deploy. That is not optional. The whole point is that the database is always in a state where both the previous deploy and the next deploy work against it. If you collapse steps, you reintroduce coupling between the database state and the application state and you lose the zero-downtime property.

The most common mistake is trying to do steps 1, 2, and 5 in one deploy. That is the old-school way of writing migrations. It only works if you take downtime. The moment you have real users, you cannot take downtime, and you have to think in deploys.

Adding A Column Safely

The classic example is "add a phone_number column to users." A naive migration looks like this:

ALTER TABLE users
  ADD COLUMN phone_number TEXT NOT NULL DEFAULT '';

On Postgres 11 and later, this is actually fast because Postgres stores the default in the catalog and does not rewrite the table. On Postgres 10 or earlier, it rewrites every row. So step zero is "know which version of Postgres you are on." Postgres 18 is the current major as of May 2026, and you should be on it or close to it for any new project.

Even on modern Postgres, the safe pattern for a column the app actually needs is:

Step 1, expand. Add the column as nullable with no default. This is a metadata-only change and runs in milliseconds.

ALTER TABLE users ADD COLUMN phone_number TEXT;

Step 2, dual-write. Update the application to write phone_number on every insert and update where you have it. Old rows still have NULL. New rows have the value.

Step 3, backfill. If you have a source for historical phone numbers, copy them in batches. Use a script that does small updates with sleeps in between, not a single UPDATE users SET phone_number = ... over the whole table.

UPDATE users
SET phone_number = legacy_phone
WHERE id IN (
  SELECT id FROM users
  WHERE phone_number IS NULL AND legacy_phone IS NOT NULL
  ORDER BY id
  LIMIT 1000
);

Run that in a loop with a short pause between batches. Each batch holds row locks for milliseconds. Your application keeps serving traffic.

Step 4, switch reads. Update the application to read phone_number instead of legacy_phone. Both columns still exist.

Step 5, contract. Once you are confident nothing reads legacy_phone, drop it.

ALTER TABLE users DROP COLUMN legacy_phone;

Dropping a column in Postgres is metadata-only. The data is reclaimed by autovacuum later. The drop itself is instant.

If you also want phone_number to be NOT NULL, add the constraint after the backfill, using the safe two-step variant covered later.

Renaming A Column Without Downtime

Renaming a column is the migration most people get wrong. The naive version:

ALTER TABLE users RENAME COLUMN email_address TO email;

This is metadata-only and fast. The problem is not the database. The problem is the deploy. The old application code is still running and querying email_address. The new code queries email. Whichever one runs against the renamed column at any moment will break.

The safe version uses expand-and-contract:

Add a new email column. Empty.
Dual-write: every write to email_address also writes to email.
Backfill: copy historical email_address to email.
Switch reads to email.
Stop writing to email_address. Drop it.

Five deploys. Boring. Survives every concurrent deploy and rollback in between.

If you are using an ORM that auto-generates migrations from your models, this pattern requires you to break out of the ORM's default behaviour. You will need to write the SQL by hand or use a tool that understands expand-and-contract natively. pgroll is the open-source tool that does this for Postgres, and it has become the default I reach for in 2026.

Adding An Index Without Locking The Table

Indexes are the single most common cause of accidental downtime. The default CREATE INDEX takes a write lock on the table for the entire duration of the build. On a table with hundreds of millions of rows, that can be ten or twenty minutes. Every write to the table during that time waits.

The fix is one word: CONCURRENTLY.

CREATE INDEX CONCURRENTLY idx_users_email ON users (email);

The concurrent variant takes a much weaker lock that does not block writes. It is slower overall because it scans the table twice, but it does not stop your app.

Two gotchas with CONCURRENTLY that bite people:

The first is that it cannot run inside a transaction. Most migration frameworks wrap migrations in a transaction by default. You have to tell the framework not to. In Prisma, this means writing the migration as a raw SQL file with the right annotation. In Rails, you set disable_ddl_transaction!. In every framework, it is one line of config and forgetting it makes the migration fail loudly.

The second is that CREATE INDEX CONCURRENTLY can fail partway and leave an invalid index sitting on the table. If that happens, the index exists but is marked invalid and will not be used by queries. You have to drop the invalid index and try again. Always check the state of an index after building it:

SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE indrelid = 'users'::regclass;

If any indisvalid is false, drop that index and rebuild.

Changing A Column Type Without Rewriting The Table

Some type changes are free. VARCHAR(50) to VARCHAR(100) is free. INTEGER to BIGINT is not free, it rewrites the table. TEXT to INTEGER is definitely not free.

The expansion-and-contract pattern works for type changes too:

Add a new column with the target type.
Dual-write to both the old column and the new column, casting the value on the way in.
Backfill historical rows in batches.
Switch reads to the new column.
Drop the old column. Rename the new column to the old name (this is metadata-only).

The "rename to the old name" step is what makes this transparent to the rest of the application. By the end, the column name is the same, the type is different, and there was no table rewrite.

For columns that need to grow from INTEGER to BIGINT (because you ran out of auto-increment space), this pattern is the standard answer. Doing it the naive way on a billion-row table is a multi-hour outage.

Constraints Without Lock Storms

Adding a NOT NULL constraint on an existing column without taking a long lock requires a two-step:

-- Step 1: add the constraint as NOT VALID
ALTER TABLE users
  ADD CONSTRAINT users_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;

-- Step 2: validate the constraint without an exclusive lock
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;

The NOT VALID part tells Postgres "trust me, future writes will satisfy this, but do not check the existing rows yet." That part is fast and only locks briefly. The VALIDATE step scans the table but does so without blocking reads or writes.

Once the constraint is validated, you can convert it to a proper NOT NULL on the column itself if you want. In Postgres 12 and later, you can do this without rewriting the table because Postgres knows the check constraint already proved every row is non-null.

The same pattern works for FOREIGN KEY constraints. Add as NOT VALID, then VALIDATE separately. Splitting the operation into the cheap declaration and the slower scan is how you keep production responsive.

For UNIQUE constraints, the trick is to build the underlying index CONCURRENTLY, then attach the constraint to it:

CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users (email);
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;

The constraint creation in the second step is metadata-only because the index already exists. Zero downtime.

The Backfill Script That Will Not Crash Your Database

Backfills are where production migrations go to die. People write a single UPDATE over millions of rows, that update holds locks and bloats the WAL, and the database starts misbehaving. Then they kill the query, which Postgres has to roll back, which takes longer than running the query in the first place.

The pattern that works is small batches, slow pace, observable progress. Something like:

async function backfillPhoneNumbers() {
  let lastId = 0;
  const batchSize = 1000;

  while (true) {
    const result = await db.query(
      `UPDATE users
       SET phone_number = legacy_phone
       WHERE id > $1
         AND id <= $1 + $2
         AND phone_number IS NULL
         AND legacy_phone IS NOT NULL
       RETURNING id`,
      [lastId, batchSize]
    );

    if (result.rows.length === 0 && lastId > maxUserId) break;

    lastId += batchSize;
    await sleep(100);
  }
}

The script processes a thousand rows at a time, sleeps for a hundred milliseconds between batches, and tracks progress by primary key. Each batch holds row locks for a tiny window. If the script dies, it can resume from the last lastId. If it falls behind, the sleep can be reduced. If the database starts struggling, the sleep can be increased without changing the logic.

Run this from a worker, not from inside your migration tool. Migrations should be schema changes. Data backfills are application code with retry semantics, and they belong in your background job system.

The Tools That Make Expand-And-Contract Less Painful

The expand-and-contract pattern is correct but verbose. Five deploys for one schema change is a lot of process for a small team. Two tools have changed how I do this in 2026.

pgroll by Xata. An open-source tool that runs schema migrations as multi-step plans, automatically maintaining both the old and new schemas during the rollout. You describe the migration in a JSON or YAML file (add column, rename column, change type), and pgroll figures out the expand-contract steps and runs them safely. It serves both schema versions through Postgres views so the old app and the new app can both read and write during the transition. The contraction step happens when you explicitly complete the migration.

The other one is Atlas by Ariga. A schema-as-code tool that lints migrations for unsafe operations, warns you about lock implications, and integrates with CI to block dangerous migrations before they merge. The linter alone has saved me twice from running an ALTER COLUMN TYPE that would have rewritten a large table.

For projects already on a framework-specific migration system (Prisma, Drizzle, ActiveRecord, Alembic), the workflow is usually:

Write the migration in the framework's native format for the safe operations.
For the dangerous operations, write raw SQL and add the safe variants by hand.
Run pgroll or Atlas alongside the framework, or graduate to them for any non-trivial change.

The Prisma migrate workflow in particular has gotten better about generating CREATE INDEX CONCURRENTLY and similar safe variants in 2025 and 2026, but it still cannot reason about expand-contract across multiple deploys. That part is your job. The Prisma vs Neon comparison covers more of the Postgres-on-modern-platforms angle if you are picking a stack from scratch.

What I Refuse To Run During Business Hours

Even with the safe patterns, some operations should not be run during peak traffic. The list:

VACUUM FULL on any table that fits in cache. It rewrites the table from scratch and takes ACCESS EXCLUSIVE for the entire duration. There is almost never a case where you actually need it in 2026 because routine autovacuum handles bloat for most workloads. If you do need it, schedule a maintenance window.

Any backfill that touches a hot table (high write volume from your app). Even with small batches, the contention with normal traffic can degrade performance. Run backfills during low-traffic hours, even if the script is technically safe at any time.

Reindexing a unique index, even with CONCURRENTLY. The brief windows of exclusive locking around the rebuild can stack with your app traffic and cause weird stalls. Schedule it for quiet hours.

Any ALTER TABLE that you have not tested against a production-sized dataset. Staging databases are usually a fraction of production size. The lock behaviour is the same, but the duration scales with row count. Test against a copy of production or against a representative subset.

The general rule: if a migration could theoretically lock the table, the time of day matters even if you think it will be fast. The cost of being wrong at 2pm is much higher than the cost of being wrong at 2am.

Connection Pools And Migration Coordination

The other thing that bites people during migrations is the connection pool. Many migration tools take a separate connection to run the migration. Your app holds connections through PgBouncer or a built-in pool. If your migration takes an exclusive lock and your app has a connection mid-transaction on the same table, both will sit there waiting for each other.

The fix is to set a lock_timeout on the migration session:

SET lock_timeout = '5s';

ALTER TABLE users ADD COLUMN phone_number TEXT;

If the migration cannot acquire its lock within five seconds, it fails fast instead of blocking the entire database. You catch the error in your migration tool, wait, and retry. This is much better than the default behaviour, which is to wait forever and queue every other query behind you.

Pair this with statement_timeout to prevent any single migration statement from running indefinitely. Five seconds is a good default for the lock_timeout. The statement_timeout depends on the operation, but for any single DDL statement on a properly-sized table, anything over thirty seconds is suspicious.

SET lock_timeout = '5s';
SET statement_timeout = '30s';

For long backfills, the statement_timeout needs to be larger, but those should be running as application code with their own retry logic, not as migration statements.

Rollback Plans That Actually Work

The version of "rollback" you learn in tutorials does not work for real production migrations. The naive idea is "if the new code breaks, roll back to the old code, and roll back the migration." This fails because the migration usually cannot be reversed without data loss. Dropping a column you just added is fine. Reverting a column rename is fine if you saved the old data, which you did because you used expand-and-contract. But reverting a backfill that mutated data is generally impossible without restoring from backup.

The actual rollback strategy that works:

For the expand-and-contract pattern, the rollback at any step is to deploy the previous version of the application. The database state is forward-compatible at every step because both schemas are present. You never need to undo a migration mid-flight, you just deploy the previous code and the schema continues to support it.

For destructive steps (dropping a column, dropping a table), there is no rollback. The data is gone. The mitigation is to leave the destructive step until you are absolutely certain nothing reads or writes the old column. In practice this means waiting a full release cycle (or two) between the "switch reads" step and the "contract" step.

For data mutations (a backfill that transformed a column), the rollback is a restore from backup. This is fine for non-critical data. For critical data, the right strategy is to not mutate in place. Add a new column, backfill the new column with the transformed value, and switch reads. The original data is preserved until the final contraction step, which you delay until you trust the new column.

Take a logical backup before every production migration that touches data. Postgres pg_dump of the affected tables, copied somewhere safe, dated. The backup will be useless 99% of the time. The 1% will save your business.

What I Run In Production

The setup I have arrived at after enough scars:

Every migration goes through a CI pipeline that runs Atlas in lint mode against the migration file. The lint catches unsafe operations (missing CONCURRENTLY, NOT NULL without NOT VALID, type changes without expand-contract). The pipeline blocks the PR if any unsafe pattern is detected.

Every dangerous migration is broken into expand-and-contract steps. I use pgroll when the team is willing to learn it. For projects where pgroll is too much overhead, I write the steps as separate migration files with clear comments about which deploy each one belongs to.

Every backfill runs as a worker job, not as part of the migration. The worker uses small batches, primary-key cursors, and a sleep between batches. The job's progress is visible in the same observability stack the rest of the app uses. The production observability post for solo developers covers what that stack looks like for small teams.

Every migration session sets lock_timeout = '5s' and statement_timeout appropriate to the operation. Migrations fail fast instead of blocking the app.

Every destructive step (dropping a column or table) is delayed until at least one full release cycle after the switch-reads step. I have never regretted the wait. I have regretted not waiting.

Every production migration that touches data takes a pg_dump of the affected tables first. The dump is stored in object storage with a date prefix and a retention policy. The dump itself takes seconds for most tables. The peace of mind is worth the seconds.

Every migration is reviewed by at least one other engineer who has the explicit job of asking "what does this lock, and for how long." The review is not about the SQL syntax. The review is about the production behaviour.

What I Would Tell You If You Asked

Most production database outages are not from queries. They are from migrations. The migration runs, takes a lock you did not expect, the lock cascades into every other operation, and your app is down before you have finished refreshing the deploy log.

The fix is not to be smarter at writing migrations. The fix is to assume every operation is dangerous until you have verified it is not. Look up the lock mode in the Postgres docs. Check the version-specific behaviour. Test against a production-sized dataset. Use CONCURRENTLY, NOT VALID, expand-and-contract. Split every dangerous change into multiple deploys.

The reason this matters more in 2026 than it did five years ago is that databases have gotten bigger faster than tooling has gotten safer. A SaaS that hit ten million rows in 2020 hits a hundred million in 2026 because the underlying compute is cheaper and AI features generate more rows per user. The migrations that were fine on small tables are not fine on the new tables. The patterns that protected you at small scale do not protect you at the scale you are about to operate at.

The good news is that the safe patterns are well-known and the tools are good. pgroll, Atlas, and the safe variants built into your migration framework cover almost every case. The hard part is internalising that "safe" is not the default and you have to ask for it explicitly. Once you do, the production-Postgres game becomes much less scary.

The broader instinct, the one that generalises beyond databases, is to ask "what locks while this runs and what happens to everyone waiting." That question is the difference between code that runs in isolation and code that runs in production. It is the same question that separates a working webhook handler from a Stripe webhook handler that does not double-bill users, and the same question that separates a real-time feature from a real-time feature that survives a network blip. Production is the place where invisible costs become visible. Migrations are the part where they become visible to your customers.

Run your migrations like the database is also doing other things. Because it is.

Feature Flags For Solo Developers in 2026: When You Need Them, When You Do Not, And What I Actually Use

Alex Cloudstar — Fri, 15 May 2026 09:31:53 +0000

The first feature flag I ever shipped was a single boolean in a YAML config file. The second one was a boolean in a database row. The third one was a third-party feature flag service I spent forty minutes setting up because a tutorial said I should. The third one was a mistake. I deleted the integration two weeks later and went back to the database row.

If you are a solo developer or a two-person team building a real product, you have probably seen the same tutorials I did. They tell you that feature flags are the modern way to deploy safely. They show you LaunchDarkly, Statsig, Flagsmith, ConfigCat. They start at fifty dollars a month and scale up fast. The pitch is targeted at companies with a hundred engineers running thousands of experiments. You are not that company. You may never be that company. And the truth nobody tells you is that you do not need a feature flag service to ship safely. You need to know when a flag is actually earning its keep and when it is just ceremony.

This post is the version of the feature flag conversation I wish someone had had with me. It covers what feature flags actually do, when a config file is enough, the three lightweight setups that earn their place at solo and small-team scale, and the moment you finally graduate to a proper service.

What Feature Flags Actually Do

A feature flag is a runtime switch that changes application behaviour without redeploying. That is the whole definition. Everything else is a feature on top of the switch.

The switches come in four flavours, ordered by complexity:

The first is a release toggle. You merge code that is not ready for users yet, but you put it behind a flag that is off by default. When the code is ready, you flip the flag on. This decouples "the code is in main" from "the feature is live." The flag is binary, the audience is everyone, and the flag goes away once the feature ships.

The second is a kill switch. You ship a feature, something goes wrong, and you flip the flag off to instantly disable the feature without deploying. The flag stays in the codebase forever as a panic button. Most features that touch payments, file uploads, or third-party APIs need a kill switch.

The third is a gradual rollout. You enable a feature for ten percent of users, watch the metrics, ramp to twenty-five percent, fifty, one hundred. If something breaks, you ramp back down. This is where the targeting logic starts to matter, because "ten percent of users" needs to be a stable subset, not a random ten percent on every request.

The fourth is an experiment. Two variants, randomised assignment, conversion tracking, statistical analysis. This is what feature flag companies actually sell. It is a tiny fraction of what most products need.

For a solo developer, the first three matter. The fourth is almost never worth the overhead until you have enough traffic that statistical tests actually converge, which is roughly when you have something like ten thousand monthly active users. Before that point, you are looking at noise.

When You Do Not Need A Flag

Most code does not need a flag. The instinct to flag everything is the same instinct that leads to over-engineered codebases full of abstractions for hypothetical futures. Flags add cognitive overhead because every branch is now two branches the reader has to track. Old flags that nobody removed are dead weight that gets in the way of every refactor.

Skip the flag if:

The feature is small and contained. A UI tweak, a copy change, a small bug fix. Deploy it. If it breaks, deploy a fix or a revert. The flag costs more than the protection it provides.

The feature is part of a larger flow you cannot ship partially. If turning the flag off would leave the product in a broken state, the flag is not protecting you, it is hiding a coupling.

The change is irreversible. A database migration, a webhook reconfiguration, a third-party API switch. Flags cannot undo state changes. They can only prevent state changes from happening in the first place, and only if the flag is checked before the change.

You are pre-launch. If nobody is using the product, there is nothing to break. Flags exist to protect users, not to protect future users. Ship without flags until you have real users to disappoint.

Most solo projects spend their first six to twelve months in the "no users yet" or "few users" stage. During that stage, the right deploy strategy is to ship fast, watch for errors, revert if needed. Flags add overhead with little payoff.

The signal that you should start using flags is the first time you ship a change and immediately wish you could undo it without a redeploy. That is the moment a kill switch would have helped. That is the moment to start.

When You Actually Need A Flag

The use cases where flags genuinely pay off, even at small scale:

Anything that touches money. Payment flows, subscription changes, refund logic, pricing tiers. The cost of a broken deploy is real money lost or refunded. A kill switch on the critical path is cheap insurance. I have one on every Stripe-related path in my products. It has fired twice in two years. Both times saved me an evening of customer support.

Anything that calls a third-party API on the critical path. Email sending, SMS, image processing, anything that costs money per call and has rate limits. If the API misbehaves or your usage spikes, you want to disable the integration without redeploying.

Anything that is rolling out to a specific user segment. Beta features for paying users, admin-only tools, region-specific behaviour. The flag is doing real work of routing different users to different code paths.

Anything you would otherwise gate by environment variable. A flag service or database row is just a better-shaped environment variable for things you want to change without a deploy. The env vars in production article touches on this in a different context, but the principle is the same: anything you want to flip at runtime should not live in your bundled config.

Pre-launch dark launches. You merge the new feature with the flag off, you flip the flag on for your own account, you test in production with real data, you flip on for a few users, you ramp up. This pattern works regardless of team size. A solo developer can dark-launch by treating themselves as user-one.

If none of those apply, you do not need a flag. If at least one does, you need a flag for that specific feature, not a flag system across your whole codebase.

The Three Setups That Cover Solo And Small-Team Scale

Here are the three feature flag setups I have actually used in production, in increasing order of complexity. Pick the smallest one that solves your current problem.

Setup One: The Config File

The simplest possible feature flag is a constant in code. Some teams call this anti-pattern. I call it the right answer for the first few flags.

export const flags = {
  newCheckoutFlow: false,
  exportPdfV2: false,
  betaAiSummariser: true,
};

The flag is checked at runtime. Flipping it requires a deploy. The deploy is the whole point: the change is recorded in git, reviewed in a PR, and rolled back by reverting the commit.

This works when:

You can deploy quickly (under five minutes from "flip the flag" to "users see the change").
You do not need to roll out to a subset of users.
You do not need to disable the feature without a deploy.

It does not work when you need a kill switch, because by the time you have edited the config and pushed a deploy, the broken feature has already harmed your users. For everything else, this is enough. Most of the flags I have ever shipped have been this shape.

The graduation signal is the first time you wish you could disable something without a deploy. Move to Setup Two.

Setup Two: The Database Row

A single table of flags, read at request time, with a small layer of caching to avoid querying on every request:

CREATE TABLE feature_flags (
  key TEXT PRIMARY KEY,
  enabled BOOLEAN NOT NULL DEFAULT false,
  enabled_for_user_ids JSONB DEFAULT '[]'::jsonb,
  enabled_percentage INTEGER DEFAULT 0,
  updated_at TIMESTAMPTZ DEFAULT now()
);

Application code that reads the flag:

async function isEnabled(key: string, userId?: string): Promise<boolean> {
  const flag = await getCachedFlag(key);
  if (!flag) return false;
  if (flag.enabled) return true;
  if (userId && flag.enabled_for_user_ids.includes(userId)) return true;
  if (userId && stableBucket(userId, key) < flag.enabled_percentage) return true;
  return false;
}

The stableBucket function hashes the user ID with the flag key and returns a number from zero to ninety-nine. The hash is deterministic, so the same user always lands in the same bucket for the same flag. A percentage rollout uses the bucket: users in buckets zero through twenty-four get the feature when the rollout is at twenty-five percent.

function stableBucket(userId: string, key: string): number {
  const hash = createHash('sha256').update(userId + key).digest();
  return hash.readUInt32BE(0) % 100;
}

Cache reads aggressively. Most flags do not change minute-to-minute, so a thirty-second in-process cache is fine. Some teams use a Redis layer for sub-second propagation across nodes. For a solo product running on a single VM or a serverless platform with a small number of instances, in-process is enough.

Build a tiny admin page (or a CLI command) to flip flags. Two endpoints: list flags, update flag. Restrict access to your own admin account. The admin page does not need to be pretty. It needs to work in thirty seconds when something is on fire.

This setup gives you kill switches, percentage rollouts, per-user enabling, and same-database transactional safety. It costs zero dollars in marginal infrastructure if you already have a database. It scales to thousands of flags before performance becomes an issue.

The graduation signal is when you want a flag to be the same across multiple unrelated apps, or when you need targeting more sophisticated than user-ID-and-percentage, or when you want non-engineers to manage flags. Move to Setup Three.

Setup Three: A Hosted Service

When you cross the line where Setup Two stops paying off, the move is to a hosted feature flag service. The market in 2026 is broad. The realistic options for small teams:

ConfigCat is the cheapest option for small teams that want a managed dashboard. The free tier covers most solo projects (ten flags, two environments, sub-second propagation). Pricing scales by request volume, not seat count, which fits indie hackers better than per-seat pricing.

Flagsmith has a generous self-hosted open-source mode. If you want a real dashboard but you do not want to pay, you can host it yourself for free on the same VM your app runs on. The self-hosted path is genuinely production-ready, not a stripped-down version.

Statsig is more focused on experimentation and analytics. It is the right move if you actually need A/B testing with statistical confidence, not just toggles. The free tier is generous (a million events a month) but the value of the platform is in the analytics, which only matter if you have enough traffic for statistical significance.

Unleash is the open-source veteran. Mature, well-documented, self-hostable, and the SDKs are widely used. It is what I would pick if I were running a small team that needs the full feature flag feature set without paying for it.

Skip LaunchDarkly until you are a real company with real budget. Their product is fine, their pricing is not designed for solo developers, and the feature gap between LaunchDarkly and ConfigCat at small scale is not worth the difference in cost.

The criteria for picking among the lightweight options:

If you want zero infrastructure and pay-as-you-go pricing: ConfigCat.
If you want self-hosting and a real dashboard: Flagsmith or Unleash.
If you actually need experimentation analytics: Statsig.

For most solo products that need flags, ConfigCat or self-hosted Unleash is the answer. The decision is reversible: the SDK abstraction is similar across providers, and migrating from one to another is a couple of hours if you wrote your flag-check function as a thin wrapper around the provider's SDK.

The Wrapper That Makes Migration Easy

The biggest mistake I see people make with feature flag services is to call the provider SDK directly from every place in the codebase. Then when you want to swap providers, every flag check has to change.

The fix is a one-function abstraction:

// flags.ts

export async function isEnabled(
  key: string,
  context?: { userId?: string }
): Promise<boolean> {
  return configcat.isEnabled(key, context);
}

Every flag check in your app calls isEnabled. The provider is hidden behind a single function. Swapping providers is changing one file.

This wrapper is also where you put your test overrides. In tests, you replace the provider with a static map. In development, you can read from a local YAML file. In production, you hit the real service. The wrapper handles the dispatch.

const overrides = process.env.FLAG_OVERRIDES
  ? JSON.parse(process.env.FLAG_OVERRIDES)
  : null;

export async function isEnabled(key: string, context?: any) {
  if (overrides && key in overrides) return overrides[key];
  return provider.isEnabled(key, context);
}

The override layer is what lets you debug "why is this flag not firing" without poking around in a third-party dashboard. Set FLAG_OVERRIDES={"newCheckout":true} in your local env, restart, the flag is on for you regardless of the dashboard state.

The Hygiene That Keeps Flag Sprawl In Check

Every feature flag is a code branch the reader of your code has to track. Old flags rot. They survive past the feature they were gating, they get checked in dead code paths, and eventually a new engineer asks "what does this flag even do" and nobody knows.

The hygiene that prevents this:

Every flag has a sunset date in a comment when you add it. Six months out by default, less for release toggles. When the date hits, the flag either gets removed or its date gets extended with a written reason.

// FLAG: newCheckoutFlow
// Added: 2026-03-12
// Sunset: 2026-09-12
// Owner: alex
// Purpose: gradual rollout of new Stripe Checkout integration
if (await isEnabled('newCheckoutFlow', { userId })) {
  return renderNewCheckout();
}
return renderLegacyCheckout();

Release toggles get cleaned up after the feature ships. Kill switches stay forever, but they are explicitly labelled as kill switches and not as release toggles.

A monthly cron job (or a calendar reminder for solo devs) to grep the codebase for flag names that no longer exist in the flag store, or flags in the flag store that are not checked anywhere. Both are signs of drift.

The flag table or service is the source of truth for what flags exist. If you check isEnabled('foo') and foo does not exist in the store, the result should be a logged warning, not a silent false. Silent false is how flag drift bugs hide.

Treat the flag list like the dependency list. Audit it. Prune it. A flag that has been on for six months everywhere is not a flag, it is dead code with a runtime branch.

The Pattern For Rolling Out Without Pain

When you are ready to enable a feature, the pattern that works is the same regardless of which setup you are on:

Start by enabling for yourself only. Use the per-user-ID list. Verify the feature works in production with your real data.

Enable for a single friendly user. Someone who knows you are testing and will tell you if something breaks.

Enable for one percent. Watch error rates, latency, conversion. Watch for a full business day.

Enable for ten percent. Watch again. The metrics you care about most depend on the feature: error rate for technical reliability, conversion or retention for product impact.

Enable for fifty percent. By now you have enough data that something has shown up if it was going to.

Enable for one hundred. Leave the flag in for a couple of weeks. If anything goes wrong, you can roll back to zero without redeploying.

After two weeks of clean operation at one hundred percent, remove the flag from the code. The feature is now permanent.

The whole arc takes one to three weeks for a non-trivial feature. The pace is set by how much traffic you have. With a hundred users a day, a percentage rollout takes longer to gather meaningful signal than with a million. At very small scale, the percentage rollout is mostly ceremony, and you should focus on the per-user-ID enabling instead.

Common Mistakes That Hurt

The mistakes I made or watched others make, in rough order of pain:

Putting flag checks inside hot loops. Every flag check is a network call or at least a cache lookup. If you check the same flag a thousand times in a request, that is a thousand checks. Cache the result for the request. Most flag libraries do this automatically; your custom wrapper should too.

Forgetting to handle the flag-not-found case. If your flag service is down or the flag does not exist, the default behaviour should be the safe one. For a release toggle, safe is false (do not show the new thing). For a kill switch, safe might be true (the feature is on, do not disable it). Think about what "safe" means for each flag explicitly.

Using flags for things that should be database columns. "Is this user on the pro plan" is not a feature flag, it is a user state. Storing it as a flag means you have to keep the flag service in sync with the database. Just check the database. Flags are for code paths, not user attributes.

Building targeting rules in the dashboard that nobody can read. A flag that is enabled for "users in North America with more than ten projects created in the last thirty days who have not seen the welcome email" is a flag that nobody will understand six months later. If the targeting is complex, the targeting belongs in code, not in the flag service. Keep the flag itself simple.

Treating flag changes as harmless. Flipping a flag in production is a deploy in disguise. It changes the running behaviour of your app. Log it. Track who did it. Be able to ask "what changed in the last hour" and have the answer include flag flips.

The Honest Cost Picture

Here is what feature flag tooling actually costs at solo and small-team scale in 2026.

A config-file setup is free. Cost is one minute per flag, the minute it takes to add the boolean.

A database-row setup is free, assuming you have a database. Cost is a few hours to build the table, the wrapper, the admin page, and the bucketing function. Then a minute per flag.

ConfigCat at small scale: free for the first ten flags, then around forty dollars a month for the next tier. Pricing scales by request volume, which for a solo product is usually fine for months or years.

Statsig at small scale: free up to a million events a month, which covers most products until you have real traffic.

Self-hosted Unleash or Flagsmith: the marginal cost of running another container on your VM. Twenty dollars a month of VM, less if you are already running other services on the same box.

LaunchDarkly: starts at a few hundred dollars a month, scales fast. Worth it for real engineering teams. Not worth it for solo developers.

The honest math: for the first hundred users, the config file is fine. For the first thousand users, the database row covers it. After that, a hosted service starts paying off because the admin UI saves time. The right time to upgrade is when you find yourself in the database every week to flip flags and you wish you had a dashboard. Not before.

What I Run In Production

For my main product, which has a few thousand monthly active users, the setup is:

The flag store is a Postgres table called feature_flags. The schema is the one above. Updates are made through a small admin page I built into the product itself, accessible only to my admin user. The page lists every flag, shows its current state, and has a toggle. Updates go through a Postgres transaction that also writes an audit log row. The audit log captures who changed what, when, and why.

The flag check function lives in src/lib/flags.ts. It is the wrapper. Every flag check in the app goes through that function. The function caches results for thirty seconds in process. On serverless, the cache lifetime is the function instance's lifetime, which is usually a few minutes.

Each flag has a comment in code with the sunset date, the owner (me), and the purpose. A monthly calendar reminder makes me look at the flag list and prune anything that has been on for six months. The list has about twenty flags right now. Half are kill switches that will never go away. The other half are release toggles in various states.

For one product where I have a co-founder and we are running A/B tests, I added Statsig on top of the Postgres flags. The Postgres flags handle release toggles and kill switches. Statsig handles the experiment variants. The wrapper in flags.ts dispatches to the right backend based on the flag name's prefix.

That is the whole setup. No third-party feature flag service for the main product. Statsig only where the analytics earn their place. Total cost: zero dollars a month for the flag infrastructure, plus the Statsig free tier for the experiments. Total complexity: one table, one wrapper, one admin page.

What I Would Tell You If You Asked

You probably do not need a feature flag service. You probably do not need a feature flag system. You probably need a kill switch on your Stripe webhook, a release toggle for the next big feature, and a per-user enable for your own account so you can dark-launch.

The mistake is to start with the service and let it dictate how you think about flags. The right move is to start with the smallest thing that solves the actual problem, and graduate only when the smallest thing stops working. The graduation signals are concrete: "I need to flip this without a deploy" moves you from config file to database. "I need targeting more complex than user-and-percentage" moves you from database to service. "I need experimentation analytics" moves you to a service with analytics, not a service with just flags.

The bigger pattern, the one that generalises beyond feature flags, is that solo developers and small teams should use the smallest tool that works and resist the pressure to adopt the tools that are built for hundred-engineer teams. The same instinct that says "just use a config file" for flags is the same instinct that says "pick the boring stack" for everything else. The complexity of your tooling should scale with the complexity of your problem, not with the LinkedIn job titles of the people writing tutorials.

Flags are a tool. The tool earns its place when you have a specific problem it solves. Until then, the tool is overhead. Ship without flags until you have the specific problem. Then ship the smallest version of the flag system that solves it. Then graduate when the small version stops fitting.

The product I am working on right now has twenty active flags and zero feature flag service. The cost is twenty seconds when I add a flag and a few cents per month in database load. The protection is real. The complexity is small. That is the whole point.

Use flags when they earn their keep. Resist them when they do not. The smaller your team, the more important it is to know the difference.

Stripe Webhooks in Production: Idempotency, Retries, and the Mistakes That Cost Me Real Money

Alex Cloudstar — Fri, 15 May 2026 09:18:56 +0000

The first Stripe webhook bug that cost me actual money happened on a Tuesday. A user signed up for a paid plan, the checkout.session.completed event arrived, my handler created their workspace, and then Stripe retried the same event nine seconds later because my response had taken longer than the timeout. The retry created a second workspace. The user could see both. They picked the first one and ignored the second. A week later they cancelled because their data had silently been split across two accounts and they thought we were buggy.

We were buggy. Not because the webhook code was wrong in any single line, but because I had treated webhooks like a normal API call. They are not a normal API call. They are a message queue with weird rules, and if you do not respect those rules you ship a product that quietly corrupts billing state.

This post is the version of the webhook integration guide I wish I had read before shipping. The Stripe docs are good. They are not pessimistic enough about what happens when your handler meets the real internet.

What Stripe Actually Sends You

A webhook is a POST request from Stripe to a URL you control, with a JSON body describing an event. Events are things that happened: a checkout completed, an invoice was paid, a subscription was updated, a payment failed. Stripe sends every event to every endpoint you have configured to listen for that event type.

The shape of the request is simple. Headers contain a signature. The body is JSON. You verify the signature, parse the body, do something, return 200. Stripe sees the 200, marks the event delivered, and moves on. That is the happy path.

The unhappy path is where the work is. If your endpoint returns anything other than a 2xx within Stripe's timeout, Stripe retries. If your endpoint times out, Stripe retries. If your endpoint returns 200 but you crashed before persisting anything, the event is lost from your side and Stripe thinks it succeeded. If two events arrive at almost the same time, you can process them out of order. If Stripe's own infrastructure has a delivery delay, you can receive a customer.subscription.deleted before the customer.subscription.updated that preceded it.

Three things to internalize before writing a single line of handler code:

Stripe will retry the same event many times. Your handler must be idempotent. Processing the same event twice must produce the same result as processing it once.
Events do not arrive in order. Your handler cannot assume that the event you are reading is the most recent state of the underlying object.
Stripe does not care about your downstream systems. If your database is down, your queue is full, or your downstream API has rate-limited you, that is your problem. Stripe just keeps retrying.

If you build around these three rules from day one, the rest of the work is small. If you do not, you spend the next year writing patches.

Signature Verification, And Why You Cannot Skip It

The first thing your handler does, before parsing the body, is verify the signature. Stripe signs every request with a secret you configured on the endpoint. The signature is in a header called Stripe-Signature. If the signature does not match, drop the request. Period.


const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);
const webhookSecret = process.env.STRIPE_WEBHOOK_SECRET!;

export async function POST(req: Request) {
  const signature = req.headers.get('stripe-signature');
  if (!signature) return new Response('No signature', { status: 400 });

  const rawBody = await req.text();
  let event: Stripe.Event;

  try {
    event = stripe.webhooks.constructEvent(rawBody, signature, webhookSecret);
  } catch (err) {
    return new Response('Invalid signature', { status: 400 });
  }

  // event is now verified
}

Two things people get wrong here. First, you must use the raw request body, not a parsed JSON object. The signature is computed over the exact bytes Stripe sent. If your framework auto-parses JSON, you have to either disable that for the webhook route or read the raw body separately. Next.js App Router gives you req.text() which works. Express needs express.raw({ type: 'application/json' }) for the route.

Second, the signature header includes a timestamp. The constructEvent helper enforces a default tolerance of 300 seconds, which prevents replay attacks where someone captures a webhook payload and resends it later. Do not extend this tolerance unless you have a specific reason. Five minutes is plenty.

If you are testing locally, the Stripe CLI forwards events with a generated webhook secret you can use during development. The signature is real. The tolerance check still applies. This is by design and it catches plenty of bugs before they ship.

Idempotency Is The Whole Job

Every webhook handler is a function that takes an event ID and updates some state. The contract is: processing the same event ID twice must do the same thing as processing it once. That is what idempotency means in this context.

The simplest implementation is a table of processed event IDs.

CREATE TABLE processed_webhook_events (
  event_id TEXT PRIMARY KEY,
  event_type TEXT NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

In the handler, you insert the event ID before doing anything else. If the insert fails on the primary key conflict, you return 200 immediately. The event has already been processed.

const result = await db.query(
  `INSERT INTO processed_webhook_events (event_id, event_type)
   VALUES ($1, $2)
   ON CONFLICT (event_id) DO NOTHING
   RETURNING event_id`,
  [event.id, event.type]
);

if (result.rowCount === 0) {
  // already processed
  return new Response('ok', { status: 200 });
}

This is the cheapest idempotency strategy that works. It has two limitations that bite you in real production.

The first is the window between insert and the actual work. If you insert the event ID, then crash before doing the work, the next retry sees the event ID and skips. The work is lost.

The fix is to wrap the insert and the work in the same database transaction. If the work fails, the transaction rolls back, the event ID is gone, and the retry will process it. If the work succeeds, both the event ID and the work commit together.

await db.transaction(async (tx) => {
  const result = await tx.query(
    `INSERT INTO processed_webhook_events (event_id, event_type)
     VALUES ($1, $2)
     ON CONFLICT (event_id) DO NOTHING
     RETURNING event_id`,
    [event.id, event.type]
  );

  if (result.rowCount === 0) return; // already processed

  await handleEvent(tx, event);
});

The second limitation is that this only protects against duplicate event IDs. If Stripe sends two genuinely different events for the same underlying state change (which can happen, especially with subscription lifecycle events), each event has its own ID and both will be processed. Your handler has to be idempotent at the business level too. Updating a subscription to active when it is already active should be a no-op, not an error.

The Reordering Problem

Webhooks do not arrive in order. The Stripe docs say this in a sentence and most readers skip past it. Then they ship a handler that assumes the latest event reflects the latest state, and they get strange bugs that take days to track down.

The clearest version of this problem is subscription lifecycle. A user upgrades from plan A to plan B. Stripe emits a customer.subscription.updated event. Two seconds later they downgrade to plan C. Another customer.subscription.updated event. Both events sit in Stripe's delivery queue. The second one arrives first. Your handler sets the user's plan to C. Then the first event arrives and your handler sets the plan back to B. The user is on B in your database and C in Stripe.

The fix is to never trust the event payload as the source of truth for current state. The event tells you something happened. The current state lives on the object in Stripe. For anything that matters (subscription state, customer state, invoice state), refetch the object from Stripe by ID and use that response.

async function handleSubscriptionUpdated(event: Stripe.Event) {
  const subscriptionId = (event.data.object as Stripe.Subscription).id;

  // do not trust the event payload, refetch
  const subscription = await stripe.subscriptions.retrieve(subscriptionId);

  await upsertSubscriptionState(subscription);
}

This costs you an extra API call per webhook. It is worth it. The extra call returns the canonical current state, regardless of which event arrived in which order. Your database converges to the right state even when the events arrive in the wrong order.

There is a more subtle version of this problem for objects that change rapidly. If a subscription is updated five times in a second, you might get five webhooks and each refetch returns whatever the state is at the time you ask, which might be the same value for all five. That is fine. The end state is correct, which is the only thing that matters.

For events about objects you cannot refetch (like payment_intent.succeeded after the intent has been processed), you store the relevant fields from the event and accept that you might overwrite with stale data. The fix is to compare timestamps or version fields if Stripe provides them, and only apply the update if it is newer.

Respond Fast, Process Later

Stripe's timeout for webhooks is approximately 30 seconds, but the practical timeout you should target is 5 seconds or less. The reason is not Stripe's patience. It is your retry behaviour. If your handler takes 8 seconds on a good day and 12 seconds on a slow day, you are going to time out on the slow days, get retried, and double-process events. The solution is to do as little work as possible in the handler itself and offload the rest.

The pattern is:

Verify the signature.
Insert the event into a local queue table or a real message queue.
Return 200 immediately.
Process the queued event asynchronously in a worker.

export async function POST(req: Request) {
  const signature = req.headers.get('stripe-signature');
  const rawBody = await req.text();

  let event: Stripe.Event;
  try {
    event = stripe.webhooks.constructEvent(rawBody, signature!, webhookSecret);
  } catch {
    return new Response('Invalid signature', { status: 400 });
  }

  await db.query(
    `INSERT INTO webhook_events (event_id, event_type, payload, status)
     VALUES ($1, $2, $3, 'pending')
     ON CONFLICT (event_id) DO NOTHING`,
    [event.id, event.type, JSON.stringify(event)]
  );

  return new Response('ok', { status: 200 });
}

A separate worker reads from webhook_events, processes each event, and updates the status. This decouples the speed of your processing from the speed of your acknowledgement. Your webhook endpoint becomes a fast, dumb, almost-impossible-to-break inbox.

The trade-off is that you now have a second system to operate. The worker needs to handle failures, retries, and dead-letter cases. If you do not have a queueing system already, the cheapest version is a polling worker that selects pending events and processes them one at a time. It is not elegant. It works.

For a more complete look at queue selection for solo developers, the background jobs and observability question is the broader topic this pattern sits inside.

Race Conditions With Your Own Application

The most insidious bugs come from webhooks racing with user actions. A user completes checkout in their browser. Your app reads success_url and tries to read the subscription state to show them their new plan. The webhook has not arrived yet. Your app reads stale state. The user sees the old plan.

There are three approaches to this, in increasing order of robustness.

The first is to poll. After the user lands on the success page, your frontend polls your backend for a few seconds checking whether the subscription is active. The webhook usually arrives within a second or two. The polling stops as soon as it does. This is ugly but it works for most products.

The second is to fetch synchronously. When the user lands on the success page, your backend hits Stripe directly to fetch the current subscription state and write it to your database before responding. The webhook still arrives and is idempotent, but you do not depend on it for the immediate UX. This costs you an extra Stripe API call per checkout but eliminates the race entirely.

The third is to make the synchronous fetch and the webhook converge on the same code path. Your success_url handler calls the same function the webhook would call, passing the subscription ID. The function refetches state from Stripe and upserts. Whichever one runs first wins. The other is a no-op. This is the cleanest answer and it generalises beyond checkout to any user flow that depends on Stripe state.

async function reconcileSubscription(subscriptionId: string) {
  const subscription = await stripe.subscriptions.retrieve(subscriptionId);
  await upsertSubscriptionState(subscription);
}

// from the webhook
await reconcileSubscription(event.data.object.id);

// from the success_url handler
await reconcileSubscription(session.subscription);

The function is idempotent. It can be called from either path with no coordination. The race disappears.

What To Listen For And What To Ignore

Stripe sends many event types. Most products only care about a handful. Listening for events you do not handle is a small cost (you have to verify signatures and ignore them) but a bigger cognitive cost (the events show up in logs and confuse you when you debug).

For a typical SaaS with subscriptions, the events that matter are:

checkout.session.completed: a user completed a hosted checkout. Use this to provision their account.
customer.subscription.created, customer.subscription.updated, customer.subscription.deleted: the canonical subscription lifecycle. Refetch the subscription on every event.
invoice.paid: a recurring invoice was paid. Use this to extend the user's access.
invoice.payment_failed: a recurring invoice failed. Use this to flag the user for dunning, suspend access, or send a payment update email.
customer.subscription.trial_will_end: three days before a trial expires. Useful for sending warning emails.

If you are doing one-time payments, you care about payment_intent.succeeded and payment_intent.payment_failed. If you are doing usage-based billing, you care about invoice.upcoming so you can preview the next bill.

Everything else, ignore unless you have a specific reason to care. Stripe has more than 200 event types. Most are diagnostic or only relevant for specific products (Connect, Issuing, Terminal). Listening to all of them is a recipe for noise.

Configure the endpoint in Stripe's dashboard to only send the events you handle. This reduces the volume hitting your endpoint and reduces the surface area of what can go wrong.

Testing Webhooks Without Losing Your Mind

Local development with webhooks used to be miserable. The Stripe CLI fixed most of it. You run stripe listen --forward-to localhost:3000/api/webhooks/stripe and it forwards real Stripe events from your test account to your local server with a temporary webhook secret. You can also trigger specific events with stripe trigger checkout.session.completed for testing handlers in isolation.

The trigger command is what most people miss. You do not have to manually create checkouts and subscriptions to test every handler. Stripe ships a list of common scenarios you can fire with one command. This makes integration testing tractable.

For unit tests, the Stripe Node SDK exposes the same constructEvent function. You can build a fake event payload, sign it with a test secret, and run your handler against it. This is fast and reliable. The only thing you cannot easily simulate locally is the order in which events arrive, but you can build that into your tests by deliberately calling your handlers out of order and confirming the end state is correct.

For end-to-end tests against the live (test mode) Stripe API, the trick is to use idempotency keys on every Stripe API call. This means a flaky test that retries does not double-charge the test customer. The idempotency key is a header on every Stripe API call; passing the same key with the same parameters returns the cached response instead of creating a new resource.

await stripe.customers.create(
  { email: 'test@example.com' },
  { idempotencyKey: `test-customer-${testRunId}` }
);

This is unrelated to webhook idempotency but worth mentioning because both protect against duplicate work, and people often have one and not the other.

What I Run In Production

The setup that has not bitten me in eighteen months is:

A single webhook endpoint that handles all event types. The endpoint verifies the signature, inserts the raw event into a webhook_events table with status = 'pending', and returns 200. Total work in the request handler: signature verification plus one insert.

A worker process that polls the webhook_events table every second, picks up pending events, and dispatches them to type-specific handlers. The worker uses SELECT ... FOR UPDATE SKIP LOCKED so multiple worker instances can run safely.

Type-specific handlers that refetch the relevant Stripe object before applying any state changes. No handler trusts the event payload as the source of truth for current state.

A retry policy in the worker that retries failed events up to five times with exponential backoff, then moves them to a dead-letter table that pages me if anything lands there. The dead-letter table has had four entries in eighteen months. Each was a genuine bug I needed to know about.

An idempotency check at the start of each handler, even though the queue table also has unique constraints. Belt and braces.

A reconcile function that can be called from both the webhook path and the user-facing checkout success path, so races between the two converge instead of conflicting.

A daily cron job that fetches all active subscriptions from Stripe and reconciles them against my database. This catches anything I missed: dropped events, edge cases, bugs in my own code. It runs at 3am and emails me a diff if it finds anything. In eighteen months it has caught exactly two real issues, both of which were my fault.

That last one is the thing most teams skip. Webhooks are a delivery mechanism, not a guarantee. Stripe themselves recommend periodic reconciliation against the API as the canonical source of truth. If your billing state matters (and if you are reading this, it does), you want a backstop that does not depend on every webhook firing correctly forever.

What I Would Tell You If You Asked

Most webhook bugs are not in the webhook code. They are in the assumption that webhooks are simple. They are not simple. They are a distributed system with at-least-once delivery, out-of-order events, and timeouts that turn correctness bugs into double-billing incidents.

If you have one weekend to ship a webhook integration that will survive contact with real users, do this:

Verify signatures. Insert events into a queue. Return 200 fast. Process from the queue with idempotent handlers that refetch state from Stripe. Add a daily reconciliation job. Wire up alerts on the dead-letter table.

That is it. Everything else is a refinement on top of that pattern. The pattern itself does not change between a side project doing $50 a month and a SaaS doing $50,000 a month. The volume changes. The architecture does not.

For the broader question of which auth provider sits in front of your billing flow, the auth comparison post covers the trade-offs that matter for a billing-heavy product. And if you are still picking your stack, the stop obsessing about the perfect stack post is the thing I should have read three projects ago.

Webhooks are one of the few areas where the boring, paranoid version of the integration is also the cheapest one to maintain. Build it boring. Build it paranoid. Sleep through your weekends.

Server-Sent Events vs WebSockets in 2026: When Each One Actually Wins

Alex Cloudstar — Fri, 15 May 2026 09:18:55 +0000

The last three real-time features I shipped were all built on Server-Sent Events. Two of them, I had originally planned to build on WebSockets. The third I had already half-built on WebSockets before I realised the WebSocket part was adding work and removing nothing. I ripped it out and the feature shipped a day earlier.

This is not a takedown of WebSockets. WebSockets are the right answer for plenty of things. The point is that the reflex to reach for WebSockets the moment someone says "real-time" is wrong more often than people admit, and the cost of that wrong choice is paid quietly for the entire lifetime of the feature.

Most of what people call "real-time" is one-way streaming from server to client. Notifications, live counters, progress bars, AI token streams, log tails, dashboard updates, presence indicators. None of these need the client to talk back over the same channel. They need the server to push, the client to listen, and a way to reconnect when the network blips. That shape is what SSE was designed for and what WebSockets are overbuilt for.

The decision is not "which one is better." The decision is "which one is the right shape for my data flow." Once you frame it that way, the answer is usually obvious.

What Each One Actually Is

Server-Sent Events is a one-way streaming protocol that runs over plain HTTP. The client opens a long-lived GET request. The server holds the connection open and writes chunks of text in a specific format. The browser parses each chunk into an event and fires it at an EventSource object. The connection is unidirectional: server pushes, client receives, that is the whole protocol.

data: {"type":"progress","percent":42}

data: {"type":"progress","percent":67}

data: {"type":"complete"}

That is literally the wire format. Two newlines separate events. Lines starting with data: are the payload. Other prefixes (event:, id:, retry:) add metadata. Browsers have been parsing this since 2011.

WebSockets is a bidirectional binary or text protocol that starts with an HTTP upgrade handshake and then runs over its own framing on top of TCP. After the handshake, the connection is no longer HTTP. It is a full-duplex pipe where either side can send messages at any time. The browser exposes it through new WebSocket(url). Most server frameworks have their own WebSocket library (ws for Node, gorilla/websocket for Go) because the protocol does not map cleanly onto the request-response model frameworks are built around.

The first big difference is the connection model. SSE runs on the same HTTP stack as everything else in your app. Your reverse proxy understands it. Your CDN can pass it through. Your load balancer routes it like any other request. WebSockets sit outside the normal HTTP request lifecycle. Most infrastructure handles them with a special path, special timeouts, and special sticky-session rules.

The second big difference is direction. SSE is server-to-client only. If the client needs to send something to the server, it makes a normal HTTP request. WebSockets are bidirectional from the start. If you do not need bidirectionality, that is a feature you are paying for and not using.

The Reflex To Reach For WebSockets

The reason most developers reach for WebSockets first is path dependency. The first big real-time tutorial people encountered (Socket.io chat apps, around 2015) used WebSockets. The mental model of "real-time = WebSockets" got cemented before SSE was widely known. Most React tutorials for live features still start with socket.io-client. Most Stack Overflow answers about "how do I push from server to client" answer with WebSockets even when the question describes a one-way flow.

This is a vibes-driven default, not an evaluated one. The honest version of the comparison for one-way flows is:

SSE: a single HTTP GET, your existing auth middleware works, your existing logging works, your CDN handles it, automatic reconnection is in the spec, the wire format is human-readable. No new dependencies.
WebSockets for the same one-way flow: a separate protocol upgrade, sticky-session rules in your load balancer, a separate auth path (cookies often do not work the same way), no automatic reconnection unless you write it, binary framing you have to debug with Wireshark or a browser DevTools panel that is less mature than the network panel.

For a one-way flow, every line of that comparison favours SSE. The only reason to pick WebSockets for a one-way flow is if you already have a WebSocket infrastructure for other reasons and adding SSE would be a second thing to maintain. That is a real reason. It is just rarer than the reflex suggests.

When WebSockets Actually Win

There are flows where SSE is the wrong shape and WebSockets are the right one. The clearest signal is bidirectional, low-latency, high-frequency message exchange. Specifically:

Multiplayer games and collaborative editing. The client sends actions every few milliseconds, the server merges and broadcasts, the client renders. Round-trip latency matters. The overhead of opening a new HTTP request per client action would crush you. WebSockets are the right shape.

Real-time voice or video signaling. The signaling layer for WebRTC negotiates connections through a server, and the negotiation is bidirectional and bursty. WebSockets are the standard tool. Once the WebRTC connection is established, the actual audio/video flows peer-to-peer over a different protocol entirely.

Chat applications with presence and typing indicators. Messages flow both ways. Presence updates flow both ways. Typing indicators are bursty bidirectional events. You can build this on long-polling or SSE plus POST, but it is more code and the latency is worse. WebSockets fit.

Low-latency trading interfaces, live auction bidding, multiplayer drawing tools. Anything where the round-trip latency between a client action and a server-broadcast response matters at the sub-100ms level.

The pattern across all of these is that the client and the server are in conversation. Not a server monologue with the client occasionally interrupting, an actual conversation. If your feature is a conversation, WebSockets earn their cost.

When SSE Actually Wins (Which Is Most Of The Time)

The flows that suit SSE are the ones where the server pushes and the client mostly listens. These cover more product features than people expect:

AI token streaming. The server generates a response from an LLM, streams tokens as they arrive, and the client renders them. The client does not need to interrupt in real time over the same channel. If the user wants to cancel, that is a separate HTTP request. This is the canonical SSE use case in 2026 and the reason the AI SDK in most frameworks defaults to SSE under the hood, not WebSockets. The AI SDK guide covers this in more detail for the framework-specific patterns.

Progress bars for long-running jobs. The user kicks off an import, a render, a deployment. The server reports progress every second or so until done. One-way push. SSE is the right answer.

Notifications. A bell icon updates when something happens. The client connects on page load. The server pushes events as they happen. The client renders. SSE.

Live counters and dashboards. Visitor counts, sales dashboards, system metrics, social media counters. The server has updates. The client wants them. No back-channel needed beyond normal HTTP requests when the user actually does something.

Log streaming. Tail a server log into a browser. Push lines as they appear. SSE.

Server-driven UI updates in collaborative tools, where one user changes something and other users need to see the change. The clients are the source of changes, but they communicate via normal HTTP POSTs to the server. The server fans the changes back out over SSE. The fan-out direction is one-way.

For all of these, the SSE version of the implementation is shorter, the operational footprint is smaller, and the failure modes are the standard HTTP ones your tooling already understands.

The Code Comparison That Made It Click For Me

The clearest way to see the difference is to look at the same feature built both ways. Take a progress bar that streams updates from a long-running job.

The SSE version, server side (Node, any HTTP framework):

export async function GET(req: Request) {
  const stream = new ReadableStream({
    async start(controller) {
      const encoder = new TextEncoder();

      for await (const update of jobUpdates(req)) {
        const data = `data: ${JSON.stringify(update)}\n\n`;
        controller.enqueue(encoder.encode(data));
      }

      controller.close();
    },
  });

  return new Response(stream, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
    },
  });
}

The client side:

const es = new EventSource('/api/job-progress');
es.onmessage = (event) => {
  const update = JSON.parse(event.data);
  setProgress(update.percent);
};

That is the whole feature. The browser handles reconnection automatically. If the network drops for ten seconds, the EventSource reopens and resumes. If the server crashes and restarts, the client reconnects without code.

The WebSocket version is longer in both directions. You set up a WebSocket route on the server (most frameworks need a separate adapter or middleware). The client opens a WebSocket. You send and receive JSON messages. You write reconnection logic because the WebSocket API does not include it. You handle the case where the server restarts and you need to re-authenticate.

const ws = new WebSocket(`wss://${location.host}/api/job-progress`);
ws.onmessage = (event) => {
  const update = JSON.parse(event.data);
  setProgress(update.percent);
};
ws.onclose = () => {
  // reconnect logic here, including backoff and re-auth
};

The reconnect block is the thing that gets you. Most tutorials skip it. Most production WebSocket clients re-implement it. The EventSource API includes it for free.

Auth, Which Is Where People Get Stuck

Auth on SSE works exactly like auth on any HTTP request. The browser sends cookies. Your middleware reads them. Your handler verifies. The connection is opened. This is the same code path as every other request in your app.

Auth on WebSockets is its own conversation. You can pass cookies during the handshake (which is technically an HTTP upgrade), but many WebSocket libraries do not expose the cookies cleanly to handler code. The common workaround is to send an auth token as the first message after the connection opens, which means there is a brief window where the connection is open but unauthenticated. You have to handle that window correctly or you have a security bug. People get this wrong.

The other workaround is to put the token in the URL as a query parameter, which works but leaks the token into server access logs and is generally a bad idea.

For SSE, you do nothing. The cookie just works. This is one of those quiet wins that does not show up in feature comparisons but saves you a half-day of debugging the first time you try it.

The Connection Limit Trap

There is one specific gotcha with SSE that catches people: HTTP/1.1 browsers limit concurrent connections per origin to six. If your user has six tabs open and each one has an SSE connection, the seventh tab cannot make any HTTP requests at all until one of the SSE connections closes. The browser is queuing requests behind the connection limit.

The fix is to use HTTP/2 or HTTP/3 in production. Both protocols multiplex many streams over a single connection, so the six-connection limit does not apply. Vercel, Cloudflare, Fastly, AWS CloudFront all serve HTTP/2 by default. If you are behind a proper edge layer, this is a non-issue.

If you are running SSE through a self-hosted reverse proxy on HTTP/1.1, this can bite you, and the symptom is "the seventh tab is broken." Always check your edge configuration. HTTP/2 fixes the entire class of problem.

WebSockets do not have this limit because they use a separate connection upgrade per socket, not a pool of HTTP requests. This is a real WebSocket advantage in the narrow case where you have to run on HTTP/1.1 for some reason, but in 2026 there is almost no reason you have to.

What About Long Polling

Long polling is the original "real-time over HTTP" pattern. The client opens a request. The server holds it open until there is data, then responds. The client immediately opens another request. The loop continues forever.

Long polling works, but in 2026 there is no reason to start with it. SSE is the modern version of the same idea, with a defined wire format, automatic reconnection, and ten lines less code on both sides. The only reason to reach for long polling is if you are stuck behind a proxy that strips streaming responses, which is rare and usually fixable.

I have not built a long-polling feature on purpose in five years. It is the kind of thing you only choose when something else is broken.

Frameworks Are Quietly Choosing For You

The framework you are building on probably has an opinion. Next.js, Astro, SvelteKit, and most modern React frameworks support SSE through their normal response APIs without any special configuration. You return a ReadableStream with the right headers and it works. The same frameworks need extra adapters or third-party libraries to support WebSockets, especially in serverless deploys.

This is not an accident. SSE fits the request-response model that frameworks are built around. The handler returns a response, the response is a stream, the stream stays open for a while. WebSockets break that model. They need a long-lived connection that is not tied to a single request, which most serverless platforms do not support cleanly.

If you are deploying to Vercel, Cloudflare Workers, AWS Lambda, or any other function-based platform, SSE works in their default execution model. WebSockets need either a workaround (Cloudflare Durable Objects, Lambda WebSocket API Gateway) or a separate long-lived server. The infrastructure cost of WebSockets in a serverless world is real.

For long-lived servers (Fly.io, Railway, Render, a VM you manage), WebSockets are fine. The infrastructure cost goes away. The cost is now operational: keeping a server alive for the duration of every connected client.

The platform-aware version of the decision: if you are on serverless, default to SSE unless you have a bidirectional flow that requires WebSockets. If you are on long-lived servers, the choice is more open and you can pick based on the data flow alone.

What I Run In Production

The pattern I have landed on for most real-time features:

The server uses SSE for any one-way flow: notifications, progress, AI streams, dashboards, logs. The handler returns a ReadableStream with Content-Type: text/event-stream. Auth runs through the normal middleware. Reconnection is the browser's problem.

The client uses EventSource for SSE, with a tiny wrapper that handles typed events and JSON parsing. The wrapper is about thirty lines of code and replaces a much larger WebSocket client library.

For the rare bidirectional features (a collaborative editor I built last year, a live drawing tool, an internal admin tool with two-way command-and-control), I reach for WebSockets and accept the cost. I run those on a long-lived Fly.io server, separate from the rest of the app. The separation is deliberate: keeping WebSocket complexity out of the main serverless app is worth one extra deployment target.

A small amount of glue connects the two. The SSE-served frontend can send commands to the server via normal HTTP POST. The server fans the commands out to the connected SSE clients. This pattern (POST in, SSE out) covers more "bidirectional" features than you would expect, because most of those features are actually one-way fan-out with occasional user-initiated commands.

The result is a real-time stack that is mostly boring HTTP, with a small WebSocket surface for the cases that genuinely need it. The boring HTTP part is the part that scales, deploys, and debugs cleanly. The WebSocket part is the part that needs care.

What I Would Tell You If You Asked

The question is not "which protocol is better." The question is "what shape is my data flow."

If your data flow is one-way from server to client, with the client only occasionally needing to send commands back, use SSE. Use POST for the commands. You will write less code, deploy on more platforms, and your auth will already work.

If your data flow is genuinely bidirectional, with the client and server in continuous conversation, use WebSockets. Accept that you are taking on a deployment burden and a new failure mode in exchange for the flexibility you actually need.

If you are not sure which one you have, look at how many messages per second the client sends versus the server. If it is one-to-many in the server's favour, you have a one-way flow with a back-channel. SSE. If it is more balanced, you have a conversation. WebSockets.

The mistake I made for years was treating "real-time" as a synonym for "WebSockets" and reaching for the heavier option by reflex. The mistake cost me real engineering time, real infrastructure complexity, and real debugging hours that I would not have spent if I had picked the simpler protocol. SSE is not new. It is just quiet, and quiet things lose to loud things in tutorials and frameworks even when they are the better tool.

The broader version of this lesson is that the framework defaults are usually right, and the boring HTTP-shaped answer is usually right, and the moment you find yourself adding a new dependency to solve a problem the platform already solved you should pause and check. The stop obsessing about the perfect stack post is the same instinct applied to a different question. The instinct generalises: pick the tool that matches the shape of the problem, not the tool that pattern-matched to the buzzword.

For real-time in 2026, the shape of most problems is one-way. The tool that matches is SSE. Use it without apology.

TypeScript at Scale: Why Your tsc Takes 90 Seconds and How to Fix It

Alex Cloudstar — Fri, 08 May 2026 08:41:54 +0000

The TypeScript codebase I inherited last year had a clean build time of 94 seconds. Incremental builds were 12 seconds on a good day. The editor would freeze for two or three seconds every time you hovered over a Zod schema. Nobody wrote new code without first opening their second monitor to scroll Twitter while the language server caught up.

It is now 11 seconds for a clean build, sub-second incremental, and the editor stays responsive. We did not move to Project Corsa. We did not switch to Bun. We did not split the repo. We deleted three patterns that were generating millions of redundant type instantiations and tightened a few tsconfig settings. The work took about a week.

Most TypeScript performance problems at scale are not "TypeScript is slow." They are "we are asking TypeScript to do something quadratic and it is doing it." This post is the diagnostic playbook for figuring out which thing your codebase is doing.

The First Question: Where Is the Time Going

Before tuning anything, get real numbers. The TypeScript compiler ships with two flags that turn the diagnostic question from "feels slow" into "spends 47% of its time in type checking step X."

npx tsc --extendedDiagnostics

The output gives you a breakdown: parse time, bind time, check time, emit time, total memory usage. If "Check time" dominates, your problem is in the type system. If "I/O Read time" or "Parse time" dominates, your problem is the size of what you are loading. These are very different problems with very different fixes.

The next flag is more targeted:

npx tsc --generateTrace ./trace

This drops a Chrome trace file into ./trace. Open it in chrome://tracing or https://ui.perfetto.dev. You get a flame graph of every file the compiler checked, how long each took, and what types it instantiated.

The pattern to look for is single files that take seconds. Healthy code generates a flame graph where most files complete in under 100ms and the long tail tops out somewhere around 500ms. A file that takes 5 seconds is a file with a type the compiler is struggling with. A file that takes 30 seconds is the file generating most of your build pain, and finding it is most of the work.

@typescript/analyze-trace is the tool that reads the trace and tells you what is hot:

npx @typescript/analyze-trace ./trace

It surfaces the worst-offending files, the deepest type instantiations, and the most expensive type aliases. The output is sometimes opaque, but the file names it gives you are almost always the right places to look.

The Patterns That Actually Cost You

In every slow codebase I have looked at, the cost concentrates in a small number of patterns. The patterns are recognizable once you know what to look for.

Deeply Nested Generic Inference

This is the most common offender, and it almost always lives in code that wraps a library with a generic helper.

function withRetry<T extends (...args: any[]) => Promise<any>>(
  fn: T,
  options: RetryOptions
): (...args: Parameters<T>) => Promise<Awaited<ReturnType<T>>> {
  // ...
}

const fetchUser = withRetry(api.users.fetch, { retries: 3 });

Looks fine. The cost shows up when you wrap something whose signature is itself heavily generic. If api.users.fetch returns a Drizzle query result, or a tRPC procedure, or a Zod-inferred type, the compiler has to expand all of those generics every time the wrapper is instantiated. If withRetry is used in 200 places across your codebase, the compiler does that work 200 times in every type check.

The fix is rarely to delete the wrapper. It is to break the chain of inference at strategic points. Instead of inferring Awaited<ReturnType<T>> deep inside the type, accept a simpler input type and let the user spell it out at the call site, or use a type assertion to terminate the inference.

Conditional Type Recursion in Hot Paths

type DeepReadonly<T> = {
  readonly [K in keyof T]: T[K] extends object
    ? T[K] extends Function
      ? T[K]
      : DeepReadonly<T[K]>
    : T[K];
};

A DeepReadonly over a small interface is fine. A DeepReadonly applied to your top-level state type, which contains your database row types, which reference your domain types, which contain unions of all your enums, is a recursive type explosion. The compiler will work through it, sometimes. Sometimes it gives up and emits any, silently. Either way it is slow.

The default position for recursive utility types should be: do not. If you find yourself reaching for DeepPartial, DeepReadonly, DeepKeys, or anything that walks an arbitrary tree, ask whether you actually need the type to be deep. Most of the time you need it to be one or two levels deep, which is a much cheaper type to write explicitly.

When you do need recursion, cap the depth:

type DeepReadonly<T, Depth extends number = 4> = Depth extends 0
  ? T
  : { readonly [K in keyof T]: DeepReadonly<T[K], Decrement<Depth>> };

This gives you the safety of a finite recursion at the cost of writing a numeric depth helper. The compiler can always finish.

Massive Discriminated Unions

A union with eight variants is fast. A union with 200 variants generated from a Zod schema or a code generator is slow. Every time you narrow the union with a discriminator, the compiler has to consider every variant and prove which ones are eliminated.

type Event =
  | { type: 'user.created'; payload: UserCreated }
  | { type: 'user.updated'; payload: UserUpdated }
  // ... 198 more
  ;

function handle(event: Event) {
  switch (event.type) {
    case 'user.created': return handleUserCreated(event.payload);
    // ...
  }
}

The narrowing inside the switch is where time goes. The compiler proves at each case statement which variants of the union are still possible. With 200 variants, that proof gets expensive. If handle is called from many places, and each call site re-checks the union, you can pay this cost thousands of times in a single type check.

Two fixes that usually work: split the union at module boundaries so any single function only deals with a subset, or convert the union into a record type keyed by the discriminator and look up the handler dynamically. The latter sacrifices exhaustiveness checking, which you can get back with a satisfies clause:

const handlers = {
  'user.created': handleUserCreated,
  'user.updated': handleUserUpdated,
  // ...
} satisfies Record<Event['type'], (payload: any) => void>;

function handle(event: Event) {
  return handlers[event.type](event.payload);
}

The compiler still verifies completeness on the satisfies, but the lookup at the call site is constant-time, not a union narrowing.

`as const` Object Literals With Heavy Inference

export const routes = {
  users: {
    list: '/users',
    detail: '/users/:id',
    create: '/users',
    update: '/users/:id',
  },
  posts: {
    // ...
  },
} as const;

type RouteKey = `${keyof typeof routes}.${keyof typeof routes[keyof typeof routes]}`;

The as const keeps the literal types, which is what you want. The template literal type at the bottom is what is expensive. It generates the cartesian product of all top-level keys and all nested keys, and TypeScript materializes the full set during type checking. For a route table with 50 sections and 5 routes each, you have a 250-element string union that has to be computed every time something references RouteKey.

The fix is to keep the inferred type but stop computing the joined string union at the type level. If you need to enumerate all routes, generate the list at runtime from the object and accept that you pay a tiny startup cost. If you need it at compile time for autocompletion, narrow the scope of the type so it only covers one section at a time.

Library-Caused Slowdown

Sometimes the slow file is not your code. It is node_modules/some-library/dist/index.d.ts. The trace will show this clearly. Common offenders historically have been older versions of typed-form libraries, validation libraries with very expressive types, and ORMs that try to type your entire schema.

The trace will tell you which library. The fix is usually one of: upgrade to a newer version that has fixed the issue, swap the library, or wrap the library at a thin module boundary so the heavy types do not leak into your call sites. The wrapping pattern works better than people expect: define a narrower internal type for the bits of the library you actually use, and import only that internal type from the rest of the codebase. The compiler stops re-checking the library's types every time you reference your internal type.

Project References, the Right Way

tsconfig project references are the thing everyone reaches for and rarely sets up correctly.

The promise of project references is that you split your codebase into smaller projects, each with its own tsconfig.json, and the compiler builds each project once and reuses the output. Incremental builds are dramatically faster because changing a leaf project does not invalidate the type checking of unaffected projects.

The catch is that project references require composite mode, which requires every referenced project to emit declaration files, which means every referenced project needs a real build output. This is fine for libraries. It is awkward for application code that historically just relied on tsc --noEmit for type checking and a separate bundler for output.

The setup that has worked for me on a Next.js + workspace setup:

apps/
  web/tsconfig.json
packages/
  domain/tsconfig.json
  database/tsconfig.json
  ui/tsconfig.json
tsconfig.base.json
tsconfig.json

The root tsconfig.json references each project:

{
  "files": [],
  "references": [
    { "path": "./packages/domain" },
    { "path": "./packages/database" },
    { "path": "./packages/ui" },
    { "path": "./apps/web" }
  ]
}

Each package has composite: true, declaration: true, and produces a .tsbuildinfo file. The first build is roughly the same speed as before. The second build is dramatically faster because unchanged packages are skipped entirely.

The mistake to avoid: do not split into projects until you have profiled and have a real reason. A small codebase with project references is slower than the same codebase without, because the overhead of the build orchestration outweighs the savings. The crossover point is usually somewhere around 50,000 lines of TypeScript or three to four logical domains that change independently.

For Astro, SvelteKit, and Next.js apps specifically, the project reference setup interacts with the framework's own type generation. Read the framework's docs before assuming the standard setup will work; they often have specific guidance.

Compiler Settings That Matter for Speed

A handful of tsconfig options have a direct performance impact. Most of the others do not, regardless of what online guides claim.

skipLibCheck: true. This is the single highest-impact setting for most codebases. It tells the compiler not to type-check your node_modules. The downside is that a broken type declaration in a dependency will not be caught at type-check time. The upside is that you stop doing redundant work for hundreds of dependencies. Almost every production codebase should have this on. Library authors who publish types should have it off in their own builds and on in their consumers' builds.

incremental: true with a tsBuildInfoFile. This caches the type-check graph between runs. Even on a single project (no references), this halves the time of subsequent runs because most files have not changed.

isolatedModules: true. Required if you are using a separate bundler for emit (which you almost certainly are in 2026 with Vite, Bun, esbuild, Turbopack, or any of the others). Forces you to write code that can be compiled file-by-file without cross-file type information. Slightly more restrictive but enables the bundler to skip work the compiler would otherwise have to redo.

moduleResolution: "bundler". The newer resolution mode introduced in TypeScript 5.0. Faster than node16 for most setups because it skips some of the legacy behavior. Use it if your bundler is newer than 2023.

noUncheckedIndexedAccess: true. Not a performance setting, but worth mentioning because people assume it slows things down. It does not. It changes the inferred type of array index access from T to T | undefined. Pure type-system change, no impact on check time.

The compiler options that do not matter for speed despite the rumors: strict, noImplicitAny, strictNullChecks, exactOptionalPropertyTypes. Turning these off does not measurably speed up type checking. They affect what gets reported, not how much work the compiler does.

Editor Performance Is a Different Problem

The TypeScript language server is what your editor uses for autocomplete, hover info, go-to-definition, and inline errors. It runs the same compiler as tsc but with different priorities: it tries to give you fast partial answers rather than complete answers.

When the editor feels slow, the tsc benchmark does not always reflect it. The language server has its own performance characteristics. The diagnostic for editor performance is to open the TypeScript: Open TS Server log command in VS Code (or your editor's equivalent) and watch what it is doing. You will see entries like:

Info 1234 [10:31:42.123] getQuickInfoAtPosition: 4823.4ms

A getQuickInfoAtPosition taking five seconds means the type at the position you hovered is genuinely that expensive to compute. The hot path in the compiler for hovers is type display, and large inferred types (especially from generic libraries) can blow up at display time even when type checking them is fast.

Two specific editor optimizations that help:

Memory limit: 8192 (or higher). The default language server memory limit is 3GB. Codebases with very rich types blow past this and the language server starts garbage collecting aggressively, which feels like lag. Bumping the limit in your editor settings is free if you have the RAM.

Disable inlay hints in the files where they are slow. Inlay hints (the inferred parameter types and return types shown in the editor) require the language server to compute every type for display. In files with heavy generics, this is the single most expensive operation. Most editors let you disable specific inlay hint categories. Turning off "All inlay hints" on a heavy file is a quality-of-life win even if you keep them on globally.

If you are running Cursor, Zed, or any of the AI-augmented IDEs from the IDE comparison post, the language server runs the same way. The AI features are layered on top, but the underlying TypeScript performance is the language server's responsibility, and the same diagnostics apply.

What Project Corsa Changes, and What It Does Not

The Go-based TypeScript compiler (Project Corsa) is the largest single performance change to the language since it shipped. The headline numbers are real: 10x faster on most codebases, sometimes more on codebases that are I/O bound.

What it does not change is the type system. A codebase with quadratic type-instantiation patterns will still have quadratic type-instantiation patterns under Corsa. The 10x speedup compounds: a 90-second build becomes 9 seconds, but a 9-minute build becomes 54 seconds, which is still slow. If your codebase is generating millions of redundant type instantiations, fixing those patterns is still worth doing. Corsa makes the existing work faster; it does not make the work go away.

For most codebases, the incremental version of Corsa lands as a drop-in replacement for tsc and the language server. The migration is small. The wins are large. It is worth doing as soon as it is stable for your version of TypeScript. It is not worth waiting for if your build is currently slow; the patterns described above will pay off both before and after Corsa lands.

A Concrete Diagnostic Loop

If your build is slow and you do not know why, here is the order of operations that almost always isolates the problem.

Start with npx tsc --extendedDiagnostics and capture the timings. Save the output. You will compare against this later.

Run npx tsc --generateTrace ./trace and npx @typescript/analyze-trace ./trace. The output will list the hottest files. Pick the top three.

Open each of the hot files. Look at the imports first. The expensive types usually come in through an import. Note any types from libraries that look complex (Zod, Drizzle, tRPC, anything with deep generics).

Search for usages of those types in the file. Find any place where a generic is being inferred deeply or a conditional type is being recursively expanded. These are your candidates for surgery.

Try the fixes one at a time. After each, re-run tsc --extendedDiagnostics and compare against the baseline. You want to see the check time drop. If it does not, revert and try the next thing.

The reason for one-at-a-time changes is that some "fixes" make things worse, and a batched change hides which one helped and which one hurt. The diagnostic is fast enough that the patience pays off.

Once the hot files are no longer hot, run the trace again. New hot files will surface as the previous ones fall down the list. Stop when the worst file is in a range you are happy with, usually 200ms or less for a single file.

The whole loop is a day or two of focused work for most codebases. The win is permanent unless someone reintroduces the same patterns, which is why a tsc --extendedDiagnostics check in CI as a regression guardrail is worth considering.

What I Would Tell You If You Asked

If you have a slow TypeScript codebase and limited time, the highest-leverage thing you can do is generate a trace and read it. Most teams skip this and try fixes blind. The fixes work some of the time, but the trace tells you exactly where to look, and the work after that is usually small.

The second highest-leverage thing is skipLibCheck: true, if you do not already have it. The savings are immediate. The downside is rarely material.

The third is to cap any recursive utility types you have introduced and to push deeply inferred generic helpers to terminate inference earlier. These are pattern-level changes, not config tweaks, and they require reading the trace to know which patterns matter for your codebase.

What I would not do: rewrite to a different language or framework hoping the performance will be better. Bun, Deno, and esbuild are faster at the bundling and parsing parts, but the type checking is still TypeScript's compiler doing TypeScript's compiler work. The gains from tooling come from building, not type-checking. You can ship faster builds with a faster bundler and still have a 90-second tsc because nothing about the bundler changed how the type system works.

The honest summary: TypeScript at scale is fast enough if you do not do the expensive things, and slow if you do. The expensive things are knowable and the fixes are not exotic. The work is figuring out which of them your codebase is doing, which is what the trace is for.

For the broader picture of where TypeScript is heading, the Project Corsa post covers what is coming. For a related performance angle on running TypeScript without a build step at all, the type-stripping post is useful. Both are about reducing the work the toolchain has to do. This post is about reducing the work the type system has to do, which is the part you control directly even before any new compiler ships.

Passkeys in Production: What I Wish I Knew Before Replacing Passwords

Alex Cloudstar — Fri, 08 May 2026 08:41:53 +0000

The first passkey login I shipped to real users worked perfectly for forty minutes. Then the support tickets started.

A user with a personal MacBook and a work Windows laptop could not figure out why his iPhone passkey was not showing up on the Windows machine. A second user had set up a passkey on her phone, lost the phone in a taxi, and now could not get into her account because we had quietly deleted her password fallback when she enrolled. A third user was on a corporate-managed Chrome that had WebAuthn policy-locked to platform authenticators only, but our flow assumed roaming authenticators would always be offered.

None of these are bugs in WebAuthn. They are the gap between "passkeys work" as a protocol statement and "passkeys work for the actual humans using your product." Most articles on this topic stop at the first half. This one is about the second half, the part you only learn by shipping.

What Passkeys Actually Are, Stripped of Marketing

A passkey is a WebAuthn credential where the private key lives in something the user trusts (their device, their password manager, their security key) and the public key lives on your server. Authentication is a signature challenge. Your server sends a random nonce, the authenticator signs it with the private key, you verify the signature against the public key you stored at registration.

That much has been true since WebAuthn level 1 in 2019. What changed in 2022 and shipped broadly through 2024 and 2025 is the sync part. Apple, Google, and Microsoft started syncing WebAuthn credentials across devices through their cloud accounts. Then 1Password, Bitwarden, and Dashlane started doing the same across platforms. The credential is no longer locked to a single device.

The user-facing pitch is "no more passwords, no more phishing, your account is just there on every device you trust." The pitch is mostly true. The mostly part is where the work is.

Three things to internalize before writing any registration code:

A passkey is bound to a relying party ID, which is your domain. Cross-domain passkeys do not exist. A passkey for app.example.com cannot be used on example.com unless you set the RP ID to the parent domain at registration time. You make this choice once and you live with it.
A user can have many passkeys. They will. Treat the credential as the primary key for authentication, not the user. One user, many credentials, with metadata on each one (device label, last used, transport types).
The authenticator decides what is possible. Some authenticators are platform-bound (Touch ID without iCloud Keychain). Some are roaming (YubiKey). Some are syncing (iCloud Keychain, 1Password). Your code asks for what you want and the browser tells you what you got. You design around the answer, not around your assumptions.

The Protocol in One Page

Registration is a four-step dance. The browser API is navigator.credentials.create() with a publicKey options object. You generate the options on the server, send them down, the browser creates the credential, you send the attestation back, you verify and store.

// Server: generate registration options

const options = await generateRegistrationOptions({
  rpName: 'Example',
  rpID: 'example.com',
  userID: new TextEncoder().encode(user.id),
  userName: user.email,
  attestationType: 'none',
  excludeCredentials: existingCredentials.map((c) => ({
    id: c.credentialId,
    transports: c.transports,
  })),
  authenticatorSelection: {
    residentKey: 'preferred',
    userVerification: 'preferred',
    authenticatorAttachment: undefined,
  },
});

await sessionStore.set(session.id, { challenge: options.challenge });
return options;

Three knobs in that block matter more than they look:

attestationType: 'none' is the default for consumer apps. Anything else asks the authenticator to prove what it is, which is useful for regulated environments and a privacy concern for everyone else. Most consumer flows do not need it.
residentKey: 'preferred' asks for a discoverable credential, which is what makes the "click sign in and just be signed in" flow work without typing a username. The browser respects the preference but does not always honor it. You handle both cases on login.
authenticatorAttachment: undefined means the user can pick a platform authenticator (Touch ID, Windows Hello) or a roaming one (security key, phone). Locking this to platform will exclude users who want their YubiKey. Locking to cross-platform will exclude users who want Face ID. Leaving it open is almost always right.

// Server: generate authentication options

const options = await generateAuthenticationOptions({
  rpID: 'example.com',
  userVerification: 'preferred',
  allowCredentials: undefined, // empty for discoverable credential flow
});

await sessionStore.set(session.id, { challenge: options.challenge });
return options;

Leaving allowCredentials empty triggers the discoverable credential flow: the browser shows the user every passkey they have for your domain, they pick one, and you find out which user it is from the credential ID after the assertion. This is the flow you want. The alternative, asking the user for their username first and then sending the list of credentials they own, is fine for sign-in form layouts but gives up the magic.

The verification step on the server is where you check the signature, the challenge match, the origin, the RP ID hash, and the signature counter (if the authenticator increments one). @simplewebauthn/server handles all of that. You hand it the response, the expected challenge from the session, and your domain, and it tells you whether to trust this assertion.

Most of the protocol-level work is solved by the SimpleWebAuthn library on Node.js, webauthn-rs on Rust, and platform-specific equivalents on Go and Python. Writing it yourself in 2026 is not a sign of seriousness. It is a sign of not having read the spec carefully enough to notice how many ways there are to subtly miscount bytes when parsing the authenticator data.

The Account Model You Actually Need

The schema for storing passkeys is small but easy to get wrong. The shape that has held up for me across three production rollouts:

type User = {
  id: string;
  email: string;
  emailVerifiedAt: Date | null;
  createdAt: Date;
};

type Credential = {
  id: string;                 // your primary key
  userId: string;             // foreign key
  credentialId: Uint8Array;   // WebAuthn credential ID
  publicKey: Uint8Array;      // COSE-encoded public key
  signatureCounter: number;
  transports: AuthenticatorTransport[];
  deviceLabel: string;        // user-editable
  lastUsedAt: Date;
  createdAt: Date;
  backupEligible: boolean;
  backupState: boolean;
};

Two fields people skip and regret: backupEligible and backupState. These come from flags on the authenticator data and they tell you whether the credential is syncing across the user's devices. A credential that is backupEligible: true, backupState: true is a credential that exists in iCloud Keychain or 1Password or similar. If the user loses their phone, that credential is still recoverable. A credential with backupEligible: false is locked to one device. If that device dies, the credential dies with it.

You do not show these flags to the user as raw booleans. You use them to decide what to tell the user about recovery. A user who has only single-device credentials needs more aggressive prompting to add a second factor or set up recovery. A user with synced credentials is in much better shape.

The transports array is what makes the autofill UI on the next device work. A credential created on an iPhone reports ['internal', 'hybrid']. The hybrid transport is what enables QR-code-mediated cross-device auth where the user scans a code on a desktop with their phone to log in. Storing transports correctly and passing them back in excludeCredentials and allowCredentials makes the browser surface the right options at the right moments.

The deviceLabel field exists because users will end up with five or six credentials and need to be able to tell them apart. "iPhone 15 Pro," "Work MacBook," "1Password," "YubiKey 5C." The browser does not give you a clean device name on registration. You ask the user. A small text input at the end of the registration flow with a sensible default like "Device added on May 8, 2026" is enough.

The Recovery Problem

Here is the part most demos skip. Passkeys without a recovery story are worse than passwords, because at least passwords have email-based reset flows that everyone understands.

The mental model that has worked: a user account needs at least two ways back in, and they need to be independent failure modes. If both of your recovery methods require the user's phone, losing the phone takes the user out of the account permanently. That is a churn event and, for some applications, a regulatory issue.

The recovery options worth combining:

A second passkey, registered on a different authenticator. "Add another device" is the clean version of this. The phone is one credential, the password manager is another, the laptop's platform authenticator is a third.
An emailed magic link. Cheap, familiar to users, and works as long as email is accessible. The downside is that it makes your account security exactly as good as the user's email security, which is a known weak link. For a consumer product this is usually acceptable. For a financial product it is not.
A printed or shown-once recovery code. A 16-character string the user is told to save somewhere. Most users will not save it. The ones who will are exactly the users you want to keep.
Identity verification through a third-party service. KYC providers can re-verify the user against their original ID. Expensive and slow. Use this for high-value accounts.

The pattern that holds up: at registration time, push the user to set up a second method before they finish onboarding. If they bail, mark the account as having weak recovery and show a banner on every login until they fix it. The friction is worth it. The cost of supporting "I lost my only passkey" tickets is high and the resolution is often "the user creates a new account and we lose their data."

The other thing to do at registration time: do not delete the password if the user has one. Add the passkey alongside, mark passkeys as preferred, and offer to remove the password later once the user has multiple working passkeys. A common rollout mistake is treating passkey registration as a one-way migration. It should be additive. The password becomes a fallback. Once the user has confirmed they can log in with their passkey on every device they use, you can offer to remove the password. Never remove it without an explicit user action.

The Cross-Device Reality

The hardest part of shipping passkeys is not writing the code. It is reasoning about what happens when a user sits down at a device that does not have their credential.

The clean cases:

iPhone user opens Safari on their iPhone or Mac signed into the same iCloud account. The credential syncs. Login works.
1Password user with the browser extension installed and unlocked. The credential is in 1Password. The extension intercepts the WebAuthn ceremony. Login works.
Android user with Google Password Manager and Chrome signed in. The credential syncs across their Android devices and Chrome on desktop. Login works.

The messy cases:

Mac user logs in on a Windows laptop. iCloud Keychain does not exist on Windows. The user needs to use the cross-device flow: the browser shows a QR code, the user scans it with their iPhone, the iPhone authenticates over Bluetooth, and the desktop receives the assertion through a relay server. This works but it is not obvious to users. The first time they see the QR code they assume something is broken.
A user with credentials only in their work device's platform authenticator goes home and tries to log in on their personal laptop. Same QR code flow needed. If their work device is in their pocket, it works. If they left it at the office, they are locked out unless they have a second method.
A user on a corporate-managed device where IT has disabled cross-device authentication. The QR code flow does not appear. The user can only log in if they have a credential on this specific device. Your support team will see this case more than you expect.
A user whose password manager is locked. 1Password and Bitwarden need to be unlocked before they can serve a passkey. If the user just opened their browser, the autofill prompt may not show their saved passkeys until they manually unlock their password manager. This is confusing and looks like the passkey is missing.

The pattern that helps: never assume a login attempt is final. Always offer at least two paths on the login page. "Sign in with passkey" and "Email me a sign-in link" side by side. The passkey path covers most cases. The email path covers the user who is on a new device, locked password manager, or weird policy environment. Forcing users into a single path is where the support tickets come from.

The other thing that helps: explicit copy. When the QR code flow triggers, do not just show the QR code. Tell the user "Use your phone to scan this code and approve the sign-in." Most users have never seen a WebAuthn cross-device flow and need a sentence to recognize what is happening.

What Breaks in the Wild

A list of real failures from real production rollouts. None of these are exotic.

Safari and the third-party cookie blocker. Safari's privacy mode in some configurations blocks the storage that holds the WebAuthn challenge if you store it in a cookie scoped wrong. If you are seeing intermittent challenge mismatch errors specifically on Safari, check that your session cookie has SameSite=Lax and is not getting blocked by intelligent tracking prevention. Storing the challenge server-side keyed by session ID dodges this entirely.

Subdomain credential split. A user registers a passkey on app.example.com because that is what the browser was on at the time. They later try to log in on example.com. The credential does not show up because the RP ID does not match. Fix: pick one canonical RP ID at the start, usually the registrable domain (example.com), and use it everywhere. Migrating later is painful.

Counter rollback. Some authenticators (notably some old YubiKeys) increment the signature counter on each authentication. Some (most platform authenticators today) do not, and the counter stays at zero. Your verification logic should accept both. A naive "counter must always increase" check rejects platform authenticator users intermittently.

The exclude list explosion. excludeCredentials is meant to prevent the user from registering the same authenticator twice. If a user has 12 credentials, you send 12 entries in the exclude list. Some authenticators handle this poorly and time out. Cap the exclude list at the user's most recently used credentials, or skip it entirely and dedupe on the server when you receive the registration response.

Resident key promises broken. You ask for residentKey: 'required' because you want discoverable credential flows. The user's authenticator does not support it. The browser silently registers a non-discoverable credential. The user's next login does not show their passkey in the autofill prompt because the credential is not discoverable. Fix: check the response's authenticatorAttachment and credentialDeviceType to see what you actually got, and surface a warning if the flow you wanted is not what was created.

Email-as-username collision with discoverable credentials. You designed your sign-in page to ask for an email first, then offer a passkey. Discoverable credential flow is a button labeled "Sign in with passkey" that bypasses the email entry. New users who open your sign-in page see two options and pick the wrong one. The fix is to combine: show the passkey button up front, and below it, the email input for users who do not have a passkey or want the magic-link path.

The Code That Holds Up

What I have ended up with after a few rounds of iteration, on the server side:

import {
  generateRegistrationOptions,
  verifyRegistrationResponse,
  generateAuthenticationOptions,
  verifyAuthenticationResponse,
} from '@simplewebauthn/server';

const RP = {
  id: process.env.WEBAUTHN_RP_ID!,
  name: process.env.WEBAUTHN_RP_NAME!,
  origin: process.env.WEBAUTHN_ORIGIN!,
};

export async function startRegistration(user: User) {
  const credentials = await db.credentials.findByUserId(user.id);
  const options = await generateRegistrationOptions({
    rpName: RP.name,
    rpID: RP.id,
    userID: new TextEncoder().encode(user.id),
    userName: user.email,
    attestationType: 'none',
    excludeCredentials: credentials.slice(0, 10).map((c) => ({
      id: c.credentialId,
      transports: c.transports,
    })),
    authenticatorSelection: {
      residentKey: 'preferred',
      userVerification: 'preferred',
    },
  });
  await sessionChallenges.set(user.id, options.challenge, { ttl: 300 });
  return options;
}

export async function finishRegistration(user: User, response: RegistrationResponseJSON, label: string) {
  const expectedChallenge = await sessionChallenges.get(user.id);
  if (!expectedChallenge) throw new Error('challenge expired');

  const verification = await verifyRegistrationResponse({
    response,
    expectedChallenge,
    expectedOrigin: RP.origin,
    expectedRPID: RP.id,
  });

  if (!verification.verified || !verification.registrationInfo) {
    throw new Error('registration failed');
  }

  const info = verification.registrationInfo;
  await db.credentials.create({
    userId: user.id,
    credentialId: info.credential.id,
    publicKey: info.credential.publicKey,
    signatureCounter: info.credential.counter,
    transports: response.response.transports ?? [],
    deviceLabel: label || `Device added ${new Date().toLocaleDateString()}`,
    backupEligible: info.credentialBackedUp,
    backupState: info.credentialBackedUp,
  });

  await sessionChallenges.delete(user.id);
}

The login side is the same shape with generateAuthenticationOptions and verifyAuthenticationResponse. The thing worth noting is that on a discoverable credential flow, you do not know which user is logging in until after you verify the assertion. So you look up the credential by credentialId first, then load the user, then verify. The order matters because verification needs the public key that belongs to that credential.

The session challenge storage is the unsexy part that is worth getting right. A short-lived TTL (five minutes is plenty) keyed by something stable for the request, and never reused. Reusing a challenge breaks the security model entirely. If you are tempted to write your own challenge storage, use Redis or your existing session store and move on.

For the broader auth library question of whether to build this yourself or pick a service like Clerk, Auth0, or Better Auth, the auth library comparison is worth reading. Most of the hosted providers now offer passkey support out of the box, with the same recovery and cross-device subtleties handled for you. The decision is the standard one: build for control and customization, buy for speed and offloaded support burden.

The Browser Compatibility Floor in 2026

A short matrix of where things actually work as of mid-2026:

Safari 17+ supports passkeys, syncs through iCloud Keychain, supports the cross-device hybrid transport.
Chrome 125+ supports passkeys on macOS, Windows, Linux, ChromeOS, and Android. Google Password Manager syncs across signed-in devices.
Firefox 122+ supports the WebAuthn API but does not sync credentials itself. It defers to the OS-level platform authenticator on macOS and Windows. On Linux, the user's experience depends on whether they have a hardware authenticator plugged in.
Edge follows Chrome.
Mobile browsers all defer to the OS authenticator. iOS Safari uses iCloud Keychain. Android Chrome uses Google Password Manager. Both work well.

Conditional UI (the autofill prompt that shows passkeys without the user clicking anything) requires the page to call navigator.credentials.get() with mediation: 'conditional' and an <input autocomplete="username webauthn">. This works in Safari 16+, Chrome 108+, and Firefox 119+. The user experience is excellent when it lands. The fallback to a clicked button needs to exist for browsers that do not support it.

The compatibility story is in a much better place than it was even a year ago. The remaining gap is configuration, not capability. Corporate-managed environments are still where things break, and the gap between what the spec allows and what enterprise IT permits is the gap your support tickets will live in.

What I Would Tell My Past Self

Three things that would have saved me significant time on the first rollout.

The recovery story is the product. Spend more time on it than on the registration flow. Most engineering attention goes to "how do we make registration smooth" and not enough goes to "what happens when the user calls support saying their phone fell in a lake." The second one is what determines whether passkeys are a net win for your users or a way for them to get locked out.

Add passkey support without removing passwords first. Treat passwords as a legacy fallback, not a problem to eliminate. Letting users opt in incrementally and confirming their passkeys work across all their devices before any cleanup means the rollback path stays open. Removing passwords prematurely is how you generate a churn event.

Test on a corporate-managed Windows laptop. The flows that are smooth on a personal MacBook with iCloud Keychain are not necessarily smooth on a managed Windows device with a third-party password manager. The only way to know is to try, and ideally to ship a beta to a population that includes those users before you flip the default.

Passkeys are better than passwords for users who already have a sync mechanism set up. They are an improvement for users with one device. They are a regression for users you push into them without giving them a working recovery story. The technology is solid. The product work around it is where the wins and losses are.

If you are building auth from scratch in 2026 and want to skip most of this, the auth library comparison is the honest version of which providers handle the messy parts well. If you are extending an existing auth system, the SimpleWebAuthn library plus the schema above will get you to a working passkey flow in a week. Getting it to a flow that does not generate support tickets takes longer, and the difference is mostly the work described in this post.

The protocol is solved. The product is not. That is the gap worth budgeting for.

JavaScript Async Lifetimes: The Leak You Have and Probably Do Not Know About

Alex Cloudstar — Thu, 07 May 2026 08:28:47 +0000

Here is a production bug I have seen three times now, in three different codebases, written by three developers who all considered themselves experienced with async JavaScript.

A route handler fires three parallel database queries with Promise.all. One of them hits a slow external service and times out after 30 seconds. Promise.all rejects immediately. The handler sends a 500. The caller moves on. The other two queries are still running. They are holding database connection pool slots. At a few hundred concurrent requests, the pool exhausts. Every subsequent request queues waiting for a slot. The app looks hung, but the logs show mostly successes.

The fix everyone reaches for is adding a shorter timeout to the slow query. That helps but does not solve the underlying issue. When Promise.all rejects, it rejects. It does not cancel the tasks it was waiting on. Those tasks have no owner anymore. They run to completion or to error, nobody is listening, and the resources they hold are not released until they are done.

This is the async leak problem in JavaScript, and it is more common than most people realize because it is often invisible. The code "works" in the sense that it produces correct outputs. The resource leak shows up as a slow degradation under load, a pool exhaustion event, or a flaky test that passes locally and fails in CI on a slow machine.

ES2026 shipped the primitives to actually fix this. You do not need a library. You do need to understand what you are composing and why.

The Three Failure Modes Worth Knowing

Before the solution, the problem is worth making concrete. These are the three production patterns I have seen cause real incidents.

The Abandoned Fetch

async function loadDashboard(userId: string) {
  const [user, settings, notifications] = await Promise.all([
    fetchUser(userId),
    fetchSettings(userId),
    fetchNotifications(userId), // slow, sometimes takes 10 seconds
  ]);
  renderDashboard(user, settings, notifications);
}

The user navigates away before the notifications fetch completes. The component unmounts. Your framework might fire a cleanup callback, but that cleanup has no way to reach inside Promise.all and abort the in-flight fetches. All three requests continue running. In a single-page app with heavy route churn, these orphaned fetches accumulate. They fill browser connection slots, they log errors to surfaces nobody checks, and they burn mobile data the user did not ask to spend.

The Zombie Database Query

const [userData, auditLog, recommendations] = await Promise.all([
  db.users.findOne(id),         // completes in 5ms
  db.audit.findByUser(id),      // completes in 12ms
  externalService.recommend(id), // times out after 30s
]);

When recommend throws, Promise.all rejects. Your code catches the error and returns a 500. findOne and findByUser are still holding connection pool slots from the database. In a busy API, this pattern under load means your connection pool fills with queries attached to requests that have already failed, and new requests queue waiting for slots that are technically occupied by work nobody is waiting for.

The Port Still Bound

async function run() {
  const server = await startServer(3000);
  await performSetup(); // slow, sometimes takes a few seconds
  await server.waitForShutdown();
}

process.on('SIGINT', () => process.exit(0));

You hit Ctrl-C during performSetup. The process.exit(0) fires synchronously, tearing down the event loop before performSetup has a chance to resume and reach any cleanup code. The port stays bound. You try to restart and get EADDRINUSE. You have seen this. The fix is usually "kill the process manually" rather than "understand why the port is not being released."

All three of these have the same root cause: the tasks you started have no owner. When the parent gives up, the children keep running. The language gave you a way to start concurrent work, but not a way to define what happens to that work when the context that started it goes away.

What ES2026 Actually Gives You

The honest framing first: JavaScript in 2026 does not have a "structured concurrency" primitive in the way Go, Kotlin, or Swift do. There is no native task scope that automatically propagates cancellation to children when the parent exits. That language feature does not exist yet.

What does exist is a set of composable primitives that were not in the language two years ago. Together they make it possible to build the pattern yourself without depending on an external library.

`await using` and `Symbol.asyncDispose`

The Explicit Resource Management proposal reached Stage 4 in May 2025. await using is now available natively in Node.js 24+ and Chrome 134+. TypeScript has supported it since version 5.2 with transpilation.

The core idea: any object that defines [Symbol.asyncDispose]() returning a Promise can be declared with await using. When the enclosing block exits, regardless of how it exits (normal return, thrown error, early return), the runtime calls and awaits that method before continuing.

class DatabaseConnection {
  constructor(private conn: Connection) {}

  async query<T>(sql: string, params: unknown[]): Promise<T> {
    return this.conn.execute(sql, params);
  }

  async [Symbol.asyncDispose]() {
    await this.conn.close();
  }
}

async function getUser(id: string) {
  await using db = new DatabaseConnection(await pool.acquire());
  // the connection releases when this block exits, always
  return db.query('SELECT * FROM users WHERE id = ?', [id]);
}

The important part is "always." Not "if we reach the cleanup code." Not "if the Promise chain resolved normally." The disposal runs if the function returns, if it throws, and if something higher up calls its AbortSignal. The LIFO ordering also matters: multiple await using declarations in the same block dispose in reverse order, which is what you want when resources depend on each other.

AsyncDisposableStack extends this for ad-hoc aggregation:

async function withCleanup() {
  await using stack = new AsyncDisposableStack();
  const conn = stack.use(await openConnection());
  stack.defer(async () => await logCompletion());
  // both cleanup when block exits, in reverse registration order
  return conn.query('...');
}

The limitation worth knowing: Safari does not support await using natively as of early 2026. TypeScript's transpilation covers it for browser targets, but if you rely on native support in a Safari-heavy environment, test carefully.

`AbortSignal.any()` for Composed Cancellation

AbortSignal.any() shipped in all major browsers in March 2024 (Chrome 116+, Firefox 124+, Safari 17.4+) and is available in Node.js 20+. It takes an array of AbortSignal instances and returns a new signal that fires the moment any of the input signals fires.

const controller = new AbortController();
const timeoutSignal = AbortSignal.timeout(5000);
const combined = AbortSignal.any([controller.signal, timeoutSignal]);

const response = await fetch(url, { signal: combined });

The fetch aborts if the user cancels (via controller.abort()) or if the 5-second timeout fires, whichever comes first. The combined signal's reason property tells you which input triggered it.

The real value is in composition. You can have a request-scoped abort signal, a user-interaction abort signal, and a global shutdown signal, and combine them into one that you pass into all the work spawned for a given operation. Any of them firing aborts everything.

Building a Task Scope

These two primitives together make a small but useful abstraction possible. I have been using a version of this in a handful of projects.

class TaskScope {
  private controller = new AbortController();
  readonly signal = this.controller.signal;
  private tasks: Promise<unknown>[] = [];

  spawn<T>(fn: (signal: AbortSignal) => Promise<T>): Promise<T> {
    const task = fn(this.signal).catch((err) => {
      if (err.name !== 'AbortError') this.controller.abort(err);
      throw err;
    });
    this.tasks.push(task);
    return task as Promise<T>;
  }

  async [Symbol.asyncDispose]() {
    this.controller.abort();
    await Promise.allSettled(this.tasks);
  }
}

Using it:

async function loadDashboard(userId: string, parentSignal: AbortSignal) {
  const scopeSignal = AbortSignal.any([
    parentSignal,
    AbortSignal.timeout(8000),
  ]);
  const scopeController = new AbortController();
  const combinedSignal = AbortSignal.any([scopeSignal, scopeController.signal]);

  await using scope = new TaskScope();

  const [user, settings, notifications] = await Promise.all([
    scope.spawn((sig) => fetchUser(userId, sig)),
    scope.spawn((sig) => fetchSettings(userId, sig)),
    scope.spawn((sig) => fetchNotifications(userId, sig)),
  ]);

  return { user, settings, notifications };
}

When any of the spawned tasks fails, the catch handler in spawn calls this.controller.abort(). All other spawned tasks receive the abort signal and should stop work. When the await using block exits, the asyncDispose method fires the abort and waits for all tasks to settle before releasing.

This does not magically make your fetch calls abort cleanly. Each function you pass to spawn needs to actually respect the signal. That means threading the signal through to every fetch call, every database query, every async operation that has a cancellation mechanism. The scope provides the structure; you still do the wiring.

The fetch case is easy because the fetch API accepts a signal. The database case depends on your driver. Many modern Node.js database drivers support AbortSignal on query calls. If yours does not, you wrap the query in a Promise.race against the abort signal and release the connection in the losing branch. It is more boilerplate, but the intent is explicit.

`AsyncLocalStorage` as Context Carrier

One more tool that ties this together, particularly in server environments: AsyncLocalStorage from Node.js.

The use case is ambient context, values that need to be available to anything spawned within a request without being passed as arguments everywhere. Request IDs, user sessions, cancellation tokens, tracing metadata.

Node.js 24 changed the internal implementation of AsyncLocalStorage from the legacy async_hooks machinery to a new AsyncContextFrame backend. The public API did not change but the correctness did. Earlier versions had edge cases where context could be silently lost across certain microtask boundary patterns. The Node 24 implementation is more reliable, which matters specifically for patterns where context carries cancellation tokens through nested async call chains.

import { AsyncLocalStorage } from 'node:async_context'; // Node 24+

const requestContext = new AsyncLocalStorage<{ signal: AbortSignal; requestId: string }>();

app.use((req, res, next) => {
  const controller = new AbortController();
  res.on('close', () => controller.abort(new Error('client disconnected')));
  requestContext.run({ signal: controller.signal, requestId: req.id }, next);
});

async function anywhereInTheStack() {
  const ctx = requestContext.getStore();
  if (!ctx) throw new Error('called outside a request context');
  // ctx.signal is the request-scoped abort signal
  // no need to thread it through every function signature
}

This pattern composes cleanly with TaskScope. The scope reads the ambient signal from the store, combines it with its own signal, and any work spawned inside inherits both.

When to Reach for Effection

The primitives above get you a long way. For most server routes and browser interactions, await using plus AbortSignal.any() plus a thin scope abstraction covers the problem.

Effection is worth knowing about for cases where the generator-based model is a better fit. It is a maintained library (~5KB gzipped) that enforces the lifetime guarantees at the library level: no task outlives its parent, cancellation propagates down the entire task tree, and cleanup always runs.


await main(function* () {
  const result = yield* race([
    function* () { return yield* fetchUser(id); },
    function* () { yield* sleep(5000); throw new Error('timeout'); },
  ]);
  // the losing task is actively cancelled, not just abandoned
});

The difference from Promise.race is that Effection's race actively cancels the loser and awaits its cleanup before resolving. Promise.race abandons the loser. That distinction is exactly the failure mode described at the start.

The tradeoff is the generator syntax. It is not familiar to most JavaScript developers, it requires buy-in from the whole team, and it does not incrementally compose with existing async/await code. I would reach for Effection on greenfield CLIs and servers where correctness is the priority and the team is willing to adopt the model. For existing codebases, the await using approach is easier to add incrementally.

The Honest Limitation

I said this at the start and it is worth repeating: JavaScript in 2026 does not enforce task lifetime guarantees. The language lets you build the pattern. It does not require it.

Compare this with Go's goroutines, where passing a context.Context is idiomatic and cancellation propagation is expected by every library you use. Or Kotlin coroutines with structured concurrency enforced by the CoroutineScope. Or Swift's async let, which lexically bounds the lifetime of the spawned task. In those languages, "structured" is a property the runtime or compiler enforces.

In JavaScript, "structured" is a property you add to your codebase through discipline and a thin abstraction. The discipline part is the limiting factor. A new engineer joins, writes Promise.all without threading signals through, and the leak is back.

The TC39 Concurrency Control proposal (Stage 1) is about concurrency limiting, not lifetime management. It adds a governor model for capping concurrent operations, which is useful but a different problem. There is no proposal on the standards track for native task lifetime management as of mid-2026.

What we have is enough to write correct code. What we do not have is a language that makes incorrect code hard to write. That gap is worth being honest about, particularly if you are introducing this pattern to a team that is used to Promise.all and considers the topic closed.

Making It Stick in Practice

The structural change that actually made this work in a production codebase I maintain: treat task scope as a first-class part of the request lifecycle, not an optional add-on.

Every route handler receives an abort signal from the framework (or creates one tied to the response close event). That signal flows into a TaskScope that wraps the handler. Every async operation inside the handler uses scope.spawn rather than raw Promise.all. New code added later follows the same pattern because the pattern is already in the scaffolding.

The cost of adoption is the upfront wiring: making sure fetch calls and database queries actually accept and respect an abort signal. Most modern Node.js libraries do. For the ones that do not, a wrapper that races against the signal is worth writing once and reusing.

The benefit is not academic. Database connection pool exhaustion under load is a genuinely painful incident. Orphaned fetches in a React app are a common source of "this bug only happens after you navigate quickly" reports. Ports that stay bound after Ctrl-C are a small irritation that adds up over a development day.

These primitives exist now, they are stable in Node.js 24 and modern browsers, and they compose cleanly without pulling in a new runtime model. The question is whether you add the pattern to your scaffolding now or explain the connection pool leak to your on-call engineer six months from now.

Given how central async JavaScript is to AI agent tooling and multi-step pipelines where task cancellation actually matters, this is one of those patterns that goes from "good practice" to "necessary" as the complexity of what you are building goes up. The primitives are there. Worth using them.

Anthropic and SpaceX: What the Colossus Deal Actually Means for Developers

Alex Cloudstar — Thu, 07 May 2026 08:28:14 +0000

On May 6, Claude Code's five-hour rate limits doubled. The peak-hour throttling that had been frustrating paid users for months disappeared. Most people noticed the change and moved on without looking too closely at what caused it.

The answer is strange enough that I think it is worth looking at closely. Anthropic rented the entire Colossus 1 supercomputer cluster in Memphis, Tennessee from SpaceX. That is 220,000 NVIDIA GPUs and 300 megawatts of power capacity, coming online within a month of the announcement. The reason it is strange: three months before signing this deal, Elon Musk had posted on X that Anthropic's AI was "misanthropic and evil" and told the company it was "doomed."

Let me walk through what actually happened, what it means practically, and what I think it signals about where we are in the AI compute story.

What Colossus 1 Actually Is

Most people have heard the name but do not have a clear picture of the scale. Colossus 1 is the original AI supercomputer cluster that xAI (Musk's AI company) built in Memphis starting in 2024. It went operational in July of that year, remarkably fast for infrastructure of that size.

The hardware breakdown: the cluster runs a mix of NVIDIA H100s, H200s, and GB200s. 220,000 GPUs total. The 300 megawatt power draw is equivalent to the entire electricity load of roughly 300,000 average American homes. When it launched, it was described as the largest AI training facility in the world by a significant margin.

Here is what changed and why the deal was possible. Since then, xAI (now merged into SpaceX after a $1.25 trillion all-stock deal in February 2026) built Colossus 2, an even larger cluster with around 520,000 GB200s targeting one gigawatt of power capacity. When Grok's training workloads migrated to the newer, faster hardware, Colossus 1 became a 300-megawatt facility generating very little revenue. The deal with Anthropic solves that problem.

Anthropic gets the compute immediately. SpaceX gets rental income ahead of its planned June 2026 IPO. That is the straightforward business logic.

Why Anthropic Needed This

Dario Amodei was on stage at Anthropic's developer conference the same day the deal was announced. He said something that landed harder than most conference quotes: the company had projected 10x growth in Q1 2026. The actual number was 80x, annualized. He called it "just crazy" and "too hard to handle."

Claude Code specifically drove a lot of that. The adoption curve for AI coding tools has been steep across the industry, and Claude Code became the default choice for a large chunk of that market. The infrastructure was not built for 80x growth. That is what was behind the rate limit caps and the peak-hour throttling that paying users had been hitting for months. It was a capacity problem, not a policy problem.

Anthropic is not short on future compute commitments. The company has deals with Amazon (up to $25 billion invested, roughly 5 gigawatts of Trainium capacity coming over the next few years), Google (up to $40 billion invested, 5 gigawatts via Broadcom), and several other infrastructure partners. The total compute reserved across all of those deals is measured in gigawatts.

The problem those deals do not solve is now. AWS Trainium rollouts and Google TPU clusters are measured in years, not weeks. Colossus 1 is available within a month of the announcement. For a company that just discovered its demand is 8x higher than forecast, "available in weeks" is worth a lot even at a smaller scale than the future partnerships will deliver.

The current deal also appears to be focused on inference rather than training. Anthropic trains Claude on AWS Trainium and Google TPUs. Colossus 1's hardware mix, particularly the H100 and H200 GPU density, is better suited for the inference workloads that serve Claude Pro, Claude Max, and the API. The immediate user-facing impact, the doubled rate limits and removed peak throttling, is consistent with that.

The Musk Reversal

This is the part of the story that every tech journalist covered, and for good reason. The timeline is genuinely unusual.

In February 2026, hours after Anthropic announced a $30 billion funding round, Musk posted directly at the @AnthropicAI account: "Your AI hates Whites & Asians, especially Chinese, heterosexuals and men. This is misanthropic and evil. Fix it." In other posts around the same period he called Anthropic "Misanthropic," said it "hates Western civilization," and declared that "Winning was never in the set of possible outcomes for Anthropic."

He also had a specific grievance: Anthropic had cut off xAI's access to Claude through Cursor, citing their commercial terms that prohibit using the API to build competing AI products. (Anthropic did the same to OpenAI in August 2025.) The xAI cofounder Tony Wu confirmed it internally: "We will take a hit on productivity, but it really forces us to develop our own coding products and models."

Three months later they signed a deal together.

Musk's explanation, posted the day after the announcement: "I spent a lot of time last week with senior members of the Anthropic team to understand what they do to ensure Claude is good for humanity and was impressed. Everyone I met was highly competent and cared a great deal about doing the right thing. No one set off my evil detector. So long as they engage in critical self-examination, Claude will probably be good."

There is one unusual clause buried in the deal: SpaceX reserves the right to reclaim the compute if Anthropic's AI "engages in actions that harm humanity." Whether that is meaningful contractual language or a rhetorical add-on is hard to say from the outside, but it is the kind of condition that reflects how personally Musk was taking the criticism before the handshake.

My read on the reversal is simpler than the drama makes it seem. Colossus 1 was sitting underutilized. Anthropic needed compute fast and had budget to pay for it. Both sides had a clear financial reason to set the insults aside. The "evil detector" framing is Musk, but the underlying transaction is just two companies with complementary short-term needs.

What Actually Changed for Claude Users

The practical changes are real and immediate.

For Claude Code specifically: five-hour rate limits doubled for Pro, Max, Team, and Enterprise plans. The peak-hour throttling that kicked in during high-demand periods is gone for Pro and Max accounts. If you have been hitting rate limit errors in the late afternoon US time, that should largely stop.

For API users on Opus models: Anthropic described the limits as "considerably raised" without publishing exact numbers. The framing in the announcement focused on the ability to "process significantly more input and output tokens per minute."

The rate limit doubling matters more than it might sound if you are actively building with Claude Code. The five-hour window was a real constraint on complex, multi-step agentic tasks. Longer context windows, more tool calls, deeper refactors, those all burn limits faster. Doubling the window is a meaningful change for anyone doing serious work rather than quick edits.

The timing of availability is also notable. Colossus 1 is supposed to come online for Anthropic within one month of the announcement. That is unusually fast for infrastructure at this scale, but the cluster is already built and operational. It is a matter of provisioning Anthropic's access rather than constructing anything.

The Compute Race Is Now a First-Class Business Problem

Something this deal makes clear, if it was not already, is that AI compute is now a strategic constraint that the companies in this space have to solve actively and continuously.

Anthropic's situation is a good illustration. They have gigawatt-scale deals committed with Amazon and Google. They also just signed an emergency lease on a competitor's data center because the demand curve outran their projections by a factor of eight. Both things can be true at once. Long-term infrastructure deals are not enough on their own when you are growing at rates this fast.

The orbital compute angle in the announcement is worth noting, even if it reads as forward-looking. Anthropic and SpaceX expressed interest in developing "multiple gigawatts of orbital AI compute capacity." SpaceX filed with the FCC in January 2026 for authorization to deploy a satellite constellation for exactly this purpose. Google published a feasibility study suggesting space-based data centers become cost-competitive with terrestrial ones once Starship brings launch costs down to around $200 per kilogram, which is a realistic target on a ten-year horizon.

I would not count orbital compute as near-term capacity planning. But it does reflect where the ceiling conversation is already happening. Terrestrial power, land, and cooling are the constraints. SpaceX has a credible path to removing those constraints eventually, and Anthropic is a customer with both the compute need and the capital to be interesting to them as a long-term partner.

The Weird Politics at the Edge of This Deal

This part is less about development and more about context, but I think it matters for how you read the deal.

Anthropic has said they are "very intentional" about where they add compute capacity, specifically mentioning a preference for democratic countries with stable legal frameworks. In the same month they signed this deal, they were actively suing the Trump administration to reverse a Defense Department decision that blacklisted them as a supply chain risk and cut them off from federal contracts.

Musk, who controls SpaceX and now SpaceXAI, is closely aligned with that same administration. There is an obvious tension between Anthropic's stated preference for democratic infrastructure partners and signing a major deal with someone whose political alignment is with the government that just tried to cut them off.

I am not drawing a conclusion here, partly because the financial logic of the deal is clear and partly because I do not have visibility into how Anthropic weighed the tradeoff internally. But it is the kind of contradiction that tends to come up again when there is a policy dispute down the line. If SpaceX invokes the "harms humanity" reclaim clause someday, that context will matter.

What This Means for Developers Using Claude

The immediate practical takeaway is: the bottleneck you were hitting on Claude Code is about to be significantly less painful.

The longer-term takeaway is less tidy. The AI infrastructure layer is consolidating around a small number of very large players, and the relationships between those players are more complicated than a simple vendor-customer model. Anthropic's compute stack now includes Amazon, Google, Microsoft, SpaceX, and Fluidstack in a mix of equity investments, compute credits, and rental agreements. Those relationships come with interests that are not always perfectly aligned with the people building on the platform.

This is not a reason to stop building on Claude. The rate limits are better, the pricing is still competitive, and the prompt caching economics still favor Anthropic for high-volume production features. For complex agents, the Claude-specific features (extended thinking, memory primitives, tool use) remain genuinely strong. If you have been building your AI agent architecture around Claude, the deal does not change the calculus there.

What it does is add one more data point to the general pattern of the AI infrastructure layer being much more entangled than the clean abstractions on the surface suggest. The API call you make to get a completion goes through a stack that includes a data center leased from the company whose CEO called your provider "evil" this spring. That is not an argument for or against using the API. It is just an accurate description of the current state of things.

The Short Version

SpaceX had a 300-megawatt data center with 220,000 GPUs sitting underutilized after upgrading to newer hardware. Anthropic was growing 8x faster than projected and hitting capacity limits. They made a deal that makes clear financial sense for both parties, regardless of what either CEO had said about the other three months earlier.

Claude Code rate limits doubled as a direct result. That is the part that affects your day-to-day work, and it is a real improvement for anyone doing serious agentic development.

The rest of the story, the Musk reversal, the orbital compute ambitions, the political contradictions, is worth understanding as context for an industry where the infrastructure layer is genuinely complicated and the companies building on it are making consequential decisions about who they do business with. Those decisions have a way of mattering more than they seem to at announcement time.

RAG Chunking Strategies In Production 2026: What Actually Survives Real Documents And Real Queries

Alex Cloudstar — Wed, 06 May 2026 07:47:35 +0000

The first RAG system I shipped chunked every document at 512 tokens with a 50 token overlap, because that was the example in the tutorial I was reading at three in the morning. It worked well enough to ship. It worked poorly enough that two weeks later a customer support engineer pinged me with a screenshot of the assistant confidently citing a policy document, except the cited paragraph was the second half of one policy glued to the first half of an unrelated one. The model had retrieved a chunk that crossed a section boundary, and the chunk read like a single coherent rule that did not exist anywhere in the source. Fixing that one bug took longer than building the original retriever.

That was a few years ago. The pattern has not changed. Teams still ship RAG systems where the LLM is sophisticated, the embedding model is fine, the vector store is overkill for the data volume, and the chunker is a one-line call to a default splitter that tears documents apart at arbitrary character offsets. The retrieval looks like it is working in the demo, because the demo uses clean Wikipedia paragraphs. It stops working the moment the documents are real, which means messy, inconsistent, structurally meaningful, and full of edge cases the default chunker has never seen.

By 2026 the production patterns for chunking have settled. They are not glamorous. They are mostly about respecting the structure the document already has, sizing chunks to match how the embedding model thinks, and making the retrieval shape match the queries you actually expect. This post is what I would tell my past self before that 3 a.m. tutorial, and what I would build into any retrieval pipeline before its first real user.

Chunking Is The Hidden Half Of RAG

The framing most teams start with is that RAG is about retrieval and generation, with chunking somewhere in the wiring. That framing is wrong. The chunker decides what answers can possibly be found, because the unit of retrieval is the chunk. If the right answer lives in a span the chunker split in half, the retriever cannot return it intact, and the model cannot cite it. Every other component in the pipeline is downstream of the chunking choice.

This is the same lesson I keep relearning in every retrieval project. You can change the embedding model, swap the vector store, tune the top-k, add a reranker, and you are still bottlenecked by whether the chunks contain the answers the user asks about. A great LLM cannot answer from a chunk that does not contain the relevant information. A great embedding model cannot match a query to a chunk where the answer is split across two retrievable units. The chunker is the floor, and most teams ship with that floor lower than they realize.

The reason it stays hidden is that chunking failures are silent. The system returns plausible-looking citations, the model produces fluent answers, and only a careful read of the source documents reveals that the answer is wrong, or partial, or stitched together from the wrong context. Compare that to a pipeline where the embedding model is broken: queries return obvious garbage, on-call gets paged, the bug is fixed in an afternoon. Chunking bugs do not page anyone. They show up as a slow drift in answer quality and an unhappy customer support engineer who does not know how to file the ticket.

Fixed-Size Chunking Is The Default For A Reason, And A Trap For Another

The default everybody starts with is fixed-size chunking. Pick a chunk size, pick an overlap, slide a window across the document. It is one line of code. It works on any document type. It produces predictable chunk counts and predictable storage costs. There is a real reason this pattern is the default, and there is a real reason it stops being good enough the moment the documents have any structure at all.

The strength of fixed-size chunking is that it is uniform. Every chunk is the same size, every chunk has the same overlap with its neighbors, and the embedding model sees inputs in a consistent shape. That uniformity matters more than people give it credit for. Embedding quality is sensitive to chunk size, and a pipeline where chunks vary wildly in length produces vectors that are not directly comparable. A 50-token chunk and a 2000-token chunk live in different parts of the embedding space, even if they describe the same topic, because the model encodes density and breadth differently. Fixed-size chunking sidesteps that problem by pretending everything is the same shape.

The weakness is the part everybody hits within a week of shipping. Fixed-size chunking ignores the structure of the document. It splits in the middle of sentences, in the middle of code blocks, between a heading and the section it introduces, between a question and its answer. The overlap parameter is supposed to paper over this, but overlap is a band-aid. A 50-token overlap on a 512-token chunk gives the next chunk a small lead-in to the previous one, but it does not preserve the boundary that mattered, which was the section heading. The retriever finds the body but loses the title that explained what the body was about.

The pattern that has worked when I am stuck with fixed-size chunking is to preprocess aggressively. Before the splitter runs, I prepend every chunk with the document title and the nearest preceding heading. The chunker still cuts where it cuts, but the chunk now carries enough context that the embedding can place it in the right neighborhood. This is a hack, and it works, and it is almost always worth the small storage hit. The chunk that says "from a document titled X, in a section about Y, the following text..." retrieves better than the chunk that starts mid-paragraph with no signal of where it came from.

Structure-Aware Chunking Is Where Production Lives

The next step up, and the one most production systems should be at, is to chunk along the structure the document already carries. Markdown documents have headings. HTML has tags. PDFs have pages and, with the right parser, sections. Code has functions and classes. Notion pages, Confluence pages, and most internal documentation systems expose a structural tree if you ask nicely. Use it.

The pattern is to split at structural boundaries first, then post-process to merge or further split based on size constraints. A markdown document becomes a tree of sections, each section becomes a candidate chunk, and any section that exceeds the embedding model's effective context gets recursively split along sub-headings. Sections that are too small get merged with their neighbors, but only their structural neighbors, never across a top-level heading. The output is chunks that respect the author's intent: each chunk is a thing the author wrote as a unit, not a slice of arbitrary text.

The benefit shows up in retrieval quality, but it also shows up in citation quality. When a structural chunk is retrieved, the model can cite the section heading directly. The user can see "this answer comes from Section 4.2 of the Refunds Policy" instead of "this answer comes from chunk 137." That is a product feature. Users trust citations they can verify. Citations that point to recognizable structural units are easier to verify than citations that point to opaque ranges.

The trap with structure-aware chunking is that the structural parser has to be good. A bad markdown parser will mistake a code block for a heading and chunk wrong. A bad PDF parser will fail to find sections in a document where the section breaks are visual rather than semantic, which is most real PDFs. Investing in the parser is the unglamorous part of this work. The right move is to spend a day looking at how your parser actually splits a representative sample of your documents, and to fix the cases where it is wrong. The fixes pay back for the lifetime of the index.

Semantic Chunking Sounds Smart, Mostly Is Not

There is a class of chunking strategies marketed as "semantic" that try to use embeddings or a small model to find natural break points in the text. The pitch is that the chunker reads the document, notices where the topic shifts, and cuts there. The pitch is correct in theory. In practice, semantic chunking works well on a narrow set of documents and poorly on most of the rest, and the cost is high enough that the trade is rarely good.

Where it works is on flowing prose without explicit structure. Long-form articles, transcripts, books. The structural signals are absent, the topic shifts are real, and a semantic chunker can find a cut point that a fixed-size chunker would miss. If the entire corpus is documents like this, semantic chunking is worth the engineering cost.

Where it fails is everything else. On structured documents the semantic chunker fights with the structure. The headings already mark topic shifts, and the embedding-based detector is noisy enough to put cuts in places where the author did not intend cuts. On code, on logs, on FAQs, on transactional documents, semantic chunking adds latency and cost without measurable retrieval improvements. The teams I have seen ship semantic chunking and keep it are the ones whose corpus is dominated by long prose. Everybody else has either ripped it out or quietly downgraded to structure-aware with semantic-style heuristics for the rare cases where it matters.

The compromise that works is to use a semantic detector only as a fallback. If a structural chunk is too long to fit the embedding model's window, use a semantic detector to find the best cut point inside it. That keeps the cost bounded and the benefit targeted at the cases where structure has run out.

Hierarchical Chunking And The Parent-Child Pattern

The pattern that has earned its place in production over the last two years is hierarchical chunking, sometimes called the parent-child or small-to-big pattern. The idea is to chunk at two granularities. Small chunks, sized for retrieval, are what the embedding model and the vector store see. Large chunks, sized for context, are what the LLM sees when a small chunk is retrieved. The retrieval index points from the small chunk to its parent.

The reason this works is that retrieval and generation have different sweet spots. Retrieval works best on chunks small enough that the embedding represents a single coherent idea. The vector for a 200-token chunk about how to issue a refund is sharp. The vector for a 2000-token chunk that contains that same idea plus four other ideas is blurred, because the embedding has to average over all of them. Generation, on the other hand, works best with more context, because the model needs the surrounding details to produce a complete answer.

The hierarchical pattern lets you have both. The retriever finds the precise small chunk that matches the query. The pipeline then expands to the parent, which is the section or the page or the document, and sends that to the LLM. The model gets the precision of the small chunk's match and the context of the parent's surroundings. The cost is a little extra storage for the parent text, which is rounding error in any production vector store.

The discipline is to set the parent boundary at a level that means something. Parents that are entire documents are usually too big. Parents that are paragraphs are usually too small. The right level is almost always the structural level: a section in a markdown doc, a page in a PDF, a function in a code file. The parent is the unit a human would point to when asked "where did this come from."

The same discipline I covered in RAG vs long context applies here, because hierarchical chunking is partly an answer to the question of how much context to send. The retrieval narrows the search. The parent expansion gives the model enough surrounding text to produce a grounded answer. Tuning the small-chunk size and the parent size independently is one of the highest-leverage tuning operations in a RAG pipeline.

Chunk Size: The Number Everyone Asks About And The Wrong One To Optimize First

The first question every team asks is what chunk size to use. The honest answer is that it depends on the embedding model, the document type, and the query shape, and the fastest way to get to a good number is to start at 256 to 512 tokens and adjust by measuring. Anchoring to a number before measuring is how teams end up with a confidently wrong setting.

Embedding models have an effective context that is shorter than their advertised maximum. A model with a 8192-token context window does not produce equally good embeddings for 8192-token chunks as it does for 512-token chunks. The longer the input, the more the embedding has to compress, and the more semantic detail gets lost in the averaging. The advertised context is the limit, not the recommendation. The recommendation is usually a few hundred tokens, sometimes up to a thousand for newer models. Check the model card. Then verify on your own data, because model cards are written for a benchmark and not for your corpus.

Document type matters because chunk size interacts with information density. Technical documentation packs ideas tightly: a 256-token chunk of API reference can contain three or four distinct facts. Narrative content is sparser: a 256-token chunk of a blog post might contain half of a single argument. The right chunk size for the dense corpus is smaller, because the embedding can capture the multi-fact density at smaller sizes. The right chunk size for the sparse corpus is larger, because cutting too small leaves the chunks without enough signal to retrieve.

Query shape matters because the chunk has to answer the kind of question users ask. If the queries are precise lookups ("what is the refund window for product X"), small chunks win, because the answer is a single fact and small chunks isolate facts. If the queries are exploratory ("how does our refund process work"), larger chunks win, because the answer needs context the user is implicitly asking the system to assemble. Most production systems get a mix of both, and the right move is hierarchical chunking, which sidesteps the choice.

Overlap: The Knob That Matters Less Than You Think

The other parameter every tutorial mentions is overlap. The standard advice is to overlap chunks by 10 to 20 percent. The standard advice is fine and almost never the difference between a working system and a broken one. Overlap is a small lever, and tuning it is one of the last things to do.

The reason overlap exists is to handle the case where the answer to a query straddles a chunk boundary. With no overlap, the answer is split between two chunks, and neither chunk is a great match for the query. With overlap, one of the two chunks contains the full answer, and the retriever can find it. This is real, and overlap helps, and the help is bounded.

The case where overlap stops helping is when the chunk boundaries are wrong in the first place. Adding overlap to a fixed-size chunker that splits in the middle of sentences does not produce chunks that respect sentence boundaries. It produces chunks that share a few sentences with their neighbors and still split mid-sentence at the start and end. The fix is not more overlap. The fix is structure-aware chunking that does not split mid-sentence.

The other case where overlap is wasted is when the chunk size is already large enough that boundary-straddling answers are rare. A 2000-token chunk almost never has its answer split across the boundary, because almost any answer fits inside it. Spending storage on overlap at that size is paying for an edge case that does not happen.

The pattern I default to is small overlap, around 10 percent, on smallish chunks, around 256 to 512 tokens. It is a sensible setting that does not need tuning unless something else in the pipeline forces it. If the retrieval quality is bad, do not start by tuning overlap. Start by looking at whether the chunks themselves make sense.

Metadata Is The Multiplier

The chunk text is not the only thing you store. Every chunk should carry metadata that lets the retriever filter, the reranker reason, and the LLM cite. Document title. Section heading. Source URL. Author. Publication date. Document type. Tags. Whatever your system has that distinguishes documents from each other.

Metadata pays back in three places. First, in retrieval, where filters cut the search space and improve precision. A query about a 2024 policy should not return a chunk from a 2020 policy, no matter how semantically similar the text is. A metadata filter on date solves that without any embedding-side work. Second, in reranking, where the metadata becomes additional features the reranker can weight. Recent documents, authoritative sources, official policies score higher. Third, in citation, where the metadata is what the LLM uses to tell the user where the answer came from. A citation is only as good as the metadata behind it.

The pattern that has worked is to over-collect metadata at chunking time and decide later what to use. Storage is cheap. Re-chunking the corpus to add a missing field is expensive. If the source has it, capture it. The first time you need to filter by something you did not capture is the day you regret not capturing it.

Tables, Code, And Other Things That Break Default Chunkers

Default chunkers handle prose. They do not handle tables, code blocks, lists with structural meaning, or multi-column PDFs. Each of these requires a different strategy, and each of them shows up in real corpora, and each of them silently degrades retrieval if you do not address them.

Tables are the worst offender. A table chunked by character count loses its row structure and becomes a stream of cells the embedding model cannot interpret. The fix is to detect tables before chunking and serialize them in a format that preserves structure. Markdown tables, JSON arrays of row objects, or natural-language summaries of the table contents all work, with different trade-offs. The summary approach is the highest quality and the highest cost, because it requires running the table through a small model. The markdown approach is cheaper and works for most queries that ask about the table's contents.

Code blocks should be chunked by the structure of the code, not by line count. A function or class is the natural unit. Chunking in the middle of a function produces chunks that have neither the signature nor the implementation, and the embedding represents nothing useful. Most languages have AST parsers that can extract function-level chunks cleanly. The investment pays back in code-search quality, which is otherwise terrible.

Multi-column PDFs are the failure mode that catches every team that ships RAG against scanned documents. The default text extractor reads top-to-bottom, left-to-right, which produces a stream where the first sentence of column one is followed by the first sentence of column two. The chunks are gibberish. The fix is a layout-aware extractor that respects columns, of which there are several open and commercial options as of 2026. Pick one, evaluate on your corpus, switch.

How To Know Your Chunking Is Wrong

The hardest part of chunking is that the failure signal is buried in answer quality, which is hard to measure and slow to surface. The discipline is to build a small evaluation set early, before the chunker is locked in, and to run it on every chunking change.

The eval set is a list of representative queries with known correct answers and known correct source spans in the corpus. For each query, the eval measures whether the retrieval returned the chunk containing the correct span, and whether the LLM produced an answer matching the expected one. This is the same evals discipline I covered in AI evals for solo developers, applied to the retrieval-and-generation pipeline as a unit.

The chunking-specific signal to watch is recall at k. If the correct chunk is in the top 10 results most of the time, the chunker is doing its job. If the correct chunk is missing from the top 10 even when the embedding model is solid and the query is clear, the chunker has split the answer in a way that breaks retrieval. That signal is much faster to act on than answer quality, because it points directly at the chunking step.

The other signal is qualitative. Read the chunks. Take a sample of fifty chunks at random and read them as if you were the embedding model. Do they make sense as standalone units? Do they cut off mid-thought? Do they have enough context to be retrievable? Five minutes of reading chunks beats five hours of tuning hyperparameters, every time, and most teams skip it because it does not feel like engineering. It is the most engineering thing you can do at this layer.

What I Would Build From Scratch In 2026

If I were starting a RAG pipeline today, the chunker would be structure-aware, hierarchical, with metadata enrichment, with a small overlap, with special handling for tables and code, and with an eval set running on every change. The chunk size would be a few hundred tokens for retrieval, with parents at the section or page level for generation. The fixed-size fallback would only kick in for unstructured prose, and even then with title and heading prepended to every chunk. The semantic chunker would be a fallback inside the structural chunker, used only when a structural unit was too large to embed cleanly.

That stack is not novel. It is the stack the production teams I trust have converged on, and it is unglamorous in the same way the guardrails layer is unglamorous and the observability layer is unglamorous. The interesting work is at the LLM, the visible improvements are at the LLM, and the actual quality ceiling sits at the chunker. Most of the wins in a RAG system over the next year are going to come from teams realizing this and putting an engineer on the chunking layer for a week instead of swapping models for the third time.

If your RAG system is producing answers that look right but feel slightly off, the answer is almost never the LLM. It is almost always the chunker, doing exactly what you told it to do, on documents that did not deserve to be cut where they got cut. Fixing that is the highest-leverage thing you can do in retrieval, and it is sitting there, waiting for somebody to read fifty chunks and notice.

Embedding Models And Reranking In Production 2026: Picking The Pair That Actually Lifts Retrieval Quality

Alex Cloudstar — Wed, 06 May 2026 07:47:34 +0000

The first time I swapped an embedding model in production, the answer quality on our internal eval set jumped by twelve points and the latency went down. I felt very smart for about a week. Then a customer success engineer asked why the assistant had stopped finding documents that contained exact product SKUs, and I spent a Saturday discovering that the new model, which was great at semantic similarity, had gotten worse at lexical matching. The old model carried enough surface-level signal to find the SKU. The new one had been trained out of that and pretended every SKU was a similar SKU. Recall on a specific class of query had collapsed, and our eval set had not covered that class.

That is the standard embedding-model story. The model that wins on benchmarks is not always the model that wins on your data, and the model that wins on your data is not always the model that keeps winning when the queries change shape next quarter. Embeddings are not a commodity. The choice of embedding model and the decision of whether to put a reranker behind it are two of the highest-leverage tuning operations in a retrieval pipeline, and most teams treat both as defaults. The defaults are not bad. They are also not what you ship past year one.

By 2026 the patterns for picking embedding models and adding rerankers have settled into a small set of choices that consistently outperform the defaults. None of them are exotic. All of them are about understanding what each layer does, what it cannot do, and where the failure modes hide. This post is what I would tell my past self after that Saturday.

What An Embedding Model Actually Encodes

The framing that helps most when picking an embedding model is to think about what the model was trained to optimize, because that is what its vectors will encode well. Models trained on web search query-document pairs are good at matching short queries to long documents. Models trained on natural language inference are good at semantic similarity between full sentences. Models trained on code are good at code-to-code or code-to-comment retrieval. Models trained on multilingual corpora are good at cross-language retrieval and often slightly worse at any single language than a dedicated monolingual model.

What this means in practice is that the right model for your corpus depends on what your queries and documents look like. A support knowledge base with short user queries and medium-length policy documents wants a model trained on query-document pairs. A semantic search across blog posts wants a model trained on long-form similarity. A code search wants a code-specific model. A multilingual product wants a multilingual model and accepts the small penalty in any single language. Defaulting to the highest-MTEB-scoring model regardless of corpus is how teams end up with embeddings that are good in general and mediocre on the specific shape of data they actually run.

The other thing the embedding encodes is what it does not encode. Most general-purpose embedding models are trained to be invariant to surface-level details that do not affect meaning. Word order, exact phrasing, specific identifiers, punctuation. That invariance is great for semantic search. It is terrible for any retrieval that depends on those exact details. SKUs, version numbers, function names, error codes. The model has been trained to compress these into a representation where similar identifiers are close to each other, which is exactly the wrong behavior when the user wants the specific identifier and not a similar one.

The fix is not always a different embedding model. The fix is often a hybrid retrieval pipeline that combines dense embeddings with a lexical signal. More on that below. But the framing matters: if you understand what the embedding encodes, you understand which queries it will fail on, and you can plan for those failures instead of being surprised by them in production.

The Embedding Model Choice In Three Tiers

The market in 2026 looks like three tiers, and most teams should pick from one of them based on their constraints.

The frontier tier is the proprietary embedding APIs from the major model providers. These are the models with the highest benchmark scores, the broadest training, and the steepest cost. They are the right default when you do not want to think about it, when latency is not critical, and when sending your data to an external API is acceptable. The capability is real. The trade is the per-token cost and the network round trip on every embed call.

The open-weights tier is the strong open models, the descendants of E5, BGE, GTE, Nomic, and the like. By 2026 these are good enough that the gap with the frontier API tier is small for most use cases, and they can be served on commodity GPUs at a fraction of the cost. The trade is that you now run inference: GPU bills, autoscaling, monitoring. For high-volume retrieval, this is almost always cheaper than the API after a few weeks. For low-volume systems, the operational cost is not worth it. The same calculus I covered in small language models in production applies here, because embedding models are exactly that: small models you can host yourself when the volume justifies it.

The specialized tier is models fine-tuned for a specific domain or task. Code embeddings, scientific paper embeddings, legal document embeddings, product search embeddings. These are not always better than the general models on benchmarks, but they are often better on the specific shape of data they were trained for. For domain-heavy products, this tier is worth the search cost. For general-purpose retrieval, it is not.

The pattern that has worked when I am unsure is to pick a strong open-weights model, run it on a representative eval set, and only escalate to the frontier tier if the open model leaves measurable quality on the table. Start cheap, measure, escalate only when measurement justifies it. The opposite pattern, starting on the frontier API and trying to descend later, almost always stalls because the team gets used to the latency and quality and the migration becomes a project.

Embedding Dimension And The Cost Curve

The other axis on which embedding models differ is dimension. Models output vectors of varying lengths: 384, 512, 768, 1024, 1536, sometimes higher. Higher dimensions can encode more information. They also cost more in storage, more in retrieval, and more in latency, and the cost scales linearly with the number of vectors in the index.

The trade-off is real and the right setting depends on corpus size. For small indexes, up to a few million vectors, dimension does not matter much. The storage and retrieval costs are rounding error, and the quality gain from higher dimensions is worth taking. For larger indexes, tens or hundreds of millions of vectors, dimension becomes a real cost line. Doubling the dimension doubles the storage and roughly doubles the retrieval cost. At those scales, the right move is often the lower-dimension variant of the same model family, accepting a small quality hit for a large cost reduction.

The pattern that has emerged in 2026 is Matryoshka embeddings, where the same model can produce vectors at multiple dimensions and the lower-dimension variant is a meaningful prefix of the higher-dimension one. This lets a single model serve both a fast, low-dimension index for the first retrieval pass and a slower, high-dimension representation for reranking. If your embedding model supports this, use it. If it does not, picking a fixed dimension that fits the corpus size is the right move. Avoid the trap of picking the highest dimension the model offers because it scored slightly higher on the benchmark. The benchmark did not run at your scale.

Hybrid Search Is Not Optional

Pure dense retrieval, where the only signal is embedding similarity, is the default in tutorials and the wrong default in production. By 2026 the consensus pattern is hybrid search: combine dense retrieval with a lexical signal, usually BM25 or its variants, and merge the results. Teams that do this consistently see measurable lifts on real-world queries. Teams that skip it consistently rediscover this lesson when their assistant fails to find the document containing the exact phrase the user typed.

The reason hybrid works is that dense embeddings and lexical search fail in opposite ways. Dense embeddings handle paraphrases, synonyms, and semantic similarity. They miss exact-match queries with rare terms. Lexical search handles exact matches and rare terms. It misses paraphrases. The two signals together cover both failure modes, and the resulting retrieval is more robust than either alone.

The pattern that has worked is to run both retrievers in parallel, take the top-k from each, and merge with a reciprocal rank fusion or a weighted score combination. The simplest weighting is to give each retriever equal weight and fuse by reciprocal rank, which produces solid results without any tuning. The tuned version weights the two signals based on the query type, but the simple version is good enough for most production systems and avoids the complexity of dynamic weighting.

The implementation cost is low. Most modern vector stores support a sparse index alongside the dense one, and the additional storage for the sparse index is small. The latency cost is also low, because the two retrievals run in parallel and the merge is a few milliseconds. The quality lift is real and shows up most clearly on the queries that pure dense retrieval was secretly failing on. If your retrieval pipeline is dense-only, adding a sparse component is the highest-leverage change available, and it is usually a half-day project.

What A Reranker Does, And Why You Probably Need One

A reranker is a model that runs on the top results from the initial retriever and reorders them by relevance to the query. The initial retriever, dense or hybrid, optimizes for recall: getting the right candidates into the top-k. The reranker optimizes for precision: making sure the most relevant candidates are at the top of that list, where the LLM will see them.

The reason rerankers exist is that the initial retriever is doing fast similarity matching against a vector index, and that matching is approximate. A bi-encoder embedding model produces one vector per document and one vector per query, then computes similarity. It is fast and scales to billions of documents. It is also limited, because the document and the query are encoded independently, without the model ever seeing them together. A cross-encoder, which is what most rerankers are, takes the query and a candidate document as a single input and produces a relevance score that takes both into account. It is much slower, because it has to run for each candidate. It is also much more accurate, because the model can attend to specific overlaps and interactions between query and document.

The production pattern is to use the bi-encoder for the first pass, retrieve the top 50 to 200 candidates, and run the cross-encoder reranker on that smaller set to pick the top 5 to 10 that go to the LLM. The bi-encoder handles the scaling problem. The cross-encoder handles the quality problem. Together they get you both, with a latency cost in the tens to low hundreds of milliseconds for typical reranker sizes.

The teams that ship without a reranker usually do so because the demo looked fine and the additional latency felt unnecessary. The teams that add a reranker after the fact almost always see a measurable lift in answer quality, especially on harder queries where the initial retrieval put the right document at rank 5 instead of rank 1. The LLM cannot prioritize a document the retrieval pipeline ranked low, and a reranker is the cheapest way to fix that ordering.

Picking A Reranker

Rerankers come in roughly the same three tiers as embedding models. Frontier APIs from major providers, open-weights cross-encoders, and specialized variants. The cost calculus is similar but the latency story is different. Reranking adds latency on every query, which means it sits in the user-perceived path. The choice of reranker is a tighter trade-off than the choice of embedding model, because embedding latency is paid once at indexing time while reranking latency is paid on every query.

The frontier rerankers are accurate and add real latency. They are the right choice for high-stakes retrieval where the latency budget can absorb a few hundred milliseconds. The open-weights rerankers are nearly as accurate and faster, especially when self-hosted on a GPU close to the application. They are the right choice for most production systems, particularly chat applications where the user is waiting on the response.

The other lever is reranker size. The same family often comes in multiple sizes, and the small variants are dramatically faster than the large ones with a small quality penalty. For most production systems, the small variant is the right starting point, and the upgrade to a larger variant happens only if the quality measurements justify it. The latency budget is real, and a 50-millisecond reranker that is 95 percent as good as a 250-millisecond reranker is the better production choice nine times out of ten.

The pattern that has worked when I am picking a reranker is to evaluate three to five candidates on the same eval set used for the embedding model, look at both the quality lift and the p95 latency, and pick the one that maximizes the quality-per-millisecond. The candidate list is small, the eval is fast, and the answer is almost always clearer than it looks before you measure.

Cost And Latency Budgets

A pipeline with hybrid retrieval and reranking has more moving parts than a pure dense pipeline, and each part has its own cost and latency profile. The discipline is to be honest about the budget at each stage and to allocate it intentionally.

The dense retrieval is the cheapest and fastest stage. It runs in milliseconds against a vector index, and the cost is dominated by the storage of the vectors themselves. The sparse retrieval is similarly cheap, with the storage cost of an inverted index that scales with the number of unique tokens in the corpus. Both run in parallel and contribute milliseconds to the latency budget.

The reranker is the expensive stage. A cross-encoder running on 50 candidates is a meaningful chunk of latency, and on 200 candidates it can dominate. The lever is the candidate count: rerank fewer candidates and the latency drops linearly. The right candidate count is the smallest one that still surfaces the correct document into the top-k after reranking, which is something the eval set can tell you. Most production systems land somewhere between 30 and 100 candidates, and the variance below that range is small.

The LLM call is the slowest and most expensive stage by far, and the retrieval pipeline's job is to keep its input small and relevant. A retrieval that returns five precise chunks lets the LLM run on a small input and produce a fast, focused answer. A retrieval that returns twenty mediocre chunks forces the LLM to read more, costs more in tokens, and dilutes the answer. Investing in retrieval quality is the same as investing in LLM cost reduction, and the LLM cost optimization story I covered earlier is downstream of how good the retrieval is.

Multilingual, Multimodal, And The Rest Of The Long Tail

Most embedding models are trained primarily on English. If your corpus or your queries are in other languages, you need a multilingual model, and you need to be honest about the quality trade. Multilingual models are usually slightly worse at any single language than a dedicated monolingual model, and the gap shrinks every year but does not close. For a single-language product, monolingual is the right choice. For a multilingual product, multilingual is the right choice, and the small quality gap is the price of language coverage.

Multimodal embeddings, where the model encodes both text and images into the same vector space, have matured to the point where they are useful in production for image-text retrieval and visual search. The trade-off is that a model trained on text-image pairs is usually worse at pure text-text retrieval than a dedicated text model. For products where images are central, multimodal embeddings are the right choice. For products where images are incidental, the right move is often two separate indexes, one for text and one for images, with the application deciding which to query based on the input.

The long tail of edge cases is the part where evals matter most. Numeric reasoning, chronological ordering, complex multi-clause queries, queries that mix exact matches with semantic intent. Each of these is a class where embedding-only retrieval can fail in ways that are not obvious until they show up in production. The defense is the eval set, again. Cover the long tail in your evals and the failures show up before the users find them.

How To Tune The Pipeline Without Breaking It

Embedding models and rerankers have a lot of knobs, and the temptation is to tune everything at once. The discipline is to tune one thing at a time, on a fixed eval set, with a measurement loop that takes minutes rather than days.

Start with the embedding model. Pick three candidates, run them on the eval, look at recall at the top-k that the reranker will see. Pick the best one and lock it in.

Move to the reranker. Pick two or three candidates, run them on the locked embedding model, look at the answer quality and the latency. Pick the one that maximizes quality within the latency budget.

Then tune the candidate count for reranking. Sweep from 20 to 200, plot quality versus latency, pick the knee of the curve. The knee is usually obvious. The temptation to rerank everything is rarely justified by the data.

Finally, tune the merge weights for hybrid retrieval, if you are running it. The default of equal weights with reciprocal rank fusion is usually within a percent or two of the optimum, and tuning past that is worth doing only if the gap shows up in evals.

The discipline that ties all of this together is the same one I covered for AI evals for solo developers, and it applies the same way here: build the eval first, run the eval on every change, trust the eval over your intuition. Retrieval is a place where intuition is consistently wrong, because the failure modes are subtle and the wins are often counter-intuitive.

What I Would Build From Scratch

If I were building a retrieval pipeline today, I would start with a strong open-weights embedding model in the bi-encoder tier, hybrid search combining dense and BM25 with reciprocal rank fusion, a small open-weights cross-encoder reranker on the top 50 candidates, and an eval set built from real user queries and corrected answers. The candidate count and the reranker size would be tuned by measurement. The frontier APIs would be in reserve for the case where the open stack hit a quality ceiling I could measure.

That stack is unglamorous. It is also the stack that production teams have converged on by 2026, because it works and because the trade-offs are honest. The interesting work in retrieval is no longer at the embedding model. It is at the chunker, where the unit of retrieval gets decided, and at the reranker, where the order gets fixed. The same chunking discipline I covered in RAG chunking strategies in production is the layer above this one, and the two layers together are most of what determines whether a RAG system is good or just demoable.

If your retrieval is producing the right kind of answer at the wrong rank, the fix is a reranker. If it is failing to find documents that contain the exact phrase the user typed, the fix is hybrid search. If it is finding the wrong documents entirely, the fix is the chunker or the embedding model, in that order. The patterns are mostly known. The work is in measuring carefully and resisting the urge to swap models when the actual problem is one layer up or one layer down.

The pipeline that ships in 2026 and still works in 2027 is the one with an eval set that grows when production surfaces a new failure class, a chunker that respects document structure, an embedding model picked on data and not on benchmarks, hybrid retrieval as a default, and a small fast reranker that earns its latency. None of that is novel. All of it is the thing that turns a retrieval demo into a retrieval product, and most teams are still one or two of these layers short of where they need to be.

Small Language Models In Production 2026: Where SLMs Beat Frontier Models, And Where They Quietly Fail

Alex Cloudstar — Tue, 05 May 2026 09:58:59 +0000

The first time I replaced a frontier model with a small one in production, the cost graph dropped by ninety percent and the on-call channel got quieter. The first time I tried to do that and broke the product, the cost graph also dropped by ninety percent, but the user complaints climbed in a way the dashboard did not catch for two days. Both runs taught me the same thing from opposite directions: small language models are a real production lever, and the lever does not move the same way for every task. The teams I trust have spent 2025 and into 2026 figuring out which tasks bend nicely under a small model and which tasks break the moment you try to save a dollar.

By small language model I mean roughly the 1B to 30B parameter range. Phi-4 size, Llama 3 8B size, Qwen 2.5 7B size. Models that fit on a single consumer or low-tier datacenter GPU, run at low latency without exotic infrastructure, and cost an order of magnitude less per token than a frontier model. The capability gap between these and the frontier has narrowed enough that the question is no longer "can a small model do this" for many tasks. The question is "is the gap small enough to matter for your specific use case." That is a different question, and answering it well is what this post is about.

This is what has worked, what has not, and what I would consider before swapping any frontier call for a smaller one in 2026.

What An SLM Actually Is, In Production Terms

The interesting line is not parameter count, it is deployment shape. A small language model is a model you can serve yourself, on infrastructure you control, at predictable latency and cost. A frontier model is one you call over an API, at the API's latency and pricing. The capability gap between the two is the headline. The deployment gap is where the actual product implications live.

When the model is yours to host, you control the latency. You control the rate limits. You control whether the data leaves your network. You can fine-tune. You can quantize. You can colocate the model with the rest of your stack and avoid a network round trip on every call. Those capabilities are not free. You are now responsible for the GPU bill, the deployment, the autoscaling, the monitoring, and the failover. The trade is real and the calculation is rarely the one teams expect when they start.

When the model is an API, you give up control and you get reliability and capability for the price. The frontier model is run by people whose only job is to run it. Your token cost includes a margin, but it also includes the on-call rotation, the multi-region failover, and the model itself. The trade is paying more per token to do less work yourself, and for many production workloads that is the right trade.

The production version of "should we use an SLM" is "is this workload high enough volume, low enough complexity, and stable enough in shape that owning the model is cheaper than renting it." If the answer is yes, an SLM is on the table. If the answer is no, the frontier API is almost always still the right call.

Where SLMs Beat Frontier Models In 2026

There is a clear class of tasks where a small model, fine-tuned or even just well-prompted, matches or beats a frontier model in production. Knowing the shape of these tasks is the key to picking the right ones to migrate.

Classification is the most obvious win. Sentiment, intent, topic, language, content moderation, routing decisions. These are tasks where the input is a chunk of text and the output is one of a known set of labels. A 7B model fine-tuned on a few thousand examples typically beats a frontier model on a fixed-label classification task, runs ten times faster, and costs ten times less. The frontier model is doing more work than the task needs. The small model is doing exactly what the task needs.

Extraction is the next clear win. Pulling structured fields out of unstructured text. Names, dates, amounts, IDs, sentiment per aspect. The same shape as classification but with multiple output fields. Fine-tuned SLMs are very good at this. The benchmark gap between a fine-tuned 8B model and a frontier model on a domain-specific extraction task is often within noise, and the latency and cost gap is enormous.

Reformatting and rewriting are good targets when the source and target are both in the model's wheelhouse. Convert this prose into bullets. Convert this CSV into JSON. Convert this email into a summary. The task is structurally simple and high volume, and the small model handles it cheaply. The frontier model is overkill.

Routing decisions inside an agent are a sweet spot. The "which tool should I call" decision can often be made by a small model with a tight prompt, faster and cheaper than asking the frontier model. The same goes for "is this query in scope" or "is this response complete." These are gateway decisions that fire on every request, so the cost savings compound.

Embedding-adjacent tasks like reranking and similarity scoring are not always SLM tasks in the traditional sense, but small dedicated models in this space have gotten very good. If your retrieval pipeline is calling a frontier model to rerank retrieved chunks, you are leaving money and latency on the table. A small reranker is a better fit and the gap is not capability, it is engineering effort to swap.

The pattern is that SLMs win on tasks that are narrow, high-volume, and tolerant of fine-tuning. They lose on tasks that are open-ended, low-volume, or that require the kind of broad world knowledge that only a frontier model has internalized.

Where SLMs Quietly Fail

The failures are the part that the benchmark tables do not show, because the benchmark tables are testing the tasks the small models are good at. The production failures live in a different shape of task.

Long-horizon reasoning is the first place SLMs fall apart. Multi-step planning, math with several intermediate steps, code that has to track state across many lines, agentic loops that span more than three or four tool calls. The small model can take any one step. It cannot reliably keep the chain coherent across many of them. By the fifth step, it has lost the plot, and the failure looks like a model that confidently does the wrong thing for reasons that do not match the trace.

Open-ended generation that has to be on-brand and competent is the second place. Long-form writing where the user expects the same quality as the frontier model. Customer-facing replies in a domain where tone matters. Content where the difference between "good" and "fine" is what the product is selling. A small model can do the work. The output reads like a small model did it, and users notice.

Anything that requires the model to know things it was not fine-tuned on. The frontier models have absorbed a huge slice of public knowledge in their pretraining. A 7B model has absorbed a smaller slice. Tasks that require recall of facts, especially current ones, are tasks where the SLM will hallucinate or wave generically while the frontier model gets it right. The gap closes for domains you fine-tune on. It widens for everything else.

Edge cases in classification. The 8B model is great at the ninety-five percent of inputs it has seen variants of in training. It is mediocre on the long-tail five percent. The frontier model is great on both. If your application sees a fat tail of weird inputs, the SLM will quietly misclassify the weird ones, and you will not notice until the metric for "how often the user clicked the wrong-result-feedback button" creeps up.

Reasoning over long context. The small model has a smaller working memory in practice, even when its advertised context window is large. Document QA over a fifty-page contract is a task where the frontier model still wins, because the small model loses focus partway through and starts answering from a few salient chunks instead of the whole document. The same task on a one-page input is fine. The threshold is real and worth measuring on your specific workload.

The failure mode that is hardest to catch is the slow drift. The SLM works on the launch dataset and degrades on the data that comes in three months later, because the data distribution shifted and the model was fine-tuned on the old shape. The frontier model is more robust to this kind of drift because its pretraining was broader. The SLM needs to be retrained or refreshed. If you do not have the pipeline to do that, you have a model whose quality drops slowly and whose problems show up in user complaints, not in your eval suite.

The Routing Pattern: Use Both

The teams that are getting the most out of SLMs in 2026 are not picking SLM or frontier. They are routing between them based on the task. The pattern is roughly:

A cheap, fast classifier or rule-based router takes the incoming request and decides whether it is a task an SLM can handle or one that needs a frontier model. Easy classification or extraction goes to the SLM. Open-ended, multi-step, or out-of-domain requests go to the frontier model. The router itself is often a small model, because deciding "is this complex" is a classification task in itself.

For requests that go to the SLM, you get the fast, cheap path. For requests that go to the frontier model, you pay for capability. The blended cost across the workload is dramatically lower than running everything through the frontier, and the quality on the hard requests is the same as it would have been without the router.

The pattern that I covered in the LLM router and model routing patterns post is the same shape. Routing by task, with a fallback path, is the production architecture that has won. Single-model architectures are now the exception, not the default, in any system that is cost-sensitive at all.

The trick to making the router work is to be honest about what each model can do, and to monitor the rate at which the router sends things to the wrong path. A router that sends ten percent of frontier-needing requests to the SLM is producing bad outputs on those requests, and the user does not know that the model decision was the cause. Instrument the router. Sample the SLM responses for human review. Be willing to tighten the router as you learn the shape of the wrong-path failures.

Fine-Tuning Is The Multiplier

A small model out of the box is okay. A small model fine-tuned on your task is often as good as a frontier model on that task. The discipline of fine-tuning is what unlocks most of the SLM win, and it is also the part that teams underspend on because it requires data, infrastructure, and a willingness to maintain a training pipeline.

The data piece is the hardest. You need labeled examples, in your domain, in the shape the model needs to produce. Some of that data you have. Some of it you have to generate or label. The frontier model is your best tool for generating training data: prompt it carefully, generate examples, validate a sample by hand, and use the rest to fine-tune the small model. This is the loop that makes fine-tuning practical: the frontier model trains the small model, the small model serves production, and the frontier model handles the long tail of requests the small model cannot.

The infrastructure piece is now solvable with managed services. The bar to fine-tune a 7B or 13B model has dropped enough that a single engineer can run the loop in a week. LoRA-style adapters mean you do not have to host a separate full model per fine-tune; you host the base model and swap adapters per task. That is a real architectural advantage that did not exist as cleanly two years ago.

The willingness piece is harder than the technical pieces. Fine-tuning is not a one-time job. The model needs to be retrained as the data drifts, as the task evolves, as new edge cases come in. The team has to own that pipeline, and the pipeline has to be on a schedule, with monitoring, with a rollback story. Without that, the fine-tuned model is a snapshot that gets stale, and the staleness shows up in production. The same maintenance discipline I covered in the LLM fine-tuning developer guide applies, and the teams that take it seriously are the ones who get sustained wins from SLMs.

Latency: The Quiet Reason To Switch

The cost win is the headline. The latency win is the one that changes the product. A small model running on a colocated GPU answers in tens of milliseconds for short prompts. A frontier API call is hundreds of milliseconds at best, sometimes more under load, with a long tail that is meaningfully worse. The difference is not in the marketing copy. It is in the user experience.

For interactive features where the model is on the critical path of a user action, a sub-100ms response feels like an interaction, and a 500ms response feels like a wait. The same feature with a small model can be enabled in places where a frontier model could not. Autocomplete. Inline suggestions. Real-time classification. These are features that exist or do not based on the latency budget.

For batch and background workflows, the latency difference matters less, but throughput differences are large. A self-hosted small model can run hundreds of concurrent requests on one GPU. The frontier API has rate limits. For high-volume offline work, the SLM throughput advantage compounds with the cost advantage and produces savings that are hard to ignore.

The latency story has a wrinkle: cold starts. A self-hosted model on autoscaling infrastructure has cold starts, and a cold start on a 13B model loading into GPU memory is not trivial. The pattern is to keep at least one warm replica per region and to be careful about scaling-to-zero on user-facing paths. The cost of one warm replica is small. The cost of a thirty-second cold start in front of a user is large.

Cost: The Math Is Different Than You Think

The naive cost comparison is per-token API price versus GPU hourly cost divided by tokens served. That math is right but incomplete. The full picture includes the engineering time to ship and maintain the SLM stack, the cost of the fine-tuning pipeline, the cost of the eval and monitoring infrastructure, and the cost of the inevitable migration when the base model gets superseded by a better one.

For low-volume workloads, the frontier API wins on total cost. The fixed costs of running your own model are larger than the per-token savings until the volume is high enough. The crossover point varies by workload, but for most teams it is somewhere north of a million tokens per day on a sustained basis. Below that, paying the API is the right call, and the engineering effort is better spent elsewhere.

For high-volume workloads, the SLM stack starts winning, and the win compounds with each layer of optimization: quantization, batching, KV caching, request scheduling. By the time you are running real volume on dedicated hardware, the per-token cost is a fraction of the API price, and the question is whether you have the engineering bandwidth to keep that stack running well.

The hidden cost is the migration cost when the base model improves. The Llama 3 fine-tune you shipped last year is now behind a Llama 4 base model on the same task. Migrating means retraining, re-evaluating, redeploying. That is a quarter of work, not a sprint. Build the pipeline so that the migration is as automated as you can make it, because there will be more of them.

What I Would Build Today

If I were starting a new AI product in 2026, I would default to a frontier API for v0. The capability gap is large enough at the start that owning the model is a distraction from product work. The cost will not matter at v0 volume. Ship the product, get users, learn what the workload actually looks like.

After v0, I would profile the workload by request type. The high-volume, narrow tasks are the ones to migrate first. Classification, extraction, simple reformatting, routing decisions. These are the tasks where an SLM is reliably as good or better, and where the per-request savings compound to real money.

I would keep the frontier model in the loop for the long tail. Open-ended generation, complex reasoning, multi-step agent flows, anything where the SLM is not yet matching the bar. Route by request shape. Be honest about which tasks are which. Update the routing as the SLMs get better, because they will.

I would invest in the fine-tuning pipeline early once the migration starts paying off. The pipeline is the multiplier. Without it, the SLM is mediocre and the team gets discouraged. With it, the SLM is competitive and the cost and latency wins are real.

The other thing I would invest in early is the monitoring and rollback story. SLMs fail differently from frontier models. The failure modes are subtler. The eval suite has to catch them. The rollback path has to exist. The same observability discipline I covered in AI agent observability and debugging in production applies double, because the SLM is a model you own and the responsibility for its quality is yours.

The frame that has held up across a year of running this is that SLMs are a tool, not a strategy. The strategy is "use the right model for the task." The SLM is one of the models. The frontier is another. The router is the part of the system that knows which is which, and the team's job is to keep the router honest and the SLMs sharp. The teams that did that in 2025 are the teams whose AI features are profitable in 2026. The teams that did not are the teams whose AI line item is the largest one on the cloud bill, and who are now scrambling to migrate under deadline pressure.

If your AI product is a single API call to a frontier model on every request, the next quarter's work is probably about replacing some of those calls with smaller models you own. The capability has caught up enough to make it worth doing. The patterns are clear enough to make it doable. The hard part is being honest about which tasks are SLM tasks and which are not, and that honesty is the work that does not show up in the model card.

Forem: Alex Cloudstar

Claude's June 15 Pricing Split: What Indie Devs Actually Need to Do Before the Meter Starts

What Actually Changes On June 15

The Cost Math Nobody Is Showing You

Who Wins, Who Loses

What To Change Before June 15: A Checklist

Should You Just Skip The Credit Pool And Go Direct API?

A Brief Word On Trust

The Real Lesson

Zero-Downtime Postgres Migrations: The Mistakes That Locked My Production Database

The Mental Model You Need First

The Five-Step Migration That Made Me Stop Breaking Things

Adding A Column Safely

Renaming A Column Without Downtime

Adding An Index Without Locking The Table

Changing A Column Type Without Rewriting The Table

Constraints Without Lock Storms

The Backfill Script That Will Not Crash Your Database

The Tools That Make Expand-And-Contract Less Painful

What I Refuse To Run During Business Hours

Connection Pools And Migration Coordination

Rollback Plans That Actually Work

What I Run In Production

What I Would Tell You If You Asked

Feature Flags For Solo Developers in 2026: When You Need Them, When You Do Not, And What I Actually Use

What Feature Flags Actually Do

When You Do Not Need A Flag

When You Actually Need A Flag

The Three Setups That Cover Solo And Small-Team Scale

Setup One: The Config File

Setup Two: The Database Row

Setup Three: A Hosted Service

The Wrapper That Makes Migration Easy

The Hygiene That Keeps Flag Sprawl In Check

The Pattern For Rolling Out Without Pain

Common Mistakes That Hurt

The Honest Cost Picture

What I Run In Production

What I Would Tell You If You Asked

Stripe Webhooks in Production: Idempotency, Retries, and the Mistakes That Cost Me Real Money

What Stripe Actually Sends You

Signature Verification, And Why You Cannot Skip It

Idempotency Is The Whole Job

The Reordering Problem

Respond Fast, Process Later

Race Conditions With Your Own Application

What To Listen For And What To Ignore

Testing Webhooks Without Losing Your Mind

What I Run In Production

What I Would Tell You If You Asked

Server-Sent Events vs WebSockets in 2026: When Each One Actually Wins

What Each One Actually Is

The Reflex To Reach For WebSockets

When WebSockets Actually Win

When SSE Actually Wins (Which Is Most Of The Time)

The Code Comparison That Made It Click For Me

Auth, Which Is Where People Get Stuck

The Connection Limit Trap

What About Long Polling

Frameworks Are Quietly Choosing For You

What I Run In Production

What I Would Tell You If You Asked

TypeScript at Scale: Why Your tsc Takes 90 Seconds and How to Fix It

The First Question: Where Is the Time Going

The Patterns That Actually Cost You

Deeply Nested Generic Inference

Conditional Type Recursion in Hot Paths

Massive Discriminated Unions

as const Object Literals With Heavy Inference

Library-Caused Slowdown

Project References, the Right Way

Compiler Settings That Matter for Speed

Editor Performance Is a Different Problem

What Project Corsa Changes, and What It Does Not

A Concrete Diagnostic Loop

What I Would Tell You If You Asked

Passkeys in Production: What I Wish I Knew Before Replacing Passwords

What Passkeys Actually Are, Stripped of Marketing

The Protocol in One Page

The Account Model You Actually Need

`as const` Object Literals With Heavy Inference

`await using` and `Symbol.asyncDispose`

`AbortSignal.any()` for Composed Cancellation

`AsyncLocalStorage` as Context Carrier