Forem: Takayuki Kawazoe

"Why we told our AI plan generator to never split tests into a separate sub-task"

Takayuki Kawazoe — Tue, 26 May 2026 08:24:33 +0000

The run was marked failed. Two of the three sub-tasks merged cleanly. The third one, titled "Add tests for is_sent=True treated as read in test_inbox_service_unread_propagation.py", never finished. CI retried up to the cap, all failures, then gave up. The whole plan was thrown out even though two thirds of the actual code had already landed on green branches.

The fix turned out to be one paragraph in one prompt. Not a code change in the dispatcher. Not a new CI flag. Just a rule that says: if a sub-task introduces or modifies code, the unit tests for that code go in the same sub-task. The "tests as their own task" pattern is forbidden.

Here is what I observed, why the AI reached for the wrong decomposition, and the exact prompt rule that closed the gap.

What actually happened

Codens Purple has what I call a plan generator. That is the part of the system that takes one PRD or bug report and breaks it into sub-tasks. Each sub-task then gets dispatched on its own Git branch, runs in parallel with the others, and merges back to the base when its CI goes green. The piece of the plan generator that actually does the splitting is driven by what we internally call the analyze prompt, which is just the system prompt the model sees when it decides "how should this work be carved up."

On a project called opsguide-back, for one bug, the plan generator produced this triple:

1. Add tests for is_sent=True treated as read in
   test_inbox_service_unread_propagation.py
2. Fix _store_messages_batch in inbox_service.py to mark
   self-sent messages as read
3. Add sender_email exclusion to _build_activity_unread_count
   in resolver.py

If you read that as a human reviewer, it looks great. Three clean concerns, easy to review independently, no overlap in files touched. Textbook parallelization.

It died anyway. Sub-tasks 2 and 3 both finished and merged. Sub-task 1, the test-only one, kept failing CI. Its branch contained only changes to the test file. The implementation functions it was asserting against did not exist on that branch yet, because the implementation lived on a sibling branch that this branch could not see. pytest collected the test, tried to import the helpers, and the asserted behaviour was simply not present. Retry, retry, retry, give up. Run failed.

The cruel part is that if the merge order had happened to put the test branch last, after both impl branches had landed, the test would have passed. But we cannot guarantee that order. Each sub-task races on its own.

Why the AI did this

This was not a model failure. The model did exactly what every general-purpose decomposition heuristic would tell you to do. Split tests from implementation so they can move in parallel. That is correct advice for a human team, where the reviewer and the merge queue keep the order honest, and where a developer can rebase a test PR onto the impl PR before merging.

The thing the model did not know is that our dispatch system runs each sub-task on its own isolated branch. Each sub-task sees the base branch plus its own changes, and nothing else. Sibling sub-tasks' work is invisible to it until merge time. That is not a universal fact about software development. It is a property of how we, specifically, run parallel agents. Nothing in the model's training corpus tells it that this constraint applies, because most of the corpus is about human teams.

So the model reached for the most-cited decomposition pattern it knew, which happens to be wrong for our dispatcher. The mistake lived in the prompt. We had been asking the model to plan parallel work without telling it the actual rules of "parallel" in our system.

This is the general shape of a lot of AI agent failures I have hit. The agent is not bad at reasoning. It is reasoning correctly in the wrong universe, because the prompt forgot to describe the universe.

The fix

We added this block to the analyze prompt. It is the only change.

## CRITICAL: Tests live with their implementation

NEVER split tests for new behaviour into a separate sub-task. Every sub-task
that introduces or modifies code MUST also add the unit tests for that code
in the SAME sub-task. The pattern "Sub-task A: implement X / Sub-task B:
add tests for X" is FORBIDDEN.

Title heuristic: if you are about to write a sub-task title that starts
with "Add tests for ..." or "Write tests for ...", STOP and merge it
into the impl sub-task whose code it tests.

Two things are doing the work here. The first is the explicit "FORBIDDEN" framing. The second, which I think matters more in practice, is the title heuristic. The model writes the title before it writes the body. If we can get it to catch itself at the title stage, the bad plan never gets generated in the first place, so we do not have to rely on a later pass to repair it.

We also rewrote the few-shot examples in the same prompt. Before, the example impl sub-task's ## Steps section only listed source-code file edits. After, every example impl sub-task lists the implementation file edit and the test file edit side by side. Roughly:

 ## Steps
 1. Edit src/inbox_service.py: in _store_messages_batch,
    set is_read=True when message.sender_email == account_owner_email.
+2. Edit tests/test_inbox_service_unread_propagation.py:
+   add unit test asserting is_sent=True self-messages count
+   as read.

That tiny diff is the part that changes behaviour. Models pattern-match very strongly on few-shot examples. If every example shows tests bundled with impl, the model produces the same shape.

Since the rule went in, the plan generator has stopped emitting "Add tests for ..." sub-tasks on new behaviour. The test-only failure mode is gone.

The exception

There is one shape of test-only sub-task that is still fine. If we are backfilling a regression test for code that is already on the base branch, the test-only sub-task is allowed. The reason is symmetrical to the original failure: when the implementation already exists on main, a test-only branch has everything it needs to compile, import, and assert. pytest finds the function, the test runs, CI passes.

The prompt calls that out explicitly so the model does not over-apply the new rule and start refusing legitimate backfill work. The line in the prompt is roughly "the rule is about new behaviour introduced in this plan, not about all test-only sub-tasks ever."

Generalizing

The bigger lesson is that AI agents reach for human-team decompositions by default, and that is fine when your dispatch system also behaves like a human team. Most agent dispatch systems do not. Ours runs sub-tasks on isolated branches with no cross-visibility. Some teams run agents in long-lived shared worktrees. Some serialize. Each of these creates its own invisible constraint on what can and cannot be split.

The agent does not know which one you have. It cannot infer it from the codebase, because none of those constraints are encoded in the code. They live in the dispatcher.

So the work, when you start letting an agent plan parallel sub-tasks, is to spend prompt tokens drawing the line between what can be split and what cannot. For us that line was: tests for new code live with the new code. For someone else it might be: never split a migration from the code that depends on it. Or: never split a config change from the deployment that consumes it. The shape of the rule depends entirely on your dispatcher, not on the model.

The pattern I would suggest is to add a single "CRITICAL" section to the planning prompt that enumerates the constraints your dispatcher imposes. Use a title-stage heuristic so the model self-rejects bad plans before generating the body. Rewrite the few-shot examples to demonstrate the right shape, because that is what the model actually copies.

We rebuild Codens with Codens. Every prompt rule like this one came from watching a real run fail and adding the one sentence that would have prevented it. If you want to see how the parallel planner works end to end, the English landing page is at https://www.codens.ai/en/.

"Why your Playwright screenshots show for Japanese / Chinese / Korean text, and the 3-line Dockerfile fix"

Takayuki Kawazoe — Mon, 25 May 2026 06:44:17 +0000

I opened the screenshot artifact for our codens.ai landing page smoke test and the page was full of square boxes. Where the Japanese hero copy should have been, there was a row of □□□□□. Where the feature names were, more boxes. The nav looked like an ancient artifact from a half-decoded file.

The page itself was fine. I had the dev server open in another tab and the Japanese rendered perfectly. The problem was inside the Playwright container.

Three lines in the Dockerfile fixed it:

    fonts-noto-cjk \
    fonts-noto-cjk-extra \
    fonts-noto-color-emoji \

That is the entire fix. If you only came for the answer, you can close the tab now. If you want to know why this happens and where else it will bite you, keep reading.

What is actually happening

The official Playwright Docker image (and most slim base images people build on) only installs Latin fonts. In our case it was fonts-liberation plus fonts-dejavu-core. That is enough to render English, most European languages, basic punctuation, and not much else.

When Chromium tries to paint a character it has no glyph for, it does the only thing it can do. It draws the missing-glyph placeholder, which on most systems is that hollow rectangle people call a tofu box. The character code is correct. The DOM is correct. The page is correct. The screenshot rendering side just has no shape to draw.

This is the part that confuses people the first time. The browser is not broken. The test is not broken. The page is not broken. The container does not have the font installed, so when the screenshot is composited there is nothing to fill the box with.

You can verify this in two seconds. SSH into the container, run fc-list | grep -i cjk, and you will see an empty result. That is the whole story.

The fix

Three apt packages, added to whatever RUN apt-get install block already exists in your Dockerfile.

Before:

RUN apt-get update && apt-get install -y \
    fonts-liberation \
    fonts-dejavu-core \
    && rm -rf /var/lib/apt/lists/*

After:

RUN apt-get update && apt-get install -y \
    fonts-liberation \
    fonts-dejavu-core \
    fonts-noto-cjk \
    fonts-noto-cjk-extra \
    fonts-noto-color-emoji \
    && rm -rf /var/lib/apt/lists/*

What each one buys you:

fonts-noto-cjk is the main package. It covers Japanese kana, the Han characters used in both Japanese and Simplified Chinese, and Korean Hangul. This is the one that fixes most of the boxes.
fonts-noto-cjk-extra covers the long tail. Traditional Chinese variants, less common Han glyphs, characters that show up in proper nouns. Worth including because the cost is small and you do not want to debug a single rare character later.
fonts-noto-color-emoji is the one people forget. If your page has any emoji, you will get tofu for those too. Most modern marketing pages have at least a checkmark or a sparkle somewhere.

Image size impact is about 70 MB on a Debian or Ubuntu base. CJK font files are large because there are tens of thousands of glyphs. If you are squeezing every megabyte you can use the smaller variable-weight subset, but for a CI image used by a test runner the 70 MB is irrelevant.

I shipped this in commit 40422650 for Codens Blue, our QA agent. Rebuilt the image, reran the same smoke test, and the screenshot came out with actual readable Japanese.

Why you only notice after the fact

This is the annoying part. Nothing in your test suite tells you the screenshot is broken.

Unit tests pass. The page renders correctly when a human visits it. The Playwright test reports green because the test only checks that the page loaded and the screenshot was saved. CI is happy. The artifact thumbnail in the GitHub Actions UI is tiny and you cannot tell tofu from text at that size.

You notice when someone opens the screenshot to share it. A designer asks for the latest LP screenshot to compare against a Figma mock. A stakeholder pulls a screenshot for a Slack thread. A regression alert fires and you open the diff. That is when the boxes show up and someone asks why the page is full of squares.

You can technically assert against tofu rendering inside the test. Sample a region that should contain CJK text, check that not every pixel in that region is identical white, fail if it looks suspiciously uniform. I have seen people do this. The implementation cost almost never beats the cost of just installing the fonts once. Three lines of Dockerfile beats a hundred lines of pixel sampling logic.

The same trap is everywhere

Playwright is just the messenger. Anything that wraps a headless Chromium in a Docker container has this problem if the base image lacks CJK fonts.

Puppeteer, pyppeteer, playwright-python, Selenium with headless Chrome, any custom screenshot service built on chrome-launcher, server-side rendering pipelines that use headless Chrome to generate Open Graph images. Same root cause every time. Same fix every time.

If your product touches any audience outside Latin script, default to installing the CJK and emoji fonts in your base image. Treat it as part of the container setup, not as a thing you wait to hit. The cost is 70 MB and three lines. The cost of not doing it is some future Slack message that says "why is the page full of boxes" and then an afternoon of confused debugging.

Wrap

That is the whole thing. Three apt packages, one rebuild, done. If you are running Codens Blue or any other screenshot-based QA flow against a multilingual page, this is the first place to look when boxes appear.

If you want to see the actual landing page these screenshots are taken from, it lives at https://www.codens.ai/en/.

"Adding Cursor Composer 2.5 as a third executor lane: 10x cheaper than Opus at comparable scores, but smoke tells a different story"

Takayuki Kawazoe — Mon, 25 May 2026 00:00:19 +0000

A roughly tenfold per-task cost drop at comparable accuracy is one of those numbers you do not get to ignore for very long. Composer 2.5 published SWE-Bench Multilingual figures in the same neighborhood as Opus, and the per-attempt API cost is about an order of magnitude lower. For an agent harness that runs hundreds of attempts per project per week, a 10x cost compression on a viable lane reshapes the unit economics enough to justify a real integration, not just a spike.

So I shipped Composer 2.5 as a third executor lane in Codens Purple, the orchestration service that decides which model runs each task. Codens was already running two lanes side by side: Claude via the raw Anthropic API and a self-hosted Qwen deployment. The third lane went in over two days, May 23-24, across a Phase 1 skeleton commit, a Phase 2 SDK wire, an ECS Fargate task definition change, an IAM credential isolation fix, and a one-project canary toggle.

Then I ran a smoke pass. 16 failed out of 25 attempts across v4 through v17. The integration works. The benchmark numbers are not the production numbers. This is the writeup of both halves: what shipped, and what the smoke phase actually told me.

Why a third lane at all

The case for a third lane is the same case I made earlier this year for the per-model retry cap pattern. Each model has its own failure shape and its own cost curve. Pinning the whole harness to one provider means inheriting one bill, one rate-limit policy, and one definition of "the model got it wrong."

Composer 2.5 changes the cost arithmetic in a way that matters at our retry caps. Codens retries each task per model up to a cap: claude=3, qwen=6, composer-2.5=5 for now. At cap=3 with Opus, the worst-case attempt cost dominates the per-task budget. At cap=3 with Composer 2.5 at roughly 1/10 the per-attempt rate and comparable accuracy, the worst-case attempt cost drops by roughly an order of magnitude even before factoring in higher-than-Opus first-pass success. That math is what made integration time worth spending.

The optionality argument also got stronger recently. Anthropic clarified that the Agent SDK and claude -p CLI workflows are not covered by subscription plans for agent use cases, which validates the API-direct path Codens already runs on. Adding a Cursor lane on top of that is the same bet, extended: do not get pinned to any one vendor's pricing or policy, and keep the harness free to route tasks to whichever lane wins on cost and reliability for the workload at hand.

Executor lane design

The pleasant part of the design was that PurpleTask.execute_model already supported per-task model switching, and PurpleProject.default_model already let an entire project pin a model. Adding the third lane was not an architecture change. It was an enum value plus a new runner module.

class PurpleTask(Base):
    # existing fields elided
    execute_model = Column(
        Enum("opus", "sonnet", "qwen", "composer-2.5", name="execute_model"),
        nullable=True,
    )

The runner dispatcher already had two branches: runner_claude.py for the Anthropic API path that wraps the claude -p CLI, and runner_qwen.py for the self-hosted endpoint. The third runner, runner_cursor.py, slots in next to those two with the same input contract (task spec, workspace dir, env) and the same output contract (workspace diff, structured result, failure_reason on non-zero).

I split the change into two commits on purpose. Phase 1 was a validation-only runner that exited non-zero on every invocation, plus the enum addition. Shippable in isolation, zero behavior change for existing tasks because nothing pointed at composer-2.5 yet. Phase 2 was the actual SDK call. Splitting like this means each commit can be reverted on its own, and the enum migration is not coupled to any SDK behavior question.

I have learned the hard way that bundling an enum addition with the runtime that depends on it produces commits you cannot cleanly revert when the runtime turns out to be the problem. Phase 1 / Phase 2 splits are cheap insurance.

Phase 1: the skeleton

Phase 1, commit 5a575031, did three things and nothing else. It added composer-2.5 to the model enum, registered runner_cursor.py in the dispatch table, and made the runner validate its inputs and exit non-zero with a clear "not yet implemented" failure_reason. The migration ran on staging. The dispatch table picked up the new entry. No production task pointed at the new lane, so the runner was never invoked in the live path.

This is the kind of commit that looks like it does nothing and is actually doing the most important thing: proving the surrounding plumbing is correct before the new code can hide bugs in the plumbing. If Phase 2 had landed in one shot and the SDK call had failed, I would have spent the next hour trying to figure out whether the failure was in the dispatcher, the env wiring, the IAM role, or the SDK. With Phase 1 already in production for an hour, the only thing Phase 2 could break was the SDK call itself.

Phase 2: wiring the Cursor SDK

Phase 2, commit b1e7ebcd, is where the real work happened. The Cursor Python SDK exposes a session that walks Bridge → Client → Agent → events. The shape in the runner is:

bridge = await Bridge.launch(...)
client = Client(bridge=bridge)
agent = await client.agent.create(
    model=ModelSelection(id=model_id),
    local=LocalAgentOptions(cwd=workspace_dir),
)
run = await agent.send(prompt, SendOptions(...))
async for event in run.events():
    handle_event(event)

The local=LocalAgentOptions(cwd=workspace_dir) part matters: Cursor agents can run remotely or locally, and for Codens the workspace is already mounted into the Fargate task at a known path, so local-mode keeps the file IO inside the task and avoids round-tripping the diff over the wire. agent.send returns a run handle whose events() async iterator yields the structured event stream we already know how to consume from the Claude path. The translation layer in runner_cursor.py normalizes Cursor's event shapes to the internal event schema that the rest of Purple already speaks.

CURSOR_API_KEY is the obvious blocker. We store it in AWS Secrets Manager at purple-codens-prod/cursor-api-key and inject it into the per-task environment so the SDK picks it up automatically. The ECS Fargate task definition change in PR #1156 (commits d1ef5db4 and 656f42e4) exposes the secret ARN as an environment variable:

{
  "name": "CURSOR_API_KEY_SECRET_ARN",
  "value": "arn:aws:secretsmanager:ap-northeast-1:...:secret:purple-codens-prod/cursor-api-key"
}

The entrypoint script resolves it before launching the runner:

CURSOR_API_KEY=$(aws secretsmanager get-secret-value \
    --secret-id "$CURSOR_API_KEY_SECRET_ARN" \
    --query SecretString --output text)
export CURSOR_API_KEY
exec python -m purple.runner_cursor "$@"

This part is where I introduced a bug I want to flag specifically, because it is the kind of bug a multi-tenant SaaS should never ship. Initial commit pulled the secret using whatever AWS_PROFILE was active in the task environment, which in some code paths inherited from the customer's connected AWS credentials. That is wrong in a multi-tenant harness. The fix in commit 6210a052 makes the entrypoint use the ECS task IAM role for the Secrets Manager call, never the customer's profile. Customer credentials are scoped to customer resources only. Platform credentials, including our Cursor API key, must resolve through the task role. Easy mistake, important fix.

The canary procedure

I do not trust new lanes in production until a real project has run on them for at least a day. The canary procedure (commit d6fe3cb3) is intentionally small: flip purple_projects.default_model = 'composer-2.5' on exactly one internal Corevice-org project, dogfood it, and watch the metrics. Every other project stays on whatever model they were already on, which means the canary is fully isolated.

The SQL is one row:

UPDATE purple_projects
SET default_model = 'composer-2.5'
WHERE id = '<internal-project-id>';

Rollback is the same statement with the prior value. No code deploy involved. This is one of the upsides of keeping model selection as runtime data rather than baking it into deploy artifacts: rollback is a transaction, not a release.

The comparison axes we track on the canary versus the same project's last 30 days on Opus:

Completion rate (task finishes without exhausting retries)
Verify pass rate (Codens verify steps succeed against the final diff)
Wall time per task
Cost per completed task

The point of the canary is not to certify the lane is good. The point is to surface the failure modes that benchmarks do not surface, before any real customer touches the new lane.

What the smoke runs actually showed

Across v4 through v17, the smoke pass ran 25 attempts on the canary project. Nine finished. Sixteen failed. That is a 36% completion rate on a workload where the equivalent Opus runs were sitting around 80%+. The benchmark numbers and the production numbers were not the same numbers.

Two failure modes accounted for almost all of the misses.

The Cursor SDK bridge dropped mid-session on a handful of long-running tasks. When the bridge dropped, the workspace diff in progress was lost, the run handle errored, and the runner reported a generic SDK exception. Salvaging the partial diff at the moment the bridge dropped was the obvious fix. Commit 0f95f020 catches the bridge-drop exception, snapshots whatever is currently on disk in the workspace, and feeds that diff into the retry attempt's context so the next attempt does not start from zero.

The other failure mode was uglier. When a task exhausted its retry cap, the runner reported failure_reason = "exceeded max executions (5)" and that was it. The operator on the other side had no visibility into why each of those five attempts had failed. The fix in the same commit (0f95f020) enriches failure_reason with the last attempt's actual error string. Now when the cap is exhausted, the operator sees "exceeded max executions (5): last attempt failed with: <real error>" and can route the task to a different lane or escalate.

Two smaller fixes shipped alongside. Commit 1be0614f surfaces the AWS CLI failure when the Secrets Manager call fails. Previously the entrypoint swallowed it silently and the runner started with an empty CURSOR_API_KEY, producing an opaque 401 from the SDK three seconds later. Now the entrypoint exits non-zero with the AWS CLI error before the runner even starts. Commit 64af2b50 cleans up the per-task env injection and drops a message field collision between the Cursor event schema and our internal one that was causing some events to lose their payload during translation.

None of these fixes turn Composer 2.5 into a production-grade lane for our workload. They turn it into a lane I can operate, observe, and reason about while we keep iterating on it. The canary stays canary. Customer-facing projects stay on the lanes they were on.

Closing

Multi-lane executor architecture is a hedge, and like all hedges, the value shows up only when you actually need it. Composer 2.5 may or may not become a default-routing lane for Codens in the coming weeks. The 10x cost compression is real, the benchmark numbers are real, and the smoke phase is also real. The point of the canary procedure is that we get to find out which of those three numbers matters for our workload before any customer feels it.

The integration cost was a Phase 1 skeleton, a Phase 2 SDK wire, an ECS task definition change, an IAM fix, and a one-row SQL toggle. The integration value, regardless of whether Composer 2.5 sticks, is one more lane the harness can route through next time a pricing announcement or a model release reshapes the cost curve. That optionality is what an AI dev harness is supposed to give you.

Codens is at https://www.codens.ai/en/ if you want to see what a multi-lane harness for autonomous code repair and QA looks like in production.

"Centralizing billing across 5 products triggered a 403 nobody saw coming"

Takayuki Kawazoe — Sat, 23 May 2026 10:44:56 +0000

We flipped USE_BCP=true on Red at 14:02. The first 403 hit Sentry at 14:06. By 14:11 the pattern was clear: any user who tried to do something that touched org-level credit (granting a teammate access, viewing the org credit balance, kicking off a fix run under an org-scoped project) got a 403 back from the Red API, which had received a 403 from BCP, which had received a "not a member" from Auth.

Staging didn't catch it. I want to be honest about that part before anything else. Staging had two users in one org, both of which had been provisioned by me through the Auth admin path months ago, so their org memberships existed in Auth's org_members table by accident of history. Every code path I exercised in staging happened to read from a row that was already there. The bug only fires when a user accepts an org invitation on the product side after the cutover, and we had no synthetic flow for that in staging. Lesson noted, expensive way to learn it.

This post is about what actually broke, why the design wasn't wrong (the implementation was missing), and the three branches I considered for where org-membership authority should live before settling on the one that produced the bug.

Phase H: why centralize billing now

Codens is five products plus two platform services. Red does auto-fix, Blue does QA, Green does PRDs, Yellow is the engineering activity ledger, Purple is the orchestration layer. Auth is the identity service. BCP, the Billing Control Plane, is the newest piece and the subject of this story.

Until last quarter, each product calculated its own credit consumption. That was fine when Red was the only product taking money. It became untenable around the time Green went into beta, because we had three different rounding rules, two slightly different definitions of "what counts as a billable run," and a support ticket pattern that boiled down to "my org's credit balance on Red doesn't match my org's credit balance on Blue and you charged me twice." Phase H of the architecture roadmap pulls all of that into BCP. Every product reads its credit policy from BCP, posts consumption events to BCP, and asks BCP "can this user/org afford this operation?" before starting work.

The cutover is gated behind two env vars per product:

USE_BCP=true
BCP_API_URL=https://api.billing.codens.ai

I cut one product at a time, starting with Red because it has the highest traffic and the most mature billing surface. Red PR #266 was the actual flip. Blue PR #233 and Green PR #411 followed once Red had been stable for a week. Yellow and Purple are scheduled for next quarter, both still on local credit math.

The cutover order matters for this story because the 403 only manifests on org-scoped operations. Red individual-account billing kept working perfectly. So did Blue and Green individual accounts. It was specifically the org-shared credit pool path that exploded, and only for users who had joined their org through the product-side invitation flow rather than through Auth's admin console.

Tracing the 403

The first instinct was "BCP is misconfigured." It wasn't. BCP logs showed clean inbound requests with the right org_id, the right user_id, the right requested operation. BCP then made an internal call to Auth: "is user X a member of org Y?" Auth returned false. BCP returned 403. Red returned 403. User saw 403.

The Auth log line was the clarifying one:

GET /internal/orgs/{org_id}/members/{user_id} -> 404

So Auth wasn't broken either. Auth was correctly reporting that user X was not a member of org Y, as far as Auth knew. I pulled the user out of the database. The user existed in Auth's users table. The org existed in Auth's organizations table. The link row in Auth's org_members was missing.

I went over to Red's database. The link row was there. Red had a row that said user X belonged to org Y, with the role and joined-at timestamp from the day the user accepted the invitation. Red had been authoritative for this relationship the entire time.

CDTSK-1392 captured the root cause. Auth Codens is supposed to be master of organizations and memberships, but each product had grown its own organizations and org_members tables back when each product was a standalone service. Invitation acceptance was handled locally by each product. The row landed in the product's database, and nobody told Auth. Pre-BCP, this didn't matter, because the product was the one authorizing org-scoped operations against its own tables. Post-BCP, BCP asks Auth, Auth doesn't know, 403.

The bug is not in the centralization. The bug is that we shipped centralization assuming a sync that didn't exist.

Three branches for where authority lives

Before writing the sync, I had to decide whether the sync was even the right answer. There are three reasonable places to put authority over org membership in a multi-product setup like ours.

Authority in the auth service. Auth is the master record. Every product holds a local cache (or a foreign-key shadow) and reflects changes back to Auth as they happen. This is what we have. It's the most conventional choice. The downside is the one we just discovered: every product-side write path that affects membership has to remember to call Auth, and forgetting is silent until something else (like BCP) starts depending on Auth being correct.

Authority in billing itself. BCP owns the org and member tables. Every product reads from BCP. This has the appeal of "the system that needs to know the truth owns the truth." It also means every product becomes hard-dependent on BCP being up to render a user's basic org context, which is a much bigger blast radius than billing being temporarily degraded. I didn't want every Red dashboard render to fail because BCP was deploying.

Authority distributed across products. Each product remains the source of truth for memberships that originate in that product. BCP, when asked to authorize an org-scoped operation, routes the membership question to whichever product owns the org. This sounds clever for two products. With five products, the routing table is a permanent piece of infrastructure that has to be updated every time a new product launches, and the question "who owns this org" is itself a piece of state that has to live somewhere central. You've reinvented the auth service, badly.

I chose branch one. The 403 wasn't evidence of a wrong choice. It was evidence that I'd shipped half of a choice. The half I shipped (BCP queries Auth) was correct. The half I hadn't shipped (products tell Auth about new memberships) was the gap.

The sync endpoint

The fix has two halves. Auth needs an endpoint that products can call. Products need to call it at the right moments.

On the Auth side, I added POST /api/v1/internal/organizations/{org_id}/members:upsert. The verb is upsert deliberately. The endpoint is idempotent and the products call it both on invitation acceptance and on role changes, so the handler has to be willing to create or update without the caller knowing which case applies. The response status differentiates: 201 if a new membership row was created, 200 if an existing row was updated.

Getting FastAPI to actually return 201 vs 200 from the same handler was the part that almost shipped broken. PR #124 was the fix. The original handler looked like this:

@router.post(
    "/organizations/{org_id}/members:upsert",
    response_model=UpsertOrgMemberResponse,
)
async def upsert_org_member(
    org_id: UUID,
    payload: UpsertOrgMemberRequest,
    use_case: UpsertOrgMemberUseCase = Depends(get_upsert_use_case),
) -> UpsertOrgMemberResponse:
    result = await use_case.execute(org_id, payload)
    return UpsertOrgMemberResponse.from_domain(result)

When you annotate the return as a Pydantic model, FastAPI takes over status code resolution and forces the default for the route (200 for POST in our config, or 201 if you set status_code= on the decorator). Either way you can't branch. You get one status for both the create and the update case, which silently broke the idempotency contract for any caller that wanted to distinguish.

The fix is to return JSONResponse directly so the handler controls the status:

@router.post("/organizations/{org_id}/members:upsert")
async def upsert_org_member(
    org_id: UUID,
    payload: UpsertOrgMemberRequest,
    use_case: UpsertOrgMemberUseCase = Depends(get_upsert_use_case),
) -> JSONResponse:
    result = await use_case.execute(org_id, payload)
    status = 201 if result.created else 200
    return JSONResponse(
        status_code=status,
        content=UpsertOrgMemberResponse.from_domain(result).model_dump(mode="json"),
    )

You lose automatic OpenAPI response model inference, which is a real cost. You get correct semantics, which is a bigger gain. I document the response shape with responses={200: ..., 201: ...} on the decorator to keep the OpenAPI spec honest.

On the product side, Red PR #264 added the client call at the two moments membership state changes: invitation acceptance and role update.

async def accept_invitation(self, invitation_id: UUID, user_id: UUID) -> None:
    invitation = await self.invitations.get(invitation_id)
    await self.org_members.create(
        org_id=invitation.org_id,
        user_id=user_id,
        role=invitation.role,
    )
    await self.auth_client.upsert_org_member(
        org_id=invitation.org_id,
        user_id=user_id,
        role=invitation.role,
    )
    await self.invitations.mark_accepted(invitation_id)

The Auth call is not in a transaction with the local write, which is a deliberate choice and a place where I might be wrong. If the local write succeeds and the Auth call fails, we have drift. The current mitigation is a nightly reconciliation job that compares product org_members to Auth org_members and re-upserts anything missing. I'd rather drift and reconcile than block invitation acceptance on Auth being reachable.

Blue and Green shipped matching calls in their respective PRs.

Side cleanup: while I was in BCP I noticed that the bonus-credit endpoint silently dropped its grant when the grant_type field name on the wire didn't match what the receiver expected (the sender was using bonus_type, the receiver was reading grant_type, Pydantic accepted the payload with extra="ignore" and quietly inserted a row with the default grant type). PR #265 fixed the Red caller and PR #231 fixed Blue. Lesson there is to not use extra="ignore" on internal wire models, but that's another post.

Lessons

The biggest one is that staging only catches the bugs you have data for. The org-membership row was present in staging by historical accident, so the path that read it worked. I now provision a fresh, end-to-end test user (sign up, accept invitation, perform org-scoped action) as part of pre-cutover validation, scripted, not "remember to do it."

Cutting one product at a time was the only thing that kept the blast radius survivable. If I had flipped all three on the same morning the triage would have taken twice as long, because every signal would have been duplicated three ways. The order Red, then Blue, then Green wasn't load-balanced for anything clever — it was just the order I trusted the metrics on.

Naming the endpoint :upsert instead of overloading POST .../members mattered more than I expected. When the FastAPI status code issue came up, the conversation was "the upsert endpoint should return different codes for create vs update," which is a one-sentence problem statement. If the endpoint had been POST /members I'd have spent another hour arguing about whether 200 or 201 was correct in the abstract.

Wrap

The hardest part of centralizing anything across a product family is not the new service. The new service is straightforward, you write it, you deploy it, you wire up clients. The hard part is figuring out who is allowed to be the source of truth for the relationships the new service depends on, and then making every existing write path honor that choice. We chose Auth as the master for org membership, which I still think is right. We just hadn't enforced it everywhere it mattered, and BCP was the first dependent that actually cared.

If you want to see how the rest of the harness fits together, the English landing page is at https://www.codens.ai/en/. Yellow and Purple come onto BCP next quarter. I'll write that one up too, hopefully without the same shape of bug.

"When the AI gets stuck, the engineer fetches the same PRD via MCP and keeps going"

Takayuki Kawazoe — Wed, 20 May 2026 07:33:54 +0000

Last Tuesday I watched our auto-fix agent burn through three retries on a session-handling bug and surrender. The failure mode was honest. It tried, the diff broke a test we did not know existed, it tried again, the second diff fought with an old idempotency check, the third diff was basically the first one with renamed variables. Then it stopped. The bug report sat in our system marked analysis_failed, the proposed plan was there, the partial diff was there, and the engineer who had to take over was sitting in Slack scrolling.

That gap, the moment between "AI gave up" and "engineer is coding," is where most AI dev tools quietly cost more than they save. The engineer cannot just resume. They have to reconstruct what the AI was looking at: which PRD section, which kickoff decision, which root cause analysis, which files the bug report pointed at. The data exists. It just lives in five places and none of them are inside the IDE.

We shipped codens-mcp v0.7.5 partly to close that gap. The AI workflow inside Codens reads and writes the same PRDs, bug reports, kickoffs, and run logs that an engineer can now pull into Claude Code over MCP with one call. Same source of truth. Two surfaces. The handoff loses nothing.

The 80/20 reality nobody markets

The honest number for a well-tuned AI dev harness on real production code is somewhere between 80% and 90% of tasks completed end-to-end. The rest is novel business logic, conflicts with code the AI never saw, spec ambiguity that no amount of retry will resolve, and the long tail of edge cases that someone has to think through. I do not believe the "100% AI development" pitch and I do not think anyone shipping into real codebases does either.

The 20% is not the problem. The problem is the seam between the 80% and the 20%.

When the AI hands a task back, the human arrives without context. The PRD is in Notion. The bug analysis is in Sentry plus some chat thread. The kickoff decision that explains "we chose JWT not session cookies" is buried in a meeting recap. The engineer has to play archaeologist before they write a single line. And because the AI workflow has already burned through three retries, the next attempt starts from a worse position than if the engineer had been the first responder.

Most AI dev tools optimize the 80%. They get better at the part the AI was already good at. The 20% gets a "human-in-the-loop" label and a button that says "request review." That button does not solve anything. The engineer still has to find everything.

Codens treats the seam as the actual product. The 80% has to keep getting better, obviously. But the 20% is where the trust gets built or destroyed, and the only way to make it good is to make the takeover instantaneous.

One source of truth, two read paths

Every artifact the AI produces or consumes during a task is a first-class entity in Codens, stored in Postgres, owned by a project, scoped to an org. Green Codens owns the planning side: Consultation (the requirement-gathering conversation), PRD (the structured spec), Kickoff (the implementation plan with vision, scope, tech selection, milestones), Plan (the task breakdown). Red Codens owns the repair side: Bug Report (with the AI's root cause analysis attached), Bug Fix Plan (proposed impact scope and test requirements). Purple Codens owns execution: Run (the live event stream from a workflow), Logs.

The AI workflow writes to these entities through internal service calls. When the Green PRD AI generator finishes a section, it patches the PRD row. When Red's analyzer finishes, it attaches an analysis blob to the bug report. When Purple's runner emits an event, it goes to the run's event log. Nothing escapes into chat. Nothing depends on a human copying text from one tab to another.

The second read path is codens-mcp. It is a Python package that registers as an MCP server inside Claude Code (or any other MCP client). It authenticates with the same JWT the web app uses, talks to the same backend APIs that the AI workflow talks to, and exposes 38 tools that cover 137+ actions. When an engineer calls green_prd(action="get", prd_id=...), they get the same PRD bytes the AI agent read three retries ago.

The point is not "we have an API." Every product has an API. The point is that the AI workflow and the engineer use the same access shape against the same row. There is no "engineer-facing version" of the PRD that drifts from the "AI-facing version." There is one row. Both sides read it. Both sides can write it.

What codens-mcp actually exposes

The retrieval surface that matters for a takeover is small. An engineer who arrives at a failed task needs to know: what was being built, what decisions were already made, what the AI tried, and where it broke.

Install and authenticate once:

pip install codens-mcp
codens-mcp login

login runs Device Code Flow against the Codens auth service and stores a JWT at ~/.purple-codens/credentials.json. From that point every tool call carries the token automatically.

{
  "mcpServers": {
    "codens": { "command": "codens-mcp", "args": ["serve"] }
  }
}

Then the engineer, in their IDE, asks Claude to pull the bug report the AI was working on:

red_bug_report(
    action="get",
    organization_id="org_abc",
    bug_id="bug_2f8a"
)
# -> { id, title, description, severity, steps_to_reproduce,
#      expected_behavior, actual_behavior, affected_files,
#      analysis: { root_cause, evidence, suspected_files }, ... }

The action parameter pattern is the whole reason 38 tools cover 137+ operations. One green_prd tool handles create, list, get, update, delete, update_section, approve, submit_for_review, request_changes, archive, unarchive, link_notion, unlink_notion, and consistency-check. The tool descriptor that the model loads at startup is one short signature, not fifteen. (We have written separately about why that matters for context budget — the short version is that a five-server stack burns 55K tokens advertising itself before any work; codens-mcp burns under 5K for everything.)

For a takeover the engineer typically chains two or three calls:

green_kickoff(action="get", kickoff_id="kck_7a1c")
# -> vision, scope, non-goals, tech selection, milestones

green_plan(action="get_tasks", plan_id="pln_91de")
# -> ordered task list with status and dependencies

purple_run(action="get_status", run_id="run_be40")
# -> last events, failure reason, partial outputs

Three calls. Maybe forty seconds. The engineer now has the same view of the work that the AI had when it gave up, without leaving the IDE and without reading a single Slack thread.

Walking through a real takeover

The Tuesday session-handling bug. Here is what actually happened after the third retry failed.

The on-call engineer opened their IDE. Claude Code was already running with codens-mcp registered. They typed:

"Pull bug report bug_2f8a and the latest fix plan."

Claude called red_bug_report(action="get", bug_id="bug_2f8a") and red_bug_fix_plan(action="get_by_bug", bug_id="bug_2f8a") in parallel. Both returned in under a second. The analysis pointed at the auth middleware. The fix plan listed the three files the AI thought needed to change and the test it expected to pass. The engineer read it in maybe two minutes.

Then they asked:

"What did the last Purple run actually do?"

Claude called purple_run(action="get_status", run_id=...) and purple_run(action="subscribe_events", run_id=...) for replay. The event log showed exactly which test had failed on each retry and why the third retry had effectively reverted to the first. The AI had been bouncing between two incompatible local minima.

That was the engineer's "aha." The fix plan was conceptually right, but the test the AI was retrying against was wrong, written by an earlier feature, asserting a behavior the new spec explicitly changed. The engineer fixed the test, applied the AI's second-attempt diff with a four-line manual adjustment, and shipped it. From bug report open to PR merged: 23 minutes, including reading.

Without codens-mcp that same takeover would have been: open Sentry, search by ticket, copy stack trace, open Notion, find the PRD by title, scroll to the right section, open the chat thread where the kickoff lived, find the test naming pattern, grep the repo, then start coding. I have timed that path on myself. It is between 25 and 45 minutes before the first edit.

The tradeoff

The price of "one source of truth, two read paths" is schema discipline. Every artifact has to be modeled well enough that the AI workflow and the engineer both find what they need in it. You cannot let the PRD turn into a Markdown blob with five conflicting section conventions, because the AI's update_section action and the engineer's get_section reader both depend on the structure being honest. You cannot let the bug report become a free-text field with the root cause analysis stuffed at the bottom in a different format every time, because the takeover tooling that highlights analysis.suspected_files will silently miss them.

This is heavier upfront than the alternative, which is to let each side render its own view. The alternative loses every time. The drift between "what the PM thinks the spec says" and "what the engineer thinks the spec says" is, in my experience, the single biggest source of bugs in features that get partially built by an AI. The schema discipline pays for itself the first time a takeover succeeds in under thirty minutes.

The other cost is honest: we run on the Anthropic API direct path, with per-token billing and our own multi-model routing across Claude and Qwen. That gives us control over the escalation path (AI workflow to engineer manual takeover via MCP) independent of what any single platform decides about subscription-tier agent access. When the platform shifts, the takeover path does not move.

Wrap

Graceful degradation is the unappreciated half of AI dev tool design. Anyone can build an agent that succeeds on the easy 80%. The teams that ship into real production code earn their trust on the 20% where the agent gives up and a human takes over. The only way to make that takeover not feel like a downgrade is to make the data the human needs be exactly the data the agent had, in the same shape, one tool call away.

That is what codens-mcp is. The AI does most of the work. When it cannot, the engineer reads the same row.

Codens English landing: https://www.codens.ai/en/
codens-mcp on PyPI: https://pypi.org/project/codens-mcp/

"One JWT, five services, and the python-jose audience list trap"

Takayuki Kawazoe — Sat, 16 May 2026 04:34:53 +0000

audience must be a string or None.

That was the exception python-jose threw the moment our unified MCP server tried to talk to the second backend behind it. The token was valid. The signature checked out. The claims were correct. The library just refused to accept a list as the expected audience, and the JWT spec disagrees with the library on whether that should be a problem.

We run a single MCP server, codens-mcp on PyPI, that fronts five backends: Red (auto-fix), Blue (QA), Green (PRD), Purple (orchestration), and Auth. One MCP token, five destinations. When Claude calls a Red tool, the MCP server proxies an HTTP request to the Red backend carrying that same token. Same for Blue, Green, Purple, Auth. Each backend has its own primary audience for its own user-facing tokens, and we wanted all of them to also accept the MCP server's token without minting five service-specific JWTs per session.

This is the story of how that ran into a python-jose quirk, and the 12-line workaround we ended up shipping.

The architecture, briefly

Codens exposes 31 tools across the five product surfaces through one MCP server. From Claude's side it is a single connection. From the backends' side, each one sees a normal authenticated HTTP request with a bearer token in the header. The token is issued by the Auth service. Its aud claim is purple-codens-mcp, because the MCP server is the thing the user logged into when they connected their client.

Each backend already had its own audience for its first-party tokens. Green expects green-codens. Red expects red-codens. And so on. Those audiences were baked into the OAuth verifier and matched the audience claim on tokens minted by that service's own login flow.

We had two ways forward.

The first option: mint five tokens per MCP session. The MCP server logs into Red, Green, Blue, Purple, and Auth as the user, gets five JWTs, and selects the right one based on which tool the user invoked. This is conceptually clean. It also means five times the token issuance, five rotation surfaces, five sets of refresh flows to coordinate, and a routing layer in the MCP server that has to know which token belongs to which tool. None of that adds value.

The second option: mint one token, declare its audience as purple-codens-mcp, and teach every backend to accept that audience in addition to its own primary one. The MCP server holds one credential. Each backend keeps its primary audience for its own native flows and additionally trusts MCP-issued tokens. Rotation surface stays small. The routing logic in the MCP server disappears.

We picked option two. The plan was to add a per-service config that lists additional accepted audiences, expand the verifier to check against the union, and ship it.

Fix v1: pass a list to python-jose

The setting looked like this in every backend service:

class Settings(BaseSettings):
    OAUTH_AUDIENCE: str = "green-codens"
    OAUTH_ADDITIONAL_AUDIENCES: list[str] = ["purple-codens-mcp"]

The verifier change looked equally innocuous. python-jose's jwt.decode accepts an audience keyword. The naive reading of every JWT tutorial on the internet says you give it the expected audience and it checks the token's aud against that. So we built a list of accepted audiences and handed it over:

audiences = [self.audience] if verify_audience and self.audience else []
if audiences and settings.OAUTH_ADDITIONAL_AUDIENCES:
    audiences.extend(settings.OAUTH_ADDITIONAL_AUDIENCES)

payload = jwt.decode(
    token,
    self.secret_key,
    algorithms=[self.algorithm],
    audience=audiences if audiences else None,
)

This is the version we wrote, ran a quick local smoke test against, and pushed to the dev environment thinking the work was done. The shape of the change matched the shape of the problem. A list of allowed audiences in, an aud claim checked against that list, request accepted. Done.

The dev environment, of course, immediately disagreed.

The trap

The MCP server made its first call into Green and the request came back as a 401. The Green logs had the actual exception underneath the generic auth failure:

TypeError: audience must be a string or None

python-jose's jwt.decode does not accept a list for its audience parameter. If you pass one, it raises before it even looks at the token. The library has only ever supported single-string audience verification. There is no flag, no overload, no helper that takes a list.

RFC 7519 is unambiguous on the other side of this question. Section 4.1.3 defines aud as either a single case-sensitive string or an array of case-sensitive strings, and verification logic is supposed to check that the recipient identifies itself with at least one of the values present. The spec assumes set membership semantics on both ends. The token can have multiple audiences, and the verifier can accept multiple audiences. Whether either side is a list is a transport detail.

python-jose is one of the most-used Python JWT libraries. Most FastAPI tutorials reach for it without thinking. It is also old, and the maintainer activity is thin. There is a multi-year-old GitHub issue tracking exactly this limitation, with patches floating around in forks and pull requests that never merged. The library's behavior is what it is, and if you need list audience verification, you are on your own.

The honest read here is that the JWT spec describes capability and most libraries describe a comfortable subset of it. The subset is usually fine. The moment you do anything cross-service it stops being fine.

Fix v2: decode without audience verification, then verify manually

The fix that worked is to use python-jose for what it is good at, which is signature verification and claim decoding, and do the audience check ourselves. python-jose lets you disable individual claim checks through its options dict. verify_aud: False turns off the built-in audience verification entirely. The signature, expiry, issuer, and everything else still get checked. We just take responsibility for aud.

should_verify_aud = verify_audience and bool(self.audience)

payload = jwt.decode(
    token,
    self.secret_key,
    algorithms=[self.algorithm],
    options={"verify_aud": False},
)

if should_verify_aud:
    allowed_audiences = {self.audience, *settings.OAUTH_ADDITIONAL_AUDIENCES}
    token_aud = payload.get("aud")
    token_aud_set = (
        set(token_aud) if isinstance(token_aud, list)
        else {token_aud} if token_aud is not None
        else set()
    )
    if not (token_aud_set & allowed_audiences):
        raise InvalidTokenError(
            f"Invalid audience: token aud={token_aud!r}, expected one of {sorted(allowed_audiences)}"
        )

The set intersection does the entire job. token_aud_set & allowed_audiences returns a set of values present in both, and if that set is empty the token is for someone else and we reject it. If the token's aud is a single string we wrap it in a one-element set. If it is a list we convert directly. If it is missing we get an empty set and the intersection is empty, which fails closed.

One subtle thing about the order. We compute should_verify_aud before calling jwt.decode, not after, because we want the variable to capture the caller's intent independent of what python-jose returns. If someone passes verify_audience=False, we skip the manual check entirely. If they pass verify_audience=True but the service has no configured audience, there is nothing to verify against, so we also skip. The manual block only runs when there is something real to check.

The error message includes both the token's actual aud value and the sorted list of audiences we accept. When you debug an inter-service auth failure at 2am, the only thing worse than a 401 with no detail is a 401 that tells you nothing about the mismatch. The cost of formatting that message into the exception is zero and the time it saves is real.

The bonus pattern: decode and verify as separate steps

Once you have done this once, decoupling decoding from verification starts to feel like the right default for any JWT code that has to do anything non-trivial. The library is good at parsing the structure and confirming the signature. Your service is the one that knows which claims matter and what acceptance looks like.

The same pattern handles a bunch of adjacent problems. Token introspection for audit logs without re-running all the checks. Soft expiry where you log a warning at 90 percent of the lifetime instead of rejecting. Migration windows where you accept tokens signed with either the old or new key for a week. Custom claim validation that the library has never heard of. Whenever a future library bug lands in the issuer check or the expiry math, you have an escape hatch already in place because the verification logic is yours.

This is also the answer even if python-jose ships list audience support tomorrow. You do not lose anything by owning the audience check. You gain a place to put the next requirement that does not fit cleanly into a kwarg.

Wrap

Multi-service authentication keeps running into the gap between what JWT can do and what the convenient libraries actually do. The spec is generous. The libraries are opinionated. When you stitch services together, the opinions usually have to give.

The unified-token path was worth the workaround. One JWT, one rotation, one issuer, five backends that each know how to accept it. The cost was a dozen lines of manual verification in a shared OAuth module. We would make the same trade again.

If you want to see how Codens uses this on the agent side, the English landing page is at https://www.codens.ai/en/. The MCP server is codens-mcp on PyPI and it is what the agent connects to when it needs to talk to any of the five product surfaces.

"Claude 3, Qwen 6: why we set a different fix_verify retry cap per model"

Takayuki Kawazoe — Fri, 15 May 2026 07:58:45 +0000

Claude gets 3 retries. Qwen gets 6. Everything else gets 5.

That is the default fix_verify_retry_cap in Codens Purple right now, after a few weeks of staring at fix-rate curves per model. It started as one global cap, the same number for every model the workflow could route to. We changed it once we had enough production data to see that the same number was both too high for one model and too low for another at the same time.

This is the story of the split, what the loop actually does, and the few lines of code that put the policy in.

The fix_verify loop

Codens Purple runs an agent that proposes a code fix, then verifies it by running a test or a check, then decides whether to retry with feedback from the verification step. The loop looks roughly like this. Generate a candidate change, apply it, run the verify command, read the result. If verify passes, the loop is done. If verify fails, feed the failure output back into the next prompt and try again. Each retry is a new API call. Each API call costs per-token credits, and verify itself costs wall clock time plus whatever the test suite costs to run.

The retry cap is the integer that says how many of those iterations the loop is allowed before it gives up and surfaces the partial result to the user. A cap of 1 means one attempt, no retry. A cap of 3 means an initial attempt plus two retries. A cap of 6 means up to six attempts total.

The cap matters because the curve of "fix succeeds at attempt N" is not flat. It is heavily front-loaded. Most successful fixes succeed on attempt 1 or 2. The question for any given model is how long the long tail is, and how much of that tail is worth paying for.

When we had one cap for all models, that one number had to be a compromise. The compromise was bad in two directions at once.

How we got to multi-model

Codens started with Claude as the only model. Specifically, Claude via the Anthropic API, using a raw API key with per-token billing. Not the subscription, not the bundled tier. We are a multi-tenant product running thousands of small fix_verify cycles per day across many customers, and a subscription does not cleanly support that shape of workload. Per-token billing lets us scale spend with usage and attribute cost back to the project that incurred it.

This came up again recently when Anthropic announced that the claude -p print mode, the Agent SDK, and CI use cases now require an API plan rather than a subscription. For us this was a non-event. We were already on the API. The announcement just confirmed that the path we picked is the path Anthropic wants production agent workloads to take.

Claude is excellent for fix_verify. The per-attempt success rate is high and the failure modes are usually informative, meaning when it does not fix the bug on attempt 1, the diff it produces and the verify output together give the next attempt a real signal. The downside is cost. At scale, with thousands of fix loops a day, the per-token bill is a real line item.

A few months in, we started evaluating Qwen as a secondary model to drive cost down on a subset of tasks. Qwen runs on our own infrastructure on AWS EC2 hosts, which gives us per-token cost well below the Anthropic API for the same task size. The tradeoff was the reliability profile. Per-attempt success rate is lower than Claude. Failure modes are noisier. Some of the time the model will produce a syntactically valid but semantically wrong patch, and the verify step is the only thing that catches it.

This is exactly the kind of model where retries earn their keep. Qwen's curve of cumulative success vs attempt number rises more slowly than Claude's, but it keeps rising further out. Attempt 5 is still adding meaningful success rate. With Claude, attempt 5 is mostly wasted credits on a fundamentally wrong understanding that more retries are not going to fix.

So we had two models in production with different shapes of success curve, and we were applying the same retry cap to both. Something had to give.

Why one cap did not work

Suppose we set the global cap to 3, tuned for Claude. Claude is fine. Qwen leaves real success on the table, because attempts 4, 5, and 6 would have converted a measurable fraction of failures into passes, and now they do not happen. Fix rate drops on Qwen-routed tasks. Users notice. They route more work to Claude, which is the opposite of what we wanted from introducing Qwen.

Suppose we set the global cap to 6, tuned for Qwen. Qwen is fine. Claude wastes credits. Attempts 4, 5, and 6 on a Claude-routed task that has already failed three times have a low chance of succeeding, because Claude's failure mode at attempt 3 is usually "I do not understand the bug" or "the test I am running is checking something I cannot see," and the same prompt with the same verify output is not going to flip that on attempt 6. We were paying full Sonnet-tier per-token cost for those attempts.

The compromise we ran for a while was a cap of 5 globally. It was bad on both axes. Claude wasted 2 attempts worth of credits on its failure cases. Qwen left 1 attempt worth of success on the floor. We could see this in the data once we started bucketing the loop outcome by model and attempt number. The right answer was clearly per-model, not global.

The per-model defaults

The implementation is small. We added a nullable integer column on the project table, fix_verify_retry_cap, with NULL meaning "use the model-based default." A helper function returns the default for a given model name. The use case layer combines the two when it kicks off a loop.

The helper:

def _default_fix_verify_cap(model: str) -> int:
    name = (model or "").lower()
    if name.startswith("claude"):
        return 3
    if name.startswith("qwen"):
        return 6
    return 5

The schema field, on the project update payload:

class PurpleProjectUpdate(BaseModel):
    fix_verify_retry_cap: Optional[int] = Field(
        default=None, ge=1, le=20
    )

The Alembic migration adds the column:

op.add_column(
    "purple_projects",
    sa.Column("fix_verify_retry_cap", sa.Integer(), nullable=True),
)

And the use case resolves the effective cap when it starts a task:

effective_cap = (
    pp.fix_verify_retry_cap
    or _default_fix_verify_cap(execute_model)
)

The override range is 1 to 20. One on the low end because some projects have run a single attempt followed by a human review, and we do not want to break that pattern. Twenty on the high end because it is a reasonable ceiling for a customer who wants to push the long tail of a cheap self-hosted model further than our default. If they set 20 and burn through it, that is their cost. We log the effective cap on every task so it shows up in the project audit log alongside the outcome.

The defaults of 3, 5, 6 are not magic numbers pulled out of intuition. We picked them by plotting cumulative fix rate against attempt number for each model from a few weeks of production runs and looking at where the curve flattens. For Claude, the curve is essentially flat past attempt 3. For Qwen, it is still meaningfully rising at 5 and starts to flatten at 6. For other models we had less data, so 5 is the safe middle.

The tradeoff

The honest cost of this change is that adding a new model to the routing layer is no longer free. Before, we added a model and it inherited the global cap. Now we have to pick a default. If we do not pick one, the model falls through to the 5 default, which is usually fine but not always optimal.

In practice, this turned into a small ritual when introducing a new model. Route a small fraction of traffic to it at cap 8 or 10 for a week, plot the curve, find the elbow, set the default to one or two above the elbow. The ritual takes a few hours of analysis on top of the model integration itself. We considered automating it, computing the default from rolling fix rates per model on a cadence. We have not built that yet. The set of models we route to is small enough that a manual review every couple of months is fine. If the set grew to ten or more, automation would start to pay back.

The other tradeoff is that the policy is now opinionated in a way users can feel. If a customer on a Claude-routed project reports "fix gave up too early," the answer is sometimes "the default cap is 3, raise it to 5 on your project and try again." That is a real conversation we have had. It is the price of a default that is right on average but not for every codebase.

What the cap is, really

A retry cap is a budget. Specifically, it is a budget that integrates two things at once. The marginal probability of success at each attempt. The marginal cost of each attempt. The optimal cap is the largest N where the expected value of attempt N is still positive, which means attempt N's marginal success times the value of a fix exceeds attempt N's marginal cost in credits and verify time. That number is per-model because both factors are per-model.

When we set 3 for Claude and 6 for Qwen, we are saying the integral converges faster on Claude because high per-attempt success runs out of incremental room quickly, and converges slower on Qwen because lower per-attempt success keeps adding incremental room for longer at a much lower per-attempt cost. The split is what makes a multi-model workflow economically coherent.

If you are running anything like this loop in production, do not pick one number for all your models. Plot the curve. The number falls out.

Codens Purple is part of the harness at https://www.codens.ai/en/ . The retry cap split lives in purple-codens under the project use case layer.

"When 'Control request timeout: initialize' actually means SIGKILL: Claude Code CLI OOM inside Celery"

Takayuki Kawazoe — Thu, 14 May 2026 00:08:43 +0000

A production Celery task in Codens Green started returning this, intermittently, only under real load:

Control request timeout: initialize

The string is suspiciously specific. It looks like the kind of message you would see if Claude Code CLI's MCP initialization handshake had timed out on the other side of a pipe. That is what it sounds like. That is not what it was.

The task is analyze_code_specification. It spawns Claude Code CLI as a subprocess to analyze a repository against a PRD. It worked in staging, worked locally, worked in CI. It failed in production a few times a day, almost always when more than one analysis was running at the same time.

What we eventually shipped: route that task to a dedicated Celery queue, run that queue on a separate ECS Fargate worker tier with 8 GB of memory, pin concurrency to 1. The real bug was the Linux kernel OOM killer terminating Claude Code CLI partway through startup, before it could complete its handshake with the parent task. The misleading log line was just what survives when a child process is shot in the head mid-init.

This is the chase.

The wrong paths

I spent the better part of a day inside Claude Code CLI's initialization code path, because that is where the error string lived.

First theory: stdio buffering. The CLI talks to the parent over stdin/stdout. If the parent is not reading fast enough, the child can block on a full pipe and look like it is hanging. I added explicit buffer drains, raised the timeout, switched to line-buffered mode on both sides. The error still happened.

Second theory: MCP protocol version mismatch. Maybe a recent Claude Code update changed the init handshake and our version pin was stale. I diffed the changelog, compared protocol versions across our deployed image and a known-good local environment. They matched.

Third theory: a bug in the agent SDK config. We pass a lot of options into the CLI. Maybe one of them was triggering a slow path during init that exceeded the handshake budget. I trimmed the config down to the smallest reproducible set, then to nothing. Same error in production. Still nothing in staging.

Fourth theory, the one I am least proud of: maybe Claude Code itself has an upstream init bug under concurrent load. I drafted half of a GitHub issue before I noticed I had no actual evidence and was just frustrated.

None of these held up. The fingerprint of the failure, intermittent, only under load, only in production, did not match any of them. Buffering bugs are deterministic. Protocol mismatches are deterministic. Config bugs are deterministic. This was load-correlated. That is a different shape of problem.

The exit code

The thing that finally cracked it was looking at the subprocess exit code instead of the log message. We were capturing the error string before we captured returncode, and the error string was so plausible it had crowded out the rest of the diagnostic surface.

proc = await asyncio.create_subprocess_exec(*cmd, ...)
stdout, stderr = await proc.communicate()
if proc.returncode != 0:
    logger.error("claude code failed rc=%s", proc.returncode)

The value coming out was -9.

On POSIX, when subprocess reports a negative return code, the absolute value is the signal that killed the child. Signal 9 is SIGKILL. SIGKILL cannot be caught, cannot be handled, cannot be cleaned up after. The process is removed from the run queue. There is exactly one common source of SIGKILL on Linux that arrives without a parent or operator sending it on purpose: the kernel OOM killer.

That was the moment. This is no longer a Claude Code problem. This is an OS-level problem. The CLI had not timed out during initialization. The CLI had been shot during initialization, by the kernel, for using too much memory.

The "Control request timeout: initialize" message was a downstream symptom. The parent task was waiting for the child to finish its handshake. The child was killed mid-handshake. The parent eventually gave up waiting and surfaced the most specific thing it knew, which was that init had not completed in time. The error was technically true and completely misleading.

OOM math

Once you know the shape, the math is easy.

Claude Code CLI is not a small process. It boots a JavaScript runtime, loads the agent SDK, hydrates context, and prepares for tool calls. In our workload, resident memory per invocation sits between roughly 500 MB and 1.5 GB, peaking higher during initial context load.

Our Celery worker pool was the general-purpose one. Sized for the rest of our tasks, which are normal Python work: webhook fan-out, database writes, small HTTP calls. Those tasks live happily in well under 200 MB each. The worker host had memory headroom appropriate to that profile, with default Celery concurrency, which spins up multiple worker processes per host so several tasks run in parallel.

That is fine for normal traffic. It is not fine when two of those parallel tasks each decide to spawn a 1+ GB CLI subprocess.

Picture the failure mode. Two PRDs are submitted within the same minute. Two Celery workers pick up analyze_code_specification. Each launches Claude Code CLI. Both CLIs start allocating. The host's resident memory climbs past its limit. The kernel's OOM killer wakes up and picks a victim, typically the largest recent allocator. Claude Code CLI dies with SIGKILL. The Celery task surfaces "Control request timeout: initialize" because that is what it saw from its end of the pipe. The other task may or may not also die, depending on timing.

The reason this never showed up in staging was simple: staging has one user, me, running one job at a time. Concurrency was always 1 by accident. The bug needed two simultaneous invocations on the same host to express itself.

The fix, in four parts

I did not want to over-engineer this. The fix is structurally small. It is mostly Celery routing and infra sizing.

1. Dedicated queue. analyze_code_specification got its own queue, separated from everything else.

# celery_app.py
task_routes = {
    "tasks.analyze_code_specification": {"queue": "analysis"},
    "tasks.run_fix": {"queue": "fixing"},
    "tasks.control_plane.*": {"queue": "control_plane"},
    "tasks.plan_monitor.*": {"queue": "plan_monitor"},
    # everything else falls through to "default"
}

The point of the queue split is not load balancing. It is so we can attach a different worker profile to this task without changing anything about the others.

2. Dedicated ECS Fargate worker tier. The analysis queue gets its own worker service, on its own Fargate task definition, with 8 GB of memory. The rest of the workers stay on the smaller general-purpose host. One service, one queue, one process shape.

3. Concurrency = 1. The worker for the analysis queue starts like this:

celery -A app worker -Q analysis --concurrency 1 --loglevel info

This is the load-bearing piece. Even on an 8 GB host, if you let two CLI invocations run in parallel, you can still blow past the limit when both peak at 1.5 GB at the same time and the OS plus worker plus everything else has its own footprint. Concurrency 1 means exactly one Claude Code CLI subprocess exists on this host at any time. Two analyses come in, the second one queues, waits, runs next. Slower, totally fine, never OOMs.

4. Memory headroom. 1 CLI × roughly 1.5 GB peak × concurrency 1, against 8 GB total, with the worker process and OS taking a few hundred MB. That gives more than 5 GB of headroom for a worst-case CLI invocation. If we ever needed to raise concurrency to 2, we would also need to either double the instance size or accept the OOM risk back. We chose not to.

We also added regression tests at the routing layer, asserting that analyze_code_specification resolves to the analysis queue, that control-plane tasks do not accidentally get rerouted there, and that plan-monitor isolation is preserved. The routing dict is the kind of thing that quietly bit-rots in a PR review, and a misroute would silently bring the bug back.

Tradeoffs

The dedicated worker tier is more expensive per task than just bumping the general worker's RAM. It scales slower under burst load because the queue depth gates throughput. It is one more service to deploy, monitor, alert on, and update during a Claude Code CLI version bump. None of that is free.

What we got in return is that this failure mode cannot happen anymore for any reason that is not "we accidentally raised concurrency above 1." That is a single config line in one repo with a test guarding it. I will take that tradeoff.

What generalizes

Two things stuck with me after this.

One: when a child process surfaces a plausible-sounding error during a handshake, check returncode before you check the message. A negative return code on POSIX is a different category of failure from anything the application itself can report. A negative number is the OS telling you the application never got a chance.

Two: per-task memory profiles matter for Celery worker sizing in a way that defaults do not protect you from. A worker pool tuned for 200 MB tasks will silently kill a 1.5 GB task and tell you something else happened. If your task spawns a subprocess that is heavier than your worker, the right answer is almost always a separate queue with its own concurrency and its own host, not a bigger general-purpose host.

We build Codens, an AI dev harness with this kind of analysis baked in. https://www.codens.ai/en/

"Cutting MCP token bloat by 12x: what happened when we packed 31 tools into one server"

Takayuki Kawazoe — Tue, 12 May 2026 02:49:09 +0000

Earlier this week @akshay_pachaar summarized a year of MCP-vs-CLI arguing into one sharp line:

"The MCP vs CLI debate. For most of 2025, AI Engineers argued about it. The skeptics had real numbers: Playwright MCP eats 13.7K tokens, Chrome DevTools MCP eats 18K. A 5-server setup burns 55K tokens before any work."

He is right. Those numbers are the steady drumbeat against MCP as a delivery format. If your agent burns 55K tokens just advertising capabilities, the protocol starts to look like a tax.

We just shipped a counter-data point. codens-mcp is a single Python package that exposes 31 tools across five products (Purple, Red, Blue, Green, Auth, plus a cross-product registration tool). I sat down with wc -c and a calculator and got a number I had to triple-check: the entire tool surface, descriptions and all, is ~4,720 tokens. That is roughly 12x less than the 5-server number in the tweet, and about 3x less than Playwright MCP alone.

This is not a "look how clever we are" post. It is the boring engineering answer: most of MCP's token cost is not the protocol, it is the loading strategy. Below I walk through how we measured it, the five architecture decisions that made the number small, and the real tradeoffs we ate to get there.

The measurement

Here is the actual byte count from the tool definition files, straight off disk:

auth_tools.py     1,555 chars
blue_tools.py     2,576 chars
cross_tools.py    3,913 chars
green_tools.py    6,160 chars
purple_tools.py   1,448 chars   # re-exports 16 tools from purple-codens-mcp
red_tools.py      3,231 chars
                 ───────
total            18,883 chars  ≈ 4,720 tokens

The 4 chars/token heuristic is a known underestimate for natural-language English (3.5 is closer to GPT/Claude tokenizers in practice), but it is fine as an upper-bound on a registration payload that contains a mix of Python identifiers, docstrings, and JSON-schema-ish hints. The MCP server sends a slightly inflated version of these definitions over the wire as tool descriptors, so the on-context cost the model sees is in the same order of magnitude. I have done the apples-to-apples comparison with tiktoken on the rendered descriptors and the number lands between 4.4K and 5.1K depending on whether you count the JSON schema framing. ~4,720 is the honest middle.

The 31 tools break down like this:

Purple (16, re-exported from purple-codens-mcp): purple_login, purple_whoami, purple_analyze_repo, purple_register_project, and twelve more covering projects, repos, instructions, workflows, and SSE.
Red (4): red_create_bug_report, red_get_bug_report, red_analyze_bug_report, red_submit_bug_fix_plan_to_purple.
Blue (4): blue_list_e2e_tests, blue_generate_e2e_test, blue_run_e2e_test, blue_get_e2e_test_results.
Green (4): green_create_consultation_with_message, green_send_consultation_message, green_convert_consultation_to_prd, green_create_kickoff.
Auth (2): auth_agent_signup, auth_get_pricing.
Cross (1): codens_register_project_unified.

Where this lands against the public reference points:

Server	Tools	Approx. tokens
Playwright MCP	many	13,700
Chrome DevTools MCP	many	18,000
5-server stack (mixed)	varies	~55,000
`codens-mcp` (unified)	31	~4,720

If we had shipped five separate MCPs, one per product, even at a conservative per-server registration overhead the stack would have cost ~65K tokens of context before any tool ran. We did not, and that is the whole story.

Why one package works

Five decisions did the work. None of them are clever. All of them are boring tradeoffs that happen to compound.

1. Prefix namespacing instead of MCP-server-level scoping

Every tool carries its product prefix in the name. The flat namespace makes the file you saw above legal:

purple_login, purple_whoami, purple_analyze_repo, ...
red_create_bug_report, red_analyze_bug_report, ...
blue_generate_e2e_test, blue_run_e2e_test, ...
green_convert_consultation_to_prd, ...
auth_agent_signup, auth_get_pricing
codens_register_project_unified

We pay verbosity in the tool name. We get zero collision risk and one MCP process. I considered nested groupings (codens.red.create_bug_report style), but flat names render cleaner in tool-use traces and grep better in logs. Worth it.

2. Shared client code

All five product clients live in one place:

src/codens_mcp/client/
  auth.py
  blue.py
  green.py
  red.py
  auth_helper.py    # JWT load/refresh, shared

This is the part that does not show up in the token count but matters for the maintenance story. Five separate MCP packages would mean five copies of auth_helper.py drifting independently. One package means one bug fix.

3. Single auth flow

Auth Codens is the SSO root for the family, so the MCP server only ever speaks one login dialect:

codens-mcp login        # Device Code Flow, runs once
# token persisted to ~/.purple-codens/credentials.json
# every product client reads the same file

The historical path is ~/.purple-codens/credentials.json because Purple shipped first and we did not want to break existing users by renaming. Cosmetic debt, zero functional cost.

4. Re-export pattern for Purple

This is the move that kept us honest. Purple already had a standalone MCP package on PyPI (purple-codens-mcp) before the unified server existed. We did not fork it. The unified package imports and re-registers Purple's tools:

# src/codens_mcp/tools/purple_tools.py
from purple_codens_mcp.tools.project_tools import register_project_tools
from purple_codens_mcp.tools.repo_tools    import register_repo_tools
# ...four more imports

def register_purple_tools(mcp: FastMCP) -> None:
    _register_purple_auth(mcp, _purple_get_client)
    _register_projects(mcp, _purple_get_client)
    _register_repos(mcp, _purple_get_client)
    # ...

Existing users of purple-codens-mcp on PyPI keep working unchanged. codens-mcp adds Red, Blue, Green, Auth, and Cross on top. One package can be fully replaced by the other without breaking anyone, which gave us a safe rollout.

5. Lazy execution

The 4,720 tokens is the registration cost. Claude Code sees all 31 tool descriptors at startup. Each tool's actual HTTP call only fires on invocation, and the per-call response is bounded by the tool's own prompt (usually a few hundred tokens of JSON). The thing that scales linearly with use is the conversation transcript, not the registration. Bloat at startup is the lever; we pulled it once, and the rest of the session is unaffected.

The honest tradeoffs

Unified is not free. Three things we gave up:

One process is one failure mode. If codens-mcp crashes, all five product surfaces are gone simultaneously. With separate MCPs each product gets its own isolation boundary and a Red bug cannot take down Green tooling. We accepted this because we are a small shop, the package is small, and a crash in production would tell us we have a much bigger problem than tool routing.

Update cadence is coupled. Shipping a new Red tool means cutting a new version of the whole package. Users get every product's churn whether they wanted it or not. We considered semver-per-product subnamespacing and rejected it because our internal release cadence is already weekly and roughly synchronized; the imaginary user who wants Red on a daily cycle but Green frozen does not exist for us yet.

Permission boundary is coarse at the MCP layer. Authenticating once gives the user access to all 31 tools. You cannot tell Claude Code "allow Red but not Green" through the MCP descriptors alone. We solved this one level up: Auth Codens enforces role-based permissions on the server side, so even if the MCP exposes green_create_kickoff, the API call rejects users who do not have the Green entitlement. The MCP becomes the surface; the gate lives elsewhere.

"Unified is always right" is not the conclusion here. If you ship one MCP per oncall team and the teams release on different cycles, you are paying the token tax for a reason, and the isolation buys you something real. The unified shape worked for us because the products were already coupled.

Where the token bloat actually comes from

Akshay's follow-up tweet closes the loop:

"The protocol was never the bottleneck. The loading strategy was."

That is the line I want every MCP author to internalize. The 55K-token figure is not what MCP-the-spec costs. It is what N separate handshakes plus N capability advertisements plus N redundant client preambles cost when you let your tools sprawl into N independent servers.

Look at the math from the other direction. If five separate MCPs each carry a 10–15K registration footprint (one server's worth of capability JSON, instructions, schema bundles), you are at 50–75K before the model has done anything useful. Collapse the five servers to one and the registration overhead collapses too, because there is only one capability list, one instruction blob, one schema bundle, and the per-tool descriptor cost is small.

The protocol is doing its job. The protocol is also fine with you stacking five copies of itself in your config file, because that is a user choice, not a spec smell. Treating MCP servers like microservices ("one per product, for isolation") is the analogue of running 30 Lambda cold starts where one process would do.

We did not invent a new transport. We did not strip schemas. We just stopped paying for five handshakes when one would do.

The principle

Partition your MCP surface by domain, not by tool class. If five tools share an auth root, a release cadence, and a user mental model, they belong in one server. If they do not, split. The token cost is a downstream signal of how well that partition matches reality.

codens-mcp is on PyPI: pip install codens-mcp. Code lives at github.com/codens-ai. If you want the user-facing pitch, that is at codens.ai/en.

"How one empty message poisoned an entire AI consultation (and the three-layer fix)"

Takayuki Kawazoe — Mon, 11 May 2026 05:33:11 +0000

A user opened a support thread saying their AI consultation had gone unresponsive. Every message they sent came back with an error. Refreshing didn't help. Starting a new tab didn't help. From their side, the conversation was dead.

The product is Codens Green, a PRD management tool where users hold long, iterative conversations with Claude to refine product requirements. Some of those conversations run dozens of turns. This particular one had thirty-something messages of history, all looking normal in the database. The row was there. The user was authenticated. The organization had credits. And yet every new message hit the API and bounced.

By the time we shipped the fix it was three layers deep, and only one of those layers is the "actual" fix. The other two were the kind of belt-and-suspenders you only put on once you've been burned. I want to walk through what we saw, what we tried first (which was wrong), what the real cause turned out to be, and the shape of the patch.

What 400 BadRequest looked like

The backend log for the failing consultation looked like this on every request:

ERROR Failed to generate AI response: Error code: 400
{'type': 'invalid_request_error',
 'message': 'messages.17: text content blocks must be non-empty'}

Same error, same index, every time. The user retried, our code retried, the error didn't move. Index 17 was always index 17 because index 17 was sitting in their stored history.

I went down the wrong path first. The error code was 400, which felt like an auth-shaped problem, so I started there. Wrong key? The key was fine, every other org was working. Rate limit? No, this org wasn't anywhere close. Model deprecation? We were on a current model, and other consultations using the exact same model were responding normally. I checked the Anthropic status page. Green across the board. I checked our own credit-deduction logic to make sure we weren't somehow short-circuiting requests. Clean.

About forty minutes in I noticed the messages.17 part of the error and felt stupid. The API was telling me exactly which message in the array it didn't like. I just hadn't read it.

The real cause

I pulled the consultation row, parsed its messages JSON, and walked it. Most messages had a few hundred characters of content. Message 17, an assistant message, had content: "". Empty string. Not whitespace, not null, just empty.

Claude's API rejects requests where any message in the messages array has empty content. That's a hard validation at the boundary, not a soft failure. Which meant: the moment that empty message landed in the consultation's history, every future call was guaranteed to fail, because every future call assembled the full history and sent it back to the API. The conversation had been poisoned by one row.

The user couldn't recover from inside the app. Our UI didn't expose a "delete message" affordance for this surface, and even if it did, the broken message was an assistant turn, not theirs to edit. From the user's perspective, the consultation just stopped working. Forever. With no error message that meant anything to them.

This is the worst kind of bug. It only surfaces for users with enough history to have triggered the rare condition that produced the bad row, the dashboards don't flag it (a 400 from Claude looks like an intermittent upstream failure if you don't drill in), and the root cause is invisible because it happened on some earlier request you weren't watching.

How an empty assistant message ever got saved

Once I knew what to look for, the chain was straightforward.

Claude's API occasionally returns a response where the assistant's text_content is empty. I don't have a great theory for why. Could be transient, could be an edge case in their content filtering, could be a race in how we parse content blocks when the response has tool-use blocks but no text blocks. It's rare. I'd guess less than one in ten thousand calls in our traffic. But across enough users and enough turns, "rare" becomes "guaranteed."

Our previous code did approximately this:

ai_result = await self._claude_client.generate_consultation_response(
    messages=messages,
    title=consultation.title,
    context=consultation.context,
)
ai_response = ai_result["response"]
# ...
consultation.add_assistant_message(ai_response, metadata=ai_metadata)

ai_response could be "". Nothing checked. The empty string flowed into add_assistant_message, got appended to the message list, and the entity got persisted. From that point forward, the consultation was permanently broken.

One unchecked write, two days earlier, became a permanent block on the user's account.

The three-layer fix

The patch split into three layers. Each one defends a different boundary, and only the middle one is what I'd call the real fix. The other two are there because the real fix doesn't help users who already have a poisoned row, and because I wanted to bound the failure surface.

Layer 1: filter on the way out

In the Consultation domain entity, get_messages_for_ai() is what assembles the array we send to Claude. The old version included every non-system message. The new version also excludes anything with empty or whitespace-only content:

def get_messages_for_ai(self) -> list[dict[str, str]]:
    return [
        {"role": msg.role.value, "content": msg.content}
        for msg in self.messages
        if msg.role != MessageRole.SYSTEM
        and msg.content
        and msg.content.strip()
    ]

This is the layer that unsticks every existing poisoned consultation. We didn't run a data migration. We didn't write a one-shot cleanup script. The filter at read time simply skips the bad row on the way to the API, and the conversation works again. The bad row is still sitting in the DB, but it's never sent anywhere that would reject it.

I want to be honest about what this layer is and isn't. It's defensive. It papers over bad data. It does not prevent the bug from happening again. If you only ship this layer, you keep generating empty rows and keep skipping them, which is fine until something else relies on the history being complete (PRD generation from conversation summary, for instance) and now the user's PRD is missing a turn.

Layer 2: detect on the way in

This is the real fix. In our Claude client wrapper, generate_consultation_response() now refuses to return an empty response at all:

text_content = "".join(
    block.text for block in response.content if block.type == "text"
)
if not text_content.strip():
    raise ValueError("No text content in Claude API response")

If Claude hands us back a response with no text blocks (or only empty text blocks), we raise. The caller in AddMessageUseCase already has a try/except around the API call and falls back to a generic "sorry, please try again" message. Crucially, that fallback message goes to the user as a transient response. It does not get persisted as an assistant turn:

try:
    messages = consultation.get_messages_for_ai()
    ai_result = await self._claude_client.generate_consultation_response(...)
    ai_response = ai_result["response"]
except Exception as e:
    logger.error(f"Failed to generate AI response: {e}")
    ai_response = "申し訳ありません。AIからの応答の生成中にエラーが発生しました。..."

Wait, that's not quite right as stated. Look at the existing code and you'll see the fallback message does get persisted via add_assistant_message further down. That's a separate concern we'll come back to. What matters here is that with Layer 2 in place, the assistant message that gets stored on a failed call is either real text or our explicit, non-empty fallback string. It is never "". The DB cannot accumulate another poisoned row from this code path.

If you can only ship one of the three layers, ship this one. Defending at the output boundary, the moment data crosses from "external API response" into "thing we persist," is where bad data deserves to die. Filtering at read time is a workaround. Validating at write time is the fix.

Layer 3: bound the history

This one is technically a separate bug, but I shipped it in the same PR series because the user-visible symptom overlaps. Long consultations were starting to push against the context window, and a few users were seeing failures that looked similar (intermittent API errors on long-running conversations) but had a different cause.

So in AddMessageUseCase, we cap the history we send:

MAX_HISTORY = 40
if len(messages) > MAX_HISTORY:
    messages = messages[-MAX_HISTORY:]
    while messages and messages[0]["role"] != "user":
        messages = messages[1:]

Forty messages is roughly twenty user/assistant turns. The trailing slice gets the most recent context, which is almost always what matters. The while loop handles a Claude API requirement that conversations must start with a user role. If the slice happens to begin with an assistant message (because we truncated mid-turn), we drop the leading assistants until we find a user message.

Three things to flag about Layer 3. First, twenty turns is a product choice, not a technical limit; we picked it because our consultation UI doesn't show more than that comfortably anyway, and longer histories were producing diminishing returns on AI quality. Second, the first-user-role correction is a Claude-specific constraint. Don't carry this verbatim to a different provider without checking their docs. Third, this layer is unrelated to the empty-message bug. It's bundled in because the failure mode looks adjacent from a triage perspective, and shipping them together meant one round of regression testing instead of two.

The migration we didn't write

One thing I want to underline. Layer 1, the read-time filter, accidentally did the work of a data migration without being a data migration. Every existing poisoned consultation in our DB started working again the moment the deploy went out. No SQL to write, no rows to update, no offline job to run. The defensive layer absorbed the historical damage.

That's not always the right tradeoff. If we'd needed downstream consumers (analytics, PRD generation, exports) to see a complete history, leaving bad rows in place would have leaked into those features later. In our case the only consumer that read the bad message was the call to Claude itself, so filtering at read time was sufficient. But it's worth naming the pattern explicitly: a defensive read-side filter can serve as a zero-downtime migration for a class of bad data, as long as you're confident you've enumerated every reader.

What I'd take away

The thing I keep coming back to is that the cause of the user's problem (one empty cell, written two days earlier, somewhere on the request path) had nothing visible in common with the symptom they were experiencing (every new message fails with a 400 today). The signal that mattered was buried in the error message itself, and I spent forty minutes chasing API keys before I read it. Read the error.

The three-layer shape, defend on the way in, defend on the way out, bound the size, is general. It works for any case where you're persisting outputs from an external API and replaying them as inputs. Validate before you persist. Filter before you replay. Cap the surface.

If you're building anything with Claude, Codens is what we use this same stack to build.

"Persisting your real Chrome login across Playwright restarts on macOS"

Takayuki Kawazoe — Sun, 10 May 2026 02:08:29 +0000

Every macOS reboot, the same ritual. Open the Playwright-controlled Chrome window, see seven publishing tabs all logged out, and spend the next ten minutes typing passwords and tapping the Google account picker. Zenn, dev.to, note, Substack, X, LinkedIn, the Google Search Console dashboard. All gone, all needing the same Google SSO dance through my corevice.com workspace account.

I run a one-person GTM operation for Codens and the publishing pipeline is entirely Playwright-driven. npx @playwright/cli@latest opens a real Chrome with a persistent profile, and a stack of small scripts paste titles and bodies into each editor. It works beautifully until the host reboots and the user-data-dir at /tmp/chrome-pw-corevice evaporates with the rest of /tmp.

I finally sat down and fixed it. The result is a thirty-line shell script that clones my daily-driver Chrome profile into the Playwright tmpdir on every launch, with two non-obvious tricks that make the cookies actually decrypt. This post is about those two tricks.

Why the obvious copy doesn't work

The first thing anyone tries is the obvious thing.

cp -r ~/Library/Application\ Support/Google/Chrome/Default \
      /tmp/chrome-pw-corevice/Default

Run it, fire up Playwright, and Chrome opens looking like it has my profile. History is there. Bookmarks are there. Extensions are there. But every site is logged out, and the cookie jar in DevTools is either empty or full of cookies that don't authenticate anything.

The reason is that Playwright launches Chrome with two flags I didn't know about until I started digging:

--use-mock-keychain
--password-store=basic

Those flags tell Chrome to bypass the macOS Keychain entirely and use a hardcoded mock encryption key for cookies and the password store. From Playwright's point of view this is the right default. CI runners don't have a real keychain. Headless containers don't have a real keychain. The mock makes Chrome boot reliably in places where Keychain Access doesn't exist.

But for me, this is exactly wrong. The cookies my daily-driver Chrome wrote to disk were encrypted with the real keychain key, the one Chrome stored under "Chrome Safe Storage" in my login keychain on first install. The cookies that just got copied over are still encrypted with that real key. Playwright's Chrome boots with the mock key, tries to decrypt them, gets garbage, and silently treats every cookie as invalid.

I tried storageState first, which is the documented Playwright path for this. Export cookies and localStorage from one context, inject into another. It works for some sites and dies for others. Substack stalled at the Google SSO redirect and never finished the auth handshake. Note's editor wanted a CSRF token tied to a session cookie that storageState had captured but which the server no longer accepted, presumably because the session was bound to the original UA fingerprint. After the third site failed in a different way I gave up on storageState and went back to cloning the whole profile.

So: two real fixes are needed. Make Playwright's Chrome speak the same encryption language as my daily Chrome, and copy the cookie database in a way that doesn't corrupt it.

Fix one, patch the keychain flag

Playwright's CLI bundles its Chrome launch arguments inside playwright-core/lib/coreBundle.js. When you run npx @playwright/cli@latest, npm caches that file under ~/.npm/_npx/<hash>/node_modules/playwright-core/lib/coreBundle.js. The file is huge and minified, but the two strings I care about appear verbatim:

"--use-mock-keychain",
"--password-store=basic"

A sed rewrite is enough. Swap the first to --use-real-keychain and the second to --password-store=keychain. Chrome on macOS recognizes both, and once they're in place the launched Chrome reads its encryption key from the same login keychain entry as my daily-driver Chrome. The cookies decrypt. SSO holds.

The patch wants to be idempotent because npx happily re-extracts the package if it gets purged from the cache, and I don't want to re-edit the file by hand each time. So the script does three things. It locates the bundle with find. It checks whether the bundle still contains --use-mock-keychain, which means it hasn't been patched yet. If so, it makes a .bak copy on first patch and runs sed -i '' in place.

The .bak is the escape hatch. If a future Playwright update changes those flags or relies on the mock keychain elsewhere and my patch breaks something, I can mv coreBundle.js.bak coreBundle.js and be back to stock in one command.

The first time you launch the patched Chrome, macOS will pop a Keychain Access dialog asking you to allow access to "Chrome Safe Storage." Click Always Allow. After that, no prompts.

Fix two, SQLite backup for the cookie file

With the keychain flag patched, the next failure mode is more subtle. Sometimes the cookies decrypt, sometimes they don't, and when they don't, the SQLite file looks corrupt. Chrome refuses to read it and silently starts a fresh empty cookie jar.

Chrome's Cookies file is a SQLite database. My daily-driver Chrome is almost always running, which means it's holding write locks on that database, and depending on timing it may have a partial write in progress when cp reads the file. The result is a torn copy: the bytes are physically there, but the SQLite page checksums don't match the WAL log, and SQLite refuses to open it.

The right tool for snapshotting a live SQLite database is the .backup command:

sqlite3 "${SOURCE_PROFILE}/Default/Cookies" \
  ".backup ${PW_PROFILE_DIR}/Default/Cookies"

This isn't just a smarter copy. It uses SQLite's online backup API, which acquires a read lock, copies pages in a way that's transactionally consistent with the source database's current state, and produces a target file that opens cleanly. You can run it while Chrome is actively writing to the source. The output is always a valid database.

The script removes the stale Cookies and Cookies-journal files first, then runs .backup on every launch. That way the cookie jar is always fresh, even if I haven't rebooted but I have used my daily Chrome to log into a new site since the last Playwright session.

The script

The whole thing is at runbooks/launch/playwright-launch.sh in my GTM repo. Roughly thirty lines if you don't count comments.

#!/usr/bin/env bash
# Launch playwright-cli against the corevice.com Chrome profile copy.
# Idempotent: if profile copy missing, re-creates it; if patch missing, re-applies.
# Usage: ./playwright-launch.sh open <url>
#        ./playwright-launch.sh <command> [args...]

set -euo pipefail

PW_CACHE_BASE="${HOME}/.npm/_npx"
PW_PROFILE_DIR="/tmp/chrome-pw-corevice"
SOURCE_PROFILE="${HOME}/Library/Application Support/Google/Chrome"

ensure_profile_copy() {
  if [ ! -d "${PW_PROFILE_DIR}/Default" ] || [ ! -f "${PW_PROFILE_DIR}/Default/Cookies" ]; then
    echo "[setup] copying profile..."
    mkdir -p "${PW_PROFILE_DIR}/Default"
    rsync -a \
      --exclude='Cache' \
      --exclude='Code Cache' \
      --exclude='GPUCache' \
      --exclude='Service Worker' \
      --exclude='ShaderCache' \
      --exclude='GraphiteDawnCache' \
      --exclude='component_crx_cache' \
      --exclude='extensions_crx_cache' \
      --exclude='Sessions' \
      --exclude='File System' \
      --exclude='blob_storage' \
      --exclude='Cookies' \
      --exclude='Cookies-journal' \
      "${SOURCE_PROFILE}/Default/" "${PW_PROFILE_DIR}/Default/"
    cp "${SOURCE_PROFILE}/Local State" "${PW_PROFILE_DIR}/" 2>/dev/null || true
    cp "${SOURCE_PROFILE}/First Run" "${PW_PROFILE_DIR}/" 2>/dev/null || true
  fi

  # Always refresh cookies via SQLite .backup (safe with Chrome running)
  rm -f "${PW_PROFILE_DIR}/Default/Cookies" "${PW_PROFILE_DIR}/Default/Cookies-journal"
  sqlite3 "${SOURCE_PROFILE}/Default/Cookies" \
    ".backup ${PW_PROFILE_DIR}/Default/Cookies" 2>/dev/null
}

ensure_patch() {
  local cb
  cb=$(find "${PW_CACHE_BASE}" -path '*playwright-core/lib/coreBundle.js' 2>/dev/null | head -1)
  if [ -z "$cb" ]; then
    echo "[setup] @playwright/cli not yet installed; npx -y will install it"
    return
  fi
  if grep -q -- '--use-mock-keychain' "$cb"; then
    echo "[setup] patching playwright to use real keychain..."
    [ ! -f "${cb}.bak" ] && cp "$cb" "${cb}.bak"
    sed -i '' \
      -e 's|"--password-store=basic"|"--password-store=keychain"|' \
      -e 's|"--use-mock-keychain",|"--use-real-keychain",|' \
      "$cb"
  fi
}

ensure_profile_copy
ensure_patch

export PATH="${HOME}/.asdf/shims:${PATH}"

if [ "${1:-}" = "open" ]; then
  shift
  exec npx -y @playwright/cli@latest open --headed "$@" \
    --browser chrome \
    --profile "${PW_PROFILE_DIR}"
fi

exec npx -y @playwright/cli@latest \
  "$@" \
  --browser chrome \
  --profile "${PW_PROFILE_DIR}"

A few notes on the rsync exclude list. All the cache directories are excluded because they're large, regenerable, and sometimes hold OS-specific binary blobs that Chrome will rebuild on first launch. Sessions is excluded so Playwright's Chrome doesn't try to restore tabs from my daily browsing. File System and blob_storage are excluded for size. Cookies and Cookies-journal are excluded specifically because we handle them via .backup immediately after the rsync, and we want that to be the authoritative copy.

Local State and First Run are copied separately. Local State is where Chrome stores the encrypted master key reference and a few profile-level settings. First Run is a sentinel file that suppresses the first-run wizard.

The patched diff itself is two lines:

- "--use-mock-keychain",
+ "--use-real-keychain",
- "--password-store=basic"
+ "--password-store=keychain"

That's the whole keychain fix. Two strings.

A side issue, headless mode breaks the clipboard

The script forces --headed for the open subcommand, and there's a story behind that. My publish scripts work by pbcopy-ing the title and body into the clipboard, focusing the editor field via Playwright, and then sending Cmd+V. CodeMirror, Substack's editor, dev.to's editor — they all behave better with a real paste than with type() calls that fire individual keypress events. Markdown formatting survives. Code blocks stay intact. Smart-quote autocorrect doesn't fire.

But headless Chromium doesn't have a system clipboard. navigator.clipboard.readText() returns empty, the paste handler sees no data, and the form silently stays empty. I lost an hour to that one before realizing the open command was defaulting to headless mode in the version of @playwright/cli I was on. Forcing --headed makes the daemon run as a real Chrome window with full clipboard access, which is what I want anyway because I sometimes want to glance at the publish flow while it's running.

The non-open commands pass through unchanged, so anything else that wants headless behavior still gets it.

What it's worth

Five to ten minutes of manual relogins per reboot, multiplied by however often macOS decides to update overnight. Across a year that's hours I get back, and more importantly the publish scripts now run unattended. I push a draft, the script opens the right tab, pastes the right content, and I review the rendered preview before clicking publish.

If you're running a similar setup, the script is generic. Change SOURCE_PROFILE if you use a non-Default Chrome profile, change PW_PROFILE_DIR if you don't trust /tmp to survive your reboot policy, and the rest should work.

This is the kind of small infrastructure work that makes solo operations possible. We build a lot of these at Codens, where the day job is wiring AI agents into the same kind of publishing and dev pipelines.

"Why your long-running AI agent feels broken (even when it isn't)"

Takayuki Kawazoe — Fri, 08 May 2026 04:06:22 +0000

A support ticket came in last month with the subject line "the plan generator is broken." It was not, in fact, broken. The Celery task was running. The downstream service had accepted the job. The database row was sitting there with generation_status = 'in_progress' exactly as designed. From the server's point of view, the system was healthy.

From the user's point of view, they had clicked a button fifteen minutes ago and nothing had happened since.

I run Codens, a small AI dev harness, mostly solo. We have a product called Green Codens that turns Product Requirements Documents into actionable dev plans. The plan generation is a long-running AI job. It can take 30 seconds for a tiny repo or 30+ minutes for a sprawling one. We had built two completion paths: a webhook for the happy case and a polling fallback for when the webhook missed. The webhook had silently failed during a deploy. The polling fallback was scheduled to make its first call fifteen minutes after submission.

We changed two numbers. The same workflow now feels roughly fifteen times faster. Total compute is basically unchanged. This post is about why those two numbers mattered so much, and what they imply about designing async UX in AI products in general.

What was actually happening

Green Codens does the PRD authoring side. A separate service we call Purple Codens does the heavier lifting: cloning the repo, reading code, running an analysis agent, producing a structured task list. When a user converts a PRD into a dev plan, Green submits an analyze job to Purple, gets a 202 Accepted and a job id back, and then has to wait for the result.

There are two completion paths.

The first is a webhook, which is just the server-to-server "I'm done" callback. When Purple finishes, it POSTs the result back to Green with a signature, and Green applies it to the plan row. This is the happy path and it usually works.

The second path is a polling fallback. Webhooks miss for boring reasons. A receiver might be mid-deploy and bouncing 503s for thirty seconds. A signing key rotation might leave one side temporarily unable to verify the other. A network blip might drop the request and the sender's retry policy might give up before the receiver is back. None of these are exotic. All of them happen in real production systems. So Green also runs a Celery task that wakes up periodically, asks Purple "hey, what's the status of job X?", and applies the result if the job is done.

The polling task is idempotent. If the webhook already applied the result, the polling task sees generation_status = 'completed' and is a no-op. If the webhook missed, the polling task is the safety net that catches the dropped result.

Here is what the original schedule looked like:

# Original (bad)
_INITIAL_COUNTDOWN = 900  # wait 15 minutes before first poll
_RETRY_COUNTDOWN = 300    # then poll every 5 minutes, up to 12 times

Total polling window: 15 + 12 × 5 = 75 minutes. The reasoning was server-side and superficially sensible. Most analyses on real customer repos finish somewhere in the 10 to 20 minute range. Polling earlier than 15 minutes "wastes" API calls on jobs that are obviously still running. Polite. Considerate. Reasonable in isolation.

The problem was that the user does not live in the server's frame of reference. The user clicks the button, sees a "your plan is being analyzed..." spinner, and then the front-end is silent. If the webhook fires, great, the spinner becomes a result. If the webhook does not fire, the user sits with that spinner for a full fifteen minutes before any other code path even tries to discover the truth. They reload the page. They check the network tab. They contact us. By the time the polling fallback fires its first request, the user has already decided we are broken.

The retry-design trap

When you reach for retry logic in any system, the default mental model most engineers grab is "start short, double each time, give up at some bound." If you have ever written time.sleep(2 ** attempt) you have used it. It is taught early, it appears in HTTP client libraries, it ships in AWS SDKs by default. It is the right answer to a real problem.

But it is the right answer to a specific problem: you are calling something that is probably failing, and you do not want to hammer it while it is on fire. Each retry is a fresh attempt at the same operation. You assume the remote side might be temporarily unable to serve you, you give it space to recover, and you increase the wait between attempts so that if the outage is long, you are not piling on. The pattern protects the server from you.

The polling fallback in Green is doing something different. The job we are checking on is, in the overwhelming majority of cases, completely healthy. It started running a few minutes ago. It is going to finish on its own. The only reason we are polling at all is to catch the rare case where Purple finished, told us about it, and the message did not get through. We are not retrying a failing call. We are scanning for a missed event.

Once you frame it that way, the standard retry shape becomes obviously wrong. Starting short and lengthening makes sense when "short" means "give the failing thing a moment to recover." That is not what we are doing. We are saying "did the message arrive yet?" There is no recovery happening on the other side, because the other side is fine. Waiting longer between checks does not help anyone. It just delays the moment we notice the missed message.

If you stay with the standard shape and just shorten the initial wait, you end up over-polling at the tail. A job that legitimately takes 35 minutes does not need someone tapping it on the shoulder every 30 seconds for the back half of its run. That actually does spend API calls and Celery worker capacity for no information gain.

The shape we wanted was something the standard pattern does not provide a good vocabulary for. Aggressive at the start. Calmer at the end. Inverted from the usual instinct. Every framing I tried for it (front-loaded, decaying, head-heavy) sounded jargony and made the actual idea harder to talk about than it deserved. So I will skip the label entirely and just describe the shape.

We want the first poll within roughly a minute of submitting the job, because the cost of a missed webhook is measured in the user's emotional clock. We want a tight cluster of polls in the first five minutes, because that is the window in which essentially every kind of webhook failure manifests. Then we want to space out, because once you are ten minutes into a healthy job, the user has already accepted that this is going to take a while, and quick polling buys nothing.

The numbers after the change

Here is the new schedule, lifted from poll_purple_analyze_job.py:

# Polling window = 60s initial + sum(_RETRY_BACKOFFS) ≈ 73 min total.
# Front-loaded so a missed webhook is noticed within ~2 minutes.
_RETRY_BACKOFFS = [60, 60, 120, 240, 480, 480, 480, 480, 480, 480, 480, 480]
_MAX_RETRIES = len(_RETRY_BACKOFFS)

The submitting task schedules the first poll with countdown=60 instead of countdown=900. Each retry uses the next entry in the array as its countdown. Once the array is exhausted, the task gives up and marks the plan as failed so the UI can exit the loading state.

The total budget is almost identical to the old design. Old: 15 + 12 × 5 = 75 minutes. New: 1 + 1 + 2 + 4 + (8 × 8) = 72 minutes. Both cover the long tail of legitimately long analyses with room to spare. Both stop somewhere around the 70-minute mark, which is where we have decided that further waiting is not actually going to produce a useful result and the right move is to surface the failure and let the user retry from the PRD page.

What changed is the distribution of those minutes.

Metric	Before	After
Time to first poll	15 min	1 min
Worst-case missed-webhook detection	15 min	2 min
Polls in the first 5 minutes	0	4
Polls in the first 10 minutes	0	5
Total polling budget	75 min	72 min
Polls at the long tail (every interval)	5 min	8 min

The most important row in that table is the second one. Worst case detection went from fifteen minutes to two. That is a roughly 7.5× improvement in the time it takes the system to notice that a webhook went missing. For users who hit this path, that translates directly into how long they sit watching nothing happen.

Why is two minutes the right ceiling for missed-webhook detection? It comes from looking at how webhook failures actually present in our environment. Configuration errors and signature mismatches surface on the very first request, because the verification step is deterministic and the same key is used every time. Network blips, deploy bounces, and 5xx storms are short-lived. We have never seen a webhook failure pattern in production that took more than a couple of minutes to show up. So if we have not heard back within the first five-ish minutes of polling, the failure is one of the loud, immediate kinds, and it is already in our logs. If the webhook does eventually arrive late, the polling task is idempotent and skips out as soon as it sees the plan resolved.

Conversely, the long tail is where polite polling actually pays off. Once a job has been running for ten minutes and is still in in_progress, you are probably looking at one of the genuinely slow analyses. Polling that every 30 seconds does nothing useful and just clutters logs. Eight-minute intervals at the tail give the job room to finish on its own and only check in occasionally.

The dispatch in the submitting task is a single line:

poll_purple_analyze_job.apply_async(
    kwargs={
        "plan_id": str(plan_id),
        "analyze_job_id": analyze_job_id,
        "organization_id": str(purple_org_id),
        "project_id": str(project_id),
        "retry_count": 0,
    },
    countdown=60,  # was 900
)

That single number, 900 to 60, is most of the user-facing improvement. The array reshape is what protects the server from the consequences.

The deeper lesson

The thing I keep coming back to after this change is how much of "this product feels good" turns out to be set in the first sixty to ninety seconds of any long async operation.

A user clicking "generate plan" is making a small bet. They believe, tentatively, that this is going to work. They are willing to wait. But they need the system to keep that belief warm, and the way you keep it warm is by giving them a sign of life early. It can be a progress bar that moves. It can be a status string that updates. It can be, in our case, a backend that quickly notices when something has gone wrong and surfaces the truth instead of letting the spinner spin.

What the system absolutely cannot do is stay silent for fifteen minutes. By minute three the user has already started constructing a story about what is broken. By minute five they are looking for a way to cancel. By minute ten they have moved on and the next time they come back they will arrive expecting failure. Even if the webhook eventually fires at minute twelve and everything works, the experience has been spent.

The original 15-minute initial wait was reasoning about the wrong thing. It was optimizing the API call profile against the modal completion time of the underlying job. That is a real number and it is a real consideration, but it is not the constraint that should drive the polling cadence. The constraint that should drive the polling cadence is "how long can the user sit in front of a silent screen before they conclude we are broken." For our users, that number is somewhere between 60 and 90 seconds. Past that, you are losing them.

This generalizes. Any time you have a long-running async AI task, somewhere in the system there is a piece of code that decides how often the rest of the system asks "is it done yet." That code is a UX decision, not a backend decision. Treat it that way.

The framing I now use when reviewing this kind of code is to separate two distinct questions and answer them separately. Question one: how quickly do we need to detect that the happy path failed? That governs the early polling cadence. The answer is almost always "faster than you think," because the happy path failing silently is the worst experience the system can produce. Question two: how patiently can we wait for the work to finish on its own? That governs the late polling cadence. The answer is usually "more patiently than you think," because once the user has accepted the wait, polling more often does not buy anyone anything.

Server politeness is a real cost, and I do not want to pretend otherwise. Hammering an internal API every five seconds for an hour wastes capacity and clutters dashboards. But you weigh it against the perception cost. For a small B2B SaaS like ours, a single user concluding the product is broken and ghosting is far more expensive than any conceivable amount of well-bounded internal polling traffic. We are on a private API to our own service. The economics are not even close.

We added a single line to our internal design checklist as a result of this work: "First poll inside 60 seconds." When we review any new long-running async flow, that line gets checked. If we are scheduling the first liveness check more than a minute after submission, we have to justify it explicitly, in writing, against the user-perception cost. So far we have not had a single case that survived that justification.

What else got fixed along the way

A couple of things came along for the ride in the same PR, because once you start looking at one polling task you tend to notice the things around it.

The polling task now has an explicit "give up" path that marks the plan as failed when the retry array is exhausted. The original code logged a warning and exited. The plan row stayed in in_progress forever, which meant the UI loading state never resolved and the user could not even retry generation, because the front-end refused to start a new job while the previous one was supposedly still running. The fix is small but important: when retries hit the wall, write an explanatory error message to the plan, mark it failed, and publish a status-change event so the UI exits the spinner. The error message tells the user how long we waited and suggests retrying from the PRD detail page. It is also idempotent, so if the webhook arrives late and resolves the plan as completed, the giveup path sees generation_status is no longer in_progress and does nothing.

We also added an admin recovery endpoint for the case where a plan does get stuck in some unexpected state, usually because of a bug we have not seen yet. It manually transitions a plan back to a state where the user can retry. This sits in our admin tools and is not user-facing, but it has been useful exactly twice in the month since we shipped it, both for cases that taught us about new failure modes we then fixed properly. Operational tools earn their keep.

Neither of these changes was the headline of the PR. They were both downstream consequences of taking the polling task seriously enough to read it line by line. That is its own lesson. Polling tasks tend to be the bit of code nobody reads. They are scheduled once when the feature is built and then they quietly run forever. The next time you find yourself in a polling fallback that nobody has touched in months, it is worth half an hour of your time to read the whole thing and ask whether the cadence still matches what users actually need.

Wrap

The principle, in one sentence: poll the way the user feels the product, not the way the server feels the load. Almost everything else falls out of that.

If you want to see what the rest of Codens looks like, the English landing page is at https://www.codens.ai/en/ and our help docs (which include a lot more about how Green and Purple talk to each other) live at https://help.codens.ai/en/. The polling task discussed in this post lives in the open part of our backend; if you happen to spot a different case where this same trade-off applies, I would genuinely like to hear about it.