<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Takayuki Kawazoe</title>
    <description>The latest articles on Forem by Takayuki Kawazoe (@zoetaka38).</description>
    <link>https://forem.com/zoetaka38</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3902826%2F0187a85d-f9a1-45bb-871d-bf5e49ddcccc.jpeg</url>
      <title>Forem: Takayuki Kawazoe</title>
      <link>https://forem.com/zoetaka38</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/zoetaka38"/>
    <language>en</language>
    <item>
      <title>Claude Code Skills vs deterministic verify commands — same checks, very different ergonomics</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Sun, 03 May 2026 08:05:21 +0000</pubDate>
      <link>https://forem.com/zoetaka38/claude-code-skills-vs-deterministic-verify-commands-same-checks-very-different-ergonomics-fgb</link>
      <guid>https://forem.com/zoetaka38/claude-code-skills-vs-deterministic-verify-commands-same-checks-very-different-ergonomics-fgb</guid>
      <description>&lt;p&gt;A peer showed me their verification pipeline last week. We were both running coding agents over real tickets, both wanted "did the agent actually finish the work" to be a hard gate, and we both had a thing for it. Theirs was a Claude Code Skill — a Markdown file with a few prompt fragments and tool permissions, invoked by the agent itself when it thought verification was warranted. Mine was a deterministic shell command pinned into a workflow step that runs every time, no agent judgment involved. Same intent. Very different operational shape.&lt;/p&gt;

&lt;p&gt;The conversation that followed — "why did you pick the one you picked, and where does it bite you" — is what this post is. I'm building an AI dev harness called Codens; what's relevant here is that its verification layer is on the command side of this divide, and my read on Skill-based verification is partly observational from peers running them and partly from the Skills I do have installed for &lt;em&gt;developer-side&lt;/em&gt; work in the same repo. I'll be honest about which is which.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Skill-based verification" looks like
&lt;/h2&gt;

&lt;p&gt;A Skill in Claude Code is a Markdown-defined behavior contract. It lives in &lt;code&gt;.claude/skills/&amp;lt;name&amp;gt;/SKILL.md&lt;/code&gt; (project-scoped) or in the user's home directory, has a YAML frontmatter with a &lt;code&gt;name&lt;/code&gt; and a &lt;code&gt;description&lt;/code&gt;, and a body that's free-form prompt text plus optional references to scripts, sub-agents, or tool permissions. The agent reads the available Skills' descriptions on every turn and decides — in the same way it decides whether to call a tool — whether the current situation warrants invoking one.&lt;/p&gt;

&lt;p&gt;The shape of a verification Skill is something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;name: verify-task
description: |
  Run after implementing a Notion ticket. Checks that lint, types,
  and unit tests pass for the touched code paths, and reports back
  in plain English with a recommended next action.

&lt;span class="gh"&gt;# Verify Task&lt;/span&gt;

When invoked, do the following:
&lt;span class="p"&gt;1.&lt;/span&gt; Read the diff and infer which packages were touched.
&lt;span class="p"&gt;2.&lt;/span&gt; For each touched package, run the appropriate lint/type/test commands.
&lt;span class="p"&gt;3.&lt;/span&gt; If everything passes, return a one-line summary.
&lt;span class="p"&gt;4.&lt;/span&gt; If something fails, summarize the smallest reproducer, link to the
   failing line, and suggest one concrete fix.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The strengths are what you'd expect from giving an LLM a verification rubric instead of a script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Contextual judgment.&lt;/strong&gt; The Skill decides &lt;em&gt;what&lt;/em&gt; to verify based on the diff. A pure-frontend change skips the backend test suite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Natural-language failure explanations.&lt;/strong&gt; When something breaks, the agent doesn't dump 1500 bytes of stderr — it summarizes. "The test &lt;code&gt;test_normalize_uv_venv&lt;/code&gt; fails because the function now strips a trailing slash the test expects to remain."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output that varies in shape with the run.&lt;/strong&gt; A clean pass returns a one-liner; a messy failure returns a structured triage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The agent decides if and when to call it.&lt;/strong&gt; No engine logic to wire.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last property is also where Skill-based verification gets a little scary. It depends on the agent reading the room correctly — exactly the property that makes me reluctant to put production gating on it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What &lt;code&gt;verify_commands&lt;/code&gt; looks like
&lt;/h2&gt;

&lt;p&gt;The mechanism on the other side is what Codens uses internally. The unit of verification is a shell command (or a chain of them) declared on the ticket itself, and the workflow engine runs it as a step. No model in the loop, no judgment about whether to run, no judgment about how to interpret the result. Pass means exit 0. Fail means non-zero. Output is captured verbatim and the tail surfaces in the failure message.&lt;/p&gt;

&lt;p&gt;The runner is small. Here's the actual thing, from &lt;code&gt;purple-codens/backend/src/infrastructure/workflow/steps/run_tests.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test_command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npm run test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Interpolate context variables (e.g., {verify_commands} from Notion)
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;test_command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_command&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;span class="c1"&gt;# Build final command: prepend project-level defaults before task-specific commands.
# Each chunk runs inside its own subshell `(...)` so a `cd dir &amp;amp;&amp;amp; ...` step
# in the project default does not leak its pwd into the task-level verify and
# cause the next `cd dir` to fail. The agent service runs the final string
# under /bin/sh (dash on Debian/Ubuntu), where a failed `cd` returns exit 2 —
# which RunTestsStep would otherwise surface as the misleading
# "Tests failed (exit code: 2)".
&lt;/span&gt;&lt;span class="n"&gt;default_verify&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default_verify_commands&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;task_verify&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify_commands&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;default_verify&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;task_verify&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;default_verify&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_verify&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;test_command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things worth pointing out:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Two layers chained.&lt;/strong&gt; &lt;code&gt;default_verify_commands&lt;/code&gt; is the project-level "always run this first" — typically &lt;code&gt;format &amp;amp;&amp;amp; lint &amp;amp;&amp;amp; typecheck&lt;/code&gt;. &lt;code&gt;verify_commands&lt;/code&gt; is the task-level — usually a targeted test that exercises the specific change. They concatenate with &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt;, so defaults must pass before task-specific runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subshell wrapping.&lt;/strong&gt; Each chunk is wrapped in &lt;code&gt;(...)&lt;/code&gt; so a &lt;code&gt;cd dir &amp;amp;&amp;amp; ...&lt;/code&gt; in one part doesn't leak its working directory into the next. The kind of thing you only fix after staring at "Tests failed (exit code: 2)" in Slack and realizing dash exits 2 when &lt;code&gt;cd&lt;/code&gt; fails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The preset wires it.&lt;/strong&gt; From &lt;code&gt;notion_compatible.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"run_tests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test_command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{verify_commands}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timeout_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"notify_success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_failure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fix_verify"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When verify fails, the surfaced error includes the tail of the captured output — not just the exit code — so the next step has something concrete to feed back to the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;span class="n"&gt;_ERROR_TAIL_BYTES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;
&lt;span class="n"&gt;tail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;_ERROR_TAIL_BYTES&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;_ERROR_TAIL_BYTES&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tests failed (exit code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;). Last output:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tests failed (exit code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;); agent reported no output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;StepResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;previous_step_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;previous_step_output&lt;/code&gt; is the variable the next claude_code step (&lt;code&gt;fix_verify&lt;/code&gt;) interpolates into its prompt: "Verification failed. Fix the issues. Output: &amp;lt;last 1500 bytes&amp;gt;." The agent gets stderr to react to, but the &lt;em&gt;decision&lt;/em&gt; about whether verification passed or failed is made by the engine reading the exit code — not by the agent reading the output.&lt;/p&gt;

&lt;p&gt;Strengths, mirroring the Skill side:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It always runs.&lt;/strong&gt; No "did the agent reach for the Skill" question. Step exists, step executes, exit code resolves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bool pass/fail.&lt;/strong&gt; Engine routes on &lt;code&gt;on_success&lt;/code&gt; or &lt;code&gt;on_failure&lt;/code&gt;. No interpretation layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured failure output.&lt;/strong&gt; A 1500-byte tail of real stderr, not an LLM's summary of it. Sometimes the LLM is wrong about what failed; the raw tail is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drop into CI.&lt;/strong&gt; The same command line runs in the workflow, in GitHub Actions, and on a contributor's laptop. Portable in a way a Skill is not.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where each one is the better fit
&lt;/h2&gt;

&lt;p&gt;Both mechanisms can do "verification." The interesting question is when each one is &lt;em&gt;ergonomically&lt;/em&gt; the right tool, not just when it's technically possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Skill is the better fit when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The check requires &lt;em&gt;contextual judgment&lt;/em&gt; about what to look at. "Did the PR include sensible test coverage for the changed lines" is a Skill question — what counts as sensible varies by file, by convention, by whether the change is a refactor or new behavior. A command can't make that judgment without becoming an LLM under the hood.&lt;/li&gt;
&lt;li&gt;The output is a &lt;em&gt;report&lt;/em&gt;, not a gate. "Summarize the risk of this change" produces prose that varies with the run. Forcing it into a deterministic command's stdout feels backwards.&lt;/li&gt;
&lt;li&gt;The check is &lt;em&gt;opportunistic&lt;/em&gt; — useful when relevant, skippable when not. The agent deciding "this is a docs-only change, verify-task doesn't apply" is the right behavior. A command would have to bake that branch in.&lt;/li&gt;
&lt;li&gt;The check is &lt;em&gt;for the human reviewer&lt;/em&gt;. PR descriptions, change-impact estimates, "what else might this affect" — content for the conversation log, not a gate that blocks the PR from existing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Command is the better fit when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The check is a &lt;em&gt;gate&lt;/em&gt;. The pipeline must not advance if it fails. There must be no path where the agent decided not to call it.&lt;/li&gt;
&lt;li&gt;The check has a &lt;em&gt;natural deterministic implementation&lt;/em&gt;. Linters, type checkers, test runners, schema diffs, contract tests against an OpenAPI spec — they exist already and have spent years getting their exit codes right. Wrapping them in a Skill adds an interpretation layer that only makes them less reliable.&lt;/li&gt;
&lt;li&gt;You want &lt;em&gt;one source of truth&lt;/em&gt; across local, CI, and the agent loop. The same &lt;code&gt;npm run test:e2e&lt;/code&gt; runs in all three. If the Skill version wins, you now have two truth sources, and they will eventually disagree.&lt;/li&gt;
&lt;li&gt;The failure output is &lt;em&gt;load-bearing&lt;/em&gt;. When the agent gets routed back into a fix loop with "Verification failed. Output: &amp;lt;tail&amp;gt;," it needs the actual stderr, not a paraphrase. Paraphrases drop information the next turn needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A meta-property: &lt;strong&gt;commands compose with other commands; Skills compose less well with non-LLM tooling.&lt;/strong&gt; If verify also gates Slack, the merge step, the deploy — that's engine-level routing, and the engine wants &lt;code&gt;bool&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The traps each side hits
&lt;/h2&gt;

&lt;p&gt;Skill side, failure modes I keep seeing in peers' setups:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The agent forgets to invoke.&lt;/strong&gt; The Skill description matched ten previous turns; this turn the agent is mid tool-use chain and never reaches for it. No "did you remember to verify" hook — just the agent's running judgment, which is a function of the prompt window at that exact turn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill bloat eats context.&lt;/strong&gt; Each installed Skill's description sits in the context window every turn. A repo with twenty Skills pays twenty descriptions of attention budget for the chance one matches. Verification needs to be always-available, which means its description has to fit in the "always relevant" slot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard to verify which version is running.&lt;/strong&gt; Skills get edited; the agent reads the file at invocation time. Did the last fix actually take effect? With a command, the answer is &lt;code&gt;git log -- path/to/script&lt;/code&gt;. With a Skill, it's "let me read the file and hope no one edited it in the last hour."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-reports are not gates.&lt;/strong&gt; A Skill returning "verification passed" is an LLM telling you the LLM thinks it passed. Fine for a PR comment or triage note. Not fine for "should this merge to main." The pass-signal is structurally weaker than an exit code, and the only way to upgrade it is to put a deterministic check inside the Skill — at which point the Skill is just a shell wrapper around a command.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Command side, scars I have from each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Feedback latency.&lt;/strong&gt; Commands take minutes. A full verify chain (format → lint → typecheck → test) is slower. A Skill that says "looks fine" in three seconds feels great by comparison — until it's wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output-size overflow.&lt;/strong&gt; Captured stdout can be megabytes. We tail to 1500 bytes for the error message; truncation can land mid-error. If the traceback is at the top of the output and the bottom is just pytest's summary, the tail you surface is the wrong tail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sloppy exit codes hide "why."&lt;/strong&gt; Plenty of CLI tools exit non-zero for both "your code failed" and "I couldn't even start." &lt;code&gt;cd nonexistent &amp;amp;&amp;amp; pytest&lt;/code&gt; exits 2 from the failed &lt;code&gt;cd&lt;/code&gt;, which the runner reports as "Tests failed (exit code: 2)" — same shape as a real test exit 2. The subshell-wrapping fix landed precisely because this confusion was costing afternoons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No graceful "doesn't apply."&lt;/strong&gt; A command that runs &lt;code&gt;pytest backend/tests/&lt;/code&gt; on a frontend-only PR runs the whole suite anyway, or fails because backend fixtures aren't set up on this branch. There's no built-in "skip if irrelevant" — either the command short-circuits itself, or the engine adds routing logic. Both are work the Skill gets for free from the agent's judgment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  My current division of labor (provisional)
&lt;/h2&gt;

&lt;p&gt;In Codens, the production workflow is &lt;em&gt;commands all the way down&lt;/em&gt;. Verify is a &lt;code&gt;run_tests&lt;/code&gt; step. Conflict checks are a &lt;code&gt;check_conflicts&lt;/code&gt; step. CI gating is a &lt;code&gt;wait_ci_checks&lt;/code&gt; step. The agent (&lt;code&gt;claude_code&lt;/code&gt; step) is the thing that makes the &lt;em&gt;changes&lt;/em&gt;; everything that &lt;em&gt;judges&lt;/em&gt; whether a change is acceptable is deterministic. The reasoning is the gating property — these checks must run, must pass before the next step, and must produce structured output that the next agent step can react to.&lt;/p&gt;

&lt;p&gt;What I do have on the Skill side is &lt;em&gt;developer-facing&lt;/em&gt;: the Skills in &lt;code&gt;purple-codens/.claude/skills/&lt;/code&gt; are &lt;code&gt;test-generator&lt;/code&gt;, &lt;code&gt;pr-creator&lt;/code&gt;, and &lt;code&gt;bug-analyzer&lt;/code&gt;. These run when I'm working in Claude Code locally on the Codens repo itself — when I ask "add tests for this," the test-generator Skill picks up. They are not invoked by the production workflow that runs on customer code. So the honest answer to "does Codens use Skills" is "yes, for the human-in-Claude-Code experience inside the repo; no, for the autonomous workflow that ships to customers." Different mechanisms for different operating modes, even though the underlying tasks (generate tests, write PR descriptions, analyze bugs) are similar.&lt;/p&gt;

&lt;p&gt;A first-pass mapping of what I've settled on for which mechanism does what:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lint, type check, unit test&lt;/td&gt;
&lt;td&gt;command&lt;/td&gt;
&lt;td&gt;Gate. Already a tool with a good exit code. Composes with CI.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run task-specific verify&lt;/td&gt;
&lt;td&gt;command&lt;/td&gt;
&lt;td&gt;Gate. Output is load-bearing for &lt;code&gt;fix_verify&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generate PR description&lt;/td&gt;
&lt;td&gt;Skill&lt;/td&gt;
&lt;td&gt;Report, not a gate. Wants prose. Varies with run.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analyze bug from stack trace&lt;/td&gt;
&lt;td&gt;Skill&lt;/td&gt;
&lt;td&gt;Contextual. Output is for the human reading triage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Estimate change impact across files&lt;/td&gt;
&lt;td&gt;Skill&lt;/td&gt;
&lt;td&gt;Judgment-heavy. Doesn't gate anything.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict detection&lt;/td&gt;
&lt;td&gt;command&lt;/td&gt;
&lt;td&gt;Gate. Engine routes on result.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test generation&lt;/td&gt;
&lt;td&gt;Skill&lt;/td&gt;
&lt;td&gt;One-shot, in-conversation. Not a workflow step.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wait for CI checks&lt;/td&gt;
&lt;td&gt;command&lt;/td&gt;
&lt;td&gt;External signal, deterministic poll.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern that fell out: &lt;strong&gt;gating belongs to commands, judgment-as-a-report belongs to Skills&lt;/strong&gt;. When I find myself wanting "the gate decision should depend on the agent's judgment," that's usually a sign that the gate is wrong, not that I should put a Skill behind it. Either tighten the command so it fails on the right things, or stop trying to gate that property.&lt;/p&gt;

&lt;h2&gt;
  
  
  The unresolved question
&lt;/h2&gt;

&lt;p&gt;Both mechanisms have escape hatches that, if I leaned on them, would blur the distinction.&lt;/p&gt;

&lt;p&gt;On the Skill side, you could imagine a hook — "force the agent to invoke this Skill before exiting" — that turned the Skill into a mandatory step. That removes the "agent forgets to invoke" trap, but in exchange you've recreated the command. The Skill's contextual judgment now happens in a fixed slot, the engine routes on its output, and the only difference from a command is that the Skill body is Markdown instead of a shell script. It's not clear that buys you anything except a worse failure-output story.&lt;/p&gt;

&lt;p&gt;On the command side, you could add "skip this command if X" routing — &lt;code&gt;if diff_files match frontend/**, skip backend/run_tests&lt;/code&gt; — and recover some of the Skill's irrelevance handling. That works for a few cases, but the routing logic accumulates: skip if frontend-only, skip if docs-only, skip if dependency-only, skip if revert. Each new condition is a string of YAML or a few lines of Python. At some point you've hand-coded the agent's judgment into a rule engine, and you'd have been better off just letting the model decide.&lt;/p&gt;

&lt;p&gt;So maybe the real shape is that "Skill vs command" is the two ends of a slider labeled &lt;strong&gt;how much judgment do I delegate to the model&lt;/strong&gt;. All the way to the command end: zero delegation, the engine decides everything. All the way to the Skill end: full delegation, the model decides whether and how to verify. The interesting question isn't "which one" but "how much delegation does this specific check tolerate." Gates tolerate very little. Reports tolerate a lot. Most things sit somewhere on the slider, and the choice between Skill and command is really a choice about where on the slider to draw the line for that one check.&lt;/p&gt;

&lt;p&gt;The version of this I haven't resolved: &lt;strong&gt;what's a check that genuinely benefits from being on the slider's middle, not at either end?&lt;/strong&gt; "Verify-with-judgment" — the engine gates on a deterministic signal, but the &lt;em&gt;interpretation&lt;/em&gt; of the signal for the next agent's prompt is generated by a Skill. That'd give you the gate's certainty and the Skill's prose. I haven't built it yet. I suspect when I do, the seam between the two halves — "the deterministic part decided pass/fail, the prose part explains why" — is where the next class of bugs lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you've shipped both Skill-based and command-based verification in the same agent system and have a heuristic for when to use which, I'd want to hear your rule. My current rule — &lt;strong&gt;if it gates, it's a command; if it reports, it's a Skill&lt;/strong&gt; — feels right but I've only been running it for a few months, and I haven't seen it under the load that breaks it. The peer who showed me their Skill-based pipeline was running it for a smaller, less gated workflow than mine, and we both came away unconvinced that either of us had the universally right answer.&lt;/p&gt;

&lt;p&gt;The other thing I'm curious about is the middle of the slider: have you found a way to combine the two — deterministic gate plus agent-generated explanation — that doesn't decay into either "the explanation is unreliable so we ignore it" or "the gate is too rigid so we override it"? That's where I think the next interesting design lives, and I don't yet have the shape of it.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>devops</category>
      <category>testing</category>
    </item>
    <item>
      <title>I shipped a regex to catch AI agents committing only half the work. Then I ripped it back out.</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Sat, 02 May 2026 11:12:05 +0000</pubDate>
      <link>https://forem.com/zoetaka38/i-shipped-a-regex-to-catch-ai-agents-committing-only-half-the-work-then-i-ripped-it-back-out-3imi</link>
      <guid>https://forem.com/zoetaka38/i-shipped-a-regex-to-catch-ai-agents-committing-only-half-the-work-then-i-ripped-it-back-out-3imi</guid>
      <description>&lt;p&gt;The agent committed the test file, pushed it, and reported success. The implementation file the test was supposed to exercise was not in the diff. Exit code 0. PR open. CI green. From every downstream system, the run looked clean. The only reason I caught it was that someone glanced at the PR diff before merging and said "wait, where's the actual change."&lt;/p&gt;

&lt;p&gt;That moment, six months ago, is when I decided I needed something stronger than "the agent said it was done." This post is the story of what I built, why I kept patching it for half a year, and why I finally deleted the whole thing in a single commit: &lt;code&gt;911005fd&lt;/code&gt;, 742 lines deleted, 39 added, net -703, plus around 280 unit tests removed. The thing I deleted was a regex-based "diff shape guard" that tried to detect partial implementations by comparing the agent's diff against the file paths mentioned in the ticket. The thing I kept was an existing behavior-based check — &lt;code&gt;verify_commands&lt;/code&gt; — that turns out to subsume everything the regex was trying to do, and a lot more.&lt;/p&gt;

&lt;p&gt;I'm building an AI dev harness called Codens — happy to talk about it but it's not the point of this post. The point is the lesson: &lt;strong&gt;shape validation of AI output is a trap that compounds every time you try to fix it.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Background: what "partial implementation" looks like
&lt;/h2&gt;

&lt;p&gt;When you hand a ticket to a coding agent, the failure modes that &lt;em&gt;aren't&lt;/em&gt; "the agent crashed" are the ones that bite you. Exit 0, commits made, work incomplete in a specific way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test file written, implementation file missing.&lt;/li&gt;
&lt;li&gt;Implementation written, migration missing.&lt;/li&gt;
&lt;li&gt;Function signature updated, call sites in three other modules still pass the old shape.&lt;/li&gt;
&lt;li&gt;Route handler added, OpenAPI spec the SDK regenerates from unchanged.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent is locally consistent in all of these — from inside its context window it did the natural next step. From outside, the PR is half a feature.&lt;/p&gt;

&lt;p&gt;Our workflow launches Claude Code on a remote worker, hands it a Notion ticket, lets it create a branch, implement, commit, push. Then a &lt;code&gt;verify&lt;/code&gt; step runs the ticket's &lt;code&gt;verify_commands&lt;/code&gt; — a per-ticket shell snippet that says "task is done iff these commands pass." If verify fails, a &lt;code&gt;fix_verify&lt;/code&gt; claude_code step gets the verify output piped into its prompt and tries again. The preset matters here because the whole article hinges on the relationship between &lt;code&gt;develop&lt;/code&gt; and &lt;code&gt;verify&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;backend/src/infrastructure/workflow/presets/notion_compatible.json&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(excerpt)&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"develop"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"opus"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"max_turns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"timeout_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1800&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_failure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"notify_failure"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"run_tests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test_command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{verify_commands}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timeout_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"notify_success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_failure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fix_verify"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fix_verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"opus"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_override"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Verification failed. Fix the issues.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;Output:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;{previous_step_output}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_retries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_failure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"notify_failure"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;verify_commands&lt;/code&gt; is the &lt;strong&gt;behavior-based&lt;/strong&gt; check: did the thing the ticket described actually happen, judged by running real tests against real code. Slow, expensive (seconds to minutes), semantically strong.&lt;/p&gt;

&lt;p&gt;What I added on top of it was a &lt;strong&gt;shape-based&lt;/strong&gt; check: did the diff at least &lt;em&gt;touch&lt;/em&gt; the files the ticket mentioned. Cheap, fast, runs immediately after &lt;code&gt;develop&lt;/code&gt;. The intuition was "if the ticket says modify &lt;code&gt;src/notion_sync.py&lt;/code&gt; and the agent's diff doesn't touch it, fail fast — don't even bother running verify."&lt;/p&gt;

&lt;p&gt;The intuition was wrong in a way that took me half a year to fully accept.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the shape guard actually was
&lt;/h2&gt;

&lt;p&gt;The mechanism had three pieces, all in &lt;code&gt;backend/src/utils/expected_files.py&lt;/code&gt;. The first was a regex that pulled file-shaped substrings out of the ticket body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/src/utils/expected_files.py (DELETED)
# Matches relative file paths with known source-code extensions.
# Accepts optional leading directory segments (lowercase, hyphens, underscores,
# digits) followed by a filename that may begin with uppercase or digits.
&lt;/span&gt;&lt;span class="n"&gt;_FILE_PATH_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:[a-z][a-z0-9_-]*/)*[a-zA-Z0-9][a-zA-Z0-9_.-]*\.(?:py|tsx?|md|sh|ya?ml)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extension allowlist is restrictive on purpose — &lt;code&gt;.docx&lt;/code&gt;, &lt;code&gt;.png&lt;/code&gt;, &lt;code&gt;.csv&lt;/code&gt; would have generated noise without ever being something the agent should be modifying.&lt;/p&gt;

&lt;p&gt;The second piece was the cover-check, which used a deliberately loose suffix match so that a ticket saying "modify &lt;code&gt;notion_sync.py&lt;/code&gt;" would still match a diff that included &lt;code&gt;src/infrastructure/notion_sync.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# (DELETED) _diff_covers_expected
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_diff_covers_expected&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;diff_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;expected_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return (all_covered, missing_files).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;expected_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;missing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;ef&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;expected_files&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt;
            &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;ef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;diff_files&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The third piece was the call site inside &lt;code&gt;claude_code.py&lt;/code&gt;, sitting right after &lt;code&gt;develop&lt;/code&gt; finished and right before the no-op detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/src/infrastructure/workflow/steps/claude_code.py (DELETED block)
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;repos_committed_set&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;step_type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolve_conflicts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;_guard_body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notion_body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_goal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;_guard_expected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_extract_expected_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_guard_body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_guard_expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;_guard_diff_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diff_files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_guard_diff_raw&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;_guard_diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_guard_diff_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# worker did not supply diff_files — fall back to the agent's
&lt;/span&gt;            &lt;span class="c1"&gt;# marker block, then to scanning stdout for path-shaped substrings.
&lt;/span&gt;            &lt;span class="n"&gt;_guard_diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="nf"&gt;_parse_modified_files_marker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_extract_expected_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;_guard_ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_guard_missing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_diff_covers_expected&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;_guard_diff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_guard_expected&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;_guard_ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;StepResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Expected files missing from diff: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_guard_missing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first time it caught a real bug — a ticket that mentioned &lt;code&gt;src/notion_sync.py::_normalize_uv_venv&lt;/code&gt; where the agent had only created the test file — it felt like the right investment. A cheap regex was catching a class of error that would otherwise have wasted minutes of verify_commands time and produced a misleadingly-green run. That single early win paid for the rest of the half-year.&lt;/p&gt;

&lt;p&gt;It also paid for what I now think were four bad bets, each of which I patched my way through before finally giving up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reason 1 we ripped it out: half the tickets had no file paths to extract
&lt;/h2&gt;

&lt;p&gt;The first crack was statistical. After a month in production, roughly half of all tickets did not mention a single concrete file path. They were tickets like "The reaction bar feels sluggish on mobile, tighten it" or "When the agent fails three times in a row, send a Slack alert." Normal tickets — they describe the &lt;em&gt;outcome&lt;/em&gt;, not the &lt;em&gt;implementation surface area&lt;/em&gt;. The guard's response, by deliberate design, was to skip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# (DELETED docstring)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns an empty list when no paths are found, which causes the shape guard
to skip (false-positive avoidance for tickets that don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t mention specific files).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That skip was the only safe response — firing on &lt;code&gt;expected_files=[]&lt;/code&gt; would have failed every outcome-style ticket. But the consequence is that the guard only protected runs where someone had bothered to write file paths into the description. The other half ran with no protection at all. I had built a safety net with holes the size of normal human ticket-writing, and the math on that was bad before I even hit the next three problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reason 2: the regex caught file paths in "References" and "Notes" sections too
&lt;/h2&gt;

&lt;p&gt;The second crack appeared when teams started writing more structured tickets — the kind with sections for "Implementation," "References," "Test Plan," "Notes." The regex pulled file paths out of all of them. So a ticket that said "implement X in &lt;code&gt;foo.py&lt;/code&gt;, and as background see how &lt;code&gt;bar.py&lt;/code&gt; does it" would set &lt;code&gt;expected_files = ["foo.py", "bar.py"]&lt;/code&gt;. The agent would correctly modify &lt;code&gt;foo.py&lt;/code&gt;, leave the reference file alone, commit, and the guard would fail the run because &lt;code&gt;bar.py&lt;/code&gt; wasn't in the diff.&lt;/p&gt;

&lt;p&gt;The patch was a section-aware preprocessor. Strip out the headings I didn't want the regex to scan, then run the regex on what's left:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# (DELETED) _EXCLUDED_SECTION_TITLES
&lt;/span&gt;&lt;span class="n"&gt;_EXCLUDED_SECTION_TITLES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verification checklist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;テスト&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;検証&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checklist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;references&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;参考&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;備考&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus a markdown heading walker that propagated exclusion to subsections:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# (DELETED) _strip_excluded_sections
&lt;/span&gt;&lt;span class="n"&gt;_HEADING_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;^(#{1,6})\s+(.+)$&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MULTILINE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_strip_excluded_sections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
    &lt;span class="n"&gt;headings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_HEADING_RE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finditer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;headings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;excluded_depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headings&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;excluded_depth&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;excluded_depth&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;excluded_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;excluded_depth&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_EXCLUDED_SECTION_TITLES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;excluded&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="n"&gt;excluded_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;
    &lt;span class="c1"&gt;# ... rebuild kept text from non-excluded chunks ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked in the narrow sense. The next week, a ticket came in with a "## Background" section. Then "## 関連 PR" (Related PR). Then "## 過去の議論" (Prior Discussion). Then English variants the original list missed: "## Prior Art," "## Context," "## Linked Issues."&lt;/p&gt;

&lt;p&gt;Every new convention any engineer adopted for organizing tickets now required me to add a string to &lt;code&gt;_EXCLUDED_SECTION_TITLES&lt;/code&gt;, in two languages — or to go educate the engineer about which section names my regex understood. &lt;strong&gt;The ticket-writing humans did not exist to make my regex work.&lt;/strong&gt; They were communicating intent to an agent that had no trouble understanding what "Background" meant. Only the regex did. Every new exclusion added complexity while reducing scope. Each patch was net-negative on actual protection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reason 3: the agent's stdout fallback broke when the stdout got truncated
&lt;/h2&gt;

&lt;p&gt;The shape guard had two ways to know what the agent had touched. The first was &lt;code&gt;result["diff_files"]&lt;/code&gt;, a comma-separated list of paths from the worker, derived from &lt;code&gt;git diff --name-only&lt;/code&gt; after the agent finished. The second was a fallback that scanned the agent's stdout for path-shaped substrings, used when the worker hadn't yet been upgraded to return &lt;code&gt;diff_files&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# (DELETED fallback)
&lt;/span&gt;&lt;span class="n"&gt;_guard_diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;_parse_modified_files_marker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;_extract_expected_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fallback worked until the stdout got long enough to hit our truncator. Claude Code can produce thousands of lines of tool-use output. We truncate captured output to a fixed byte budget. When that truncator landed mid-path — &lt;code&gt;src/utils/expe...&lt;/code&gt; instead of &lt;code&gt;src/utils/expected_files.py&lt;/code&gt; — the regex stopped matching, the fallback returned an incomplete list, and the cover-check failed against a perfectly good diff.&lt;/p&gt;

&lt;p&gt;The patch was to ask the agent to emit a deterministic marker block at the end of its output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[PURPLE-MODIFIED-FILES]
src/utils/expected_files.py
src/infrastructure/workflow/steps/claude_code.py
[/PURPLE-MODIFIED-FILES]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parsed by &lt;code&gt;_parse_modified_files_marker&lt;/code&gt;, which survived the cleanup because it's still useful for diagnostics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/src/utils/expected_files.py (kept)
&lt;/span&gt;&lt;span class="n"&gt;_MODIFIED_FILES_MARKER_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\[PURPLE-MODIFIED-FILES\](.*?)\[/PURPLE-MODIFIED-FILES\]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The marker survives truncation because it's emitted near the end of the run. It also handles paths the regex couldn't — Next.js dynamic routes like &lt;code&gt;app/[locale]/page.tsx&lt;/code&gt; were tripping &lt;code&gt;_FILE_PATH_RE&lt;/code&gt; because of the bracket segments.&lt;/p&gt;

&lt;p&gt;But the agent doesn't emit the marker reliably. When &lt;code&gt;max_turns&lt;/code&gt; is exhausted mid-tool-use, when the run hits a timeout, when &lt;code&gt;stop_reason&lt;/code&gt; ends as &lt;code&gt;tool_use&lt;/code&gt; rather than a final assistant message — the summary block at the end never gets written. So now I had a fallback for the fallback: the original regex scan applied to the truncated stdout I'd just established was unreliable. Three tiers of input source, each with its own failure modes, and the guard's behavior was a function of which tier happened to be active for any given run. I was spending more time reasoning about the guard's failure modes than the guard was saving me from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reason 4: the lockfile + no-op marker collision broke the design assumption
&lt;/h2&gt;

&lt;p&gt;This is the one that finally killed it.&lt;/p&gt;

&lt;p&gt;Some of our verify_commands include &lt;code&gt;npm install&lt;/code&gt; to make sure dependencies are present before running tests. There's also a marker convention for "this task is already done":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/src/infrastructure/workflow/steps/claude_code.py (kept)
&lt;/span&gt;&lt;span class="n"&gt;NOOP_VERIFIED_MARKER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[PURPLE-NO-OP-VERIFIED-OK]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the agent emits that marker, the workflow treats the run as a verified no-op — no commits expected, no diff expected, success.&lt;/p&gt;

&lt;p&gt;The collision: when the agent runs &lt;code&gt;npm install&lt;/code&gt; as part of &lt;em&gt;checking&lt;/em&gt; whether the task is already done, npm regenerates &lt;code&gt;package-lock.json&lt;/code&gt;. The agent's session honestly concludes "nothing to do here" and emits &lt;code&gt;[PURPLE-NO-OP-VERIFIED-OK]&lt;/code&gt;. But the worker's commit hook commits the regenerated lockfile because it's the one mechanical change that happened during the session.&lt;/p&gt;

&lt;p&gt;Post-run state: &lt;code&gt;repos_committed_set = {"frontend"}&lt;/code&gt; because of the lockfile, &lt;code&gt;[PURPLE-NO-OP-VERIFIED-OK]&lt;/code&gt; is in the output, ticket says "fix the ReactionBar timing." The shape guard fires, looks for &lt;code&gt;ReactionBar.tsx&lt;/code&gt; in the diff, finds only &lt;code&gt;package-lock.json&lt;/code&gt;, stops the run with &lt;code&gt;Expected files missing from diff: ReactionBar.tsx&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Nothing has gone wrong. The agent correctly determined the work was already done. The lockfile is an unrelated side effect of the verification step. The guard is producing a false positive because its assumption — "if there are commits, the diff should match the ticket" — doesn't survive the case where commits exist for reasons orthogonal to the ticket.&lt;/p&gt;

&lt;p&gt;The patch was to short-circuit the guard when the no-op marker is present:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/src/infrastructure/workflow/steps/claude_code.py (current)
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;repos_committed_set&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;step_type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolve_conflicts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# When [PURPLE-NO-OP-VERIFIED-OK] is present but diff_files is unavailable
&lt;/span&gt;    &lt;span class="c1"&gt;# (Option B above cannot clear repos_committed_set), re-route to the no-op
&lt;/span&gt;    &lt;span class="c1"&gt;# path so the verified-no-op handling takes effect correctly.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_has_already_done_marker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ClaudeCodeStep: [PURPLE-NO-OP-VERIFIED-OK] marker present for &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run=%s step=%s — re-routing to no-op path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;repos_committed_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repos_committed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding that branch is the moment I lost faith in the design. The shape guard, whose job was to be an &lt;em&gt;external observer&lt;/em&gt; of the agent's output, was now branching on the agent's &lt;em&gt;self-report&lt;/em&gt;. The architectural premise had been "we don't trust the agent to tell us if the work is done — we look at the diff." Trusting the marker for the no-op case was the exact kind of trust I'd set out to avoid by writing the guard in the first place. I was paying a maintenance tax on a thing that no longer had its original justification.&lt;/p&gt;

&lt;h2&gt;
  
  
  The behavior-based alternative was already in the workflow
&lt;/h2&gt;

&lt;p&gt;Once I stepped back and asked what verify_commands actually catches, the answer was: &lt;strong&gt;everything the shape guard was trying to catch, plus more.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrote the test, didn't write the implementation: pytest fails on import error or assertion.&lt;/li&gt;
&lt;li&gt;Wrote the implementation, didn't write the migration: the app fails to start with a schema mismatch.&lt;/li&gt;
&lt;li&gt;Updated the function signature, didn't update the callers: mypy catches it. Or the test does.&lt;/li&gt;
&lt;li&gt;Wrote the route, didn't update the OpenAPI spec: the SDK regen step in verify_commands fails.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;run_tests.py&lt;/code&gt; step is small in the current code. The interesting bits are the variable interpolation, the merging of project-level defaults with task-specific verify, and the subshell wrapping that prevents &lt;code&gt;cd&lt;/code&gt; leakage between chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/src/infrastructure/workflow/steps/run_tests.py (current)
&lt;/span&gt;&lt;span class="n"&gt;test_command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npm run test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;test_command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_command&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;KeyError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;span class="c1"&gt;# Build final command: prepend project-level defaults before task-specific commands.
# Each chunk runs inside its own subshell `(...)` so a `cd dir &amp;amp;&amp;amp; ...` step
# in the project default does not leak its pwd into the task-level verify and
# cause the next `cd dir` to fail. The agent service runs the final string
# under /bin/sh (dash on Debian/Ubuntu), where a failed `cd` returns exit 2 —
# which RunTestsStep would otherwise surface as the misleading
# "Tests failed (exit code: 2)".
&lt;/span&gt;&lt;span class="n"&gt;default_verify&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default_verify_commands&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;task_verify&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify_commands&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;default_verify&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;task_verify&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;default_verify&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_verify&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;test_command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two details worth flagging: the subshell wrapping and the &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt; chaining. The subshell exists because dash (the default &lt;code&gt;/bin/sh&lt;/code&gt; on Debian/Ubuntu) propagates &lt;code&gt;cd&lt;/code&gt; failures as exit 2, which used to surface as a confusing "Tests failed (exit code: 2)" error that had nothing to do with the actual test outcome. Wrapping each chunk in &lt;code&gt;( ... )&lt;/code&gt; keeps &lt;code&gt;cd&lt;/code&gt; scoped to its own chunk. The &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt; chaining means project defaults run first and task verify only runs if defaults pass — the right shape for "format → lint → typecheck → test."&lt;/p&gt;

&lt;p&gt;When verify fails, the surfaced error includes the tail of the captured output, not just the bare exit code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;span class="n"&gt;_ERROR_TAIL_BYTES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;
&lt;span class="n"&gt;tail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;_ERROR_TAIL_BYTES&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;_ERROR_TAIL_BYTES&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tests failed (exit code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;). Last output:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tail&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tests failed (exit code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;); agent reported no output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;StepResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;previous_step_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;previous_step_output&lt;/code&gt; is the variable &lt;code&gt;fix_verify&lt;/code&gt;'s &lt;code&gt;prompt_override&lt;/code&gt; interpolates. When verify fails, the next claude_code step receives the actual stderr tail in its prompt: "Verification failed. Fix the issues. Output: ." Agent reads what failed, edits, workflow loops back to verify, runs again. Two retries before giving up.&lt;/p&gt;

&lt;p&gt;The shape guard was checking "did the diff include this filename." Verify is checking "does the codebase, after the agent's commits, satisfy the executable contract." The latter is strictly stronger. Anything shape caught, verify catches &lt;em&gt;and&lt;/em&gt; hands the agent something concrete to fix in the next loop. Anything shape missed — outcome-style tickets, agent edits to unrelated files, lockfile drift — verify handles correctly because it doesn't care what the diff looks like, only whether the code passes.&lt;/p&gt;

&lt;p&gt;The single case where verify is weaker is "the ticket has no &lt;code&gt;verify_commands&lt;/code&gt; written," and the right response to that is to make &lt;code&gt;verify_commands&lt;/code&gt; a required ticket field, not to maintain a 130-line regex module as a fallback safety net.&lt;/p&gt;

&lt;h2&gt;
  
  
  What was removed, what was kept
&lt;/h2&gt;

&lt;p&gt;The deletion commit is &lt;code&gt;911005fd&lt;/code&gt;, footprint across five files: 39 lines added, 742 deleted. Net 703 lines gone. Around 280 unit tests deleted — regex variants for &lt;code&gt;_extract_expected_files&lt;/code&gt;, heading-scope tests for &lt;code&gt;_strip_excluded_sections&lt;/code&gt;, suffix-match patterns for &lt;code&gt;_diff_covers_expected&lt;/code&gt;, two integration tests in &lt;code&gt;test_claude_code_step.py&lt;/code&gt; that exercised the guard's fail and pass paths.&lt;/p&gt;

&lt;p&gt;What survived in &lt;code&gt;expected_files.py&lt;/code&gt; is just the marker parser, with a docstring rewrite to reflect that it's no longer a fallback for shape detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/src/utils/expected_files.py (current, post-cleanup)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Modified-files marker parsing utilities (used for no-op diagnostics).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NOOP_VERIFIED_MARKER&lt;/code&gt; and &lt;code&gt;_has_already_done_marker&lt;/code&gt; also stayed. They let the agent self-report a &lt;em&gt;no-op&lt;/em&gt; in a way that's safely scoped: the worst case of trusting the marker incorrectly is that we treat a real-work run as a no-op, catchable by other guards downstream and by verify itself if there's something to verify. Self-reports are fine when the trust scope is fail-safe — when the failure mode of trusting wrongly is benign. They are not fine when the trust scope is "this guard runs partial-implementation detection," because then the failure mode of trusting wrongly is letting partial implementations through.&lt;/p&gt;

&lt;p&gt;The residue I'd put in front of anyone considering an agent self-report marker: &lt;strong&gt;what's the worst thing that happens if the agent emits the marker dishonestly or by mistake?&lt;/strong&gt; If the answer is "the workflow takes a safer path than it would have," ship it. If the answer is "the workflow skips a check that was supposed to catch real bugs," don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The general pattern: shape vs behavior validation of AI output
&lt;/h2&gt;

&lt;p&gt;Every shape check is a proxy for some behavior check you actually care about. "The diff contains the filename" is a proxy for "the agent did the work the ticket described." "The output ends with &lt;code&gt;[DONE]&lt;/code&gt;" is a proxy for "the agent reached a clean stopping condition." "The PR title matches &lt;code&gt;feat:|fix:|chore:&lt;/code&gt;" is a proxy for "the change is conventionally classified."&lt;/p&gt;

&lt;p&gt;Proxies are cheap and look like good engineering when you write them — fast, deterministic, no LLM in the loop, easy to test. But the moment you start patching them against false positives, you are in a race against the unbounded variety of human and agent expression, and you will lose. Each patch shrinks the proxy's scope while expanding its surface area. Eventually you're maintaining hundreds of lines that catch a smaller set of bugs than the underlying behavior check would catch on its own.&lt;/p&gt;

&lt;p&gt;The lesson: &lt;strong&gt;if a behavior check exists and runs in your pipeline already, do not put a shape proxy in front of it.&lt;/strong&gt; The shape proxy gives you false confidence ("the cheap check passed, so the run is probably fine") and false alarms ("the cheap check failed, so the run is probably bad") in roughly equal measure, while the behavior check has already done a strictly more accurate job. The proxy's only honest selling point is latency — fail fast before the expensive thing runs. For AI agent output that selling point almost never pays off, because the proxies are too leaky to safely fail-fast on.&lt;/p&gt;

&lt;p&gt;If you find yourself naming a function &lt;code&gt;_strip_excluded_sections&lt;/code&gt; on day three of building a check, take that as the signal that the check itself is in trouble.&lt;/p&gt;

&lt;h2&gt;
  
  
  A side note worth keeping
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;[PURPLE-MODIFIED-FILES]&lt;/code&gt; marker is still useful for &lt;em&gt;diagnostics&lt;/em&gt; even though we no longer use it as input to a guard. When a run does something surprising, having the agent's own list of files-it-thinks-it-touched in the captured output saves time during post-mortems. The pattern I now look for: code that's bad as a control input can be fine as an observability input, and the right move is to keep the data and delete the control logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you've shipped a similar diff-shape guard for AI agent output and later stripped it back out, I'd love to hear what tipped you over. My suspicion is that the false-positive curve looks roughly the same shape for everyone — early easy wins, then a plateau where every patch shrinks the cohort of tickets the guard protects, until the cohort is small enough that the maintenance cost dominates. The version of this question I keep asking is: is there a shape check that doesn't have this maintenance arc, or is shape validation of unstructured agent output just structurally bad and I should stop reaching for it?&lt;/p&gt;

&lt;p&gt;The marker that &lt;em&gt;did&lt;/em&gt; survive — &lt;code&gt;[PURPLE-NO-OP-VERIFIED-OK]&lt;/code&gt; — survived because its trust scope is narrow and fail-safe. If you've shipped agent self-report markers with a wider scope that didn't decay into the same tar pit, I'd want to hear how.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>python</category>
      <category>testing</category>
    </item>
    <item>
      <title>3 MCP server failure modes that bit us in production, and how we ship around them</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Sat, 02 May 2026 02:54:28 +0000</pubDate>
      <link>https://forem.com/zoetaka38/3-mcp-server-failure-modes-that-bit-us-in-production-and-how-we-ship-around-them-4fc9</link>
      <guid>https://forem.com/zoetaka38/3-mcp-server-failure-modes-that-bit-us-in-production-and-how-we-ship-around-them-4fc9</guid>
      <description>&lt;p&gt;MCP feels easy until it isn't. The first time you wire up a stdio server and call a tool from a Claude Agent SDK loop, the whole thing fits on a slide. Then you put it in front of customer codebases, customer GitHub credentials, customer build containers, and the sharp edges show up in places the spec is silent on. Tools start shadowing each other. The agent confidently uses a built-in &lt;code&gt;Read&lt;/code&gt; when you wanted it to go through your sandboxed file server. Environment variables you set on the parent process reappear inside the MCP child and become tokens-in-prompts.&lt;/p&gt;

&lt;p&gt;I'm building a SaaS that uses MCP heavily across a few different services (Codens, an AI dev harness with several specialized agents — happy to talk about it but it isn't the point of this post). Across those services we have GitHub MCPs for repo reads, an in-process Playwright MCP for browser exploration, and per-workspace local-file MCPs that let an agent navigate a cloned repo without escaping the sandbox. Three failure modes have shown up enough to rewrite the way we build this stuff.&lt;/p&gt;

&lt;p&gt;Below is each one with the code we actually ship, not a hypothetical fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 1: built-in tools shadow your MCP tools, silently
&lt;/h2&gt;

&lt;p&gt;The first time it bit us was in the test-case generation flow. The agent was supposed to be reading files only through a workspace-scoped MCP server — a thin wrapper that resolves paths, refuses to escape the workspace root, and truncates large files. Standard sandbox stuff.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read file content from local workspace. Provide the relative file path.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_file_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
    &lt;span class="n"&gt;full_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;workspace_path&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;full_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File not found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;full_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;relative_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workspace_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Access denied: path outside workspace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;# ... read + truncate ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;relative_to&lt;/code&gt; check is the whole point of the wrapper. The agent should not be able to read &lt;code&gt;/etc/passwd&lt;/code&gt;, or &lt;code&gt;/app/...&lt;/code&gt; (where our backend code lives in the container), or anything else outside the cloned customer repo. The MCP server enforces the boundary; the agent only ever sees what the tool returns.&lt;/p&gt;

&lt;p&gt;The bug: the Claude Agent SDK ships built-in tools — &lt;code&gt;Read&lt;/code&gt;, &lt;code&gt;Write&lt;/code&gt;, &lt;code&gt;Edit&lt;/code&gt;, &lt;code&gt;Glob&lt;/code&gt;, &lt;code&gt;Grep&lt;/code&gt;, &lt;code&gt;Bash&lt;/code&gt; — and unless you tell it otherwise, those built-ins are also available. The agent had two ways to read a file: the sandboxed &lt;code&gt;mcp__local__read_file&lt;/code&gt;, and the unsandboxed built-in &lt;code&gt;Read&lt;/code&gt;. We passed &lt;code&gt;allowed_tools&lt;/code&gt; listing the MCP tools by full name and assumed that was an allowlist. It wasn't tight enough — built-ins kept being chosen.&lt;/p&gt;

&lt;p&gt;The first sign was test cases that referenced files the agent couldn't possibly have seen if it was only reading from the customer's workspace. It was reading our own backend code from &lt;code&gt;/app/&lt;/code&gt; because the container Python path was sitting right there, and the built-in &lt;code&gt;Read&lt;/code&gt; tool happily walked into it.&lt;/p&gt;

&lt;p&gt;The fix is two-part, and both parts matter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cwd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_workspace_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_mcp_server&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__local__read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__local__list_directory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__local__get_file_structure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__local__get_multiple_files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;disallowed_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Edit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Glob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Grep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MultiEdit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NotebookEdit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;allowed_tools&lt;/code&gt; is necessary but not sufficient. The agent treats it as "these are allowed", not as "these are the only tools that exist." If a built-in is available and not explicitly in &lt;code&gt;disallowed_tools&lt;/code&gt;, it stays in the toolbox. We also reinforce the boundary in the system prompt with text the model actually pays attention to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;=== CRITICAL FILE ACCESS RULES ===
1. You MUST ONLY use these MCP tools to access files:
   - mcp__local__read_file
   - mcp__local__list_directory
   - mcp__local__get_file_structure
   - mcp__local__get_multiple_files
2. NEVER use built-in tools like Read, Glob, Bash, or Grep
3. NEVER read files from /app/ - that is a different application
4. ALL file paths you read should be relative to the workspace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note point 3 — explicit-path negation. We learned that one the hard way. The model has a tendency to "explore" by trying common system paths if it's confused about the workspace. Telling it &lt;code&gt;NEVER read files from /app/&lt;/code&gt; (the container path where our service actually runs) is dumb-looking but absolutely necessary to stop the model from leaking implementation details about the host into its analysis output.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;cwd=str(self._workspace_path)&lt;/code&gt; is the third leg. Without it, when a built-in &lt;em&gt;did&lt;/em&gt; slip through during early debugging, it ran with the SDK's default working directory, which made the boundary violation worse. With &lt;code&gt;cwd&lt;/code&gt; set, even an accidental tool use is at least scoped to the right tree.&lt;/p&gt;

&lt;p&gt;If I were starting again, I'd treat the agent's tool list as a default-deny zone from line one. The mental model "built-ins exist; my MCP tools are added on top" is wrong for any production agent that needs sandboxing. Default-deny everything, then pass &lt;code&gt;allowed_tools&lt;/code&gt; listing exactly the MCP tool names you want, then triple-belt-and-suspenders by enumerating the built-ins in &lt;code&gt;disallowed_tools&lt;/code&gt; so a future SDK version that adds a new built-in (&lt;code&gt;Read2&lt;/code&gt;, &lt;code&gt;Search&lt;/code&gt;, etc.) doesn't quietly enable it for you. The cost of the extra few lines is nothing. The cost of an agent reading the wrong filesystem in production is everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 2: tool name collisions across MCP servers, with no useful error
&lt;/h2&gt;

&lt;p&gt;Once you have more than one MCP server in the same agent loop, you find out fast that "the same tool name in different servers" is a thing. Our regression test generator runs two MCPs simultaneously — a &lt;code&gt;local&lt;/code&gt; server for code analysis and a &lt;code&gt;playwright&lt;/code&gt; server for browser interaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;playwright&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;playwright_server&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;allowed_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_snapshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_get_links&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_wait&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_get_visited_pages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__playwright__browser_console_messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_local_mcp_server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_local_mcp_server&lt;/span&gt;
    &lt;span class="n"&gt;allowed_tools&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__local__read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__local__list_directory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp__local__get_file_structure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mcp_servers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the prefixing convention: every tool name is &lt;code&gt;mcp__&amp;lt;server&amp;gt;__&amp;lt;tool&amp;gt;&lt;/code&gt;. That's not cosmetic. It is the difference between "the agent knows which server to call" and "two &lt;code&gt;read_file&lt;/code&gt; tools shadow each other and you find out at runtime which one it picks."&lt;/p&gt;

&lt;p&gt;The collision case for us was real. Across our services, several different MCP servers each define a &lt;code&gt;read_file&lt;/code&gt; tool — one for the workspace, one for GitHub repo reads, one inside the GitHub-tools server we use for fix generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# github-tools server: reads from a remote repo via the GitHub API
&lt;/span&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read file content from GitHub repository&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;branch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_file_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;github_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_file_content&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# local server: reads from the cloned workspace
&lt;/span&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read file content from local workspace. Provide the relative file path.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_file_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;full_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;workspace_path&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same name, different schemas, different intent. If both servers are mounted into the same agent loop and you list the tools as bare &lt;code&gt;read_file&lt;/code&gt;, the SDK has to pick one, and which one it picks is not the kind of thing you want the LLM's prompt-time context to silently determine. Even worse, the agent will eventually try to call the &lt;em&gt;other&lt;/em&gt; one and pass the wrong arg shape — &lt;code&gt;{"owner": ..., "repo": ..., "file_path": ...}&lt;/code&gt; against the local server's &lt;code&gt;{"file_path": ...}&lt;/code&gt; schema — and you get either a tool error the agent retries past, or, worse, an opaque "tool not found" that the agent reasons around by giving up on file access entirely.&lt;/p&gt;

&lt;p&gt;The fix is to never refer to MCP tools by bare name. Always use &lt;code&gt;mcp__&amp;lt;server&amp;gt;__&amp;lt;tool&amp;gt;&lt;/code&gt;. The system prompt gets paranoid too:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ONLY use mcp__local__ tools. Start with get_file_structure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use BOTH mcp__local__ (code analysis) and mcp__playwright__ (browser) tools.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The leading &lt;code&gt;mcp__local__&lt;/code&gt; prefix shows up in the system prompt because the model picks tools partly from the prompt's pattern matching. Saying "use the read_file tool" leaves it ambiguous; saying "use mcp_&lt;em&gt;local&lt;/em&gt;_read_file" tells it both the server and the operation, and the model is much less likely to pick a built-in or the wrong server's tool.&lt;/p&gt;

&lt;p&gt;There's a deeper point here about how you scope a MCP server. We deliberately scope by &lt;em&gt;concern&lt;/em&gt;, not by &lt;em&gt;protocol&lt;/em&gt;: there's a &lt;code&gt;github&lt;/code&gt; server (operations against the GitHub API), a &lt;code&gt;local&lt;/code&gt; server (operations against a cloned workspace), and a &lt;code&gt;playwright&lt;/code&gt; server (operations against a live browser). Within each concern, tool names are short and don't repeat across concerns. We do not have one mega-server with 30 tools because at that point the system prompt has to disambiguate which tool to use within the same namespace, and that's a much harder prompt-engineering problem than just naming the servers right.&lt;/p&gt;

&lt;p&gt;If I were starting again: pick a strict naming convention for your MCP servers (we use lowercase singular nouns: &lt;code&gt;local&lt;/code&gt;, &lt;code&gt;github&lt;/code&gt;, &lt;code&gt;playwright&lt;/code&gt;, &lt;code&gt;appium&lt;/code&gt;) and a strict prefix for every tool reference, both in &lt;code&gt;allowed_tools&lt;/code&gt; and in the system prompt. Treat any unprefixed tool name in your code or prompts as a code smell. The 10 minutes you spend writing a small lint rule that flags &lt;code&gt;read_file&lt;/code&gt; without &lt;code&gt;mcp__&amp;lt;server&amp;gt;__&lt;/code&gt; will save you hours of debugging "why did the agent call the wrong server" in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure 3: the parent process's environment bleeds into the MCP server
&lt;/h2&gt;

&lt;p&gt;This one is the most interesting because the bug isn't in the MCP transport — it's in how MCP server subprocesses (and even in-process tools that hit external APIs) inherit the parent's process environment.&lt;/p&gt;

&lt;p&gt;Concretely: our automatic-fix flow runs against customer repositories. Some customers configure their own AWS credentials so the fix-generation agent can run their tests in a sandbox. Those credentials get loaded by the use case and passed through to the SDK as extra env vars:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# generate_fix_use_case.py
&lt;/span&gt;&lt;span class="n"&gt;extra_env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credential_set_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;extra_env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_credential_use_case&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_env_vars_for_project&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS_DEFAULT_PROFILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;extra_env&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;default_profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;extra_env&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS_PROFILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_profile&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;extra_env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Injecting &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra_env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; extra env vars for project &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So far so good. The agent needs &lt;code&gt;AWS_*&lt;/code&gt; to talk to the customer's AWS, and we pass them in via &lt;code&gt;extra_env&lt;/code&gt;. The bug is what happens when this collides with our own &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The naive way to merge extra env into the agent's environment — and the way we wrote it the first time — looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# WRONG. extra_env can overwrite ANTHROPIC_API_KEY.
&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra_env&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spread order matters. With &lt;code&gt;extra_env&lt;/code&gt; last, anything in &lt;code&gt;extra_env&lt;/code&gt; overrides keys we set first. If a customer's credentials bundle happens to contain an &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; (for whatever reason — they were experimenting, they have an internal Bedrock+Anthropic setup, they pasted the wrong env block into the credentials UI), that key now silently replaces ours in the agent's process. We'd be billing the agent's work to the wrong key, the customer would be billing the agent's work to &lt;em&gt;their&lt;/em&gt; key, or — depending on what the value actually was — the SDK would fail authentication entirely and we'd see "401" errors that look like quota issues but aren't.&lt;/p&gt;

&lt;p&gt;The fix is one line, but only because we now know it's a one-line fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create agent options — extra_env is merged first so ANTHROPIC_API_KEY cannot
# be overridden by user-supplied credentials
&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra_env&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_secret_value&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;extra_env&lt;/code&gt; first, our key last. Now the customer's env can't shadow our key, no matter what they put in their credentials bundle. The comment on top of every one of these blocks (we have four of them across &lt;code&gt;analyze_error&lt;/code&gt;, &lt;code&gt;generate_fix&lt;/code&gt;, &lt;code&gt;improve_fix_with_test_feedback&lt;/code&gt;, and &lt;code&gt;analyze_and_fix_error_with_agent&lt;/code&gt;) is verbatim the same text, because if anyone ever "cleans up" the merge order in a future refactor, they need a giant flag in the diff that says no, this is load-bearing.&lt;/p&gt;

&lt;p&gt;There's a wider principle here that's worth saying directly: &lt;strong&gt;anything you let into an MCP child process's environment is data you've handed to a place the LLM can read indirectly&lt;/strong&gt;. If the MCP server prints diagnostics with env vars in them, those go to logs the model can see in some configurations. If the MCP server reads &lt;code&gt;os.environ&lt;/code&gt; to pick up credentials and accidentally reflects them in an error message, those land in the tool result the model receives. Treat the env passed to an MCP server like you'd treat the request body of an internal API: minimum scope, validated keys, no secrets you wouldn't want the model to be aware of.&lt;/p&gt;

&lt;p&gt;The other half of this fix is logging discipline. When &lt;code&gt;extra_env&lt;/code&gt; shows up, we log only its size:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;extra_env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Injecting &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra_env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; extra env vars for project &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not the keys. Not the values. A count. If a customer reports "the agent didn't see my AWS creds," the answer is in the count and a separate audit log of which credential set was selected, never in a debug dump of the env block itself. The same log line that says "we injected 4 vars" tells you the wiring is working without putting the secrets where any future debugger can grab them.&lt;/p&gt;

&lt;p&gt;If I were starting again: I would build a &lt;code&gt;build_env(*sources, base)&lt;/code&gt; helper from day one, with &lt;code&gt;base&lt;/code&gt; always last in the spread and a hardcoded list of "system-owned" keys (the Anthropic key, our own internal service tokens) that simply cannot be overridden no matter what comes in &lt;code&gt;*sources&lt;/code&gt;. The four spread-order copies in our codebase all do the same thing; they would do it once if I'd written a helper. They don't, because each one was added when its calling use case was added, and the merge-order discipline was learned at a time when the helper would have meant a refactor I didn't have time for. Every new place you spread env into an SDK options object is a place you can get the order wrong. Centralize it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What these three failure modes have in common
&lt;/h2&gt;

&lt;p&gt;The failure modes all live at the same seam: &lt;strong&gt;the boundary between the agent process and the things the agent process delegates to&lt;/strong&gt;. With built-in shadowing, the agent has access to a tool you didn't want it to have because you didn't enumerate the disallowed set. With name collisions, the agent has access to two tools by the same name and resolves which one ambiguously. With env bleeding, the agent's children inherit context (credentials, paths, caches) that the parent didn't deliberately decide to pass through.&lt;/p&gt;

&lt;p&gt;In each case the spec was technically silent. There's nothing in the MCP spec that says "you must shadow built-ins by listing them in disallowed_tools" because the SDK is the one that exposes built-ins, not MCP. There's nothing that says "always prefix tool names" because the MCP spec is per-server; the prefixing convention comes from how you wire multiple servers into one agent. Env inheritance is a Unix process detail that has nothing to do with MCP. But all three behave like MCP problems because they're invisible until the agent is running and you can't reason backwards from a "tool not found" or a "wrong API key" message to which seam was leaky.&lt;/p&gt;

&lt;p&gt;The mental model that helped most was: &lt;strong&gt;the MCP server lifecycle is its own concern, not a free side effect of the agent process lifecycle&lt;/strong&gt;. Just because your agent process was started with the right environment, the right tool list, and the right working directory doesn't mean the MCP children are running with that same shape. Every option you set on &lt;code&gt;ClaudeAgentOptions&lt;/code&gt; — &lt;code&gt;env&lt;/code&gt;, &lt;code&gt;cwd&lt;/code&gt;, &lt;code&gt;allowed_tools&lt;/code&gt;, &lt;code&gt;disallowed_tools&lt;/code&gt;, &lt;code&gt;mcp_servers&lt;/code&gt;, &lt;code&gt;max_turns&lt;/code&gt; — has a "what does this look like from inside an MCP child" reading, and the production-correct answer is almost always more restrictive than the demo-correct answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you've shipped a multi-MCP-server agent into production, I'd love to hear which of these bit you and which ones I haven't found yet. Especially: how do you handle MCP servers that have legitimate reasons to need real environment variables (a Stripe MCP that genuinely needs &lt;code&gt;STRIPE_API_KEY&lt;/code&gt;, an internal-tools MCP that needs a service token), without giving up the "no parent env bleed" property? I have an answer that involves a small per-call env builder, but I'm half-convinced there's a cleaner pattern I haven't seen yet.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>claude</category>
      <category>python</category>
      <category>devops</category>
    </item>
    <item>
      <title>One SSE connection, N upstream agents — the stream relay we shipped</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Fri, 01 May 2026 08:22:34 +0000</pubDate>
      <link>https://forem.com/zoetaka38/one-sse-connection-n-upstream-agents-the-stream-relay-we-shipped-20a6</link>
      <guid>https://forem.com/zoetaka38/one-sse-connection-n-upstream-agents-the-stream-relay-we-shipped-20a6</guid>
      <description>&lt;p&gt;The pod that was tailing one of our long-running agents got rotated during a deploy. The client connection to the dashboard didn't disconnect — it sat there, EventSource happily believing the world was fine, with zero events coming through. The agent was still working. The events were still being emitted upstream. The middle box that was supposed to forward them to the browser was simply gone.&lt;/p&gt;

&lt;p&gt;That moment is the reason this post exists. If you're going to put a relay between N independent event-emitting backends and one client-facing SSE connection, you have to answer a specific question: &lt;strong&gt;when the relay process dies, what guarantees that another one picks up where it left off, and that the client doesn't notice?&lt;/strong&gt; A lot of "just proxy SSE" designs fall apart at exactly this seam.&lt;/p&gt;

&lt;p&gt;I'm building a SaaS that touches this problem (Codens, an AI dev harness with a few specialized agents — happy to talk about it but it isn't the point of this post). The relay I'll walk through fans in live telemetry from N agent jobs running on independent nodes into a single per-run aggregated stream the dashboard subscribes to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of the problem
&lt;/h2&gt;

&lt;p&gt;The setup, concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;N agent jobs run, each on its own host. Each exposes &lt;code&gt;GET /jobs/{job_id}/stream&lt;/code&gt; and emits events for one workflow run only.&lt;/li&gt;
&lt;li&gt;The dashboard wants a live feed for a specific run. It opens &lt;strong&gt;one&lt;/strong&gt; SSE connection to our API, not N.&lt;/li&gt;
&lt;li&gt;We have multiple instances of the relay process. Any can die at any time. Deploys rotate them constantly.&lt;/li&gt;
&lt;li&gt;Reconnects are constant — laptops sleep, networks blip, tabs reload.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The naive shape that almost works: each browser opens an SSE connection through one of our API pods, and that pod opens an SSE connection upstream to the agent. Two-hop proxy, no state. Here's why it breaks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Two browsers viewing the same run = two upstream connections.&lt;/strong&gt; Hits the agent's per-job stream twice. Some agents fight back; some let you do it but dedupe is your problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reconnects = a new upstream connection too.&lt;/strong&gt; You either lose all events between disconnect and reconnect, or you require the upstream to support &lt;code&gt;Last-Event-ID&lt;/code&gt;. Many homegrown SSE servers don't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No durable buffer.&lt;/strong&gt; A client offline for 10 seconds loses those events. No tape.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-pod independent tailers = N tailers per run with N pods.&lt;/strong&gt; Wasted upstream connections, no single source of truth for "what events have we seen for run X."&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The shape we ended up with: exactly &lt;strong&gt;one&lt;/strong&gt; stream-relay pod owns the upstream connection for a given run at a time. That pod tails the agent's SSE, writes every event into a Redis Stream keyed by run, and the API layer reads from Redis Streams to serve clients. Many client SSE connections, one upstream connection, durable buffer in the middle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┐      one connection per run     ┌─────────────────┐
│  Agent pod A │ ─────────────────────────────▶  │ stream-relay    │
└──────────────┘                                 │  pod (owner)    │
┌──────────────┐                                 │                 │
│  Agent pod B │ ─────────────────────────────▶  │ XADD events:{r} │
└──────────────┘                                 └────────┬────────┘
   ... N agent jobs ...                                   │
                                                          ▼
                                                 ┌─────────────────┐
                                                 │  Redis Streams  │
                                                 │  events:{run_id}│
                                                 │   (maxlen 10k)  │
                                                 └────────┬────────┘
                                                          │
                                  many client SSE conns   │
                              ◀───────────────────────────┘
                              (FastAPI sse_starlette pods)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting design decisions are all about the middle box: who tails, how takeover works, how the API layer reads from the buffer, what counts as "alive."&lt;/p&gt;

&lt;h2&gt;
  
  
  Stateless relay, leased ownership
&lt;/h2&gt;

&lt;p&gt;The relay process is stateless. It boots, looks at the database for active workflow runs with an active job, and tries to acquire a per-run ownership lock in Redis. If it gets the lock, it spawns an asyncio task that opens the upstream SSE and tails it. If another relay instance got the lock first, it doesn't spawn a tailer. Every 5 seconds, it scans again.&lt;/p&gt;

&lt;p&gt;The main loop, abridged from &lt;code&gt;stream_relay/main.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;POLL_INTERVAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;  &lt;span class="c1"&gt;# seconds between active-run scans
&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Refresh owned locks
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ownership&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh_all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Clean up finished tasks
&lt;/span&gt;    &lt;span class="n"&gt;finished&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;active_tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;done&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;finished&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;active_tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rid&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CancelledError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream-relay: tailer task error run_id={}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;active_tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rid&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Discover new running workflow_runs
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;session_factory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;running_infos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_find_running_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;active_run_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;running_infos&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Cancel tasks for runs that are no longer active
&lt;/span&gt;    &lt;span class="n"&gt;stale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;active_tasks&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;active_run_ids&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rid&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stale&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;active_tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rid&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Acquire + start new tasks
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;running_infos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;active_tasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;acquired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ownership&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;acquired&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;tail_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ownership&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;streams_client&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tail-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;active_tasks&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POLL_INTERVAL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three pieces are doing real work. (1) The lock refresh runs first so we don't lose ownership while we're busy. (2) The DB scan is the source of truth for "what runs should be tailed right now"; the relay never trusts its own in-memory state. (3) Acquisition is best-effort — if someone else got there first, we move on.&lt;/p&gt;

&lt;p&gt;Stateless is deliberate. The relay doesn't keep a cursor file. There's no "what was I doing before I crashed" reconciliation. If a pod restarts, on its next 5-second tick it refreshes whatever locks it still holds (none, just booted), looks at active runs, and tries to acquire each. If the previous owner died, the lock has expired (60s TTL) and this pod takes over.&lt;/p&gt;

&lt;p&gt;The instance ID is computed once at boot — ECS task ID, falling back to hostname, falling back to UUID:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_INSTANCE_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;STREAM_RELAY_INSTANCE_ID&lt;/span&gt;
    &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ECS_TASK_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gethostname&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because the ownership value stored in Redis is this instance ID. When checking "do I still own this lock," you compare the value, not just check existence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lock, in Redis, in nine lines
&lt;/h2&gt;

&lt;p&gt;Every run_id gets one Redis key with a 60-second TTL. The owner refreshes it every 20 seconds. If the owner dies, the lock expires within 60s and another pod can grab it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;LOCK_TTL_SECS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
&lt;span class="n"&gt;REFRESH_INTERVAL_SECS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;span class="n"&gt;_KEY_PREFIX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_owner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;acquired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_instance_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;nx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;LOCK_TTL_SECS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;acquired&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_owned&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="c1"&gt;# Check if we already own it (e.g. after a crash + restart)
&lt;/span&gt;        &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_instance_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_owned&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OwnershipManager.acquire failed run_id={}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;SET key value NX EX 60&lt;/code&gt; is the entire trick. NX = only if not exists, EX = TTL in seconds. Atomic. The "already own it after a crash + restart" branch is the case where the hostname is stable across restarts and we want to recognize the lock as ours rather than wait out the TTL.&lt;/p&gt;

&lt;p&gt;Refresh is defensive too. It checks the value before extending TTL, so if someone else has taken over we don't blindly re-extend their lock:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_instance_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream_owner: lock lost for run_id={} (owner={} expected={})&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_instance_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_owned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LOCK_TTL_SECS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_owned&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OwnershipManager.refresh failed run_id={}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# conservative: don't stop on Redis error
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;return True&lt;/code&gt; on Redis error is deliberate. If Redis is briefly unreachable, we keep tailing rather than abandon a working stream. Redis comes back, next refresh confirms or denies. The downside: a brief window where two pods could both think they own a run during a Redis flap, but they're both writing to the same &lt;code&gt;events:{run_id}&lt;/code&gt; stream — duplicates, not data loss. Duplicates the consumer can dedupe; data loss it can't recover.&lt;/p&gt;

&lt;p&gt;The 60s TTL / 20s refresh ratio gives a 3x safety factor. The pod refreshes three times before TTL would expire; even one missed refresh is fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pod rotation, in practice
&lt;/h2&gt;

&lt;p&gt;The flow when a relay pod is rotated mid-tail:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pod A is tailing run X. Its &lt;code&gt;stream_owner:X&lt;/code&gt; Redis key has its instance ID and a 60s TTL, refreshed every 20s.&lt;/li&gt;
&lt;li&gt;Pod A receives SIGTERM (deploy, scale-down, ECS task draining).&lt;/li&gt;
&lt;li&gt;Inside &lt;code&gt;relay_main&lt;/code&gt;, the shutdown handler cancels all active tasks, awaits them, closes Redis.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tail_stream&lt;/code&gt;'s &lt;code&gt;finally&lt;/code&gt; block calls &lt;code&gt;ownership.release(run_id)&lt;/code&gt;, which deletes the Redis key — but only if it still belongs to this instance.&lt;/li&gt;
&lt;li&gt;Pod B, on its next 5-second poll, finds run X in the DB, tries to acquire, gets it, spawns a tailer.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_instance_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OwnershipManager.release failed run_id={}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_owned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bad case — the one that brought me to this design — is a pod dying &lt;em&gt;without&lt;/em&gt; running its shutdown handler. SIGKILL after a stuck SIGTERM. OOM. Network partition. The key still exists with its TTL ticking down; within 60 seconds it expires and Pod B's next poll picks it up.&lt;/p&gt;

&lt;p&gt;Worst-case takeover: ~60s for TTL + up to 5s for the next poll = ~65s of no events relayed. The dashboard sees a gap. The events emitted during that gap &lt;em&gt;aren't lost&lt;/em&gt; — they're still buffered in the upstream agent's own per-job queue, and the new tailer reads them on reconnect. (Whether that holds depends on your upstream; ours has a per-job in-memory replay buffer.)&lt;/p&gt;

&lt;p&gt;We picked the boring 60/20 numbers because they're easy to reason about and a one-minute gap during deploys is tolerable. Shorter TTLs halve worst-case latency at the cost of more Redis traffic. The knobs are there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tailer: one upstream connection per run
&lt;/h2&gt;

&lt;p&gt;The tailer is the asyncio task spawned for each owned run. It opens an httpx streaming GET to the agent's SSE endpoint and parses events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tail_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;aioredis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;run_info&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RunInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ownership&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OwnershipManager&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;streams_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RedisStreamsClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;run_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;run_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint_url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/jobs/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_build_headers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key_hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;consecutive_errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# --- ownership check (before reconnect) ---
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ownership&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_owned_by_us&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt;

            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connect&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1800.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;15.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="n"&gt;consecutive_errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

                    &lt;span class="n"&gt;check_deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;CHECK_INTERVAL_SECS&lt;/span&gt;

                    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;_parse_sse_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;streams_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="c1"&gt;# ... heartbeats + terminal-event handling ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two details that matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The httpx client is shared across all tailers in the same pod.&lt;/strong&gt; We have a single &lt;code&gt;httpx.AsyncClient&lt;/code&gt; and pass it into every tailer task. If you create one client per tailer, you blow your file descriptor budget at ~100 concurrent runs and lose connection pooling per host:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;http_limits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Limits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;max_keepalive_connections&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_connections&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;keepalive_expiry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verify&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;limits&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;http_limits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# all tailers share `http`
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The read timeout is 30 minutes.&lt;/strong&gt; SSE streams can be quiet for a long time and that's normal. Set the read timeout to something sensible-sounding like 30 seconds and you get spurious &lt;code&gt;ReadTimeout&lt;/code&gt; exceptions during legitimate slow phases of the agent's work. Learned this the hard way.&lt;/p&gt;

&lt;p&gt;Reconnect logic is unsurprising — exponential backoff on connection/read errors, give up after 5 consecutive failures, release the lock:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadTimeout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RemoteProtocolError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;consecutive_errors&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ConnectError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;consecutive_errors&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;consecutive_errors&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;MAX_CONSECUTIVE_ERRORS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;

&lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;consecutive_errors&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backoff&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;5s, 10s, 20s, 40s, capped at 60s. After 5 failures we exit; the main loop's next poll re-acquires and starts fresh.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two heartbeats: liveness vs forward progress
&lt;/h2&gt;

&lt;p&gt;You'd think one heartbeat is enough. It isn't. The bug: an upstream agent stops making real progress (stuck on a tool call, the underlying LLM silently retrying internally) but its SSE stream still emits &lt;code&gt;keepalive&lt;/code&gt; / &lt;code&gt;ping&lt;/code&gt; events every 15 seconds because the agent's web framework wants the connection warm. If your stall detector reads a single "is the SSE alive" heartbeat, it sees green forever. The job is stuck and you don't know.&lt;/p&gt;

&lt;p&gt;The fix is two heartbeats, read by different consumers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;LIVENESS_TTL_SECS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;        &lt;span class="c1"&gt;# 5 minutes
&lt;/span&gt;&lt;span class="n"&gt;RUN_PROGRESS_TTL_SECS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;   &lt;span class="c1"&gt;# 60 minutes
&lt;/span&gt;
&lt;span class="c1"&gt;# SSE event types that indicate Claude is making forward progress.
&lt;/span&gt;&lt;span class="n"&gt;_FORWARD_PROGRESS_TYPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;_parse_sse_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;streams_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;now_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# Liveness heartbeat: any SSE frame proves the connection is alive.
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;liveness_heartbeat:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;LIVENESS_TTL_SECS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;now_ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Forward-progress heartbeat: only real work advances this.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_is_forward_progress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forward_progress_heartbeat:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;RUN_PROGRESS_TTL_SECS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;now_ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;liveness_heartbeat:{run_id}&lt;/code&gt; advances on &lt;strong&gt;any&lt;/strong&gt; SSE event including pings. It answers "is the upstream connection alive at all?"&lt;/p&gt;

&lt;p&gt;&lt;code&gt;forward_progress_heartbeat:{run_id}&lt;/code&gt; advances only on event types that mean the agent emitted something it worked for — &lt;code&gt;output&lt;/code&gt;, &lt;code&gt;result&lt;/code&gt;, &lt;code&gt;assistant&lt;/code&gt;, &lt;code&gt;tool_result&lt;/code&gt;. It answers "is the agent doing useful work, or idling?" If this is stale but liveness is fresh, the connection is fine but the agent is stalled.&lt;/p&gt;

&lt;p&gt;A separate orphan-detection job reads the forward-progress key, not the liveness key. Mix them up and you get false positives during legitimate slow-thinking phases or false negatives during real stalls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The buffer: Redis Streams with a hard cap
&lt;/h2&gt;

&lt;p&gt;Every event the tailer parses is appended to a Redis Stream keyed by run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RedisStreamsClient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;aioredis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis_client&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_prefix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_prefix&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_STREAM_MAXLEN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 10_000
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;entry_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
                &lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;approximate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;entry_id&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RedisStreamsClient.xadd failed run_id={}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;maxlen=10_000, approximate=True&lt;/code&gt; is the load-bearing flag. Without &lt;code&gt;approximate=True&lt;/code&gt;, Redis trims exactly to 10k on every XADD — O(n) walk. With it, Redis trims at radix-tree-node boundaries: slightly fuzzy but vastly faster. Streams stabilize around 10k–11k entries, which is fine.&lt;/p&gt;

&lt;p&gt;10k events per run is roughly half an hour of busy work for our agents. A client reconnecting within that window gets the entire history replayed from Redis. Beyond that, old events fall off; persist them separately if you need them long-term.&lt;/p&gt;

&lt;p&gt;We use Streams (&lt;code&gt;XADD&lt;/code&gt; / &lt;code&gt;XREAD&lt;/code&gt; / &lt;code&gt;XREADGROUP&lt;/code&gt;) and not just Pub/Sub precisely for this durability. Pub/Sub is fire-and-forget — no subscriber, event gone. Streams keep the bytes around so a client reconnecting after a blip replays from where it left off.&lt;/p&gt;

&lt;p&gt;There's a layer above this in our codebase that bridges the buffered events into the per-client SSE response, including a history replay on connect so a late joiner sees prior events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_subscribe_to_channel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AsyncGenerator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aioredis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REDIS_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decode_responses&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pubsub&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pubsub&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Subscribe BEFORE fetching history so we don't miss events published
&lt;/span&gt;        &lt;span class="c1"&gt;# between the LRANGE and the first get_message() call.
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;pubsub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;history_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sse:history:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;history_events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lrange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;last_replayed_seq&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event_json&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;history_events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;event_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_seq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;seq&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;last_replayed_seq&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;seq&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event_json&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The order matters: &lt;strong&gt;subscribe before reading history.&lt;/strong&gt; Otherwise events published between LRANGE and the first &lt;code&gt;get_message()&lt;/code&gt; disappear. Live messages are then deduplicated against &lt;code&gt;last_replayed_seq&lt;/code&gt; — every event carries a monotonic &lt;code&gt;event_seq&lt;/code&gt; and the loop drops anything whose seq was already replayed. That seq is what gives the client a continuous-looking stream across pod takeovers, even if a reconnect briefly lands on a different API pod.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backpressure on slow clients
&lt;/h2&gt;

&lt;p&gt;The trap with SSE fan-out is that a single slow client (hotel wifi, throttled background tab, dev who left a curl pipe running for two days) piles up unsent events in the per-connection buffer.&lt;/p&gt;

&lt;p&gt;The choices that protect the rest:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The relay never blocks on a client.&lt;/strong&gt; It only writes to Redis. A slow client is the API pod's problem, not the relay's. The relay keeps tailing at full speed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Redis Stream is bounded.&lt;/strong&gt; 10k entries cap. A reconnecting client gets at most 10k events; beyond that, the stream has aged out.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-connection sse-starlette buffers are small.&lt;/strong&gt; Connections that fall too far behind get closed. We don't extend their grace period.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The single most important rule is the first: &lt;strong&gt;the upstream-tailing path is independent of any single client&lt;/strong&gt;. Many pods serve SSE to many clients all reading from the same Redis Streams buffer. A slow consumer only slows itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this is overkill
&lt;/h2&gt;

&lt;p&gt;If you have one service emitting events, one client per run, no deploys, a short session bound, and reconnects are rare — a single-pod direct SSE proxy with no Redis in the middle is fine. Ship that.&lt;/p&gt;

&lt;p&gt;The relay is justified by &lt;strong&gt;N agents × M clients × frequent deploys × long session lengths&lt;/strong&gt;. Take any of the four out and the cost-benefit shifts. The specific moment where this design pays off is the rotated-pod-mid-stream case. If that never happens for you, most of the ownership-lock complexity is theatre. Keep the Redis Streams buffer for reconnect-replay; drop the leased ownership.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd build differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A real &lt;code&gt;Last-Event-ID&lt;/code&gt; header path.&lt;/strong&gt; Reconnect today uses a history-list replay keyed off the channel. That's fine for our common case but it's not the SSE-spec way. Cleaner: client sends &lt;code&gt;Last-Event-ID&lt;/code&gt; on reconnect, the API seeks the Redis Stream to that ID. We will get there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fewer Redis round-trips per event.&lt;/strong&gt; Every event is one XADD plus a SETEX for liveness, sometimes two more SETEXes for forward progress. Three to four Redis calls per event. A pipelined batch would cut latency. We haven't needed it yet, but it'll show up the first time someone points a debugger at the inner loop.&lt;/p&gt;

&lt;p&gt;If you've shipped a multi-instance SSE relay, I'd love to hear which corner you cut. Especially: did you go with a leased single-tailer-per-run like this, or did you let every pod tail every run and dedupe at the consumer layer? I'm half-convinced the second design is simpler and I keep talking myself out of it.&lt;/p&gt;

</description>
      <category>sse</category>
      <category>redis</category>
      <category>fastapi</category>
      <category>python</category>
    </item>
    <item>
      <title>Why we built our own credit ledger instead of using Stripe metered billing</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Fri, 01 May 2026 07:08:30 +0000</pubDate>
      <link>https://forem.com/zoetaka38/why-we-built-our-own-credit-ledger-instead-of-using-stripe-metered-billing-4ke5</link>
      <guid>https://forem.com/zoetaka38/why-we-built-our-own-credit-ledger-instead-of-using-stripe-metered-billing-4ke5</guid>
      <description>&lt;p&gt;A few months in, an agent I was running reserved about 1,000 credits' worth of tokens for a long-running task, then crashed mid-flight. The credits were sitting in a "reserved" bucket on the org. The user's available balance had gone down. The work hadn't actually happened.&lt;/p&gt;

&lt;p&gt;If your billing layer is Stripe metered, you have an awkward conversation with yourself at this point. &lt;code&gt;stripe.SubscriptionItem.create_usage_record(...)&lt;/code&gt; doesn't have a &lt;code&gt;quantity=-1000&lt;/code&gt; "actually never mind" path that respects your prior idempotency keys. Negative quantities exist, but they aren't the same primitive as a release; they show up as a separate usage record and they don't cancel the original reservation, because there was no reservation — Stripe metered doesn't have one. The model is: emit a usage event, it accrues, you bill at period end.&lt;/p&gt;

&lt;p&gt;That moment is when I started writing my own credit ledger.&lt;/p&gt;

&lt;p&gt;I'm building a SaaS that touches this problem (Codens, an AI dev harness with a few specialized agents that share an org-wide credit pool — happy to talk about it but it's not the point of this post). The point of this post is the architectural reason why Stripe metered, which is genuinely great at what it's designed for, doesn't fit a multi-product credit-pool product, and what the homegrown alternative actually looks like.&lt;/p&gt;

&lt;p&gt;I'll keep this builder-to-builder. No pricing math, no markup talk, just the primitives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Stripe metered is great at
&lt;/h2&gt;

&lt;p&gt;Stripe metered is designed around a clean model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One product (or one subscription item per product).&lt;/li&gt;
&lt;li&gt;A customer is on a subscription.&lt;/li&gt;
&lt;li&gt;You report usage events as they happen.&lt;/li&gt;
&lt;li&gt;At the end of the billing period, Stripe converts events into an invoice.&lt;/li&gt;
&lt;li&gt;Pricing tiers (graduated, volume) are configured on the price object.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you sell a single product priced per API call, per minute, per GB — this is the right answer. You should use it. You'll be done in a day, you'll get hosted invoices and dunning for free, and you'll never reimplement tax rates.&lt;/p&gt;

&lt;p&gt;The problems start the moment you violate any of those assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it stops being a good fit
&lt;/h2&gt;

&lt;p&gt;In my case, four assumptions were violated at once.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The pool is org-wide, not per-product
&lt;/h3&gt;

&lt;p&gt;I have multiple agents (PRD writing, error auto-fix, test gen, activity ledger, and a workflow harness) that all draw from one shared credit balance attached to the customer's organization. There is no "Subscription for the test-gen agent" — there's an organization, and an org-wide pool that any agent can pull from.&lt;/p&gt;

&lt;p&gt;You can model this in Stripe two ways and they both hurt. One subscription item per product means N usage streams to keep in sync, and the customer doesn't have a single balance — they have N usages that get reconciled at invoice time. One subscription item with a "credits" SKU means you've already started building a homegrown ledger; you've just put a thin Stripe wrapper on it.&lt;/p&gt;

&lt;p&gt;The org-wide entity is the unit I actually want to lock and update atomically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# billing-control-plane/backend/src/domain/entities/organization_credit.py
&lt;/span&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;OrganizationCredit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;
    &lt;span class="n"&gt;reserved_balance&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;
    &lt;span class="n"&gt;billing_paused&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;paused_reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;available_balance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reserved_balance&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two columns: &lt;code&gt;balance&lt;/code&gt; and &lt;code&gt;reserved_balance&lt;/code&gt;. &lt;code&gt;available_balance&lt;/code&gt; is just their difference. Every consume, reserve, refund, chargeback takes a &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt; on this row before doing anything. There is exactly one row per organization, and it's the truth.&lt;/p&gt;

&lt;p&gt;You cannot get &lt;code&gt;SELECT FOR UPDATE&lt;/code&gt; semantics out of Stripe. Stripe's API is eventually consistent from your perspective; you submit a usage event, you trust their backend to process it, you read it back via the dashboard or their API. That's fine for billing-period accounting. It's not fine for "did this agent's request just bring the org under their available balance, yes or no, right now."&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Long-running tasks need reservations
&lt;/h3&gt;

&lt;p&gt;Several of my agent operations take 30 seconds to 30 minutes. The naive "consume after the fact" pattern blows up: two concurrent requests both pass the balance check, both run, the second one was always going to overspend, and now I owe somebody an apology.&lt;/p&gt;

&lt;p&gt;The fix is reservations. Before the work starts, reserve the credits. When the work finishes, consume the actual amount and release the rest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# reserve_credit_use_case.py — abridged
&lt;/span&gt;&lt;span class="n"&gt;credit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credit_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_by_auth_org_id_for_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;NotFound&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;organization &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; not initialized&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;billing_paused&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BillingPaused&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing paused: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;paused_reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;available_balance&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;InsufficientCredits&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;available=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;available_balance&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; required=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;expires_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UTC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expires_in_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reservation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Reservation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expires_at&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;expires_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reserved_by_service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reservation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reservation_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# balance is NOT changed; only reserved_balance moves
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credit_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_balance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance_delta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;reserved_balance_delta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trick is that &lt;code&gt;balance&lt;/code&gt; doesn't move during a reservation. Only &lt;code&gt;reserved_balance&lt;/code&gt; does. So the org's total credit balance is unchanged ("the customer didn't lose money yet, the work hasn't been done"), but their &lt;code&gt;available_balance&lt;/code&gt; (&lt;code&gt;balance - reserved_balance&lt;/code&gt;) drops, which is what the next concurrent request sees. If a second agent tries to reserve more than what's left available, it fails before doing any work.&lt;/p&gt;

&lt;p&gt;Then on success the reservation gets consumed for the actual amount, and the difference goes back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# consume_reservation_use_case.py — abridged
&lt;/span&gt;&lt;span class="n"&gt;new_balance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actual_amount&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credit_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_balance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance_delta&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actual_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reserved_balance_delta&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;credit_back&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actual_amount&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;balance&lt;/code&gt; drops by the actual consumed amount. &lt;code&gt;reserved_balance&lt;/code&gt; drops by the full reservation amount (so what got over-reserved is returned to available). One row update, atomic.&lt;/p&gt;

&lt;p&gt;If the agent crashed instead, the matching call is &lt;code&gt;release_reservation&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# release_reservation_use_case.py — abridged
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credit_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_balance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance_delta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;reserved_balance_delta&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;balance&lt;/code&gt; unchanged, &lt;code&gt;reserved_balance&lt;/code&gt; drops back. The credits are available again.&lt;/p&gt;

&lt;p&gt;There is no Stripe metered primitive for this. The closest you can build is a "we'll send you a usage event later, maybe" pattern, which means the customer's available balance is computed in your code anyway, which means you're keeping a parallel ledger anyway, which means... you've built this.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Idempotency on retries needs payload checking, not just key checking
&lt;/h3&gt;

&lt;p&gt;This is the one that bit me hardest in early prototypes. Every agent retries. If the network blips during a &lt;code&gt;reserve&lt;/code&gt; call, the agent retries with the same idempotency key. You cannot double-reserve.&lt;/p&gt;

&lt;p&gt;Stripe's &lt;code&gt;Idempotency-Key&lt;/code&gt; header gets you a long way for Stripe API calls. But the key alone isn't enough for a credit ledger — you also need to verify that the &lt;em&gt;payload&lt;/em&gt; is the same. If the agent reuses a key but with a different amount (almost always a bug), you should reject it loudly, not return the cached response for the wrong amount.&lt;/p&gt;

&lt;p&gt;That's what the idempotency manager looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# idempotency_manager.py — the core branching
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;lease_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DEFAULT_LEASE_SECONDS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retention_days&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DEFAULT_RETENTION_DAYS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;IdempotencyOutcome&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IdempotencyKey&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_for_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;UTC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload_hash&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;payload_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;IdempotencyConflict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idempotency_key=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; payload_hash mismatch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IdempotencyOutcome&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REPLAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed_terminal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IdempotencyOutcome&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FAILED_REPLAY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# in_flight
&lt;/span&gt;        &lt;span class="n"&gt;lease&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;in_flight_lease_until&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lease&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;lease&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;retry_after&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;lease&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;total_seconds&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;IdempotencyInFlight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retry_after_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retry_after&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four states matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Completed&lt;/strong&gt; with matching payload hash: return the cached response, do not touch balance. This is a successful retry of a prior success.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Completed&lt;/strong&gt; with mismatched payload hash: throw. Same key, different payload is a bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In-flight, lease still valid&lt;/strong&gt;: another worker is doing this right now. Tell the caller to retry after N seconds. This prevents two workers from racing on the same key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed terminal&lt;/strong&gt;: cached error response. This stops retry storms on permanently-bad requests (insufficient credits, malformed payload).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key observation: &lt;strong&gt;payload hash is computed by the use case and includes only the fields that should make the request unique.&lt;/strong&gt; For reserve, that's &lt;code&gt;{amount, expires_in_seconds}&lt;/code&gt;. For consume, that's &lt;code&gt;{service_id, operation_type, input_tokens, output_tokens, model_name}&lt;/code&gt;. If the caller resubmits with different content but the same key, the lock catches it before any balance changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_payload_hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Deterministic SHA-256 hash of the canonical JSON payload.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort_keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The whole &lt;code&gt;acquire → mutate balance → complete&lt;/code&gt; sequence runs inside one DB transaction. If the use case crashes between &lt;code&gt;acquire&lt;/code&gt; and &lt;code&gt;complete&lt;/code&gt;, the row is left &lt;code&gt;in_flight&lt;/code&gt; with a lease. Lease expires, next attempt takes over. No double-spend.&lt;/p&gt;

&lt;p&gt;I genuinely cannot see how to express this in terms of Stripe usage records. They aren't the right primitive — they're append-only, conditionally cancellable, and the cancellation isn't tied to a payload hash.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Refunds at the credit level vs the invoice level
&lt;/h3&gt;

&lt;p&gt;In Stripe-metered land, a refund is an invoice-level concept. You issued an invoice for $X based on accumulated usage; you can refund some or all of $X. That's fine for "we billed you wrong, here's your money back."&lt;/p&gt;

&lt;p&gt;In credit-pool land, two different things happen and they're not the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Real money refund.&lt;/strong&gt; The customer paid for credits, decided not to use the product, charges back or asks for a refund. The money flows back via Stripe. The credits should be removed from the balance. But if they already spent some of them, you have an AR situation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational credit-back.&lt;/strong&gt; A reservation was bigger than the actual consumption, so credits go back to available. No money moves at all. This happens hundreds of times a day per active org.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are different transaction types in the ledger:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# credit_transaction.py
&lt;/span&gt;&lt;span class="n"&gt;VALID_TRANSACTION_TYPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;consume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purchase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chargeback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reservation_consume&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reservation_release&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ar_settlement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;migration_apply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;migration_rollback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;reservation_consume&lt;/code&gt; and &lt;code&gt;reservation_release&lt;/code&gt; are operational. &lt;code&gt;refund&lt;/code&gt; and &lt;code&gt;chargeback&lt;/code&gt; involve real money. Stripe metered conflates these because Stripe is invoice-shaped, not credit-shaped.&lt;/p&gt;

&lt;p&gt;The refund use case interestingly does not change &lt;code&gt;balance&lt;/code&gt; at all — it logs the Stripe refund event for accounting and audit, because the refund itself happens in Stripe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# refund_credit_use_case.py — abridged
# Refund: balance NOT changed, affects_balance=FALSE
&lt;/span&gt;&lt;span class="n"&gt;tx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CreditTransaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transaction_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance_after&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reserved_balance_after&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reserved_balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;affects_balance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;related_transaction_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;original_transaction_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stripe_refund_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stripe_refund_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;affects_balance=False&lt;/code&gt; flag is a small thing that does a lot of work — it lets the audit trail include the refund event without double-counting it against the balance, because the chargeback/refund recovery happens via a separate flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The audit trail you want is row-level, not invoice-level
&lt;/h3&gt;

&lt;p&gt;When a customer asks "why did our balance drop by 4,000 last Tuesday," what you want to be able to answer is: this reservation, by this service, for this operation, with this idempotency key, that resolved into this consume transaction. Then you want to be able to replay the exact event sequence.&lt;/p&gt;

&lt;p&gt;I write every state-changing operation through three things in the same transaction:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The balance update on &lt;code&gt;organization_credit&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;credit_transaction&lt;/code&gt; row that records the delta, the type, the related reservation/transaction, and metadata.&lt;/li&gt;
&lt;li&gt;An &lt;code&gt;outbox_event&lt;/code&gt; that downstream consumers (notifications, analytics, the customer's audit dashboard) can replay.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# from reserve_credit_use_case.py
&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;outbox_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;OutboxEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;aggregate_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;credit.reserved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reservation_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reservation_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth_org_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expires_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;expires_at&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The transactional outbox pattern is the boring right answer for "I need to atomically update state and emit an event." If event delivery fails, the state still moved consistently and the event will be redelivered. If state failed, no event went out. You can rebuild any downstream view by replaying the outbox.&lt;/p&gt;

&lt;p&gt;Stripe's webhooks are excellent — but they're Stripe's events, not yours. They tell you "an invoice was paid" or "a usage record was created." They don't tell you "service yellow-codens reserved 1200 credits at 14:02 on behalf of operation x for org Y." That's the granularity an org admin actually wants in their audit log.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we kept Stripe for: the actual money
&lt;/h2&gt;

&lt;p&gt;The thing I am very deliberately not building is a payment processor. I am not handling card numbers, I am not implementing 3DS, I am not arguing with chargebacks. Stripe is excellent at all of that.&lt;/p&gt;

&lt;p&gt;So Stripe still owns the seam where money becomes credits. The package-purchase flow has two stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# purchase_package_use_case.py — abridged
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stripe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_checkout_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;stripe_customer_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stripe_customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;stripe_price_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;package&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stripe_price_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;success_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cancel_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cancel_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth_org_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;package_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;package_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;credits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;package&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credits&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's stage one — &lt;code&gt;PreparePurchaseUseCase&lt;/code&gt; creates a Stripe Checkout session for a credit package. No balance changes. The customer gets sent to Stripe's hosted checkout.&lt;/p&gt;

&lt;p&gt;Stage two runs from the Stripe webhook handler when payment actually clears:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# from ApplyPurchasedCreditsUseCase
&lt;/span&gt;&lt;span class="n"&gt;settlement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_settle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;incoming_credit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;net_to_balance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;settlement&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;remaining_credit&lt;/span&gt;
&lt;span class="n"&gt;new_balance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;net_to_balance&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;credit_repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_balance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth_org_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance_delta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;net_to_balance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reserved_balance_delta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stripe webhook arrives, dedup'd by &lt;code&gt;stripe_event_id&lt;/code&gt;. AR settlement happens first (if the org owed money from a chargeback, that gets paid down before new credits land in the balance). Whatever's left becomes available balance.&lt;/p&gt;

&lt;p&gt;Note the seam: &lt;strong&gt;purchase amounts go through Stripe, internal pool movements go through BCP.&lt;/strong&gt; Stripe is the source of truth for "did this customer pay." BCP is the source of truth for "what's the org's available balance right now and what is each agent allowed to spend." Neither tries to be the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it cost to build
&lt;/h2&gt;

&lt;p&gt;This is the honest part. The credit-pool service is around 4-6 weeks of focused work to do well enough to trust in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Domain modeling: organizations, credits, reservations, transactions, AR, idempotency keys, outbox events.&lt;/li&gt;
&lt;li&gt;The five core use cases (reserve, consume, consume-reservation, release, refund) plus the edge cases (chargeback, grant, AR settlement, expired-reservation cleanup).&lt;/li&gt;
&lt;li&gt;Tests, especially concurrency tests for the lock + reservation paths.&lt;/li&gt;
&lt;li&gt;The Stripe webhook surface — checkout success, payment intent, dispute, refund, all idempotent on &lt;code&gt;stripe_event_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;An admin surface to inspect balances and reservations during incidents.&lt;/li&gt;
&lt;li&gt;A migration path from a previous, simpler "just deduct" model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You are not going to ship this in a weekend, even with the schema fully designed. The concurrency tests alone took me longer than I expected, because deadlocks under &lt;code&gt;SELECT FOR UPDATE&lt;/code&gt; ordering are subtle and the bugs only show up under load.&lt;/p&gt;

&lt;p&gt;If this isn't core to what your product is, don't build it.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Stripe metered is fine
&lt;/h2&gt;

&lt;p&gt;To be specific about when the build vs buy answer flips back the other way — Stripe metered is the right answer if all of these are true:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One product, one usage stream per customer. Per-seat or per-API-call pricing.&lt;/li&gt;
&lt;li&gt;You're OK with usage being consumed as it's reported (no reservations).&lt;/li&gt;
&lt;li&gt;You're OK with the customer's "balance" being computed at invoice time, not in real-time.&lt;/li&gt;
&lt;li&gt;Your retries can be made strictly idempotent by the Stripe &lt;code&gt;Idempotency-Key&lt;/code&gt; header alone, without payload-hash protection.&lt;/li&gt;
&lt;li&gt;You don't need to atomically move money between sub-products on the same account.&lt;/li&gt;
&lt;li&gt;Your refunds are invoice-level ("we billed you wrong"), not credit-level ("here's some make-good credits for the outage").&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If even one of those isn't true, you're looking at a hybrid where you keep some ledger state on your side anyway. At which point: the ledger you're keeping is the actual source of truth, and Stripe is the payment layer. Be honest about which one's the system of record.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three things I wish I'd known sooner
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reservations are the load-bearing primitive, not consumes.&lt;/strong&gt; I started by modeling consumption as a single atomic deduct. Reservations got bolted on later when the long-running-task problem hit. If I were starting again, I would build reserve/consume/release as the only three primitives from day one, and treat "synchronous consume" as a sugar over "reserve then immediately consume."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Payload-hash idempotency is non-negotiable.&lt;/strong&gt; Key-only idempotency lets bugs through that you'll only catch in customer-support tickets. The day you accidentally retry the same key with a different &lt;code&gt;actual_amount&lt;/code&gt; and silently return the cached result, you've corrupted a balance. Make it a hard error.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep the money seam thin and explicit.&lt;/strong&gt; It's tempting to let your ledger know about Stripe customers, prices, invoices, taxes. Don't. The boundary is: Stripe collects money and tells you (via webhooks dedup'd by event ID); your ledger turns money-in into credits and bookkeeps every internal movement. The simpler that seam stays, the easier the next payment provider integration is when it inevitably comes up.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'm building this credit ledger inside a larger AI dev harness called Codens — the homegrown billing layer is the "Billing Control Plane," and it's what lets the agents share an org-wide pool with the reservation/release semantics described above. LP: &lt;a href="https://www.codens.ai/" rel="noopener noreferrer"&gt;https://www.codens.ai/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've built a credit ledger that survived contact with real customers and a few chargebacks, I'd love to hear which corners you cut and which ones bit you. Or if you're staring down this build vs buy decision right now and want to compare notes — also good.&lt;/p&gt;

</description>
      <category>stripe</category>
      <category>billing</category>
      <category>postgres</category>
      <category>architecture</category>
    </item>
    <item>
      <title>I let my AI agents be advisory-only. Here's the rules-first PR risk engine I shipped instead.</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Thu, 30 Apr 2026 01:24:23 +0000</pubDate>
      <link>https://forem.com/zoetaka38/i-let-my-ai-agents-be-advisory-only-heres-the-rules-first-pr-risk-engine-i-shipped-instead-5c6k</link>
      <guid>https://forem.com/zoetaka38/i-let-my-ai-agents-be-advisory-only-heres-the-rules-first-pr-risk-engine-i-shipped-instead-5c6k</guid>
      <description>&lt;p&gt;A pattern I keep seeing in early AI-in-the-SDLC teams: someone wires an LLM into the PR-review pipeline as a quality gate, the LLM marks one perfectly fine PR as "risky" two weeks in, the team lead overrides it once and grumbles about it twice, and within a month the AI gate is silently disabled.&lt;/p&gt;

&lt;p&gt;You can't recover from that. Once a TL has spent a Friday afternoon explaining to engineering why the AI thinks their PR is dangerous when it isn't, "AI dev tools" become a punchline in their next 1:1 with the CTO. And the AI was probably right &lt;em&gt;some&lt;/em&gt; of the time — you just lost the chance to find out which times.&lt;/p&gt;

&lt;p&gt;I'm building a SaaS that touches this problem (Codens, an AI dev harness — happy to talk about it but it's not the point of this post). When I designed the PR risk evaluation service for it, I started with five non-negotiable design rules that I think apply to any AI-in-workflow product:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;AI is advisory only.&lt;/strong&gt; Never auto-blocks a merge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The TL owns code quality.&lt;/strong&gt; Not the AI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OK needs no reason. NG needs one line.&lt;/strong&gt; Don't ask humans to defend approvals.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw events, derived facts, and AI suggestions are stored separately.&lt;/strong&gt; Audit trail must survive any single layer being wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration &amp;gt; code.&lt;/strong&gt; Risk rules are repository-editable, not in source control of the AI service.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post is the implementation of those constraints — what the engine actually looks like, with the code that ships in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture: rules first, AI second
&lt;/h2&gt;

&lt;p&gt;The flow for a PR walks through three stages, in order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PR webhook
    ↓
[1] Rules engine          ← pure deterministic, no LLM
    ↓ rule matches + risk score
[2] Threshold check       ← rule-driven "is this high-risk"
    ↓ if high-risk
[3] AI advisory layer     ← LLM looks at the matched rules and PR diff,
                            offers an opinion, never blocks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The TL still presses the merge button. The AI's role is at most "raise a concern in a comment that the TL can dismiss." The rules engine's role is "tell the TL to look closer at certain files." Nothing fans out from the AI to anything that gates a deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rules engine: 4 match types, and that's it
&lt;/h2&gt;

&lt;p&gt;Every rule has one of four match types. I deliberately resisted adding a fifth (regex was tempting; I kept saying no).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RuleMatchType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;FILE_PATH_GLOB&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_path_glob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# auth/**, billing/**, migrations/**
&lt;/span&gt;    &lt;span class="n"&gt;KEYWORD&lt;/span&gt;        &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;          &lt;span class="c1"&gt;# "permission", "PII", "encryption"
&lt;/span&gt;    &lt;span class="n"&gt;LABEL&lt;/span&gt;          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;            &lt;span class="c1"&gt;# GitHub label "risk:high"
&lt;/span&gt;    &lt;span class="n"&gt;FILE_EXTENSION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_extension&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;   &lt;span class="c1"&gt;# .sql, .tf, .pem
&lt;/span&gt;
&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RiskRule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;match_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RuleMatchType&lt;/span&gt;
    &lt;span class="n"&gt;risk_area&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;RiskArea&lt;/span&gt;          &lt;span class="c1"&gt;# "auth" / "billing" / "data" / etc.
&lt;/span&gt;    &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="nb"&gt;float&lt;/span&gt;             &lt;span class="c1"&gt;# 0.5, 1.0, 2.0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why only four? Because the people writing these rules are repository owners, not the AI service author. If I add regex, the rules become unmaintainable for anyone who isn't a regex hobbyist. With glob + keyword + label + extension, you cover roughly 90% of useful "this PR touches a sensitive area" detection, and the remaining 10% probably needs a human anyway.&lt;/p&gt;

&lt;p&gt;The evaluation is dead simple — fnmatch for the path glob, lowercased substring match for keywords, label lookup for labels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;assess_pr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GitHubPullRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AssessmentResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;rulesets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ruleset_repository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_applicable_rulesets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;repository_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repository_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RuleMatch&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ruleset&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rulesets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ruleset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;RuleMatchType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FILE_PATH_GLOB&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_match_file_paths&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;RuleMatchType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;KEYWORD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_match_keywords&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;RuleMatchType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LABEL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_match_labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;RuleMatchType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FILE_EXTENSION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_match_file_extensions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;total_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;is_high_risk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;high_risk_threshold&lt;/span&gt;  &lt;span class="c1"&gt;# default 1.0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Path matching uses &lt;code&gt;fnmatch.fnmatch&lt;/code&gt; with &lt;code&gt;**&lt;/code&gt; collapsed to &lt;code&gt;*&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_match_file_paths&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GitHubPullRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RiskRule&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RuleMatch&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;RuleMatch&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;glob_pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;changed_file_paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fnmatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;glob_pattern&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;RuleMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;rule_pattern&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;match_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;risk_area&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;risk_area&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;matched_value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Up to this point: zero LLM calls. A PR touching &lt;code&gt;migrations/0042_add_user_pii.sql&lt;/code&gt; with a &lt;code&gt;risk:high&lt;/code&gt; label and "permission" in the title gets a deterministic score derived from the rule weights, and that score is reproducible by anyone reading the ruleset.&lt;/p&gt;

&lt;h2&gt;
  
  
  High-risk → precheck profile, not AI judgment
&lt;/h2&gt;

&lt;p&gt;When &lt;code&gt;is_high_risk&lt;/code&gt; is true, the next thing that happens is &lt;em&gt;not&lt;/em&gt; an LLM call. It's a precheck profile lookup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_high_risk&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;assessment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;risk_areas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;precheck_profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profile_repository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_best_profile_for_areas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;risk_areas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;assessment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;risk_areas&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;precheck_profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;assessment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;precheck_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;PrecheckItemResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;item_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;is_required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_required&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;precheck_profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A precheck profile is a checklist. For each risk area, the team has authored a list of "if you touched this area, did you remember to do X?" items. Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;auth&lt;/code&gt; area: "Did you isolate OAuth scope changes into a separate PR?"&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;billing&lt;/code&gt; area: "Does the price calculation route through &lt;code&gt;bcp_price()&lt;/code&gt;?"&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;migrations&lt;/code&gt; area: "Did you verify the migration is reversible with &lt;code&gt;--dry-run&lt;/code&gt;?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The PR author or reviewer ticks ✅ / ❌ + a one-line reason for ❌. &lt;strong&gt;The AI doesn't tick these. Ever.&lt;/strong&gt; It can't, by design — the precheck profile is a human checklist, and the AI has no API to mark items.&lt;/p&gt;

&lt;p&gt;If the engineer wants to know "is my migration reversible," they run &lt;code&gt;--dry-run&lt;/code&gt;. If they want a second opinion, they ask the TL. The AI sees this whole exchange, but its participation is bounded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the AI finally shows up
&lt;/h2&gt;

&lt;p&gt;After the rules-first stages run, the AI gets to chime in. Two narrow things only:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Per-precheck-item advisory.&lt;/strong&gt; The AI can attach a comment to a checklist item: "Looking at the related files, I think items 2 and 3 are the ones worth reviewing closely." It cannot mark them done. It cannot add new items. It can only annotate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Anomaly detection.&lt;/strong&gt; Things like "this PR is 4× larger than the median PR for this area" or "the test count went down by 30% but the implementation grew." Statistical pattern matching on top of historical data, surfaced as a comment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AISuggestion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;suggestion_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;       &lt;span class="c1"&gt;# "anomaly" | "precheck_advisory" | "scope_warning"
&lt;/span&gt;    &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="nb"&gt;str&lt;/span&gt;       &lt;span class="c1"&gt;# "info" | "warning"  ← never "blocking"
&lt;/span&gt;    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="nb"&gt;float&lt;/span&gt;     &lt;span class="c1"&gt;# 0.0-1.0
&lt;/span&gt;    &lt;span class="n"&gt;is_dismissed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="nb"&gt;bool&lt;/span&gt;      &lt;span class="c1"&gt;# TL can dismiss → never re-shown for this PR
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;is_dismissed&lt;/code&gt; flag does a lot of work. A TL who decides the AI's anomaly detection is too noisy on a particular repo can dismiss its suggestions for a few PRs in a row, and they stop appearing. The AI gets quieter automatically when the team finds it unhelpful, instead of making them open a settings page to silence it.&lt;/p&gt;

&lt;p&gt;This is, I think, the single most important UX move in any AI-in-workflow product: &lt;strong&gt;let the human turn it down without leaving the workflow.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three-table audit log
&lt;/h2&gt;

&lt;p&gt;The other constraint that paid off in practice was separating the storage of three things that look similar but aren't:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GitHub webhook
    ↓
raw_events                 (immutable, the bytes that arrived)
    ↓ parser
gate_events / pr_stats     (derived facts: "PR opened", "review submitted")
    ↓ AI evaluation (non-blocking)
ai_suggestions             (the AI's commentary on the facts)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why three tables instead of one row-per-PR-event with everything denormalized:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A bug in the parser&lt;/strong&gt; (the path from raw_events → derived) shows up as wrong facts. You can re-run the parser against &lt;code&gt;raw_events&lt;/code&gt; and regenerate &lt;code&gt;gate_events&lt;/code&gt; without touching the AI's history. Recoverable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A bad AI prompt&lt;/strong&gt; (the path from facts → suggestions) shows up as wrong opinions. You can re-run the AI evaluator with a corrected prompt and regenerate &lt;code&gt;ai_suggestions&lt;/code&gt;. The facts are unchanged, the events are unchanged. Recoverable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A compliance question&lt;/strong&gt; ("show me what the AI told the TL on PR #4912") returns from &lt;code&gt;ai_suggestions&lt;/code&gt; directly. The fact that something was suggested, the fact that it was dismissed, the fact that the TL still merged anyway — all separately queryable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you put it all in one table, your "the AI was wrong here" investigation becomes "we lost the original event because we overwrote it with the corrected derivation". You can't get back to ground truth.&lt;/p&gt;

&lt;p&gt;This costs more storage (raw events are big) and a slightly more complex backfill story, but I'd take that trade every time over having to apologize to a customer with "we don't actually know what happened."&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd push back on if I were rebuilding from scratch
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A 5th match type.&lt;/strong&gt; I still want regex sometimes. I haven't added it because the day I do, three teams will write regexes, two will be wrong, and one will write a catastrophically slow one that DoSes the rule engine. If I do add it, it'll be &lt;code&gt;safe_regex&lt;/code&gt; with a strict timeout and a max-length cap, and I'll fight the urge to allow lookarounds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More AI in the audit suggestions.&lt;/strong&gt; Current prompts are conservative on purpose — short context, explicit "advisory only" framing, low temperature. There's a part of me that wants to let the AI chase suspicious patterns more aggressively across the codebase. But every time I've drafted that more-ambitious prompt, my mental simulation ends with the same false-positive-kills-trust failure mode that started this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;An AI in your workflow that can be dismissed at the workflow level (one click, no settings page) survives much longer than an AI that requires reconfiguration to ignore.&lt;/strong&gt; The dismiss button is the trust-preserving move.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rules + thresholds + checklists do most of the gating work.&lt;/strong&gt; AI is an excellent annotator for "these specific files in this specific PR are worth a closer look", but it's a bad replacement for "should this merge." Don't promote it past advisory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Separate raw events, derived facts, and AI suggestions in storage from day one.&lt;/strong&gt; It's cheap to do upfront and impossible to retrofit when you discover your AI was systematically wrong about a class of PRs for two months and you need to figure out where to draw the rollback line.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'm building this engine inside a larger AI dev harness called Codens — 5 specialized agents that share a credit pool across PRD writing, error auto-fix, test gen, and engineering activity tracking. The risk evaluation in this post is the "Yellow Codens" service. LP: &lt;a href="https://www.codens.ai/" rel="noopener noreferrer"&gt;https://www.codens.ai/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've shipped an AI gate that survived team contact for &amp;gt;6 months, I'd love to hear what kept it alive. Or if you've shipped one that died from a single false positive — even more interested in that.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>github</category>
      <category>devops</category>
    </item>
    <item>
      <title>Stop AI from hallucinating E2E test selectors — code analysis + live browser exploration via Claude Agent SDK and 2 MCP servers</title>
      <dc:creator>Takayuki Kawazoe</dc:creator>
      <pubDate>Wed, 29 Apr 2026 09:58:33 +0000</pubDate>
      <link>https://forem.com/zoetaka38/stop-ai-from-hallucinating-e2e-test-selectors-code-analysis-live-browser-exploration-via-claude-1g3n</link>
      <guid>https://forem.com/zoetaka38/stop-ai-from-hallucinating-e2e-test-selectors-code-analysis-live-browser-exploration-via-claude-1g3n</guid>
      <description>&lt;p&gt;Generating E2E tests with an LLM sounds great in a demo. You hand a Playwright test spec to Claude, ask it to produce TypeScript code, and the toy app passes.&lt;/p&gt;

&lt;p&gt;Plug it into a real codebase and the wheels come off immediately. The AI confidently generates &lt;code&gt;await page.click('#login-button')&lt;/code&gt; for a project where the actual element is &lt;code&gt;&amp;lt;button data-testid="auth-submit"&amp;gt;Sign in&amp;lt;/button&amp;gt;&lt;/code&gt;. Selectors are invented from common patterns ("most projects use &lt;code&gt;#login-button&lt;/code&gt;") rather than read from your code or your DOM. Result: ~every generated test fails on first run, because the selectors are pure hallucination.&lt;/p&gt;

&lt;p&gt;This post is the architecture I shipped to fix that — how I make the AI &lt;strong&gt;read the codebase and drive an actual browser&lt;/strong&gt; before it writes a single line of test code, and why the implementation choices look the way they do.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture in one diagram
&lt;/h2&gt;

&lt;p&gt;The agent gets two MCP (Model Context Protocol) servers wired in parallel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;              ┌──────────────────────────────────┐
              │     Claude Agent SDK Client      │
              │                                  │
   ┌──────────┤   mcp_servers = {                │
   │          │     "local":      &amp;lt;SDK MCP&amp;gt;,     │
   │          │     "playwright": &amp;lt;SDK MCP&amp;gt;      │
   │          │   }                              │
   │          └──────┬───────────────────────────┘
   │                 │
   ▼                 ▼
[ workspace        [ Playwright API
  file read ]        (Chromium running) ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "local" MCP server exposes file-read / list-directory tools backed by the developer's repo. The "playwright" MCP server exposes browser-control tools — navigate, snapshot, click, type — backed by an actual &lt;code&gt;chromium.launch()&lt;/code&gt; instance.&lt;/p&gt;

&lt;p&gt;Critically, &lt;strong&gt;both MCP servers run inside the same Python process&lt;/strong&gt; as the agent client. Anthropic publishes a Docker image for an "official" Playwright MCP, but I deliberately don't use it — more on that below.&lt;/p&gt;

&lt;p&gt;Here's the wiring (Python, with the Anthropic &lt;code&gt;claude-agent-sdk&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# infrastructure/external_apis/exploratory_agent_client.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;claude_agent_sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ClaudeSDKClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;create_sdk_mcp_server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browser_navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Navigate to a URL in the browser.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;browser_navigate_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domcontentloaded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;@tool&lt;/code&gt; decorator turns a Python function into an MCP tool the agent can call. The Playwright &lt;code&gt;page&lt;/code&gt; object is captured in the closure, so the tool drives a single, persistent browser session that the agent navigates step by step.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why in-process MCP, not Docker MCP
&lt;/h2&gt;

&lt;p&gt;Anthropic ships &lt;code&gt;mcr.microsoft.com/playwright/mcp&lt;/code&gt; as a Docker image. The "right" thing on paper is to spawn that container as a subprocess of the agent and let stdio do the talking.&lt;/p&gt;

&lt;p&gt;In practice, my agent runs inside a Celery worker on ECS Fargate. &lt;strong&gt;Docker-in-Docker on Fargate is a tax&lt;/strong&gt;: Fargate doesn't expose &lt;code&gt;/var/run/docker.sock&lt;/code&gt; by default, the workarounds (sysbox, Docker rootless inside a container, side-cars) all add operational complexity, and you lose the ability to attach event listeners to the Playwright browser from the agent process.&lt;/p&gt;

&lt;p&gt;In-process MCP via &lt;code&gt;create_sdk_mcp_server()&lt;/code&gt; sidesteps all of this. The Playwright &lt;code&gt;page&lt;/code&gt; is just a Python variable. I can attach &lt;code&gt;page.on("console", ...)&lt;/code&gt; and &lt;code&gt;page.on("request", ...)&lt;/code&gt; listeners &lt;em&gt;outside&lt;/em&gt; the tools, and have the tools query that captured state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;console_messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;network_requests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;console&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;console_messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;network_requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resource_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browser_console_messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get console messages captured from the browser (errors, warnings, logs).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;browser_console_messages_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;console_messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the agent generates a test and the test causes a JS console error, the agent can call &lt;code&gt;browser_console_messages&lt;/code&gt; to see it and write an &lt;code&gt;expect(consoleErrors).toEqual([])&lt;/code&gt; assertion guarding against it. With Docker MCP this is much harder — the listener state lives in a different process from the tool implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-phase flow
&lt;/h2&gt;

&lt;p&gt;Just letting the agent loose on "URL + workspace, generate tests" produces sloppy results. I prompt it through three explicit phases.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Code analysis
&lt;/h3&gt;

&lt;p&gt;Tools available: &lt;code&gt;read_file&lt;/code&gt;, &lt;code&gt;list_directory&lt;/code&gt; (the local MCP).&lt;/p&gt;

&lt;p&gt;The system prompt tells the agent to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Identify the framework (Next.js, React, Vue, etc.) and routing pattern&lt;/li&gt;
&lt;li&gt;Locate page components, forms, key UI flows&lt;/li&gt;
&lt;li&gt;Extract &lt;code&gt;data-testid&lt;/code&gt; / &lt;code&gt;aria-label&lt;/code&gt; patterns this project uses&lt;/li&gt;
&lt;li&gt;Output a JSON summary of &lt;code&gt;detected_pages&lt;/code&gt;, &lt;code&gt;detected_routes&lt;/code&gt;, &lt;code&gt;detected_forms&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The point of this phase is to make the agent &lt;strong&gt;discover this project's selector conventions&lt;/strong&gt; before it generates any test code. If the project uses &lt;code&gt;data-testid="page-login-submit"&lt;/code&gt; consistently, the agent learns the pattern; later, when it sees the actual DOM, it gravitates toward selectors that match the convention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Browser exploration
&lt;/h3&gt;

&lt;p&gt;Tools available: &lt;code&gt;browser_navigate&lt;/code&gt;, &lt;code&gt;browser_snapshot&lt;/code&gt;, &lt;code&gt;browser_click&lt;/code&gt;, &lt;code&gt;browser_type&lt;/code&gt;, &lt;code&gt;browser_console_messages&lt;/code&gt;, etc. (the Playwright in-process MCP).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;browser_snapshot&lt;/code&gt; is the linchpin. Returning the raw HTML of the page would burn hundreds of thousands of tokens for a single SPA. Instead, the tool returns the &lt;strong&gt;accessibility tree as a list of refs&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browser_snapshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Get accessibility tree snapshot ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;browser_snapshot_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accessibility&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;elements&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;ref_counter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;button&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;link&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;textbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;radio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;combobox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;listbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;option&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;menuitem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tab&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}:&lt;/span&gt;
            &lt;span class="n"&gt;ref_counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ref&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;e&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ref_counter&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;children&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])):&lt;/span&gt;
            &lt;span class="nf"&gt;extract_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;extract_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;elements&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;element_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elements&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent sees something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"ref"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"textbox"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Email"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"ref"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"textbox"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Password"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"ref"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"button"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sign in"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then it can issue &lt;code&gt;browser_type {ref: "e1", text: "test@example.com"}&lt;/code&gt; and the tool resolves the ref back to the live element. This is the same pattern the official Playwright MCP uses, and it's the right tradeoff: dramatically less token usage, and the agent operates on stable, semantic refs instead of pixel coordinates or fragile CSS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Test synthesis
&lt;/h3&gt;

&lt;p&gt;Now the agent has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;(from Phase 1) The repo's selector convention (e.g. &lt;code&gt;data-testid="page-{name}-{element}"&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;(from Phase 2) The actual ref/role/name for every element on every page it explored&lt;/li&gt;
&lt;li&gt;(also from Phase 2) The console-error history during exploration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hand it all of that and ask for Playwright TypeScript code. The output is qualitatively different from a 1-shot generation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Selectors match the project's convention instead of generic &lt;code&gt;#button&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Selectors target real DOM that exists, because the agent saw them in &lt;code&gt;browser_snapshot&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Console-error checkpoints get encoded as &lt;code&gt;expect&lt;/code&gt; assertions&lt;/li&gt;
&lt;li&gt;Steps that produced visible network errors during exploration become explicit &lt;code&gt;await page.waitFor*&lt;/code&gt; calls&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons after running this in production
&lt;/h2&gt;

&lt;p&gt;Three things that aren't obvious from the diagram:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Show the AI both the source code AND the live DOM.&lt;/strong&gt; Spec-only prompts will hallucinate selectors. Code-analysis-only will guess at runtime behavior. Browser-exploration-only will ignore project conventions. You need both, fed in as separate phases so the agent has enough context to bridge them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. In-process MCP beats Docker MCP for a cloud-deployed agent.&lt;/strong&gt; It's an operational convenience as much as a technical one — you keep the Playwright &lt;code&gt;page&lt;/code&gt; object accessible from the same Python process that runs the agent, you can attach event listeners outside tools, and you don't fight Fargate over Docker-in-Docker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Accessibility-tree refs are the right token-economy move.&lt;/strong&gt; Returning raw HTML in &lt;code&gt;browser_snapshot&lt;/code&gt; would 10x the cost per exploration and isn't even the most useful representation. Returning a flat list of &lt;code&gt;{ref, role, name}&lt;/code&gt; is faster, cheaper, and aligns the agent's "click this element" mental model with what's stable across DOM changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm still working on
&lt;/h2&gt;

&lt;p&gt;The 100-element cap in &lt;code&gt;browser_snapshot&lt;/code&gt; is a heuristic — pages with massive product grids overflow it and the agent can miss interactive elements past element 100. I'd like to make it pagination-aware (let the agent ask for "elements 100–200" if it suspects something is missing) without making the prompt longer. Suggestions welcome.&lt;/p&gt;

&lt;p&gt;The agent is also still bad at &lt;strong&gt;flows that span an email round-trip&lt;/strong&gt; (signup → magic link → continue). MCP-driven email-inbox access could close that gap, but I haven't built it yet.&lt;/p&gt;

&lt;p&gt;This pattern is shipped in Codens — a workflow-based AI harness for software development that I've been building solo at Corevice for the last 6 months. The exploratory E2E generator above is the "Blue Codens" service, one of 5 specialized agents that share state across one workflow (Green for PRD, Purple for orchestration, Red for auto-fix, Blue for QA, Yellow for engineering activity ledger). LP: &lt;a href="https://www.codens.ai/" rel="noopener noreferrer"&gt;https://www.codens.ai/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback on the architecture, especially from anyone who's shipped Claude Agent SDK to production, very welcome.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>mcp</category>
      <category>playwright</category>
    </item>
  </channel>
</rss>
