<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Webmaster Ramos</title>
    <description>The latest articles on Forem by Webmaster Ramos (@webramos).</description>
    <link>https://forem.com/webramos</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1273297%2F5ca1ac25-c251-4144-a980-975f0b6a7c4d.jpg</url>
      <title>Forem: Webmaster Ramos</title>
      <link>https://forem.com/webramos</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/webramos"/>
    <language>en</language>
    <item>
      <title>Six Principles in Practice: How an Agentic E2E Found 11 Production Bugs in 8 Runs</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Mon, 18 May 2026 07:23:46 +0000</pubDate>
      <link>https://forem.com/webramos/six-principles-in-practice-how-an-agentic-e2e-found-11-production-bugs-in-8-runs-2e1d</link>
      <guid>https://forem.com/webramos/six-principles-in-practice-how-an-agentic-e2e-found-11-production-bugs-in-8-runs-2e1d</guid>
      <description>&lt;h2&gt;
  
  
  Eight runs, eleven bugs
&lt;/h2&gt;

&lt;p&gt;I ran my E2E testing system on a production ecommerce platform eight times in&lt;br&gt;
a row – across five different business modules, in three different surface&lt;br&gt;
configurations (admin / desktop storefront / mobile-first storefront). Across&lt;br&gt;
those eight runs the system found &lt;strong&gt;eleven production bugs&lt;/strong&gt;, each one&lt;br&gt;
attached to a specific file and line via a &lt;code&gt;root_cause_slug&lt;/code&gt;. Between runs&lt;br&gt;
the knowledge base grew from 25 gotchas to 42 (+67% in nine days), and the&lt;br&gt;
first-try pass rate (&lt;code&gt;first_try_pass_rate&lt;/code&gt;) climbed from 14% to 95%.&lt;/p&gt;

&lt;p&gt;One detail up front: the methodology was assembled &lt;strong&gt;in a side stream&lt;/strong&gt;&lt;br&gt;
alongside product work, not as a dedicated project. Calibration cycles were&lt;br&gt;
interleaved between features, new-module sprints and routine support. Eight&lt;br&gt;
runs is not "eight weeks of full-time work" but &lt;strong&gt;eight iteration points&lt;/strong&gt;&lt;br&gt;
accumulated in parallel with shipping production code. Most of that time I&lt;br&gt;
was writing business logic, not agents.&lt;/p&gt;

&lt;p&gt;This isn't a story about "which framework to pick". Most teams start with&lt;br&gt;
E2E by asking exactly that question – and six months later they have a&lt;br&gt;
flaky suite that quietly gets disabled in CI. The right question is &lt;strong&gt;on&lt;br&gt;
what conditions these tests are entitled to exist at all&lt;/strong&gt;, and what agent&lt;br&gt;
architecture lets them compound instead of accumulating noise.&lt;/p&gt;

&lt;p&gt;This article is a closing piece for the previous publication&lt;br&gt;
&lt;a href="https://webmaster-ramos.com/blog/six-principles-agent-systems" rel="noopener noreferrer"&gt;Six Principles for Agent Systems That Don't Hallucinate&lt;/a&gt;.&lt;br&gt;
There I worked through the principles as an abstraction. Here – what&lt;br&gt;
happens when you apply them to a concrete task, in production, across two&lt;br&gt;
independent stacks.&lt;/p&gt;
&lt;h2&gt;
  
  
  Premise: six principles applied to E2E
&lt;/h2&gt;

&lt;p&gt;E2E testing is a convenient test bed for agent systems for three reasons.&lt;br&gt;
First, &lt;strong&gt;the validator is deterministic&lt;/strong&gt; – the test either passes or it&lt;br&gt;
doesn't, and there is no room for probabilistic judgment. Second, the cycle&lt;br&gt;
is short – one run takes minutes, not hours or days. Third, the domain&lt;br&gt;
gives an explicit signal when the system has "learned" the stack –&lt;br&gt;
&lt;code&gt;first_try_pass_rate&lt;/code&gt; plateaus.&lt;/p&gt;

&lt;p&gt;All three properties are the same ones the Six Principles are built on in&lt;br&gt;
the general case: architecture over prompt-tweaking, deterministic context&lt;br&gt;
over probabilistic retrieval, closed-loop validation with a hard signal,&lt;br&gt;
three-category attribution, editorial gates instead of auto-promotion,&lt;br&gt;
multi-run measurement as proof of compounding.&lt;/p&gt;

&lt;p&gt;If these principles work in the general case, then on E2E they should&lt;br&gt;
deliver a &lt;strong&gt;measurable effect&lt;/strong&gt;. This essay is about the measured effect.&lt;/p&gt;
&lt;h2&gt;
  
  
  The contract: seven environment principles
&lt;/h2&gt;

&lt;p&gt;E2E tests live or die by their relationship with the environment. Without&lt;br&gt;
an explicit contract, every flaky-test debate converges on the same&lt;br&gt;
question: &lt;em&gt;is this a bug in the test, in the application, or in CI?&lt;/em&gt; – and&lt;br&gt;
no one can answer, because there is no shared baseline.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ENVIRONMENT.md&lt;/code&gt; is a markdown document with seven numbered principles.&lt;br&gt;
Each is one paragraph plus a short &lt;em&gt;why&lt;/em&gt;. Three audiences read it: a human&lt;br&gt;
during onboarding, an LLM agent during test generation, and the test&lt;br&gt;
runner (the last one via &lt;code&gt;playwright.config.ts&lt;/code&gt;, not directly).&lt;/p&gt;

&lt;p&gt;The principles in short:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The container is an external dependency.&lt;/strong&gt; Tests do not start or
stop the application. If the instance is unavailable, the preflight
check (principle 4) fails before any spec runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The database is dirty by default.&lt;/strong&gt; Demo data is reused across runs.
Test data is isolated via a prefix (&lt;code&gt;e2e_*&lt;/code&gt;), seeds are idempotent
through &lt;code&gt;ON CONFLICT DO NOTHING&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential execution.&lt;/strong&gt; &lt;code&gt;workers: 1&lt;/code&gt;, &lt;code&gt;retries: 0&lt;/code&gt;,
&lt;code&gt;fullyParallel: false&lt;/code&gt;. This is not a performance compromise – it is a
methodology commitment. Half of this principle – the &lt;strong&gt;no-retries
doctrine&lt;/strong&gt; – is the most load-bearing rule in the entire methodology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health check before everything.&lt;/strong&gt; &lt;code&gt;global-setup.ts&lt;/code&gt; makes one HEAD
request to a health endpoint before any spec runs. Without the health
check, the first failing test out of 50 produces an inscrutable
timeout; with it, one clear error appears in five seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seed vs assertion separation.&lt;/strong&gt; Seed specs configure state
(&lt;code&gt;tests/_seed/&lt;/code&gt;), assertion specs verify behavior
(&lt;code&gt;tests/modules/&amp;lt;feature&amp;gt;/&lt;/code&gt;). The underscore prefix is not stylistic;
it is lexicographic sort order.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host runner + MCP browser.&lt;/strong&gt; Playwright runs on the host machine;
during test generation the LLM agent has access to MCP browser tools
– this lets it &lt;strong&gt;observe&lt;/strong&gt; the real DOM rather than invent selectors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session caching with TTL.&lt;/strong&gt; Login is cached to a file; TTL depends
on the backend's nature (admin session with DevMode login – 15
minutes; Redis session under a strict security policy – 2 minutes).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each principle in depth lives in&lt;br&gt;
&lt;a href="https://github.com/webmaster-ramos/e2e-llm-agents/blob/main/specs/contract-spec.md" rel="noopener noreferrer"&gt;&lt;code&gt;contract-spec.md&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
in the principles repo.&lt;/p&gt;

&lt;p&gt;The principles are deliberately minimal. The contract does not address&lt;br&gt;
test data factories (a structural question), selector strategy (a&lt;br&gt;
generator concern), or CI (orthogonal). The contract is the smallest&lt;br&gt;
explicit commitment that makes the rest of the methodology coherent.&lt;br&gt;
Extending the contract is fine; expecting the contract to cover&lt;br&gt;
everything is a category error.&lt;/p&gt;
&lt;h2&gt;
  
  
  Four layers of code
&lt;/h2&gt;

&lt;p&gt;The contract says what tests do and don't do. The structure says where&lt;br&gt;
the artifacts of doing those things physically live. Four layers, with&lt;br&gt;
strict one-way dependency direction:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fol6nexnvctoysz9onwiv.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fol6nexnvctoysz9onwiv.webp" alt="Architecture diagram: knowledge/ at top is read by LLM agents during planning, generation and healing but is never imported by tests; runtime stack below is tests/ → pages/ → lib/ with strict one-way dependency" width="800" height="498"&gt;&lt;/a&gt;)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;lib/&lt;/code&gt;&lt;/strong&gt; – stateless utilities. If a function in &lt;code&gt;lib/&lt;/code&gt; is called
&lt;code&gt;setupCheckoutTaxForRegion&lt;/code&gt;, it doesn't belong in &lt;code&gt;lib/&lt;/code&gt; – it belongs
in a Page Object or a flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pages/&lt;/code&gt;&lt;/strong&gt; – Page Objects. Stateful. &lt;strong&gt;Extracted only after the
third real use&lt;/strong&gt; (Rule of Three).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;tests/&lt;/code&gt;&lt;/strong&gt; – the specs themselves. &lt;code&gt;_seed/&lt;/code&gt; (idempotent setup) and
&lt;code&gt;modules/&amp;lt;feature&amp;gt;/&lt;/code&gt; (per-feature assertions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;knowledge/&lt;/code&gt;&lt;/strong&gt; – markdown/YAML references for LLM agents.
&lt;strong&gt;Never imported by tests.&lt;/strong&gt; This is data for agents, not code for
the runtime.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;tests → pages → lib&lt;/code&gt; direction is one-way. Reverse edges are&lt;br&gt;
forbidden. Empirically: across four cross-stack ports, every cycle of&lt;br&gt;
"lib imports from pages" had to be reverted within the same sprint. The&lt;br&gt;
cost of portability with a cycle in place is too high.&lt;/p&gt;

&lt;p&gt;The most common objection is "extract &lt;code&gt;pages/&lt;/code&gt; from day one?". No. Rule&lt;br&gt;
of Three: one test – leave it inline; two – leave them duplicated;&lt;br&gt;
&lt;em&gt;the third&lt;/em&gt; – extract into &lt;code&gt;lib/&lt;/code&gt; (stateless) or &lt;code&gt;pages/&lt;/code&gt; (stateful).&lt;br&gt;
At two uses you don't yet see what is &lt;strong&gt;actually&lt;/strong&gt; shared. The third use&lt;br&gt;
shows the real abstraction instead of a coincidental match between two&lt;br&gt;
cases.&lt;/p&gt;

&lt;p&gt;A single &lt;code&gt;playwright.config.ts&lt;/code&gt; serves several orthogonal surface&lt;br&gt;
combinations – not "different browsers" but &lt;strong&gt;different DOMs&lt;/strong&gt;. On my&lt;br&gt;
ecommerce platform: admin / classic storefront (legacy MVC) / modern&lt;br&gt;
storefront (Alpine.js). Different DOM, different selectors, the same&lt;br&gt;
behavior cases. One run produces three results with a per-project&lt;br&gt;
breakdown in &lt;code&gt;metrics.jsonl&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  The four-agent pipeline
&lt;/h2&gt;

&lt;p&gt;The pipeline runs four agents in sequence: &lt;strong&gt;analyze → plan → generate →&lt;br&gt;
heal&lt;/strong&gt;. Each agent has one cognitive task, one input shape, one output&lt;br&gt;
shape.&lt;/p&gt;
&lt;h3&gt;
  
  
  Analyzer
&lt;/h3&gt;

&lt;p&gt;The first. Discovery: scans the codebase, identifies modules, routes,&lt;br&gt;
DB schema, dependencies. Writes results into &lt;code&gt;e2e/.state/*.json&lt;/code&gt; –&lt;br&gt;
persistent JSON artifacts. The phase is &lt;strong&gt;cheap and cacheable&lt;/strong&gt; – on&lt;br&gt;
every run it first checks the mtime of its outputs; if they are fresh,&lt;br&gt;
it skips entirely.&lt;/p&gt;

&lt;p&gt;The skip logic here is not optimization, it is architecture. Most cycles&lt;br&gt;
work on a stable codebase; re-scanning the source tree every time is&lt;br&gt;
waste. The analyzer's artifacts (&lt;code&gt;modules.json&lt;/code&gt;, &lt;code&gt;schema-map.json&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;project-auth.yml&lt;/code&gt;, &lt;code&gt;project-seed.yml&lt;/code&gt;) are read by the planner,&lt;br&gt;
generator and healer – each takes what it needs, no one re-runs&lt;br&gt;
discovery.&lt;/p&gt;
&lt;h3&gt;
  
  
  Planner
&lt;/h3&gt;

&lt;p&gt;The second. Reasoning: takes the analyzer's output plus the KB, writes&lt;br&gt;
a &lt;code&gt;plan.md&lt;/code&gt; – a numbered list of test cases for one feature. Each case:&lt;br&gt;
short title, preconditions, steps, expected outcome, optional KB&lt;br&gt;
references to relevant gotchas.&lt;/p&gt;

&lt;p&gt;The planner is a &lt;strong&gt;distinct phase&lt;/strong&gt;, not a step inside the generator,&lt;br&gt;
because planning and code-generation are different cognitive modes.&lt;br&gt;
Planning needs broad context (feature semantics, edge cases, KB flags).&lt;br&gt;
Generation needs narrow context (the exact selector for one button on&lt;br&gt;
one page). Trying to do both in one prompt produces either an&lt;br&gt;
over-prompted generator (slow, expensive) or an under-prompted planner&lt;br&gt;
(shallow plans, missed edge cases).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;plan.md&lt;/code&gt; is not test code. It is a specification that the generator&lt;br&gt;
turns into code in the next phase. The same &lt;code&gt;plan.md&lt;/code&gt; could be&lt;br&gt;
implemented in a different test framework.&lt;/p&gt;
&lt;h3&gt;
  
  
  Generator
&lt;/h3&gt;

&lt;p&gt;The third. Code emission: takes &lt;code&gt;plan.md&lt;/code&gt; and writes &lt;code&gt;*.spec.ts&lt;/code&gt;. &lt;strong&gt;The&lt;br&gt;
defining rule is selector discipline&lt;/strong&gt;: every selector that appears in&lt;br&gt;
a generated spec must be &lt;strong&gt;observed in the live application via MCP&lt;br&gt;
browser tools&lt;/strong&gt; – not inferred from sources, not guessed from a&lt;br&gt;
screenshot.&lt;/p&gt;

&lt;p&gt;What "stable selector" means depends on the surface. For each project&lt;br&gt;
the generator has a preference hierarchy: &lt;code&gt;getByRole(...)&lt;/code&gt; →&lt;br&gt;
&lt;code&gt;getByPlaceholder(...)&lt;/code&gt; → scoped CSS → id – in descending order of&lt;br&gt;
stability.&lt;/p&gt;

&lt;p&gt;What is &lt;strong&gt;forbidden&lt;/strong&gt;: deriving a selector from source code (the&lt;br&gt;
rendered DOM may differ); guessing from a screenshot; "the button&lt;br&gt;
probably has the class &lt;code&gt;.btn-primary&lt;/code&gt;". If a stable selector doesn't&lt;br&gt;
exist, the correct reaction is to &lt;strong&gt;report a gap&lt;/strong&gt; back to the planner,&lt;br&gt;
not to write something brittle and hope.&lt;/p&gt;
&lt;h3&gt;
  
  
  Healer
&lt;/h3&gt;

&lt;p&gt;The fourth, and the most important. Diagnosis: runs the specs, observes&lt;br&gt;
failures, &lt;strong&gt;attributes each failure to one of three categories&lt;/strong&gt; –&lt;br&gt;
&lt;code&gt;test-bug&lt;/code&gt; / &lt;code&gt;app-bug&lt;/code&gt; / &lt;code&gt;env-drift&lt;/code&gt; – and writes a structured&lt;br&gt;
&lt;strong&gt;heal-finding&lt;/strong&gt; with the audit trail.&lt;/p&gt;

&lt;p&gt;That attribution is what &lt;strong&gt;makes the no-retries doctrine actionable&lt;/strong&gt;.&lt;br&gt;
Each category has its own remediation path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;test-bug&lt;/code&gt; → the healer fixes the spec.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;app-bug&lt;/code&gt; → the healer &lt;strong&gt;does not fix the application&lt;/strong&gt;. It files
the bug with &lt;code&gt;root_cause_slug&lt;/code&gt; and leaves the spec failing as a true
positive.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;env-drift&lt;/code&gt; → the healer surfaces the drift; the contract may need
updating.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The healer is also the agent that &lt;strong&gt;proposes KB candidates&lt;/strong&gt;. A failure&lt;br&gt;
exposed a gotcha future tests should know? The healer writes a candidate&lt;br&gt;
into &lt;code&gt;e2e/knowledge/_inbox/&lt;/code&gt;. The candidate is &lt;strong&gt;not auto-promoted&lt;/strong&gt; –&lt;br&gt;
an editorial gate decides.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why four agents, not three
&lt;/h3&gt;

&lt;p&gt;Early versions of the methodology used three agents (planner /&lt;br&gt;
generator / healer) and folded discovery into the planner. The&lt;br&gt;
four-agent split was empirical: a planner prompt that also did&lt;br&gt;
discovery was noticeably worse at both jobs. Pulling the analyzer into&lt;br&gt;
its own phase made each phase smaller, cheaper, and individually&lt;br&gt;
skippable (analyzer caches; planner skips when &lt;code&gt;plan.md&lt;/code&gt; exists;&lt;br&gt;
healer skips on a green run).&lt;/p&gt;

&lt;p&gt;The pipeline produces &lt;strong&gt;measurable artifacts at every boundary&lt;/strong&gt;:&lt;br&gt;
&lt;code&gt;e2e/.state/*.json&lt;/code&gt; after analysis, &lt;code&gt;plan.md&lt;/code&gt; after planning,&lt;br&gt;
&lt;code&gt;*.spec.ts&lt;/code&gt; after generation, a six-section heal-finding after healing.&lt;br&gt;
Each is reviewable. Each is comparable across runs.&lt;/p&gt;

&lt;p&gt;Per-agent depth lives in the four agent-role specs in the principles&lt;br&gt;
repo. Skill-level orchestration lives in&lt;br&gt;
&lt;a href="https://github.com/webmaster-ramos/e2e-llm-agents/blob/main/specs/skill-design.md" rel="noopener noreferrer"&gt;&lt;code&gt;skill-design.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Knowledge as the fourth layer
&lt;/h2&gt;

&lt;p&gt;The knowledge base is &lt;strong&gt;the fourth layer of the structure&lt;/strong&gt; and the&lt;br&gt;
input that makes agents &lt;em&gt;learn&lt;/em&gt; between runs. KB files are YAML&lt;br&gt;
documents, read by agents at planning, generation and healing time.&lt;br&gt;
They are never imported by test code.&lt;/p&gt;
&lt;h3&gt;
  
  
  Two categories of entry
&lt;/h3&gt;

&lt;p&gt;Every KB entry is one of two kinds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gotcha&lt;/strong&gt; – prose advisory for the agent. "When clicking a button
inside a modal on the admin surface, wait for the loading-mask
overlay to disappear". Gotchas are advisory; the agent reads them as
context, not as an enforceable rule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lint pattern&lt;/strong&gt; – a machine-checkable rule with a severity. "If a
spec calls &lt;code&gt;page.click()&lt;/code&gt; on &lt;code&gt;.btn-primary&lt;/code&gt; without preceding
&lt;code&gt;waitForLoadingMask()&lt;/code&gt; – warning". Lint patterns are runnable as a
static analysis pass (&lt;code&gt;--phase lint&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An entry starts as a gotcha. Promotion to a lint pattern is downstream&lt;br&gt;
and explicit, after the gotcha &lt;strong&gt;has fired enough times&lt;/strong&gt; for the rule&lt;br&gt;
to clearly generalize.&lt;/p&gt;
&lt;h3&gt;
  
  
  Project-local vs cross-stack
&lt;/h3&gt;

&lt;p&gt;The KB has two homes. &lt;strong&gt;Project-local&lt;/strong&gt; lives inside &lt;code&gt;e2e/knowledge/&lt;/code&gt; –&lt;br&gt;
about this codebase: auth patterns, seed fixtures, business-domain&lt;br&gt;
quirks. &lt;strong&gt;Cross-stack&lt;/strong&gt; lives outside any single project (for example,&lt;br&gt;
in &lt;code&gt;~/.claude/skills/e2e-kb/kb/&lt;/code&gt;) – about a &lt;em&gt;technology&lt;/em&gt;: UI framework&lt;br&gt;
patterns, Tailwind class quirks, admin framework selectors. Cross-stack&lt;br&gt;
KB &lt;strong&gt;generalizes across every project that uses the same technology&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This split is what produces &lt;strong&gt;cross-project knowledge transfer&lt;/strong&gt;. When&lt;br&gt;
you start a new project on a familiar technology, cross-stack KB&lt;br&gt;
applies on day one. Project-local KB starts empty and fills as the&lt;br&gt;
codebase reveals its quirks.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;code&gt;sources.yml&lt;/code&gt; routing and &lt;code&gt;kb_by_app&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;sources.yml&lt;/code&gt; describes which KBs apply to which surface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;universal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;project-auth.yml&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;project-seed.yml&lt;/span&gt;

&lt;span class="na"&gt;by_surface&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;admin&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;modules/admin.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/admin-framework.yml&lt;/span&gt;
  &lt;span class="na"&gt;storefront&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;modules/storefront.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/alpine-js.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/tailwind-css.yml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner reads &lt;code&gt;sources.yml&lt;/code&gt; and loads only the KBs relevant to the&lt;br&gt;
target.&lt;/p&gt;

&lt;p&gt;For multi-app projects – one repository hosting two genuinely different&lt;br&gt;
application stacks (a FastAPI backend and a Next.js admin in the same&lt;br&gt;
work tree) – the pattern extends to &lt;code&gt;kb_by_app&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;kb_by_app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/fastapi.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/alpine-js.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/tailwind-css.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;project-auth.yml&lt;/span&gt;
  &lt;span class="na"&gt;admin-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/nextjs.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/shadcn-ui.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;platform/tailwind-css.yml&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;project-auth.yml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tests for the backend receive &lt;code&gt;fastapi&lt;/code&gt; + &lt;code&gt;alpine-js&lt;/code&gt; KB; tests for the&lt;br&gt;
admin receive &lt;code&gt;nextjs&lt;/code&gt; + &lt;code&gt;shadcn-ui&lt;/code&gt;. No cross-contamination, no agents&lt;br&gt;
loading irrelevant gotchas. The same &lt;code&gt;tailwind-css.yml&lt;/code&gt; is reused in&lt;br&gt;
both apps – &lt;strong&gt;cross-project KB reuse on a small scale&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Editorial promotion
&lt;/h3&gt;

&lt;p&gt;The single most important rule of the KB layer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Auto-promotion of healer candidates into the active KB is forbidden.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Auto-promotion optimizes recall at the expense of precision. The&lt;br&gt;
resulting KB describes &lt;strong&gt;the system's errors&lt;/strong&gt;, not what is true. The&lt;br&gt;
agent then retrieves contradictory advice (every fix has become a&lt;br&gt;
"principle"), and compounding &lt;strong&gt;flips sign&lt;/strong&gt; – the saturation curve&lt;br&gt;
moves &lt;em&gt;downward&lt;/em&gt; instead of upward.&lt;/p&gt;

&lt;p&gt;Promotion is editorial: the healer writes a candidate into &lt;code&gt;_inbox/&lt;/code&gt;;&lt;br&gt;
a reviewer asks two questions (&lt;em&gt;does this generalize? is it not covered&lt;br&gt;
by an existing entry?&lt;/em&gt;); only on two yeses does the candidate move&lt;br&gt;
into the active KB. The editorial gate is what &lt;strong&gt;keeps the KB useful&lt;/strong&gt;&lt;br&gt;
as the project grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The healing loop and the saturation curve
&lt;/h2&gt;

&lt;p&gt;The healer produces &lt;strong&gt;one artifact per run&lt;/strong&gt;: a markdown file in&lt;br&gt;
&lt;code&gt;e2e/.state/heal-findings/&lt;/code&gt;, timestamped. Six sections, always in the&lt;br&gt;
same order, even on green runs.&lt;/p&gt;

&lt;p&gt;The six sections are not bureaucracy. They are an &lt;strong&gt;audit trail&lt;/strong&gt; that&lt;br&gt;
stays readable two months later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A. Diagnosis matrix&lt;/strong&gt; – a table: tests × projects. Pass / Fail /
Skip / N/A in cells. The reviewer sees it first – "what failed and
where" before any narrative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B. Hypothesis on root causes&lt;/strong&gt; – for each failure, a working
theory. Each hypothesis names an attribution category.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;C. Healing action + decision rationale&lt;/strong&gt; – what the healer did.
&lt;code&gt;test-bug&lt;/code&gt; is fixed in the spec, &lt;code&gt;app-bug&lt;/code&gt; is filed with a slug,
&lt;code&gt;env-drift&lt;/code&gt; is surfaced.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;D. Verification checklist&lt;/strong&gt; – how to confirm the fix worked. A
checklist, not prose. This is what makes the audit trail &lt;em&gt;closeable&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;E. KB candidates&lt;/strong&gt; – gotchas worth promoting (via the editorial
gate).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;F. Out-of-scope siblings&lt;/strong&gt; – observations that surfaced during the
run but are &lt;strong&gt;not the focus&lt;/strong&gt; of this finding. Test-infra glitches,
environment quirks, remarks worth follow-up but no action right now.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Section F matters separately: without it, observations either clutter&lt;br&gt;
the main narrative (A–E) or get lost and rediscovered a month later&lt;br&gt;
as "haven't we seen this already?".&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this produces compounding
&lt;/h3&gt;

&lt;p&gt;Every heal-finding's Section E feeds the &lt;code&gt;_inbox/&lt;/code&gt;. The editorial gate&lt;br&gt;
either promotes or rejects. Promoted entries become available to the&lt;br&gt;
next run's planner and generator. The next run on the same surface&lt;br&gt;
&lt;strong&gt;starts with a richer KB&lt;/strong&gt;, and &lt;code&gt;first_try_pass_rate&lt;/code&gt; rises.&lt;/p&gt;

&lt;p&gt;That is the compounding mechanism. It depends on three preconditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three-category attribution&lt;/strong&gt; (the no-retries doctrine) – without
it, failures become "flaky", and the healer has nothing structured
to record.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Editorial promotion&lt;/strong&gt; – without it, the KB becomes an error log,
and the curve flips sign.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-run findings&lt;/strong&gt; (the six-section discipline) – without them,
the audit trail is missing, and the next reviewer can't follow the
chain.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The saturation curve
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcn7wibgqlpuqg2nfxp1o.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcn7wibgqlpuqg2nfxp1o.webp" alt="Line chart: first_try_pass_rate climbs from ~14% at run 1, through ~78% at run 2, to ~95% at run 3, then plateaus through runs 4–8 (the saturation signal)" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Run 1: low pass rate. Every gotcha is new; the KB is empty for the&lt;br&gt;
surface.&lt;/p&gt;

&lt;p&gt;Runs 2–3: rate climbs steeply, as the first findings get promoted into&lt;br&gt;
the KB. The agent now reads gotchas it discovered itself last time.&lt;/p&gt;

&lt;p&gt;Runs 4+: rate plateaus. The KB has captured the surface's&lt;br&gt;
idiosyncrasies; further runs only encounter rare new gotchas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The plateau is saturation&lt;/strong&gt;. The empirical signal that the&lt;br&gt;
methodology &lt;strong&gt;has paid for itself&lt;/strong&gt; on this surface. After saturation&lt;br&gt;
the cost of a new test on the surface is dominated by &lt;em&gt;defining the&lt;br&gt;
case&lt;/em&gt;, not &lt;em&gt;learning the surface&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Across my eight runs: on a mature module (third run on the same&lt;br&gt;
surface) &lt;code&gt;first_try_pass_rate&lt;/code&gt; reached ~95%. On a new surface of the&lt;br&gt;
same platform – first run ~14%, second ~78%, third ~95%. The same&lt;br&gt;
shape on each of six modules: low → climb → plateau. This isn't a&lt;br&gt;
theoretical benefit – it is measured.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the metrics &lt;strong&gt;don't&lt;/strong&gt; track
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Test execution time.&lt;/strong&gt; Playwright's reporter handles that.
&lt;code&gt;metrics.jsonl&lt;/code&gt; is about the quality of generated tests, not their
runtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code coverage in the line-coverage sense.&lt;/strong&gt; That is a different
methodology (instrument, run, report).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subjective quality.&lt;/strong&gt; "Are these tests good?" is a review question,
not a metric. The metric measures whether they pass.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full &lt;code&gt;metrics.jsonl&lt;/code&gt; schema with the additive evolution v1 → v2,&lt;br&gt;
the definition of &lt;code&gt;first_try_pass_rate&lt;/code&gt;, the &lt;code&gt;root_cause_slug&lt;/code&gt;&lt;br&gt;
discipline – all in&lt;br&gt;
&lt;a href="https://github.com/webmaster-ramos/e2e-llm-agents/blob/main/specs/metric-design.md" rel="noopener noreferrer"&gt;&lt;code&gt;metric-design.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stack-agnostic: porting to FastAPI in days, not weeks
&lt;/h2&gt;

&lt;p&gt;The strongest objection to any "works for me" methodology is &lt;em&gt;it works&lt;br&gt;
only because you know your stack&lt;/em&gt;. I ported the methodology to a second&lt;br&gt;
stack – not on a new machine and not for an article, but for my own&lt;br&gt;
pet project: FastAPI + Alpine (backend) + Next.js (admin) in a single&lt;br&gt;
work tree. The port took days, not weeks.&lt;/p&gt;

&lt;h3&gt;
  
  
  What carries over 1-to-1
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The four agents&lt;/strong&gt; (analyze / plan / generate / heal) – same
prompts, same contract between phases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;e2e-coverage&lt;/code&gt; skill&lt;/strong&gt; – same orchestrator, same artifact on
output (a &lt;code&gt;metrics.jsonl&lt;/code&gt; line, a heal-finding).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;ENVIRONMENT.md&lt;/code&gt; pattern&lt;/strong&gt; – 7 principles stayed. Some are
trivially satisfied (no auth → principle 7 N/A), but the contract
kept its shape.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;STRUCTURE.md&lt;/strong&gt; – four layers, Rule of Three, dependency direction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-retries doctrine&lt;/strong&gt; – &lt;code&gt;retries: 0&lt;/code&gt; on the new stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What is rewritten
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;knowledge/&lt;/code&gt;&lt;/strong&gt; – local gotchas (FastAPI middleware quirks instead
of legacy MVC). Cross-stack KB (&lt;code&gt;alpine-js.yml&lt;/code&gt;, &lt;code&gt;tailwind-css.yml&lt;/code&gt;)
is reused without change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;lib/&lt;/code&gt;&lt;/strong&gt; – FastAPI auth helpers instead of platform CLI calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;playwright.config.ts&lt;/code&gt;&lt;/strong&gt; – different projects (&lt;code&gt;backend&lt;/code&gt; +
&lt;code&gt;admin-app&lt;/code&gt; instead of &lt;code&gt;admin&lt;/code&gt; + &lt;code&gt;classic-storefront&lt;/code&gt; +
&lt;code&gt;modern-storefront&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What appeared new on the second stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;kb_by_app&lt;/code&gt; routing&lt;/strong&gt; – a solution to the multi-app problem (one
&lt;code&gt;e2e/&lt;/code&gt; serving two genuinely different app stacks). The pattern then
back-ports onto the first stack if a multi-app scenario emerges.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Metrics on the second stack
&lt;/h3&gt;

&lt;p&gt;First run on the new backend: &lt;code&gt;first_try_pass_rate&lt;/code&gt; ~48.6%. Second –&lt;br&gt;
~91.4%. Same two-to-three runs of the same surface, the same&lt;br&gt;
compounding shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What matters:&lt;/strong&gt; the second stack didn't "repeat" the first. It&lt;br&gt;
showed that &lt;strong&gt;the shape of the curve is a property of the loop, not&lt;br&gt;
of the task&lt;/strong&gt;. Detect a deterministic validator (tests pass/fail,&lt;br&gt;
build succeeds, types check), close the loop (executor → validator,&lt;br&gt;
auto-revert on regression, KB grows only on validated new error&lt;br&gt;
classes) – and compounding appears regardless of stack.&lt;/p&gt;

&lt;p&gt;After the cross-platform port I have &lt;strong&gt;n=2 platforms plus n=8 runs&lt;br&gt;
within the first&lt;/strong&gt;. The KB saturation curve is not a Magento artifact.&lt;br&gt;
It is a property of the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The six principles from the meta article are not dogma. They are a set&lt;br&gt;
of architectural commitments that &lt;strong&gt;will make an agent system compound,&lt;br&gt;
if you accept them&lt;/strong&gt;. This article showed what happens on a concrete&lt;br&gt;
task – E2E testing – when you accept them in full.&lt;/p&gt;

&lt;p&gt;What is useful here beyond "another E2E framework":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structural framing of the flaky-test problem.&lt;/strong&gt; Not "which runner
to buy", but &lt;em&gt;what conditions must be met for tests to exist as a
signal rather than as noise&lt;/em&gt;. Those conditions are expressed in the
seven contract principles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compounding proved through measurement.&lt;/strong&gt; The KB saturation curve
is not theoretical. It appears on two independent stacks, in the
same shape. Single-run anecdotes really are almost useless for
evaluating an architecture; the multi-run curve is a different
story.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Editorial gates as load-bearing.&lt;/strong&gt; Auto-promotion is the most
obvious step that &lt;strong&gt;breaks&lt;/strong&gt; compounding. That is counter-intuitive
and worth surfacing explicitly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to apply the methodology – the &lt;strong&gt;principles repo&lt;/strong&gt; with&lt;br&gt;
granular specs and illustrative examples (against todomvc as a neutral&lt;br&gt;
target): &lt;a href="https://github.com/webmaster-ramos/e2e-llm-agents" rel="noopener noreferrer"&gt;https://github.com/webmaster-ramos/e2e-llm-agents&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The canonical narrative with principles and architecture lives on the&lt;br&gt;
site at &lt;code&gt;/docs/e2e-llm-agents&lt;/code&gt;:&lt;br&gt;
&lt;a href="https://webmaster-ramos.com/docs/e2e-llm-agents" rel="noopener noreferrer"&gt;https://webmaster-ramos.com/docs/e2e-llm-agents&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The meta article with the six principles as an abstraction:&lt;br&gt;
&lt;a href="https://webmaster-ramos.com/blog/six-principles-agent-systems" rel="noopener noreferrer"&gt;https://webmaster-ramos.com/blog/six-principles-agent-systems&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>agentskills</category>
      <category>playwright</category>
      <category>e2e</category>
    </item>
    <item>
      <title>Six Principles for Agent Systems That Don't Hallucinate</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Tue, 12 May 2026 07:06:27 +0000</pubDate>
      <link>https://forem.com/webramos/six-principles-for-agent-systems-that-dont-hallucinate-14gn</link>
      <guid>https://forem.com/webramos/six-principles-for-agent-systems-that-dont-hallucinate-14gn</guid>
      <description>&lt;h2&gt;
  
  
  Why this article exists
&lt;/h2&gt;

&lt;p&gt;Agentic development with LLMs in 2026 is no longer an "interesting experiment". It is its own engineering discipline. By an "agent" I mean a program built on top of a language model that performs a &lt;strong&gt;structured task inside a product&lt;/strong&gt; rather than merely replying to a user in chat: it reads code, makes decisions, writes files, calls external APIs, and returns a result. I join product teams where three to five such agents already work in parallel: code review bots, content classifiers, ticket routers, recommendation pipelines, internal documentation generators. A demo can be assembled in one evening. Production cannot.&lt;/p&gt;

&lt;p&gt;The line between a demo-quality and a production-quality agent system is not where people usually look for it. The deciding factor is not the model, not the token budget, and not the quality of the prompts. The deciding factor is &lt;strong&gt;the architecture of the system in which the model operates&lt;/strong&gt; – and that architecture does not come from "build your first agent" tutorials. It comes from failed attempts.&lt;/p&gt;

&lt;p&gt;At that boundary, every agent system runs into the same three problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinations&lt;/strong&gt; – the agent invents facts that sound plausible but do not match reality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-reproducibility&lt;/strong&gt; – the same prompt produces different results across runs, and errors cannot be debugged properly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No way to accumulate knowledge&lt;/strong&gt; – every run starts from zero, and the mistakes of one run do not help the next one.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Of those three, the first two are discussed by almost everyone writing about multi-agent systems in 2025 – roles, JSON contracts, system prompts. The third one – how an agent system becomes cheaper and more accurate with each subsequent run – is barely discussed, because that conversation requires metrics across multiple runs, not a single production anecdote.&lt;/p&gt;

&lt;p&gt;These three problems are &lt;strong&gt;not properties of the model&lt;/strong&gt;. They are solved at the architecture level of the system the model operates inside. Below are six design principles that address them.&lt;/p&gt;

&lt;p&gt;The principles are universal. They work equally well for &lt;strong&gt;code review, refactoring tools, security audit pipelines, migration tools, documentation generators, customer support routing, content moderation, and data pipelines with LLM stages&lt;/strong&gt;. Wherever you have multiple roles, a rules-heavy domain, and a need for reproducible output, these six layers apply.&lt;/p&gt;

&lt;p&gt;I distilled them while building one specific system – LLM agents for E2E testing. That took a month and a half of part-time iteration and produced measurable results: eight runs, eleven production bugs found automatically, and a first-try pass rate that rose from 14% to 95% as the knowledge base – hereafter "KB" – became more saturated. Each principle below is paired with one concrete E2E example and one or two applications in other domains.&lt;/p&gt;

&lt;p&gt;Those eight runs are not a generic "it works for us" claim. They are the first trendline. On a single run, any architecture can sound plausible. Across eight runs, you start to see which principles actually deliver ROI and which ones are overhead without return.&lt;/p&gt;

&lt;p&gt;The six principles work together as layers. Remove one, and the whole stack collapses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principle 1: An explicit contract
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is.&lt;/strong&gt; A document that describes &lt;strong&gt;the rules the agent operates under and that do not change between runs&lt;/strong&gt;. Not code, not config, but text. Usually Markdown with five to ten numbered principles, around 500–800 words (~3–5 KB). In my E2E version it is seven principles, 83 lines, about 600 words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it works.&lt;/strong&gt; Without an explicit contract, the agent makes arbitrary choices every time it encounters ambiguity. "Should the database be clean for each test, or should it stay dirty?" The agent picks one answer today, another next week, and you end up with incompatible tests. With a contract ("the DB is dirty by default; demo data is reused"), the answer is predefined.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E2E example.&lt;/strong&gt; My &lt;code&gt;ENVIRONMENT.md&lt;/code&gt; contains seven principles: the container as an external dependency, a dirty database, sequential execution, a health check, seed/assertion separation, a host runner plus MCP browser, and session caching. Each one is a short paragraph plus a brief rationale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-E2E application.&lt;/strong&gt; A security-audit agent gets &lt;code&gt;SCOPE.md&lt;/code&gt;: what is in scope (&lt;code&gt;src/&lt;/code&gt; production code) and what is not (test fixtures, &lt;code&gt;vendor/&lt;/code&gt;, deprecated code). Without that contract, the agent will report vulnerabilities in demo files and waste your time on false positives. A code-review agent gets &lt;code&gt;STYLE.md&lt;/code&gt; with an explicit instruction: "code style is already formalised in &lt;code&gt;.eslintrc&lt;/code&gt;; do not comment on formatting." A refactoring agent gets &lt;code&gt;BOUNDARIES.md&lt;/code&gt;: which modules it may not touch and which public APIs it may not break.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What breaks without it.&lt;/strong&gt; The agent acts on unstated assumptions, and in half the cases those assumptions will be the opposite of yours. Two weeks later, the team no longer understands why the agent behaves one way today and another way tomorrow. A month later, they stop trusting its output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principle 2: Role separation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is.&lt;/strong&gt; A complex task equals multiple cognitive modes, and those modes cannot live inside one agent. You need separate roles with differentiated tools, context, and instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it works.&lt;/strong&gt; A single prompt cannot simultaneously demand "explore broadly" and "do not deviate from the plan." A single context cannot hold both architecture diagrams and specific code. A single toolset cannot be optimal both for browser automation and for editing text files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E2E example.&lt;/strong&gt; Four agents: &lt;code&gt;e2e-analyzer&lt;/code&gt; (discovery), &lt;code&gt;e2e-planner&lt;/code&gt; (strategy), &lt;code&gt;e2e-generator&lt;/code&gt; (implementation), and &lt;code&gt;e2e-healer&lt;/code&gt; (diagnostics). Each has its own MCP tools, its own context, and its own responsibility. The generator is not allowed to invent selectors; the healer is not allowed to expand coverage. Those constraints are what make the system predictable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-E2E application.&lt;/strong&gt; In a refactoring agent pipeline, &lt;code&gt;code-mapper&lt;/code&gt; builds the dependency graph and writes &lt;code&gt;dependencies.json&lt;/code&gt;; &lt;code&gt;refactor-planner&lt;/code&gt; reads the graph and writes &lt;code&gt;refactor-plan.md&lt;/code&gt; with numbered steps; &lt;code&gt;refactor-applier&lt;/code&gt; applies each step; &lt;code&gt;verifier&lt;/code&gt; runs tests and checks the result. In a code-review pipeline, &lt;code&gt;static-scanner&lt;/code&gt; looks for obvious anti-patterns; &lt;code&gt;context-reader&lt;/code&gt; loads related files; &lt;code&gt;reviewer&lt;/code&gt; writes comments; &lt;code&gt;summarizer&lt;/code&gt; aggregates them into one review message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What breaks without it.&lt;/strong&gt; A monolithic agent with one giant prompt initially looks "simpler" – one file, one entry point. Two weeks later the prompt is 800 lines of contradictory instructions, the context is bloated, and the output is worse than that of a simple script.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principle 3: Persistent state between phases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is.&lt;/strong&gt; Artifacts the agent writes to disk and that survive between runs. Not RAM, not in-memory state, but files with structured data that can be read by both humans and downstream agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it works.&lt;/strong&gt; Discovery is an expensive phase. If you rescan the codebase from scratch every time, you pay for that in both context and time. But discovery changes slowly: the list of modules, the database schema, the routes. Do it once, save it, and later phases can read the result.&lt;/p&gt;

&lt;p&gt;An additional benefit is that persistent state enables &lt;strong&gt;idempotent skip logic&lt;/strong&gt;. If &lt;code&gt;modules.json&lt;/code&gt; is still fresh (by &lt;code&gt;mtime&lt;/code&gt;, the file modification time), the analyze phase is skipped automatically. The pipeline becomes cheap on repeated runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E2E example.&lt;/strong&gt; The analyzer writes &lt;code&gt;modules.json&lt;/code&gt; (modules, routes, dependencies) and &lt;code&gt;schema-map.json&lt;/code&gt; (database schema). On the second run for the same module, the analyze phase takes zero seconds. Those files are also useful &lt;strong&gt;in their own right&lt;/strong&gt;: a new team member can read &lt;code&gt;schema-map.json&lt;/code&gt; and understand in fifteen minutes what would otherwise take a full day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-E2E application.&lt;/strong&gt; A migration tool uses &lt;code&gt;mappings.json&lt;/code&gt; (old names → new names) and &lt;code&gt;applied-steps.jsonl&lt;/code&gt; (what has already been done). &lt;code&gt;.jsonl&lt;/code&gt; means &lt;a href="https://jsonlines.org" rel="noopener noreferrer"&gt;JSON Lines&lt;/a&gt;: an append-only format with one JSON object per line. It is ideal for event logs: a new entry is just &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt; appended to the file, you do not need to parse the whole thing, and one corrupted line does not invalidate the rest. If a migration stops halfway through, the restart reads &lt;code&gt;applied-steps.jsonl&lt;/code&gt; and continues from there. A customer-support pipeline can keep &lt;code&gt;session-context.json&lt;/code&gt; for each conversation so a new request reads prior context instead of starting from zero. A documentation generator can rebuild &lt;code&gt;module-graph.json&lt;/code&gt; only when the source files changed, speeding repeated runs up by an order of magnitude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What breaks without it.&lt;/strong&gt; Every run is expensive. The pipeline cannot be stopped and resumed. Artifacts live in one agent's head and disappear as soon as the context is cleared.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principle 4: Knowledge as a separate layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is.&lt;/strong&gt; Domain knowledge – platform patterns, known constraints, gotchas you only discover in real-world use – lives in &lt;strong&gt;separate files&lt;/strong&gt; that agents read but do &lt;strong&gt;not import&lt;/strong&gt; into the main code. Curated Markdown or YAML, not an embedding vector store where texts are pre-translated into numeric representations and retrieved by similarity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it works.&lt;/strong&gt; Domain knowledge changes on a different rhythm than the code itself. A UI framework might update once a year; your code changes every week. If the knowledge is baked into the code, a framework upgrade becomes a migration. If it lives in a separate layer, you change one YAML file and everything else stays intact.&lt;/p&gt;

&lt;p&gt;A curated KB is also &lt;strong&gt;deterministic&lt;/strong&gt;. RAG chooses top-k documents by embedding similarity, and if an important paragraph misses the retrieval cut, the agent runs without it. A flat KB is either entirely present in context or it is not – and that is immediately visible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E2E example.&lt;/strong&gt; On my ecommerce project, the local KB is 12 Markdown files (&lt;code&gt;admin&lt;/code&gt;, &lt;code&gt;classic-storefront&lt;/code&gt;, &lt;code&gt;modern-storefront&lt;/code&gt;), plus 9 YAML files in a global cross-stack KB (&lt;code&gt;tailwind-css&lt;/code&gt;, &lt;code&gt;alpine-js&lt;/code&gt;, &lt;code&gt;fastapi&lt;/code&gt;, &lt;code&gt;nextjs&lt;/code&gt;, and so on). When I ported the method to FastAPI + NextJS, &lt;code&gt;tailwind-css.yml&lt;/code&gt;, &lt;code&gt;alpine-js.yml&lt;/code&gt;, and &lt;code&gt;mailpit.yml&lt;/code&gt; &lt;strong&gt;just worked&lt;/strong&gt; on the new stack without modification. That is cross-project KB reuse: platform knowledge isolated into its own layer travels across projects.&lt;/p&gt;

&lt;p&gt;This is a rare kind of evidence in the current multi-agent literature – almost every public case study shows one system on one stack. Portability is what confirms that the split between code, KB, and agents is not cosmetic but architectural: the KB layer behaves like a self-contained component.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-E2E application.&lt;/strong&gt; A security-audit KB can cover CVE categories, OWASP patterns, and framework-specific gotchas (XSS in template engines, SQL injection in ORM bypasses). A customer-support KB can encode ticket types, escalation rules, and refund policies. A documentation generator KB can define documentation formats (JSDoc, RST, OpenAPI) and conventions for each language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What breaks without it.&lt;/strong&gt; Knowledge gets smeared across prompts and code. Every agent ends up with its own copy of the rules, and those copies drift apart over time. When the platform changes, there is no single place to update.&lt;/p&gt;

&lt;h3&gt;
  
  
  When RAG is actually needed
&lt;/h3&gt;

&lt;p&gt;A flat KB stops working at one of three thresholds: around 200k tokens (too expensive to load in full), uncurated sources (code, tickets, logs), or &lt;strong&gt;history-driven retrieval&lt;/strong&gt; (when the agent benefits from the top-k most similar prior cases). At those thresholds, the KB evolves into RAG – but that is a change of tool, not of methodology. The contract, role separation, and persistent state still remain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principle 5: Closed-loop learning (knowledge compounding)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is.&lt;/strong&gt; Every failure or error is turned into a &lt;strong&gt;structured artifact&lt;/strong&gt; – not "fixed a selector," but a completed template with diagnosis, hypotheses, action taken, verification, KB candidates, and out-of-scope items. Those artifacts then &lt;strong&gt;feed back into the KB&lt;/strong&gt;, so the next agent run already sees them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it works.&lt;/strong&gt; Without a closed loop, every run rediscovers the same failures. With one, you get &lt;strong&gt;knowledge compounding&lt;/strong&gt;. The KB grows by the same logic as compound interest: the system becomes cheaper and more accurate on every pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E2E example.&lt;/strong&gt; The healer writes per-run files under &lt;code&gt;heal-findings/&amp;lt;date&amp;gt;-&amp;lt;module&amp;gt;.md&lt;/code&gt; with six sections: A (diagnosis), B (hypotheses), C (action), D (verification), E (KB candidates), F (out-of-scope siblings). Section E is the promotion path into the KB. On my project, across eight runs, the KB grew by 67% (from 25 gotchas to 42), and &lt;code&gt;first_try_pass_rate&lt;/code&gt; rose from 14% (a new module) to 95% (the third run of the same module). That is the &lt;strong&gt;KB saturation curve&lt;/strong&gt;: same agents, same prompts, different feed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-E2E application.&lt;/strong&gt; In a code-review pipeline, each rejected agent comment becomes structured feedback ("false positive: the agent flagged X, but X is allowed in this module under line N of &lt;code&gt;CONTRACT.md&lt;/code&gt;") and is then promoted into the KB, so the next run sees the rule. In a migration tool, each failed migration becomes a markdown report with the root cause, then a rule in &lt;code&gt;migration-gotchas.yml&lt;/code&gt;, so the next migration does not repeat the mistake. In a security audit, each false positive becomes a rule in &lt;code&gt;audit-exceptions.yml&lt;/code&gt;, improving signal-to-noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What breaks without it.&lt;/strong&gt; Agents do not learn between runs. The tenth run is as expensive as the first. Every failure requires manual diagnosis from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Principle 6: Additive instrumentation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it is.&lt;/strong&gt; Metrics after each run are written to a file with an &lt;strong&gt;evolving schema&lt;/strong&gt;: new fields are added, old fields stay. v1 records remain valid after v2 fields are introduced. No breaking changes, no migrations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it works.&lt;/strong&gt; Without quantitative feedback, "is it getting better?" is an unanswerable question. The feeling that "it seems faster now" is not data. With &lt;code&gt;metrics.jsonl&lt;/code&gt;, you can actually see the trendline.&lt;/p&gt;

&lt;p&gt;There is a second benefit: an additive schema lets you &lt;strong&gt;learn gradually which metrics matter&lt;/strong&gt;. I did not know in advance that &lt;code&gt;first_try_pass_rate&lt;/code&gt; would become a key metric; it only appeared on the third run, when I noticed that the number of healing iterations was a proxy for KB maturity. If the schema had been rigid, I would have needed a migration for older records. With an additive schema, I simply added the field and the old records stayed valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;E2E example.&lt;/strong&gt; &lt;code&gt;metrics.jsonl&lt;/code&gt; v1 (the first two runs) contains timestamp, target, stack, phases, kb_updates, and volume. v2 (from the third run onward) adds &lt;code&gt;first_try_pass_rate&lt;/code&gt;, &lt;code&gt;real_app_bugs_found[]&lt;/code&gt;, &lt;code&gt;test_churn&lt;/code&gt;, &lt;code&gt;kb_hits&lt;/code&gt;, &lt;code&gt;patterns_added&lt;/code&gt;, and &lt;code&gt;wall_clock_ms&lt;/code&gt;. The v1 records &lt;strong&gt;remained valid&lt;/strong&gt;, which lets me query across all eight runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-E2E application.&lt;/strong&gt; In an ML training pipeline, &lt;code&gt;experiments.jsonl&lt;/code&gt; can record hyperparameters, dataset version, and metrics. In a refactoring tool, &lt;code&gt;refactor-runs.jsonl&lt;/code&gt; can track the number of changed files, tests broken or restored, and review time. In customer support, &lt;code&gt;tickets.jsonl&lt;/code&gt; can store time-to-first-response, escalation depth, and resolution type.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What breaks without it.&lt;/strong&gt; You cannot say objectively whether the system is improving. Debates about whether it got better or worse get resolved by intuition instead of data. When a new agent introduces an unexpected regression, you do not see it until complaints accumulate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What these principles give you together
&lt;/h2&gt;

&lt;p&gt;Each principle on its own is a useful pattern. Together they produce a system with specific properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy.&lt;/strong&gt; Contract + source reading + role separation cut down the space for improvisation. The agent works from ground truth – what is actually in the code – rather than guesses about how it might be organised.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fewer hallucinations.&lt;/strong&gt; Persistent state provides stable context; the KB provides deterministic rules; the closed loop catches hallucinations and prevents them from recurring.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility.&lt;/strong&gt; The same input artifact plus the same KB snapshot should produce the same output. Different results across runs are treated as a bug to investigate, not as "the nature of LLMs."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge accumulation.&lt;/strong&gt; Closed-loop learning plus additive metrics turn every run into data. After ten runs, you know more about your system than after a hundred one-off GPT calls driven by a single prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portability.&lt;/strong&gt; The same six principles work for E2E testing, code review, refactoring, security audit, and migration tools. Only the KB and helpers are platform-specific; the architecture is not.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What these principles do not give you
&lt;/h2&gt;

&lt;p&gt;I would not present this as a silver bullet. The principles solve a specific class of problems – accuracy and reproducibility in multi-agent systems – and do not solve others.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They do not make the agent &lt;strong&gt;smarter&lt;/strong&gt;. GPT does not turn into an expert just because you wrapped six layers around it. If the task requires creativity or deep understanding, the agent stays limited by the model.&lt;/li&gt;
&lt;li&gt;They do not work well for &lt;strong&gt;very short tasks&lt;/strong&gt;. The payback starts after three to five runs. If you only run the system once, the overhead is not justified.&lt;/li&gt;
&lt;li&gt;They &lt;strong&gt;do not replace review&lt;/strong&gt;. Closed-loop learning catches errors that the agent or the system itself already noticed. Errors nobody recognised as errors still stay in the code.&lt;/li&gt;
&lt;li&gt;They &lt;strong&gt;require discipline&lt;/strong&gt;. Six-section heal findings, an explicit contract, persistent state – all of that is work. If the team is not willing to maintain those artifacts, the method turns into dead weight.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;I am now applying these six principles to a third independent domain – knowledge work (planning, learning, content), not software development. This is a deliberate attempt to eliminate the method's software bias: the first two validations were in E2E testing, and it is still unclear which principles are code-specific and which are truly domain-agnostic.&lt;/p&gt;

&lt;p&gt;If you are applying a similar architecture in another domain – or, conversely, if you found where it stops working – I would love to hear about it. I am especially interested in cases where a principle &lt;strong&gt;did not work&lt;/strong&gt;. Those cases show the boundaries of the method more clearly than successful implementations do.&lt;/p&gt;




&lt;p&gt;P.S. In parallel, I am writing a more technical deep-dive series on one concrete application of these principles – E2E testing: a month and a half of iteration, eight runs, a six-section healing protocol, and a breakdown of KB-saturation metrics. I am also preparing an open-source companion repo with a reference implementation of the six principles – framework, four agents, metrics schema, and skeleton KBs. Announcements for new articles and the repo launch go out on &lt;a href="https://linkedin.com/in/webmaster-ramos" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;; the articles themselves are published on the &lt;a href="https://webmaster-ramos.com/blog" rel="noopener noreferrer"&gt;blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>architecture</category>
    </item>
    <item>
      <title>One Nav, Two Stacks: A Microfrontend Between Magento and Laravel Without Replatforming</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Thu, 30 Apr 2026 21:04:50 +0000</pubDate>
      <link>https://forem.com/webramos/one-nav-two-stacks-a-microfrontend-between-magento-and-laravel-without-replatforming-3on5</link>
      <guid>https://forem.com/webramos/one-nav-two-stacks-a-microfrontend-between-magento-and-laravel-without-replatforming-3on5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A working reference implementation on two production-grade stacks (Magento 2.4 + Laravel 11), with the host integration shape shown below and a server-rendered nav skeleton shipped on day one - not retrofitted after GSC panic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Mid-market ecommerce rarely lives on one stack. The industry answer - "replatform everything onto one stack" - is a $100-500k, 6-12 month project most of them cannot afford.&lt;/p&gt;

&lt;p&gt;I shipped a smaller answer on a real client stack: &lt;strong&gt;a 15-20 kb Preact microfrontend that mounts into both Magento 2.4 and Laravel 11 via a manifest&lt;/strong&gt;. This is not a Module Federation hello-world - it is two real host integrations, around 120 lines of PHP on Magento and around 90 lines on Laravel, with one &lt;code&gt;pnpm build&lt;/code&gt; and both sites updating in under a minute.&lt;/p&gt;

&lt;p&gt;The opinionated part: &lt;strong&gt;microfrontends failed as a product architecture but work as a repair strategy.&lt;/strong&gt; Greenfield product teams drown in coordination cost. Repair across inherited stacks is a different problem - and the pattern solves it cleanly, if you get the SEO contract right before shipping.&lt;/p&gt;

&lt;p&gt;The proof point is deliberately concrete: &lt;strong&gt;a before/after crawler diff on identical URLs and user-agents&lt;/strong&gt;. Not a modelled SEO score, not a Lighthouse proxy, but raw HTML facts - bytes, anchor counts, and whether the navigation exists in initial markup for non-rendering crawlers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem nobody names out loud
&lt;/h2&gt;

&lt;p&gt;A mid-market ecommerce group with multiple brands, accrued over years:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Magento 2.4 storefront&lt;/strong&gt; - catalogue, cart, checkout.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Laravel 11 marketing site&lt;/strong&gt; - brand story, awards programme, editorial.&lt;/li&gt;
&lt;li&gt;A handful of single-purpose SPAs on top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each stack has its own header and footer. When marketing adds a top-level category, it ships in one stack in a week and in the other in three. When design changes the logo, it takes two sprints to roll out across everything.&lt;/p&gt;

&lt;p&gt;The cost is not engineering hours. The cost is that the brand is visibly inconsistent to customers, the teams know it, and every cross-team sync about the nav takes an hour.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "just consolidate on one stack" is not the answer
&lt;/h2&gt;

&lt;p&gt;The standard advice is a monorepo or a headless rewrite. Both are correct on paper and wrong in the field.&lt;/p&gt;

&lt;p&gt;Monorepos assume teams that want to converge. Inherited teams - Magento folks who have been on that stack for seven years, a Laravel team that came with an acquisition - do not want to converge. They have skill investment, release cadences, and on-call rotations built around their stack. A monorepo migration is a political project before it is an engineering one, and most mid-market companies cannot push one through.&lt;/p&gt;

&lt;p&gt;Headless replatforming is the same project in a different wrapper. Twelve-month runway, executive buy-in, and a new front end that has to outpace the rate at which the old ones break the business.&lt;/p&gt;

&lt;p&gt;A shared nav microfrontend does not compete with monorepo architecturally. It competes with &lt;strong&gt;doing nothing&lt;/strong&gt; - which is what the organisation was going to do for the next two years anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why repair is different from design
&lt;/h2&gt;

&lt;p&gt;Spotify publicly &lt;a href="https://blog.pragmaticengineer.com/spotify-squads/" rel="noopener noreferrer"&gt;rolled back its extreme squad autonomy&lt;/a&gt;. The failure mode is always the same: teams own product-level vertical slices, those slices compose into one surface, coordination cost explodes, UX inconsistency becomes a feature of the architecture.&lt;/p&gt;

&lt;p&gt;That is a real lesson. It does not mean no microfrontend is ever right.&lt;/p&gt;

&lt;p&gt;Repair is a different problem than design. You are not building the surface - the surface already exists, in two incompatible implementations. You are installing the narrowest possible shared layer that brings them back into visual alignment. The nav is exactly that narrow: no business logic, no routing, no data dependencies beyond a flat link tree.&lt;/p&gt;

&lt;p&gt;Everything the microfrontend critique flags - duplicate runtime, fragmented UX ownership, coordination overhead - either does not apply to a 15 kb shell (runtime is negligible) or applies &lt;em&gt;less&lt;/em&gt; than the status quo (UX is already fragmented; we are &lt;em&gt;reducing&lt;/em&gt; coordination by centralising the decision).&lt;/p&gt;

&lt;h2&gt;
  
  
  Shell architecture: 15-20 kb, one build, one file
&lt;/h2&gt;

&lt;p&gt;A Preact tree built with &lt;a href="https://vite.dev/guide/build.html#library-mode" rel="noopener noreferrer"&gt;Vite's library mode&lt;/a&gt; into one IIFE script and one CSS file, both with content-hashed filenames. A &lt;code&gt;manifest.json&lt;/code&gt; maps logical names to hashed URLs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Preact over React&lt;/strong&gt; - ~10 kb gzipped vs ~45 kb. Non-negotiable at a 15-20 kb budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IIFE over ES modules&lt;/strong&gt; - works in Magento's RequireJS environment without extra config, and in any &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tag on any stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cssCodeSplit: false&lt;/code&gt;&lt;/strong&gt; - one file, one request, no FOUC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailwind with a prefix&lt;/strong&gt; - scoped classes, no collision with host CSS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content-hashed URLs via manifest&lt;/strong&gt; - immutable caching. Hosts read the manifest at render time and emit &lt;code&gt;&amp;lt;link href="/nav/shared-nav.abc123.css"&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;pnpm build&lt;/code&gt; takes ~8 seconds. Hosts pick up new hashes within their cache TTL. One bugfix lands on both sites in ~1 minute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Host integration: Magento 2.4
&lt;/h2&gt;

&lt;p&gt;Around 120 lines of new PHP, three files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Acme\Theme\Model\SharedNavManifest&lt;/code&gt;&lt;/strong&gt; (~85 lines) - HTTP-fetches the manifest with Magento's cache backend, falls back to a non-hashed &lt;code&gt;shared-nav.iife.js&lt;/code&gt; on fetch failure so the nav never disappears, only loses cache-busting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Acme\Theme\ViewModel\SharedNavAssets&lt;/code&gt;&lt;/strong&gt; (~26 lines) - the ViewModel that phtml templates talk to. CSS goes through a ViewModel rather than static &lt;code&gt;&amp;lt;css&amp;gt;&lt;/code&gt; layout XML because the URL has a hash in it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Acme\Theme\etc\frontend\di.xml&lt;/code&gt;&lt;/strong&gt; (~7 lines) - wires the manifest URL through deploy config.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two phtml partials - header and footer - emit the mount divs and asset tags. Included from &lt;code&gt;default.xml&lt;/code&gt;, so every page type inherits the shared nav.&lt;/p&gt;

&lt;h2&gt;
  
  
  Host integration: Laravel 11
&lt;/h2&gt;

&lt;p&gt;Around 90 lines. Smaller because the service container carries more weight.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;App\Services\SharedNavManifest&lt;/code&gt;&lt;/strong&gt; (~65 lines) - HTTP-fetches the manifest, caches via &lt;code&gt;Cache::remember('shared_nav.manifest', 60, ...)&lt;/code&gt;, logs and falls back to the unhashed bundle on fetch failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;config/services.php&lt;/code&gt;&lt;/strong&gt; - three lines exposing &lt;code&gt;services.shared_nav.manifest_url&lt;/code&gt; as env-driven config.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two Blade layouts&lt;/strong&gt; - public and a secondary layout for older marketing pages - emit &lt;code&gt;&amp;lt;link&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tags from the manifest service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 60-second cache TTL controls how fast a &lt;code&gt;pnpm build&lt;/code&gt; propagates - aggressive enough for release cadence, conservative enough that manifest fetches are one request per minute per worker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Representative code shape (abridged)
&lt;/h2&gt;

&lt;p&gt;The full production classes are client code, so I am not publishing them verbatim here. But the integration should not stay abstract either. This is the &lt;strong&gt;shape&lt;/strong&gt; of the two host adapters - abridged to show the contract rather than every guardrail and framework detail.&lt;/p&gt;

&lt;p&gt;A note on the manifest keys: Vite indexes &lt;code&gt;manifest.json&lt;/code&gt; by entry source path and asset name - &lt;code&gt;src/main.tsx&lt;/code&gt; and &lt;code&gt;style.css&lt;/code&gt; in our build - not by the output filename. The host lookups use those keys; the unhashed filenames (&lt;code&gt;shared-nav.iife.js&lt;/code&gt;, &lt;code&gt;shared-nav.css&lt;/code&gt;) are only used as fallbacks when the manifest fetch fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Magento 2.4 - manifest service shape:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SharedNavManifest&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getJsUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'src/main.tsx'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fallbackJs&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getCssUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'style.css'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fallbackCss&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// SSR fallback: fetched once from the shell, cached in Magento's cache&lt;/span&gt;
    &lt;span class="c1"&gt;// backend, and inlined into the mount div at render time.&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getHeaderHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'header.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getFooterHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'footer.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 1) read cached manifest&lt;/span&gt;
        &lt;span class="c1"&gt;// 2) on miss, fetch remote manifest URL&lt;/span&gt;
        &lt;span class="c1"&gt;// 3) cache parsed JSON&lt;/span&gt;
        &lt;span class="c1"&gt;// 4) on failure, log and fall back to unhashed asset names&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 1) read cached snapshot for $key&lt;/span&gt;
        &lt;span class="c1"&gt;// 2) on miss, fetch rendered HTML from the shell (e.g. /nav/header.html)&lt;/span&gt;
        &lt;span class="c1"&gt;// 3) cache body with a short TTL&lt;/span&gt;
        &lt;span class="c1"&gt;// 4) on failure, return '' so the shell still hydrates later&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Laravel 11 - manifest service shape:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SharedNavManifest&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Cache&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;remember&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'shared_nav.manifest'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// GET config('services.shared_nav.manifest_url'), parse JSON.&lt;/span&gt;
            &lt;span class="c1"&gt;// On failure, log and return [] so fallback filenames kick in.&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;jsUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'src/main.tsx'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="s1"&gt;'shared-nav.iife.js'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;cssUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'style.css'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="s1"&gt;'shared-nav.css'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;headerHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'header.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;footerHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'footer.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Cache&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;remember&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"shared_nav.snapshot:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// fetch rendered HTML from the shell (e.g. /nav/header.html)&lt;/span&gt;
            &lt;span class="c1"&gt;// return '' on failure so the shell still hydrates later&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is why the line counts matter. The host code is small enough to be reviewable, boring enough to be supportable, and explicit enough that another PHP team can own it without learning a front-end platform first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The host &amp;lt;-&amp;gt; shell contract
&lt;/h2&gt;

&lt;p&gt;The two sides agree on a tiny surface:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Host provides:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;rel=&lt;/span&gt;&lt;span class="s"&gt;"stylesheet"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"{{ $nav-&amp;gt;cssUrl() }}"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-header"&lt;/span&gt; &lt;span class="na"&gt;style=&lt;/span&gt;&lt;span class="s"&gt;"min-height: 80px;"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;{!! $nav-&amp;gt;headerHtml() !!}&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;&amp;lt;!-- page body --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-footer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;{!! $nav-&amp;gt;footerHtml() !!}&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;defer&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"{{ $nav-&amp;gt;jsUrl() }}"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Shell provides:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;nav-fallback.html&lt;/code&gt; emitted at build time, split into header and footer snippets the host inlines into the mount divs (the SSR fallback).&lt;/li&gt;
&lt;li&gt;Client-side mount into &lt;code&gt;#sa-header&lt;/code&gt; and &lt;code&gt;#sa-footer&lt;/code&gt; that replaces the SSR snapshot with the interactive tree (dropdowns, mobile menu, state).&lt;/li&gt;
&lt;li&gt;One CSS file, one JS file, no global pollution (IIFE scope).&lt;/li&gt;
&lt;li&gt;No knowledge of Magento or Laravel. No runtime config, no feature flags.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else - routing, authentication, cart state, checkout - stays on the host. The nav does not know the host exists. The host does not know the nav is Preact. That is the whole integration.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;min-height: 80px&lt;/code&gt; on the header mount is anti-CLS insurance - the slot reserves its space before hydration, so Core Web Vitals do not punish the deferred render.&lt;/p&gt;

&lt;h2&gt;
  
  
  The SEO question, answered honestly
&lt;/h2&gt;

&lt;p&gt;This is the part every microfrontend post skips or hand-waves. I will not.&lt;/p&gt;

&lt;p&gt;Also, this section is intentionally based on &lt;strong&gt;observable crawler facts&lt;/strong&gt;, not modelled SEO metrics. I am not claiming a ranking uplift from a synthetic score. I am showing what a crawler can and cannot see in the initial HTML before and after the fallback ships.&lt;/p&gt;

&lt;p&gt;Without a fallback, initial HTML is two empty divs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-header"&lt;/span&gt; &lt;span class="na"&gt;style=&lt;/span&gt;&lt;span class="s"&gt;"min-height: 80px;"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-footer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Googlebot renders JavaScript (eventually) and sees the nav - with a delay measured in days. But &lt;strong&gt;GPTBot, ClaudeBot, and PerplexityBot do not render JavaScript&lt;/strong&gt;. They see the empty divs. As far as AI search is concerned, the site has no nav.&lt;/p&gt;

&lt;p&gt;I measured this before shipping the SSR fallback. Three pages, five user-agents, identical &lt;code&gt;curl&lt;/code&gt; invocations. Same URLs, same crawl method, same parsing rule - only the fallback changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before SSR fallback:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Homepage&lt;/th&gt;
&lt;th&gt;/about&lt;/th&gt;
&lt;th&gt;/portfolio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bytes&lt;/td&gt;
&lt;td&gt;35,050&lt;/td&gt;
&lt;td&gt;35,050&lt;/td&gt;
&lt;td&gt;97,521&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;a href&amp;gt;&lt;/code&gt; total&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anchors from nav&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Twelve anchors per page, none of them structural. Every page - no matter how deep - exposed the same twelve inline body links to a non-rendering crawler. Sitemap.xml covered URL discovery, but not the four things nav does beyond discovery:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Link equity&lt;/strong&gt; - a multi-level nav is hundreds of internal links per page pointing at categories. Without it, category pages lose authority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crawl budget&lt;/strong&gt; - Googlebot prioritises pages by incoming-link density. Sitemap-only pages get crawled less often.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topic hierarchy&lt;/strong&gt; - sitemap is flat. Nav signals semantic structure ("Shop -&amp;gt; Men -&amp;gt; Shoes").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI assistant context&lt;/strong&gt; - ChatGPT and Perplexity build mental models from HTML, often ignoring sitemaps. Without nav in HTML, AI knows your URLs but not your structure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The three-level mitigation ladder:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;&amp;lt;noscript&amp;gt;&lt;/code&gt; fallback&lt;/strong&gt; with critical links inside the mount div (hours of work).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSR skeleton&lt;/strong&gt; - Vite emits a &lt;code&gt;nav-fallback.html&lt;/code&gt; at build time; hosts inline it into the mount divs before hydration replaces it (a day or two).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full SSR service&lt;/strong&gt; - a Node process renders each nav request server-side (a week, plus a new production dependency).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Level 2 is the sweet spot for an ecommerce group this size. We shipped it before the first production release. Same &lt;code&gt;curl&lt;/code&gt; invocations, four days later:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After SSR fallback:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Homepage&lt;/th&gt;
&lt;th&gt;/about&lt;/th&gt;
&lt;th&gt;/portfolio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bytes&lt;/td&gt;
&lt;td&gt;98,881&lt;/td&gt;
&lt;td&gt;98,881&lt;/td&gt;
&lt;td&gt;161,348&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;a href&amp;gt;&lt;/code&gt; total&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anchors from nav (&lt;code&gt;#sa-header&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anchors from footer (&lt;code&gt;#sa-footer&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All five user-agents received byte-identical HTML (the only per-request variance is the Laravel CSRF meta token). The nav and footer tree are in initial HTML - 100 additional anchors per page, constant across every page, visible to every crawler that can parse HTML.&lt;/p&gt;

&lt;p&gt;That matters methodologically. A crawler can disagree with my interpretation of the SEO impact, but it cannot disagree with &lt;code&gt;35,050 -&amp;gt; 98,881&lt;/code&gt; bytes or &lt;code&gt;12 -&amp;gt; 112&lt;/code&gt; anchors under the same crawl conditions. This is a reusable audit method, not a one-off anecdote.&lt;/p&gt;

&lt;p&gt;The gap closed on release day. No retroactive GSC panic, no "we measured a drop and here's how we fixed it" narrative. The honest framing is "we knew the risk, we closed it before shipping".&lt;/p&gt;

&lt;h2&gt;
  
  
  What this article proves today - and what it does not yet
&lt;/h2&gt;

&lt;p&gt;This article proves three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;integration pattern is real&lt;/strong&gt; on two production-grade PHP stacks.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;SEO risk is real&lt;/strong&gt; if the shell ships with empty mount points only.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Level 2 fallback&lt;/strong&gt; closes that crawler-visibility gap on day one.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What it does &lt;strong&gt;not&lt;/strong&gt; prove yet is a 90-day business outcome story. I do not have a "three months later, here are CrUX and GSC deltas" chart in this draft, because that would require waiting for the post-release window to mature. I would rather publish the implementation pattern and the crawler evidence honestly than pretend I have impact numbers I do not have yet.&lt;/p&gt;

&lt;p&gt;That makes this a build-and-ship case study, not a finished growth narrative. When post-launch search-console and field-performance data are mature enough to be worth showing, they belong in a follow-up article.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this pattern does not solve
&lt;/h2&gt;

&lt;p&gt;Not overselling: the shared nav is the &lt;strong&gt;minimum viable shared surface&lt;/strong&gt; - that is its strength and its ceiling.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary page content&lt;/strong&gt; still diverges. Magento renders products; Laravel renders marketing copy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared checkout&lt;/strong&gt; - not solved. Checkout lives in Magento; marketing links into it via cookies on a common parent domain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared authentication&lt;/strong&gt; - not solved. Cookies, redirects, OAuth handshakes - all host-specific.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared search&lt;/strong&gt; - could be built on top of the shell, but we did not. Search UX is coupled to Magento-native catalogue data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A shared nav is not a distributed-front-end strategy. It is a band-aid across a healed fracture. If you need a distributed front end, you need a different architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this pattern fits
&lt;/h2&gt;

&lt;p&gt;Short checklist. If you check fewer than three boxes, do something else.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have &lt;strong&gt;two or more existing stacks&lt;/strong&gt; with established teams you cannot realistically move.&lt;/li&gt;
&lt;li&gt;There is &lt;strong&gt;no budget or appetite&lt;/strong&gt; for a full front-end unification right now.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;primary pain is UX inconsistency&lt;/strong&gt;, not performance or architectural debt.&lt;/li&gt;
&lt;li&gt;Nobody on the executive side is willing to own a "unified portal" programme.&lt;/li&gt;
&lt;li&gt;You need to be &lt;strong&gt;AI-agent ready&lt;/strong&gt; - which means the nav must be in initial HTML, not only after JS runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If all five apply, the pattern pays for itself in weeks, not quarters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The same shell is about to land on two more stacks in the same group - a greenfield Magento storefront rewrite and a full Laravel marketing rewrite. Both will consume the existing &lt;code&gt;manifest.json&lt;/code&gt; unchanged. Zero additional shell work, the same integration footprint per host. That is the portability proof.&lt;/p&gt;

&lt;p&gt;If the pattern looks like it might fit your stack, the interesting conversation is not "how do I build a shell" - Vite's library mode docs will get you there in an afternoon. The interesting conversation is the SEO contract and shipping Level 2 on day one instead of retrofitting it after Google Search Console punishes you.&lt;/p&gt;

&lt;p&gt;For a mid-market team, that is usually the real decision framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Level 1:&lt;/strong&gt; &lt;code&gt;&amp;lt;noscript&amp;gt;&lt;/code&gt; links when the risk window is small and the nav is shallow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 2:&lt;/strong&gt; build-time SSR fallback when you need full crawler-visible structure without adding a Node service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 3:&lt;/strong&gt; full SSR service when the nav is dynamic enough that static fallback HTML becomes a maintenance problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where outside perspective usually helps, and where I spend most of my consulting time.&lt;/p&gt;

</description>
      <category>microfrontend</category>
      <category>ecommerce</category>
      <category>magneto</category>
      <category>laravel</category>
    </item>
    <item>
      <title>Everyone Says MCP Beats CLI. The AWS Benchmark Disagrees.</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Tue, 21 Apr 2026 19:12:16 +0000</pubDate>
      <link>https://forem.com/webramos/mcp-vs-cli-for-ai-agents-a-real-aws-benchmark-and-why-the-popular-narrative-asks-the-wrong-4h8</link>
      <guid>https://forem.com/webramos/mcp-vs-cli-for-ai-agents-a-real-aws-benchmark-and-why-the-popular-narrative-asks-the-wrong-4h8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Full code, aggregated numbers (n=10 across 5 tasks and 5 transports), and a curated selection of 8 hand-picked runs live in the &lt;a href="https://github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark" rel="noopener noreferrer"&gt;&lt;code&gt;mcp-vs-cli-aws-benchmark&lt;/code&gt;&lt;/a&gt; repo. This article is a dense version of &lt;code&gt;docs/findings.md&lt;/code&gt; from the same repo, rewritten for a reader who doesn't have an hour to study the test harness.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question in the title is wrong.&lt;/strong&gt; "MCP or CLI?" assumes they have the same use case and one of them is objectively better. In reality it's a trade-off between two currencies: &lt;strong&gt;engineering time&lt;/strong&gt; vs. &lt;strong&gt;input tokens per run&lt;/strong&gt;, and you need both numbers to decide.&lt;/p&gt;

&lt;p&gt;I compared &lt;strong&gt;raw aws CLI&lt;/strong&gt; against the official &lt;strong&gt;awslabs.aws-api-mcp-server&lt;/strong&gt; on five read-only tasks against a real production AWS account. The model is Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (no Claude Code and no claude-agent-sdk, to avoid poisoning the context). Ground truth is collected via boto3, verification is automatic. n=10 per &lt;code&gt;(task, transport)&lt;/code&gt; cell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; a well-designed CLI tool beats awslabs MCP by 43-60% on input tokens on &lt;strong&gt;every one&lt;/strong&gt; of the five tasks, at equal success rate. But it takes half a day of engineering work per service.&lt;/p&gt;

&lt;p&gt;If you run 200 agent invocations a day - put MCP in and forget about it. If you run 200 thousand - sit down and write your own tool wrapper following the checklist at the end of the article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this whole debate comes from
&lt;/h2&gt;

&lt;p&gt;Since February 2026, dev Twitter and dev.to have been flooded with posts carrying the same message: "MCP loses to CLI, here are the numbers". Titles like &lt;a href="https://jannikreinhard.com/2026/02/22/why-cli-tools-are-beating-mcp-for-ai-agents/" rel="noopener noreferrer"&gt;«Why CLI Tools Are Beating MCP for AI Agents»&lt;/a&gt;, &lt;a href="https://www.scalekit.com/blog/mcp-vs-cli-use" rel="noopener noreferrer"&gt;«MCP vs CLI: Benchmarking AI Agent Cost &amp;amp; Reliability»&lt;/a&gt;, &lt;a href="https://oneuptime.com/blog/post/2026-02-03-cli-is-the-new-mcp/view" rel="noopener noreferrer"&gt;«Why CLI is the New MCP for AI Agents»&lt;/a&gt;. They all cite the same Scalekit benchmark, which reported:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP is 10-32x more expensive than CLI&lt;/strong&gt; on input tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; CLI 100%, MCP 72% (the cause of all 28% of failures is TCP timeouts connecting to the GitHub Copilot MCP server).&lt;/li&gt;
&lt;li&gt;Example: a simple "what language is this repo?" query - CLI 1,365 tokens, MCP 44,026 tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The authors' explanation: &lt;strong&gt;schema dump&lt;/strong&gt;. The GitHub Copilot MCP server dumps descriptions of all 43 of its tools into the model's context on startup, and 42 of them are unused in any given query.&lt;/p&gt;

&lt;p&gt;The problem is that this benchmark is &lt;strong&gt;n=1 on a single service&lt;/strong&gt;, with one kind of MCP server ("fat", per-resource). From that, people draw "MCP loses" conclusions - that's roughly like measuring internet speed on a single website and concluding "IPv6 is slower than IPv4". There is a useful signal, but no grounds for generalisation.&lt;/p&gt;

&lt;p&gt;I decided to reproduce the comparison on a different service (AWS), with a larger n, and in a setting where the MCP server is &lt;strong&gt;not&lt;/strong&gt; designed as a "fat" directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS has already done its homework
&lt;/h2&gt;

&lt;p&gt;The first thing I found when I went to look at &lt;code&gt;awslabs/mcp&lt;/code&gt; was &lt;strong&gt;not&lt;/strong&gt; what I had expected. Following the Scalekit GitHub Copilot MCP analogy, I was expecting to see dozens of per-resource MCP servers: &lt;code&gt;awslabs/ec2&lt;/code&gt;, &lt;code&gt;awslabs/s3&lt;/code&gt;, &lt;code&gt;awslabs/iam&lt;/code&gt;, each with their own 20-30 tools (&lt;code&gt;describe_instances&lt;/code&gt;, &lt;code&gt;run_instances&lt;/code&gt;, &lt;code&gt;terminate_instances&lt;/code&gt;, &lt;code&gt;modify_instance_attribute&lt;/code&gt;...). That would have been a clean schema dump in the context of a single task.&lt;/p&gt;

&lt;p&gt;In reality, the main AWS MCP server - &lt;a href="https://awslabs.github.io/mcp/servers/aws-api-mcp-server" rel="noopener noreferrer"&gt;&lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt;&lt;/a&gt; - is built very differently. It exposes &lt;strong&gt;three&lt;/strong&gt; tools:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;call_aws&lt;/code&gt; - takes an aws CLI command string (or an array of up to 20 commands for batch mode) and runs it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;suggest_aws_commands&lt;/code&gt; - natural language to a list of candidate aws CLI commands. The authors explicitly mark it as &lt;code&gt;FALLBACK&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get_execution_plan&lt;/code&gt; - multi-step plans, experimental, gated behind an environment variable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default &lt;strong&gt;two&lt;/strong&gt; are published (without &lt;code&gt;get_execution_plan&lt;/code&gt;). And there is a built-in &lt;code&gt;READ_OPERATIONS_ONLY=true&lt;/code&gt; switch - you can tell the server "describe/list/get only" and it will cut everything else off at its own level.&lt;/p&gt;

&lt;p&gt;This is an important engineering choice: AWS itself acknowledged the schema-dump problem and &lt;strong&gt;opted out&lt;/strong&gt; of a fat MCP server in favour of a wrapper over the CLI living under the MCP protocol. Comparing such a wrapper against "raw CLI" is a far more honest experiment than repeating Scalekit on the GitHub MCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;The details (runner code, ground-truth script, whitelist) are in the &lt;a href="https://github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark" rel="noopener noreferrer"&gt;repo&lt;/a&gt;. Here - compressed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5 read-only tasks&lt;/strong&gt; against a production-like AWS account:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;What it tests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_running&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;simple&lt;/td&gt;
&lt;td&gt;List running EC2 in &lt;code&gt;us-west-2&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;One API call + filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_policy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;edge&lt;/td&gt;
&lt;td&gt;Bucket policy for a single bucket&lt;/td&gt;
&lt;td&gt;Handling of an optional resource&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_regions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;chained&lt;/td&gt;
&lt;td&gt;All S3 buckets + region of each&lt;/td&gt;
&lt;td&gt;List + per-item lookup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;iam_admin_roles&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;filter&lt;/td&gt;
&lt;td&gt;IAM roles with &lt;code&gt;AdministratorAccess&lt;/code&gt; policy&lt;/td&gt;
&lt;td&gt;Pagination + content filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;chained&lt;/td&gt;
&lt;td&gt;CloudWatch CPU over 60 min for running EC2&lt;/td&gt;
&lt;td&gt;Composition + time windows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The correct answer for &lt;code&gt;iam_admin_roles&lt;/code&gt; in my account is an &lt;strong&gt;empty list&lt;/strong&gt;. A separate honesty test: will the model make up role names.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (~150 lines). Why not &lt;code&gt;claude-agent-sdk&lt;/code&gt; or Claude Code - see the "methodology notes" section below, this choice cost me a day and a half.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transports:&lt;/strong&gt; CLI - &lt;code&gt;subprocess.run(['aws', ...])&lt;/code&gt; behind a whitelist. MCP - the &lt;code&gt;mcp&lt;/code&gt; python lib, which boots &lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt; via &lt;code&gt;uvx&lt;/code&gt; stdio and performs a real MCP handshake.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety:&lt;/strong&gt; a dedicated IAM user &lt;code&gt;mcp-benchmark&lt;/code&gt; with &lt;code&gt;ReadOnlyAccess&lt;/code&gt; + a local command whitelist. Two lines of defense - in case the model tries to break something.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification:&lt;/strong&gt; a boto3 script captures ground truth before the benchmark, a verifier compares the model's JSON response automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;n=10 per cell&lt;/strong&gt;, median on the main metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  First attempt: CLI loses everywhere
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Spoiler for anyone who won't read to the end:&lt;/strong&gt; everything you are about to see - CLI failing two tasks, 60% success rate, a naive strategy with 36 tool calls - turned into the opposite result three days later: a CLI that beats MCP by 43-60% on tokens. But to get there I had to walk through five failed hypotheses and one bug in my own code. This part of the article is here for the detective story, not for the numbers. The numbers are at the end.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On the pilot run with three transports (plain &lt;code&gt;cli&lt;/code&gt;, &lt;code&gt;cli&lt;/code&gt; with an enriched description, &lt;code&gt;mcp&lt;/code&gt;) the picture looked like a confirmation of the Scalekit narrative. On &lt;code&gt;iam_admin_roles&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cli&lt;/code&gt; plain: &lt;strong&gt;36 tool calls&lt;/strong&gt;, 20k input tokens, 68 seconds. Strategy: &lt;code&gt;list-roles&lt;/code&gt; + &lt;code&gt;list-attached-role-policies&lt;/code&gt; for each of the 34 roles in the account.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mcp&lt;/code&gt;: &lt;strong&gt;1 tool call&lt;/strong&gt;, 5k input tokens, 4 seconds. One command: &lt;code&gt;iam list-entities-for-policy --policy-arn ... --entity-filter Role&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The same model on the same prompt made a different command choice.&lt;/strong&gt; On MCP - perfect; on CLI - the naive, linear-complexity path.&lt;/p&gt;

&lt;p&gt;Even scarier was &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;. CLI failed in &lt;strong&gt;60% of cases&lt;/strong&gt;: it hit the max_turns limit trying to guess the correct timestamp for CloudWatch &lt;code&gt;get-metric-statistics&lt;/code&gt;. I looked at the logs and saw commands with &lt;code&gt;--start-time 2025-05-16T...&lt;/code&gt;, &lt;code&gt;--start-time 2025-07-14T...&lt;/code&gt; - the model clearly had no idea what year it was.&lt;/p&gt;

&lt;p&gt;MCP in the same conditions made 3 calls, always with correct 2026 timestamps, 100% success.&lt;/p&gt;

&lt;p&gt;This looked like a ready-made "CLI loses" article. Good thing I didn't stop there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five hypotheses, five ablation experiments
&lt;/h2&gt;

&lt;p&gt;Before publishing results like that, I wanted to understand &lt;strong&gt;why&lt;/strong&gt;. "MCP is smarter" is not an explanation, it's a description. Sonnet 4.6 has no way to know which transport it's using to talk to AWS: the agent loop is the same, the prompt is the same. Something &lt;strong&gt;structural&lt;/strong&gt; in the MCP transport was making the model behave differently.&lt;/p&gt;

&lt;p&gt;What follows is five controlled experiments. Each time I took the CLI transport and added &lt;strong&gt;one&lt;/strong&gt; trait from the MCP world to test an isolated hypothesis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 1: tool description length and structure.&lt;/strong&gt; awslabs's &lt;code&gt;call_aws&lt;/code&gt; description is ~3000 characters with examples and best practices. My &lt;code&gt;aws_cli&lt;/code&gt; was ~500. I wrote &lt;code&gt;tools_cli_rich.py&lt;/code&gt; with a description of the same length, including a direct hint: "For 'find roles attached to policy X', use &lt;code&gt;iam list-entities-for-policy --policy-arn ... --entity-filter Role&lt;/code&gt; instead of listing every role and inspecting each one."&lt;/p&gt;

&lt;p&gt;Result on &lt;code&gt;iam_admin_roles&lt;/code&gt;: &lt;strong&gt;37 tool calls&lt;/strong&gt;, the same naive strategy. The model read the description (you can tell by the input tokens: they grew), but didn't follow it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 2: the presence of a second "hinter" tool.&lt;/strong&gt; Besides &lt;code&gt;call_aws&lt;/code&gt;, awslabs exposes &lt;code&gt;suggest_aws_commands&lt;/code&gt;, whose description includes an example: "List all IAM users who have AdministratorAccess policy". Maybe the mere presence of this description in context works as "scaffolding", even if the model never actually calls &lt;code&gt;suggest_aws_commands&lt;/code&gt; itself?&lt;/p&gt;

&lt;p&gt;I made &lt;code&gt;tools_cli_with_fake_suggest.py&lt;/code&gt;: a second tool that returns an error when called, with a &lt;strong&gt;verbatim&lt;/strong&gt; copy of awslabs's &lt;code&gt;suggest_aws_commands&lt;/code&gt; description. Result: &lt;strong&gt;35 tool calls&lt;/strong&gt;, the same naive strategy. The model did &lt;strong&gt;not&lt;/strong&gt; call the fake &lt;code&gt;suggest_aws_commands&lt;/code&gt; (because the description says in black and white "use only when uncertain") - it just read it. And that didn't help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 3: tool and parameter names.&lt;/strong&gt; awslabs's tool is called &lt;code&gt;call_aws&lt;/code&gt; with a &lt;code&gt;cli_command&lt;/code&gt; parameter. Mine was &lt;code&gt;aws_cli&lt;/code&gt; with a &lt;code&gt;command&lt;/code&gt; parameter. Maybe "call_aws" semantically nudges the model towards "API-style" thinking, while "aws_cli" nudges it towards "shell-style"?&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tools_cli_renamed.py&lt;/code&gt;: renamed everything, even added a &lt;code&gt;max_results&lt;/code&gt; parameter for full parity. Result: &lt;strong&gt;39 tool calls&lt;/strong&gt;, naive strategy. This hypothesis was a miss too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 4: MCP capabilities / prompts / resources.&lt;/strong&gt; Maybe the MCP server passes something beyond the tool list to the model? The protocol has three other channels: &lt;code&gt;prompts&lt;/code&gt; (system prompts from the server), &lt;code&gt;resources&lt;/code&gt; (documents for RAG) and &lt;code&gt;instructions&lt;/code&gt; (system-level instructions).&lt;/p&gt;

&lt;p&gt;I wrote a diagnostic script and asked the server directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capabilities: experimental={} logging=LoggingCapability()
              prompts=PromptsCapability(listChanged=False)
              resources=ResourcesCapability(subscribe=False, listChanged=False)
              tools=ToolsCapability(listChanged=True)
instructions: None
prompts: 0
resources: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server &lt;strong&gt;declares&lt;/strong&gt; the capability but publishes nothing. &lt;code&gt;instructions&lt;/code&gt; is &lt;code&gt;None&lt;/code&gt;. It really does send the model only the tool list and nothing else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 5: runtime context in the system prompt.&lt;/strong&gt; This was the most productive one. I made a &lt;code&gt;cli-ctx&lt;/code&gt; transport - the same &lt;code&gt;aws_cli&lt;/code&gt;, but with four extra lines in the system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Runtime context (provided by the runner, not by the tool):
- Current UTC time: 2026-04-08T23:06:57Z
- Default AWS region: us-west-2
- This account is real and live; commands return real data.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four lines. 118 tokens.&lt;/p&gt;

&lt;p&gt;And here is what happened on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;, n=3:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Wall&lt;/th&gt;
&lt;th&gt;Success&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;cli&lt;/code&gt; plain&lt;/td&gt;
&lt;td&gt;13-15&lt;/td&gt;
&lt;td&gt;26-55k&lt;/td&gt;
&lt;td&gt;50-70s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;13.4k&lt;/td&gt;
&lt;td&gt;14s&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;cli-ctx&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.1k&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;cli-ctx&lt;/code&gt; didn't just catch up with MCP - it &lt;strong&gt;beat&lt;/strong&gt; it. Three times fewer input tokens and faster wall-clock.&lt;/p&gt;

&lt;p&gt;Where did the effect come from? I went into the MCP server logs and looked at &lt;strong&gt;what exactly&lt;/strong&gt; it returns to the model in each tool result. And here's what was in the very first &lt;code&gt;call_aws&lt;/code&gt; response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"ResponseMetadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"RequestId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"HTTPStatusCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"HTTPHeaders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Wed, 08 Apr 2026 00:15:21 GMT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The awslabs MCP server passes the &lt;strong&gt;full HTTP headers&lt;/strong&gt; from the AWS API back, including &lt;code&gt;date&lt;/code&gt;. Raw aws CLI v2 returns only the response body without headers. The model on MCP knows, from the very first tool call, what today's date is; the model on raw CLI does not, because its training cutoff is somewhere in 2025, and it honestly assumes it's still 2025.&lt;/p&gt;

&lt;p&gt;The entire gap on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt; was explained by an &lt;strong&gt;HTTP Date header leaking through the MCP abstraction&lt;/strong&gt;. Four lines in the system prompt reproduce the effect for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That was the moment I rethought all the previous results.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three mechanisms I found and closed
&lt;/h2&gt;

&lt;p&gt;The first mechanism - &lt;strong&gt;effect A, HTTP metadata&lt;/strong&gt; - is already covered in the previous section. Runtime context in the system prompt closed the failures on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;, and that's the most important of the three effects. But on &lt;code&gt;iam_admin_roles&lt;/code&gt; (36 vs 1) and &lt;code&gt;s3_bucket_regions&lt;/code&gt; (16 vs 2) the gap remained. So there had to be at least one more thing going on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Effect B: batch calling
&lt;/h3&gt;

&lt;p&gt;On &lt;code&gt;s3_bucket_regions&lt;/code&gt; in the MCP run I looked at the second tool call and saw this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;call_aws&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cli_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws s3api get-bucket-location --bucket bucket-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws s3api get-bucket-location --bucket bucket-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An array of 15 commands. In a single call. I went to the &lt;code&gt;call_aws&lt;/code&gt; description and found this section:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Batch Running:&lt;/strong&gt; The tool can also run multiple independent commands at the same time. Call this tool with multiple CLI commands whenever possible. You can call at most 20 CLI commands in batch mode.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So &lt;code&gt;cli_command&lt;/code&gt; accepts &lt;code&gt;anyOf string | array of strings&lt;/code&gt;, and the server executes them in parallel inside its own process, returning the results together. The model reads this and uses it.&lt;/p&gt;

&lt;p&gt;My original &lt;code&gt;aws_cli&lt;/code&gt; accepted only a string. I wrote &lt;code&gt;tools_cli_v2.py&lt;/code&gt;: added batch support to the input schema, rewrote the description following the same structure as awslabs's, and added parallel execution via &lt;code&gt;asyncio.gather&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On &lt;code&gt;s3_bucket_regions&lt;/code&gt; this instantly cut the tool call count from 16 to 2 - exactly like MCP.&lt;/p&gt;

&lt;h3&gt;
  
  
  Effect C: "smart" command choice - turned out to be a benchmark bug
&lt;/h3&gt;

&lt;p&gt;But on &lt;code&gt;iam_admin_roles&lt;/code&gt; the effect remained. The model on &lt;code&gt;cli-v2&lt;/code&gt; kept doing 36 calls. I was convinced this was some subtle feature of how the model models command selection, and I was preparing an "unexplained mystery" section for the article.&lt;/p&gt;

&lt;p&gt;Then I ran &lt;code&gt;cli-v2 iam_admin_roles&lt;/code&gt; again and carefully looked at the raw trace instead of the aggregated numbers. Here is the first tool call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. aws_cli (0ms, error=True)
   aws iam list-entities-for-policy --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
     --entity-filter Role --output json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Execution time 0ms. error=True.&lt;/strong&gt; The model &lt;strong&gt;immediately&lt;/strong&gt; tried the right command - exactly the same one MCP uses. And got an error. Not from AWS - the error never reached AWS. The error came from my own &lt;code&gt;safety.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ALLOWED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iam&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-roles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-attached-role-policies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# list-entities-for-policy WAS NOT IN THIS LIST
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wrote the whitelist based on how &lt;em&gt;I&lt;/em&gt; pictured this task being solved. And I put in exactly the commands needed for the naive path. The model on CLI tried the optimal command, got rejected, fell back to the naive path and conscientiously walked through all 36 roles.&lt;/p&gt;

&lt;p&gt;The awslabs MCP server has its own allowlist - significantly broader. And &lt;code&gt;list-entities-for-policy&lt;/code&gt; is allowed there.&lt;/p&gt;

&lt;p&gt;This was a &lt;strong&gt;benchmark bug&lt;/strong&gt;, not a property of MCP. I added one line to the whitelist:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iam&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-roles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-attached-role-policies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-entities-for-policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;- this one
&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And re-ran &lt;code&gt;cli-v2 iam_admin_roles&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Wall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;cli&lt;/code&gt; plain&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;20k&lt;/td&gt;
&lt;td&gt;68s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;5k&lt;/td&gt;
&lt;td&gt;4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;cli-v2&lt;/code&gt; (whitelist fixed)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.8k&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Exactly one tool call. And at the same time fewer input tokens than MCP, because we have one tool description of ~3000 characters and MCP has two descriptions totalling ~5800 characters.&lt;/p&gt;

&lt;p&gt;This is a &lt;strong&gt;methodologically&lt;/strong&gt; important point for anyone who wants to reproduce a benchmark like this: &lt;strong&gt;your own whitelist can silently determine the outcome&lt;/strong&gt;. If the allowlist only covers the commands needed for the naive strategy, you aren't measuring the transport, you're measuring your whitelist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final table: cli-full vs mcp at n=10
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;cli-full&lt;/code&gt; is the union of all three improvements in a single transport:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Batch input&lt;/strong&gt; (cli-v2 tool spec).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich tool description&lt;/strong&gt; with batch examples and best practices (cli-v2).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime context&lt;/strong&gt; in the system prompt (cli-ctx).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broad whitelist&lt;/strong&gt; with &lt;code&gt;list-entities-for-policy&lt;/code&gt; and everything else needed for the optimal path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At n=10 per cell, median:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;cli-full input&lt;/th&gt;
&lt;th&gt;mcp input&lt;/th&gt;
&lt;th&gt;Δ input&lt;/th&gt;
&lt;th&gt;cli-full calls&lt;/th&gt;
&lt;th&gt;mcp calls&lt;/th&gt;
&lt;th&gt;cli-full ok%&lt;/th&gt;
&lt;th&gt;mcp ok%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_running&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3,053&lt;/td&gt;
&lt;td&gt;5,368&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-43%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;90%*&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_policy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,975&lt;/td&gt;
&lt;td&gt;5,425&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-45%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_regions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5,801&lt;/td&gt;
&lt;td&gt;14,317&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-60%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;iam_admin_roles&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,934&lt;/td&gt;
&lt;td&gt;5,213&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-44%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5,345&lt;/td&gt;
&lt;td&gt;9,461&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-44%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;* the single failure on &lt;code&gt;ec2_running cli-full #9&lt;/code&gt; was an &lt;code&gt;HTTP 529 Overloaded&lt;/code&gt; from the Anthropic API. That's infrastructure noise, not a transport problem. I deliberately did not retry failed runs to avoid masking real failures - and this lone 529 made it into the stats as 90%. MCP could just as easily have caught the same 529; it just got lucky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cli-full beats MCP on input tokens on every one of the five tasks, 43-60%. Success rate - parity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On wall clock MCP wins on 4 of 5 tasks. Reason: wall clock is dominated by AWS API call time, not by model turn time. Tokens don't translate directly into seconds. The only wall-clock win for CLI is &lt;code&gt;s3_bucket_regions&lt;/code&gt;, where MCP spends time marshalling a 15-item batch through its protocol layer, and my &lt;code&gt;asyncio.gather&lt;/code&gt; does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The right question: how much is your engineering time worth
&lt;/h2&gt;

&lt;p&gt;This is where the popular "CLI is better than MCP" narrative breaks.&lt;/p&gt;

&lt;p&gt;My &lt;code&gt;cli-full&lt;/code&gt; is a few hundred lines of code and &lt;strong&gt;half a day of debugging&lt;/strong&gt;. A tool wrapper with a whitelist, a rich description copied from awslabs best practices, batch support via &lt;code&gt;asyncio.gather&lt;/code&gt;, a system prompt with runtime context, verify + ground truth for a specific task. And that's only for AWS. For GCP, for Linear, for Notion - everything from scratch.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt; is &lt;strong&gt;one command&lt;/strong&gt; (&lt;code&gt;uvx awslabs.aws-api-mcp-server@latest&lt;/code&gt;) and one environment variable. Works with every AWS service, not with five tasks. Best practices are already baked in by the authors (who &lt;strong&gt;know&lt;/strong&gt; AWS better than I do). Updates come with &lt;code&gt;@latest&lt;/code&gt;. Read-only mode is an environment variable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP pays with service knowledge, CLI pays with engineering labour.&lt;/strong&gt; It's a question of which currency you pay for your agent in: person-hours or tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to choose MCP
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;High velocity, low QPS.&lt;/strong&gt; New project, the agent has to work tomorrow. MCP installs in 30 seconds and covers everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broad surface.&lt;/strong&gt; The agent pokes at EC2, S3, IAM, Lambda, CloudWatch, RDS, ECS. Writing a CLI wrapper for each service is an unrealistic budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polyglot environment.&lt;/strong&gt; AWS today, GCP tomorrow, Notion the day after. Per-service CLI wrappers don't scale; one MCP server per service does.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're not an expert on the service.&lt;/strong&gt; You don't know by heart that &lt;code&gt;list-entities-for-policy&lt;/code&gt; is more efficient than &lt;code&gt;list-attached-role-policies&lt;/code&gt; in a loop. The awslabs authors do. You reuse their knowledge by paying a few extra tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low QPS.&lt;/strong&gt; A few hundred agent invocations a day. Saving 8k tokens per request is a few dollars a month. Engineering time costs more.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  When to choose a purpose-built CLI
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;High-QPS production.&lt;/strong&gt; A million calls a day x 8k extra tokens x $3/M input = $24/day on top. That's $8k a year, which is enough to hire a contractor to write the tool wrapper once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Narrow, stable task set.&lt;/strong&gt; The agent does five specific things. A narrow whitelist and a short description will be more compact than any universal MCP server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full control over the context.&lt;/strong&gt; Every token in the system prompt and tool description is yours. No ~3KB of hidden awslabs guidance, no update surprises, no external dependency that might suddenly change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance / audit.&lt;/strong&gt; Every tool call is visible, every input is validated by your code, every failure mode is known. MCP adds a protocol layer between you and the AWS API that some audits won't accept.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You already have the knowledge.&lt;/strong&gt; If you know how to work with the service efficiently, you can bake that knowledge into the tool description once and reuse it forever.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Checklist: how to build a cli-full equivalent
&lt;/h2&gt;

&lt;p&gt;If after all this you've decided your use case is CLI, here are six items that turn "raw &lt;code&gt;subprocess.run&lt;/code&gt;" into something that beats awslabs MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Accept batch input.&lt;/strong&gt; Tool input schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"cli_command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"anyOf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"array"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the model passes an array, the runner executes the commands in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt; (or equivalent) and returns the results in list order with index headers &lt;code&gt;[1/15]&lt;/code&gt;, &lt;code&gt;[2/15]&lt;/code&gt;... Saves 10-20x on tool calls for tasks where one command has to be run with different parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Put runtime context in the system prompt.&lt;/strong&gt; Minimum - four lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Runtime context (provided by the runner, not by the tool):
- Current UTC time: &amp;lt;now&amp;gt;
- Default region: &amp;lt;region&amp;gt;
- Identity: &amp;lt;arn&amp;gt;
- This account is real and live; commands return real data.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This closes a whole class of problems where the model gets confused about dates, regions, or thinks it's working against documentation rather than production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Write a rich tool description.&lt;/strong&gt; Aim for 2500-3000 characters. A structure that works (copying awslabs):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short tool description (1 sentence).&lt;/li&gt;
&lt;li&gt;Key constraints (allowed commands, region defaults, auth model).&lt;/li&gt;
&lt;li&gt;A "Best practices" section - how to pick commands, when to use batch, when to use &lt;code&gt;--query&lt;/code&gt; and &lt;code&gt;--filters&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;An "Anti-patterns" section - an explicit "don't list-then-iterate if there's a more specific operation".&lt;/li&gt;
&lt;li&gt;2-3 concrete examples covering different task categories.&lt;/li&gt;
&lt;li&gt;Restrictions: no shell pipes, no &lt;code&gt;--profile&lt;/code&gt;, no substitution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model reads this as a cookbook. A badly written description means the model writes naive commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The whitelist must cover the **optimal&lt;/strong&gt; commands, not just the "obvious" ones.** This is the point that cost me half a day. Ask yourself: "what would a senior AWS engineer write for this task?" - and make sure that command is in the whitelist. Not just the commands needed for the naive strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Return structured output, not prose.&lt;/strong&gt; Always &lt;code&gt;--output json&lt;/code&gt; + truncate to a fixed byte budget with an explicit truncation marker. The model has to know that the response was truncated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Forward tool errors to the model verbatim.&lt;/strong&gt; When a command fails with &lt;code&gt;[exit=N] &amp;lt;stderr&amp;gt;&lt;/code&gt;, return it to the model as-is. It can self-correct on the next turn. Silent failures waste turns for nothing.&lt;/p&gt;

&lt;p&gt;Following these six rules turns the CLI wrapper from a parody of a tool into something that actually beats awslabs MCP on tokens. Takes half a day per service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology notes
&lt;/h2&gt;

&lt;p&gt;Three things I spent time on and which are worth knowing if you want to reproduce a benchmark like this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First: claude-agent-sdk and Claude Code poison the context.&lt;/strong&gt; For the first two days I was measuring CLI vs MCP through &lt;code&gt;claude-agent-sdk&lt;/code&gt;, and the numbers were wild. 30k input tokens on a "how many running EC2" task. I thought for a long time that it was protocol overhead, but no - it was Claude Code through the SDK dragging my &lt;strong&gt;entire&lt;/strong&gt; user-level &lt;code&gt;~/.claude.json&lt;/code&gt; into the context: figma MCP, pencil MCP, PubMed MCP, Gmail, Calendar, Bash, Edit, Read... 40+ tools from other servers I hadn't asked for. I rewrote the runner onto the direct Anthropic API - cache_read dropped from 30k to 0, input tokens dropped to "normal" 2k on a simple task. If you are benchmarking agents through someone else's ready-made harness, check &lt;strong&gt;with your own eyes&lt;/strong&gt; what exactly goes into the model on the first system turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second: your own whitelist is an invisible benchmark variable.&lt;/strong&gt; I already wrote about this in the "effect C" section. I'll repeat: &lt;strong&gt;any&lt;/strong&gt; safety / security / validation layer between the model and the real service is &lt;strong&gt;part&lt;/strong&gt; of what you are measuring, even if you don't consciously think of it that way. If your whitelist forces the model into a narrow path, you are measuring the model's behaviour in that narrow path, not the model's behaviour in general.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third: &lt;code&gt;success_rate&lt;/code&gt; and retry policy.&lt;/strong&gt; One of my &lt;code&gt;cli-full ec2_running&lt;/code&gt; runs fell over with an HTTP 529 Overloaded from the Anthropic API. In the stats that's 90% success rate, even though it's not a transport issue. I decided &lt;strong&gt;not&lt;/strong&gt; to retry, because then the risk of masking real problems is too high. The article has to mention that 529 explicitly - otherwise the reader will compare 100% MCP against "90%" CLI and draw the wrong conclusion. Retry policy is yet another invisible variable the benchmark has to state out loud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducibility
&lt;/h2&gt;

&lt;p&gt;Everything is in a public repo: &lt;a href="https://github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What's in there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;src/agent_loop.py&lt;/code&gt; - ~150 lines of a self-contained agent loop on the direct Anthropic API.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/tools_cli.py&lt;/code&gt;, &lt;code&gt;tools_cli_v2.py&lt;/code&gt;, &lt;code&gt;tools_mcp.py&lt;/code&gt; - CLI and MCP transports. Plus the ablation variants (&lt;code&gt;tools_cli_rich.py&lt;/code&gt;, &lt;code&gt;tools_cli_renamed.py&lt;/code&gt;, &lt;code&gt;tools_cli_with_fake_suggest.py&lt;/code&gt;) from the "five hypotheses" section.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/runner.py&lt;/code&gt; - CLI for running &lt;code&gt;--tasks &amp;lt;ids&amp;gt; --transports &amp;lt;ids&amp;gt; --n &amp;lt;N&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/aggregate.py&lt;/code&gt; - medians + IQR + success rate from raw JSONL.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/safety.py&lt;/code&gt; - whitelist + injection guard.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/ground_truth.py&lt;/code&gt; - a boto3 script that captures ground truth from a live account (parameterised via &lt;code&gt;BENCH_S3_BUCKET&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;results/scrubbed/final_summary.json&lt;/code&gt; - aggregated numbers at n=10 across all &lt;code&gt;(task, transport)&lt;/code&gt; cells. These are the same numbers as in the tables above, in machine-readable form.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;results/scrubbed/sample_runs.jsonl&lt;/code&gt; - 8 hand-curated runs, one per key storyline in the article: naive CLI on &lt;code&gt;iam_admin_roles&lt;/code&gt; (36 calls), MCP on the same task (1 call), cli-full (1 call); CLI failure on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt; due to 2025 timestamps vs the cli-ctx fix; naive CLI on &lt;code&gt;s3_bucket_regions&lt;/code&gt; (16 calls) vs MCP with batch (2 calls) vs cli-full with batch (2 calls). All role, bucket and instance names are replaced with &lt;code&gt;role-N&lt;/code&gt;, &lt;code&gt;bucket-N&lt;/code&gt;, &lt;code&gt;i-instanceNN&lt;/code&gt;. Metrics and full model response text are preserved.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docs/findings.md&lt;/code&gt; - extended analytical notes, part of which went into this article.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why there are &lt;strong&gt;no&lt;/strong&gt; full 250 raw runs in the repo: the raw JSONL files contain real IAM role names, S3 bucket names and EC2 instance IDs from my AWS account, &lt;strong&gt;woven into free-form text&lt;/strong&gt; of model responses and batch commands. They can't be auto-scrubbed without a manual mapping for every name, and one missed line is a leak. So the repo only includes what I reviewed by eye: the aggregated &lt;code&gt;final_summary.json&lt;/code&gt; and 8 curated sample runs. If you want to see a full dataset, the best way to get a correct one is to run the benchmark on your own account in ~20 minutes (see below).&lt;/p&gt;

&lt;p&gt;To run the benchmark under your own account:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a dedicated IAM user with the &lt;code&gt;ReadOnlyAccess&lt;/code&gt; policy + any extra grants for your tasks.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cp .env.example .env&lt;/code&gt;, fill in &lt;code&gt;AWS_PROFILE&lt;/code&gt;, &lt;code&gt;AWS_REGION&lt;/code&gt;, &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, &lt;code&gt;BENCH_S3_BUCKET&lt;/code&gt; (the name of any bucket in your account for the bucket-policy task).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;python -m src.ground_truth&lt;/code&gt; - captures ground truth for your account.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;python -m src.runner --n 10&lt;/code&gt; - runs the full series, ~15-20 minutes, ~$5-10 on the Anthropic API.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;python -m src.aggregate results/raw/*.jsonl&lt;/code&gt; - prints the table.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you repeat this on your own stack and get different numbers - drop me a line, I'd love to compare.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The popular "MCP loses to CLI" narrative rests on a single benchmark&lt;/strong&gt; (Scalekit, n=1, GitHub Copilot MCP). It is correct &lt;strong&gt;in its own conditions&lt;/strong&gt;, but generalising from it to "MCP is bad" is a mistake.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS has already solved the schema-dump problem&lt;/strong&gt; in &lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt;. Their flagship MCP server is essentially the CLI with two tools, and that's a fair benchmark partner for raw aws CLI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On a fair 5-task series at n=10, &lt;code&gt;cli-full&lt;/code&gt; beats MCP on input tokens by 43-60% on every task.&lt;/strong&gt; But that takes writing a tool wrapper, a whitelist, a system prompt, a rich description. Half a day of engineering per service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The real question isn't "MCP or CLI" but "how much does your engineering time cost vs how much do your tokens cost".&lt;/strong&gt; MCP wins on velocity, broad surface, polyglot, low-QPS. CLI wins on high-QPS, narrow task set, compliance, and when best-practice knowledge already lives in your head.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three gap mechanisms&lt;/strong&gt; - HTTP metadata, batch calling, a broad allowlist - are &lt;strong&gt;reproducible&lt;/strong&gt; in a CLI tool via 4 lines in the system prompt, &lt;code&gt;anyOf string | array&lt;/code&gt; in the input schema, and one line in the whitelist. None of them is a structural property of the MCP protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Methodologically&lt;/strong&gt; - check with your own eyes what goes into the model's context, treat your own whitelist as a benchmark variable, and state your retry policy explicitly when reporting success rate.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If after all this you look at your own use case and decide you want a well-designed CLI tool, the six-item checklist is above. If you decide you want MCP - &lt;code&gt;uvx awslabs.aws-api-mcp-server@latest&lt;/code&gt; and you're in the game.&lt;/p&gt;

&lt;p&gt;Both options are &lt;strong&gt;correct answers to different questions&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>mcp</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>Stop Using JSON in Claude Prompts. I Tested 4 Formats — One Won by 30%.</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Tue, 14 Apr 2026 20:22:36 +0000</pubDate>
      <link>https://forem.com/webramos/yaml-vs-markdown-vs-json-vs-toon-which-format-is-most-efficient-for-the-claude-api-4l94</link>
      <guid>https://forem.com/webramos/yaml-vs-markdown-vs-json-vs-toon-which-format-is-most-efficient-for-the-claude-api-4l94</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;My own benchmark across three Claude tiers (Haiku, Sonnet, Opus): 120 data files, 8 real-world scenarios, 5 formats. Tokens, cost, and accuracy – numbers, not opinions.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  You Are Overpaying for Prompts
&lt;/h2&gt;

&lt;p&gt;Every time you send data to the Claude API, the format of that data determines how many tokens you spend. The same 200-product catalog in JSON costs 15,879 tokens. In Markdown, it costs 7,814. In TOON, 6,088. That is a 62% difference.&lt;/p&gt;

&lt;p&gt;A 120-task list? JSON consumes 8,500 tokens. TOON uses 2,267. Savings: 73%.&lt;/p&gt;

&lt;p&gt;The problem is that every existing benchmark focuses on GPT, Gemini, and Llama. There has not been a public benchmark for Claude. I decided to fix that.&lt;/p&gt;

&lt;p&gt;I ran 450 API calls on Claude Haiku 4.5, tested Sonnet 4.6 and Opus 4.6, and counted tokens across 120 files using Anthropic’s production tokenizer. Eight real-world scenarios, five formats. In this article – the results, the conclusions, and specific recommendations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Formats at a Glance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  JSON (JavaScript Object Notation)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2001; ECMA-404 standard (2013)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author:&lt;/strong&gt; Douglas Crockford&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; APIs, data exchange between systems, configuration files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; strict typing, nesting via &lt;code&gt;{}&lt;/code&gt; and &lt;code&gt;[]&lt;/code&gt;, mandatory quotes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;JSON is the lingua franca of programmatic interfaces. Every API speaks JSON, and every language can parse it. But that universality comes at a price in an LLM context: quotes, braces, and commas all consume tokens. They carry syntactic weight, but not semantic meaning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"products"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Mouse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;29.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"in_stock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  YAML (YAML Ain't Markup Language)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2001; YAML 1.2 standard (2009)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Clark Evans, Ingy döt Net, Oren Ben-Kiki&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; configuration files (Docker Compose, Kubernetes, GitHub Actions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; indentation-based structure, minimal punctuation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;YAML is the de facto standard of the DevOps world. It reads like pseudocode and usually does not require quotes. The trade-off is that repeating keys for every array item eats up much of the punctuation savings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;products&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Mouse&lt;/span&gt;
    &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;29.99&lt;/span&gt;
    &lt;span class="na"&gt;in_stock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Markdown
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2004&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author:&lt;/strong&gt; John Gruber (with Aaron Swartz)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; documentation, READMEs, blogs, wikis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; human-first syntax – headings &lt;code&gt;#&lt;/code&gt;, tables &lt;code&gt;|&lt;/code&gt;, lists &lt;code&gt;-&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Markdown is the most “native” format for LLMs. Models have been trained on billions of READMEs and wiki pages. GitHub, Notion, Obsidian – all rely on Markdown. It is a communication format, not a data format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Products&lt;/span&gt;

| ID | Name  | Price | In Stock |
|----|-------|-------|----------|
| 1  | Mouse | 29.99 | Yes      |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Plain Text
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; human communication – emails, notes, instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; no syntax, no markup, maximum flexibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plain text with no markup. It minimizes token overhead, but it provides no explicit structure for programmatic data extraction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Products: Mouse (ID 1, $29.99, in stock)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  TOON (Token-Oriented Object Notation)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2025 (v1.0 – November 2025, MIT license)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author:&lt;/strong&gt; open-source community (&lt;a href="https://github.com/toon-format/toon" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; token optimization in LLM prompts, replacing JSON in AI workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; a YAML + CSV hybrid (indentation for objects, row-style encoding for arrays)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The newest format in this comparison. TOON was created for one purpose: minimize tokens while preserving lossless JSON round-tripping. For arrays of homogeneous objects, field names are declared once and values are written as CSV-style rows. On GPT-5 Nano, it showed 99.4% accuracy with 46% token savings. Before this benchmark, it had not been tested on Claude.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;products[1]{id,name,price,in_stock}:
1,Mouse,29.99,true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I Tested
&lt;/h3&gt;

&lt;p&gt;Eight scenarios, each in three sizes (S / M / L), each in five formats. Total: 120 data files.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Data type&lt;/th&gt;
&lt;th&gt;S&lt;/th&gt;
&lt;th&gt;M&lt;/th&gt;
&lt;th&gt;L&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;System prompt / instructions&lt;/td&gt;
&lt;td&gt;Rules, sections&lt;/td&gt;
&lt;td&gt;10 rules&lt;/td&gt;
&lt;td&gt;30 rules&lt;/td&gt;
&lt;td&gt;60 rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product catalog&lt;/td&gt;
&lt;td&gt;Tabular data&lt;/td&gt;
&lt;td&gt;20 products&lt;/td&gt;
&lt;td&gt;100 products&lt;/td&gt;
&lt;td&gt;200 products&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Roadmap / tasks&lt;/td&gt;
&lt;td&gt;Statuses, dependencies&lt;/td&gt;
&lt;td&gt;15 tasks&lt;/td&gt;
&lt;td&gt;50 tasks&lt;/td&gt;
&lt;td&gt;120 tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Business rules&lt;/td&gt;
&lt;td&gt;Conditional logic&lt;/td&gt;
&lt;td&gt;8 rules&lt;/td&gt;
&lt;td&gt;25 rules&lt;/td&gt;
&lt;td&gt;50 rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Few-shot classification&lt;/td&gt;
&lt;td&gt;Input-output examples&lt;/td&gt;
&lt;td&gt;5 examples&lt;/td&gt;
&lt;td&gt;15 examples&lt;/td&gt;
&lt;td&gt;40 examples&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Organizational hierarchy&lt;/td&gt;
&lt;td&gt;3 levels of nesting&lt;/td&gt;
&lt;td&gt;12 people&lt;/td&gt;
&lt;td&gt;60 people&lt;/td&gt;
&lt;td&gt;150 people&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;Endpoints, parameters&lt;/td&gt;
&lt;td&gt;5 endpoints&lt;/td&gt;
&lt;td&gt;15 endpoints&lt;/td&gt;
&lt;td&gt;30 endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Output format&lt;/td&gt;
&lt;td&gt;Requesting data in a given format&lt;/td&gt;
&lt;td&gt;10 countries&lt;/td&gt;
&lt;td&gt;50 countries&lt;/td&gt;
&lt;td&gt;100 countries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Few-shot&lt;/strong&gt; (scenario 5) is a prompting technique in which several “input → output” examples are included directly in the prompt so the model can infer the task from a pattern. For example: &lt;code&gt;"Great product!" → positive&lt;/code&gt;, &lt;code&gt;"Terrible quality" → negative&lt;/code&gt;, then the question &lt;code&gt;"Love it!" → ?&lt;/code&gt;. Zero examples is zero-shot, one example is one-shot, several examples is few-shot. The format of those examples directly affects cost: 40 pairs in JSON take 2,131 tokens; in TOON, 996.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For scenarios 2, 3, 6, and 7, I prepared questions with precomputed correct answers (ground truth). For scenarios 1, 4, and 5, scoring was manual and rubric-based. For scenario 8, I measured output tokens and format compliance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Models and Pricing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Input ($/1M)&lt;/th&gt;
&lt;th&gt;Output ($/1M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;td&gt;$4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Mid&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;Premium&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;td&gt;$75&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Accuracy was measured across all three tiers. Sizes S and M were tested for accuracy. L-size was used only for token counts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clean-Test Principle
&lt;/h3&gt;

&lt;p&gt;All requests were sent directly via the &lt;code&gt;anthropic&lt;/code&gt; Python SDK: plain &lt;code&gt;client.messages.create()&lt;/code&gt; with &lt;code&gt;temperature=0&lt;/code&gt;. No MCP servers, IDE plugins, or agent frameworks.&lt;/p&gt;

&lt;p&gt;Token counting was done with &lt;code&gt;client.messages.count_tokens()&lt;/code&gt; – Anthropic’s production tokenizer, i.e. the same numbers used for billing. &lt;strong&gt;The tokenizer is the same across all Claude tiers&lt;/strong&gt; – so the token-count data applies to all Claude models.&lt;/p&gt;

&lt;p&gt;Benchmark code: &lt;a href="https://github.com/webmaster-ramos/yaml-vs-md-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/yaml-vs-md-benchmark&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Input-Token Efficiency
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;These numbers apply to all Claude tiers – Haiku, Sonnet, and Opus all use the same tokenizer. The only cost difference comes from the price per token.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Summary Table: Average Input Tokens Across All Scenarios
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Average tokens&lt;/th&gt;
&lt;th&gt;vs JSON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-32%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-53%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-57%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TOON saves 62% of input tokens on average versus JSON. Markdown saves 53%. YAML, despite its minimal punctuation, saves only 32% – because of repeated keys and indentation overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Breakdown by Scenario (% Savings vs JSON, L-size)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;YAML&lt;/th&gt;
&lt;th&gt;MD&lt;/th&gt;
&lt;th&gt;TXT&lt;/th&gt;
&lt;th&gt;TOON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Instructions&lt;/td&gt;
&lt;td&gt;-22%&lt;/td&gt;
&lt;td&gt;-29%&lt;/td&gt;
&lt;td&gt;-24%&lt;/td&gt;
&lt;td&gt;-24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Products&lt;/td&gt;
&lt;td&gt;-29%&lt;/td&gt;
&lt;td&gt;-51%&lt;/td&gt;
&lt;td&gt;-53%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks&lt;/td&gt;
&lt;td&gt;-35%&lt;/td&gt;
&lt;td&gt;-63%&lt;/td&gt;
&lt;td&gt;-69%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-73%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business Rules&lt;/td&gt;
&lt;td&gt;-28%&lt;/td&gt;
&lt;td&gt;-52%&lt;/td&gt;
&lt;td&gt;-48%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-63%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot&lt;/td&gt;
&lt;td&gt;-31%&lt;/td&gt;
&lt;td&gt;-45%&lt;/td&gt;
&lt;td&gt;-37%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-53%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchy&lt;/td&gt;
&lt;td&gt;-37%&lt;/td&gt;
&lt;td&gt;-61%&lt;/td&gt;
&lt;td&gt;-67%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-68%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Docs&lt;/td&gt;
&lt;td&gt;-35%&lt;/td&gt;
&lt;td&gt;-45%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-59%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-53%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  YAML Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz60wk1w9ooculhvvjok1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz60wk1w9ooculhvvjok1.png" alt="YAML savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  MD Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0vl0ox06w09jse0ncgq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0vl0ox06w09jse0ncgq.png" alt="MD savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  TXT Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsu83ntyoo8rm0ug98vbx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsu83ntyoo8rm0ug98vbx.png" alt="TXT savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  TOON Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F070ng6789odiolaota0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F070ng6789odiolaota0y.png" alt="TOON savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Detailed Charts by Scenario
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Instructions
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzchje0o1csh4gvxsnh3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzchje0o1csh4gvxsnh3j.png" alt="Input tokens: Instructions" width="800" height="485"&gt;&lt;/a&gt;)&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Products
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjvnbl7gwa854kp4jv79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjvnbl7gwa854kp4jv79.png" alt="Input tokens: Products" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Tasks
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flc0qtuu6gqkkt9veu9vm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flc0qtuu6gqkkt9veu9vm.png" alt="Input tokens: Tasks" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Rules
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu0vc317rzolm6qit0gc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu0vc317rzolm6qit0gc.png" alt="Input tokens: Rules" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Few-shot
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rys14l7hw6f7n6pk8q3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rys14l7hw6f7n6pk8q3.png" alt="Input tokens: Few-shot" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Hierarchy
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foayupph0gz3j4x0rigs0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foayupph0gz3j4x0rigs0.png" alt="Input tokens: Hierarchy" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: API Docs
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw984vwh94vw6r6e1uzod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw984vwh94vw6r6e1uzod.png" alt="Input tokens: API Docs" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Observations
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TOON is the clear leader for tabular data.&lt;/strong&gt; Product catalogs, task lists, few-shot examples – anything that looks like an array of homogeneous objects. Savings: 62–73% versus JSON.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Markdown is the best all-purpose format.&lt;/strong&gt; A stable 50–65% reduction across all data types. It is the only format that performs consistently well across tables, instructions, and hierarchies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;YAML is underwhelming.&lt;/strong&gt; Many people expect YAML to be much more compact than JSON. In practice, the savings are only 14–41%. The reason is repeated keys for every array element.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plain Text wins on API docs.&lt;/strong&gt; For technical specifications, plain text is more efficient than TOON (59% vs 53%). Without extra syntax, descriptive text compresses better.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale barely affects the percentage savings.&lt;/strong&gt; The difference between S and L is under 2 percentage points. Format drives efficiency more than data volume does.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Haiku 4.5: When Format Matters
&lt;/h2&gt;

&lt;p&gt;Haiku is the most format-sensitive tier. In 35% of questions, it produced different answers depending on the input format. Accuracy spread reached as high as 36 percentage points between the best and worst format within the same scenario.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy by Scenario
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: Products (product catalog)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3auwn1sahrmazcfa9ih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3auwn1sahrmazcfa9ih.png" alt="Accuracy Haiku: Products" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: Tasks (tasks / roadmap)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rf6ypezr9byqkghfq7q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rf6ypezr9byqkghfq7q.png" alt="Accuracy Haiku: Tasks" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: Hierarchy (organizational hierarchy)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3f3q2rmsufbuj0z6726.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3f3q2rmsufbuj0z6726.png" alt="Accuracy Haiku: Hierarchy" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: API Docs (documentation)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoum4j3h0tmlw7e36qlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoum4j3h0tmlw7e36qlw.png" alt="Accuracy Haiku: API Docs" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;JSON&lt;/th&gt;
&lt;th&gt;YAML&lt;/th&gt;
&lt;th&gt;MD&lt;/th&gt;
&lt;th&gt;TXT&lt;/th&gt;
&lt;th&gt;TOON&lt;/th&gt;
&lt;th&gt;Best&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Products&lt;/td&gt;
&lt;td&gt;63.4%&lt;/td&gt;
&lt;td&gt;61.4%&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;66.2%&lt;/td&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;65.7%&lt;/td&gt;
&lt;td&gt;66.7%&lt;/td&gt;
&lt;td&gt;56.7%&lt;/td&gt;
&lt;td&gt;65.3%&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchy&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Docs&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.1%&lt;/td&gt;
&lt;td&gt;78.6%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JSON/YAML/TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Hierarchy shows the sharpest gap:&lt;/strong&gt; YAML (92.9%) vs Markdown (57.1%) – a 36-point difference. Tree-like structures are clearly easier for Haiku to parse in an indentation-based format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Docs: Markdown performs unexpectedly poorly&lt;/strong&gt; – 57.1% vs 85.7% for JSON. For technical specifications with parameters and types, explicit structure matters more than compactness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy by Size (Haiku)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;S (small data)&lt;/td&gt;
&lt;td&gt;80.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M (medium data)&lt;/td&gt;
&lt;td&gt;67.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Scale matters more than format.&lt;/strong&gt; Accuracy drops by 13 points when moving from S to M – more than the average difference between formats (5.7 points). The implication is straightforward: reduce data volume first, then optimize format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost: Haiku
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Avg tokens&lt;/th&gt;
&lt;th&gt;Cost / request&lt;/th&gt;
&lt;th&gt;100K requests / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;$0.0026&lt;/td&gt;
&lt;td&gt;$260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;$0.0018&lt;/td&gt;
&lt;td&gt;$177&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;$0.0012&lt;/td&gt;
&lt;td&gt;$121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;$0.0011&lt;/td&gt;
&lt;td&gt;$111&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;$0.0010&lt;/td&gt;
&lt;td&gt;$98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON -&amp;gt; TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$162/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Output Format: Haiku
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Output tokens: S-size (10 countries) – Haiku, Sonnet, Opus
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" alt="Output tokens S-size, all 3 models" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Output tokens: M-size (50 countries) – Haiku, Sonnet, Opus
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" alt="Output tokens M-size, all 3 models" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requested format&lt;/th&gt;
&lt;th&gt;S (10 countries)&lt;/th&gt;
&lt;th&gt;M (50 countries)&lt;/th&gt;
&lt;th&gt;Savings vs JSON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;465&lt;/td&gt;
&lt;td&gt;1,985&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;296&lt;/td&gt;
&lt;td&gt;1,352&lt;/td&gt;
&lt;td&gt;-32..36%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;165&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,125&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-43..65%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;294&lt;/td&gt;
&lt;td&gt;1,381&lt;/td&gt;
&lt;td&gt;-30..37%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;342&lt;/td&gt;
&lt;td&gt;1,369&lt;/td&gt;
&lt;td&gt;-26..31%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Markdown is the cheapest output format on Haiku.&lt;/strong&gt; 165 vs 465 tokens on S-size – a 65% reduction. At $4 per 1M output tokens, that matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important: TOON loses on output.&lt;/strong&gt; Haiku does not know the TOON format and, instead of producing compact CSV-like rows, tends to emit verbose plain text that only vaguely resembles TOON. A few-shot example improves TOON output quality, but it still trails Markdown in efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output-Format Choice: Technical Requirements
&lt;/h3&gt;

&lt;p&gt;Output cost is not the only thing that matters. Often, Claude’s response must be processed programmatically – parsed, inserted into a database, or passed to another service. The best output format depends on who or what is going to read it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Usage scenario&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User-facing answer in UI&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Renders natively, lowest token cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend parsing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reliable, universal, guaranteed structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config / YAML pipeline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human-readable + machine-parsable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rows for CSV / spreadsheet&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TXT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minimal overhead, structure via delimiters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compact output for TOON SDK&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only if using Opus, or with a few-shot example&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; if a human reads the output, use Markdown. If code reads it, use JSON or YAML. Do not optimize output cost at the expense of parsing reliability in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendations for Haiku
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data type&lt;/th&gt;
&lt;th&gt;Best input&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Best output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;stable&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalogs, lists&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TXT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70.2%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks / roadmap&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;71.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;MD&lt;/strong&gt; or &lt;strong&gt;JSON&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchies&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;92.9%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON or YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot examples&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;65.3% (-0.5% vs JSON)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On Haiku, format matters – especially for hierarchies and API documentation. Use TOON on input where token savings are worth a small accuracy trade-off, but &lt;strong&gt;do not use TOON on output&lt;/strong&gt; without a few-shot example.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sonnet 4.6: Format Affects Cost, Not Quality
&lt;/h2&gt;

&lt;p&gt;Sonnet 4.6 produced identical answers across all five formats. In 100% of questions, the result was the same regardless of how the data was represented. For Sonnet, format optimization is pure cost reduction with no quality trade-off.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy: Format-Invariant
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Accuracy by model and format
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faman5i1lej65040r6mqc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faman5i1lej65040r6mqc.png" alt="Accuracy by model and format" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The answers are completely identical across all formats. Switching from JSON to TOON saves 62% of input tokens while preserving the same output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost: Sonnet
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Avg tokens&lt;/th&gt;
&lt;th&gt;Cost / request&lt;/th&gt;
&lt;th&gt;100K requests / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;$0.0098&lt;/td&gt;
&lt;td&gt;$975&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;$0.0066&lt;/td&gt;
&lt;td&gt;$663&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;$0.0045&lt;/td&gt;
&lt;td&gt;$454&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;$0.0042&lt;/td&gt;
&lt;td&gt;$417&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;$0.0037&lt;/td&gt;
&lt;td&gt;$368&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON -&amp;gt; TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$607/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 100K requests per month, switching from JSON to TOON saves $607/month. On Sonnet, output costs $15 per 1M tokens, so output optimization also matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output Format: Sonnet
&lt;/h3&gt;

&lt;p&gt;Output tokens for Sonnet (estimated as characters ÷ 3.5 chars/token):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;S (10 countries)&lt;/th&gt;
&lt;th&gt;M (50 countries)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;~210&lt;/td&gt;
&lt;td&gt;~1,120&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;~195&lt;/td&gt;
&lt;td&gt;~1,023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~143&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~746&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;~103&lt;/td&gt;
&lt;td&gt;~549&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~86&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~414&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Comparison of output tokens across all three models (S-size):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" alt="Output tokens S-size, all 3 models" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;M-size (50 countries):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" alt="Output tokens M-size, all 3 models" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On Sonnet, TOON output requires a few-shot example.&lt;/strong&gt; Without extra context, Sonnet interprets “TOON format” literally – as an abbreviation connected to cartoons – and returns an irrelevant answer. With a format example in the prompt, it generates correct TOON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical requirements for output on Sonnet&lt;/strong&gt; are the same as on Haiku: if a downstream system parses the response programmatically, use JSON or YAML. If a human is going to read it, use Markdown.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendations for Sonnet
&lt;/h3&gt;

&lt;p&gt;On Sonnet, format choice is a pure cost optimization. The logic is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input data:&lt;/strong&gt; use TOON (for tables) or MD (for instructions / hierarchies)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-readable output:&lt;/strong&gt; Markdown (-65% vs JSON)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine-parsed output:&lt;/strong&gt; JSON (most reliable) or YAML (more compact, still parseable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TOON output:&lt;/strong&gt; add a few-shot example to the prompt; otherwise the answer may be incorrect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Optimal prompt design: &lt;strong&gt;MD for instructions + TOON for data + a request for MD/JSON output&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Opus 4.6: Maximum Capability, Also Format-Invariant
&lt;/h2&gt;

&lt;p&gt;Opus 4.6 is the strongest model and the most expensive one. Like Sonnet, it is completely insensitive to input format. But Opus has one unique advantage: it knows TOON “out of the box.”&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy: Format-Invariant
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The answers are 100% identical across all formats. Changing format affects only cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost: Opus
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Avg tokens&lt;/th&gt;
&lt;th&gt;Cost / request&lt;/th&gt;
&lt;th&gt;100K requests / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;$0.0488&lt;/td&gt;
&lt;td&gt;$4,878&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;$0.0331&lt;/td&gt;
&lt;td&gt;$3,312&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;$0.0227&lt;/td&gt;
&lt;td&gt;$2,271&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;$0.0209&lt;/td&gt;
&lt;td&gt;$2,087&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;$0.0184&lt;/td&gt;
&lt;td&gt;$1,839&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON -&amp;gt; TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$3,039/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On Opus, switching from JSON to TOON saves over $3,000/month at 100K requests. Output costs $75 per 1M tokens – so format optimization has the largest financial impact here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output Format: Opus
&lt;/h3&gt;

&lt;p&gt;Output tokens for Opus (estimated as characters ÷ 3.5 chars/token):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;S (10 countries)&lt;/th&gt;
&lt;th&gt;M (50 countries)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;~254&lt;/td&gt;
&lt;td&gt;~1,271&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;~286&lt;/td&gt;
&lt;td&gt;~1,414&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~177&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~814&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;~194&lt;/td&gt;
&lt;td&gt;~986&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~106&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~543&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Comparison of output tokens across all three models (S-size):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" alt="Output tokens S-size, all 3 models" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;M-size (50 countries):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" alt="Output tokens M-size, all 3 models" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus generates TOON without hints.&lt;/strong&gt; That is the key difference from Sonnet and Haiku. Opus knows the format and produces valid TOON output on the first try.&lt;/p&gt;

&lt;h4&gt;
  
  
  Can Claude generate valid TOON output?
&lt;/h4&gt;

&lt;p&gt;&lt;a href="/media/blog/chart-toon-output.png" class="article-body-image-wrapper"&gt;&lt;img src="/media/blog/chart-toon-output.png" alt="TOON output generation across models"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Without example in prompt&lt;/th&gt;
&lt;th&gt;With few-shot example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.6&lt;/td&gt;
&lt;td&gt;Valid TOON&lt;/td&gt;
&lt;td&gt;Valid TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Cartoon / irrelevant&lt;/td&gt;
&lt;td&gt;Valid TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;Verbose plain text&lt;/td&gt;
&lt;td&gt;Closer to TOON, but still inaccurate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practical terms, this means: if you need TOON output and want it to work reliably without prompt scaffolding, use Opus.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Requirements for Output: When Parsing Matters More Than Cost
&lt;/h3&gt;

&lt;p&gt;On Opus, output costs $75 per 1M tokens – so output-format savings are highly relevant. But the requirements of the downstream system still take priority:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenarios where output must be parsed programmatically:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The response goes into a database or structured store – use &lt;strong&gt;JSON&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Another LLM or service consumes the response through an API – use &lt;strong&gt;JSON&lt;/strong&gt; or &lt;strong&gt;YAML&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The response is part of a pipeline (the next step processes the data) – use &lt;strong&gt;JSON&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The response is rendered in the UI as text or a document – use &lt;strong&gt;Markdown&lt;/strong&gt; (lowest token cost)&lt;/li&gt;
&lt;li&gt;You need compact machine-readable output and already have a TOON SDK – use &lt;strong&gt;TOON&lt;/strong&gt; (only Opus works reliably without prompt help)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key point:&lt;/strong&gt; output on Opus costs $75 per 1M – five times more than input. A 65% output reduction (Markdown vs JSON) can matter even more than input savings. But do not trade away parse reliability just to cut cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendations for Opus
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input:&lt;/strong&gt; TOON for tabular data (-62%), MD for instructions (-53%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-readable output:&lt;/strong&gt; Markdown (-65% output tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine-parsed output:&lt;/strong&gt; JSON – reliable and universal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TOON output:&lt;/strong&gt; works without few-shot – Opus’s unique advantage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do not use JSON on input:&lt;/strong&gt; it is the most expensive format with no accuracy benefit&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Accuracy Across All Models and Formats
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;75.3%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;75.1%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;69.6%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;70.6%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;74.8%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For Sonnet and Opus, format does not affect accuracy. For Haiku, it matters materially – especially for hierarchies and documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Matrix: Input Format
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data type&lt;/th&gt;
&lt;th&gt;Haiku&lt;/th&gt;
&lt;th&gt;Sonnet / Opus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompts / instructions&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;MD&lt;/strong&gt; (-29%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; or &lt;strong&gt;MD&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalogs, lists&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TXT&lt;/strong&gt; (70.2%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-62%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks / roadmap&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;JSON&lt;/strong&gt; (71.0%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-73%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business rules&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;JSON&lt;/strong&gt; (stable)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-63%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot examples&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (≈JSON)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-53%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchies&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;YAML&lt;/strong&gt; (92.9%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; or &lt;strong&gt;MD&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;JSON/YAML&lt;/strong&gt; (85.7%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TXT&lt;/strong&gt; (-59%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Decision Matrix: Output Format
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Output consumer&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;th&gt;Haiku&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;Opus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UI / end user&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API / JSON parser&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML pipeline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON SDK&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;with few-shot*&lt;/td&gt;
&lt;td&gt;with few-shot*&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CSV / spreadsheet&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TXT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;with template&lt;/td&gt;
&lt;td&gt;with template&lt;/td&gt;
&lt;td&gt;with template&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Requires a few-shot example in the prompt&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy was measured only on S+M sizes.&lt;/strong&gt; L-size includes token counts only. Accuracy may degrade more sharply on larger data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The data is synthetic.&lt;/strong&gt; Catalogs and tasks were script-generated. Real-world data may be messier (missing fields, Unicode, long descriptions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic scoring covers 4 of 8 cases.&lt;/strong&gt; Cases 1, 4, and 5 require rubric-based evaluation. The accuracy numbers here cover cases 2, 3, 6, and 7.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet / Opus were tested via subscription (subagents).&lt;/strong&gt; Output-token counts are estimated, not directly measured. Haiku was tested via API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No A/B test on live traffic.&lt;/strong&gt; This is a laboratory benchmark. The impact on a production product must be validated separately.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code and data are open – reproduce it, extend it, challenge it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Opus and Sonnet are completely insensitive to format.&lt;/strong&gt; I expected a 3–5% gap. I got 0%. For the higher tiers, format is pure cost optimization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;YAML is not as efficient as many assume.&lt;/strong&gt; The expectation is usually “YAML is more compact than JSON.” In practice, the savings are only 32%. Repeated keys wipe out much of the benefit from removing braces.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TOON works on Claude without special training.&lt;/strong&gt; Claude may not have seen much TOON in training data, yet all three tiers parse it correctly – essentially on par with JSON.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Opus knows TOON; Sonnet does not.&lt;/strong&gt; Opus generates valid TOON output without hints. Sonnet interpreted “TOON format” as “cartoon” and produced an irrelevant answer. With a few-shot example, both work correctly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Markdown is the best output format.&lt;/strong&gt; The gap in output tokens between JSON and Markdown is 65%. At $75 per 1M on Opus, that is significant. It is also the only format every tier generates natively without extra prompting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;On Haiku, scale matters more than format.&lt;/strong&gt; Accuracy drops from 80.3% (S) to 67.2% (M) – a 13-point drop. The average difference between formats is 5.7 points. On Sonnet and Opus, scale is much less of an issue.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Do these results apply to other models (GPT, Gemini)?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The trends are similar, but the numbers differ. Every model has its own tokenizer. On GPT-5 Nano, YAML shows 62% accuracy on nested data (&lt;a href="https://www.improvingagents.com/blog/best-nested-data-format/" rel="noopener noreferrer"&gt;ImprovingAgents&lt;/a&gt;); on Claude Haiku, it reaches 93%. Use these results for Claude, and other benchmarks for other models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How were tokens counted?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;client.messages.count_tokens()&lt;/code&gt; – the standard Anthropic SDK method and production tokenizer. These are the same numbers used for billing. The tokenizer is the same across all tiers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why not test XML?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;XML is rarely used in modern LLM workflows. Existing benchmarks (&lt;a href="https://shshell.com/blog/token-efficiency-module-13-lesson-2-format-comparison" rel="noopener noreferrer"&gt;ShShell&lt;/a&gt;) suggest that XML is significantly more expensive than Markdown in token terms, with comparable or worse accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Is TOON a serious format or just hype?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TOON v1.0 was released in November 2025 under MIT, and there are SDKs in 6+ languages. For tabular data, the savings are real – 62% on Claude with JSON-level accuracy. Opus generates TOON output without prompting. Other tiers require a few-shot example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does the input format affect the output format?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Partially. If you provide data in YAML, Claude is more likely to structure its answer with indentation. But an explicit instruction such as “Return as a Markdown table” overrides that tendency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Is it worth converting all prompts away from JSON?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At 100K requests/month on Sonnet, moving from JSON to TOON saves $607/month. On Opus, it saves $3,039/month. For hobby projects with 1K requests, the difference is around $6. Run the math on your own usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can you combine formats in one prompt?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes – and that is usually the recommended approach. Markdown for instructions + TOON for data + a request for output in the format you need. Claude handles multi-format prompts well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Where is the benchmark source code?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/webmaster-ramos/yaml-vs-md-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/yaml-vs-md-benchmark&lt;/a&gt;. All 120 data files, 51 questions, ground truth, runner, and scorer are open for reproduction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Data format in a prompt is not a cosmetic choice. On the Claude API, the gap between JSON and TOON is 62% on input tokens. Markdown saves 65% on output tokens. At 100K requests/month on Opus, that means $3,039 saved on input and even more on output.&lt;/p&gt;

&lt;p&gt;But the main finding is not about tokens. &lt;strong&gt;Claude Sonnet 4.6 and Opus 4.6 are completely insensitive to format.&lt;/strong&gt; They produced 100% identical answers on JSON, YAML, Markdown, Plain Text, and TOON. For the higher tiers, format optimization is pure savings with no quality trade-off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Only Haiku 4.5 is meaningfully format-sensitive&lt;/strong&gt; – and only there does the choice of format affect accuracy (by up to 36 percentage points). On Haiku, format should be matched to data type: YAML for hierarchies, JSON for tasks with dependencies.&lt;/p&gt;

&lt;p&gt;Beyond cost, there are technical requirements: if the output must be parsed programmatically, JSON is more reliable than Markdown. If a human reads the answer, Markdown is cheaper. Opus is the only tier that generates TOON natively; Sonnet and Haiku require a few-shot example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR by tier:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Does format affect accuracy?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes, by up to 36 points&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best input (data)&lt;/td&gt;
&lt;td&gt;YAML/JSON/TXT by data type&lt;/td&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best input (instructions)&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best output (human-readable)&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best output (parsing)&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON output without prompt help&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON -&amp;gt; TOON savings&lt;/td&gt;
&lt;td&gt;$162 / 100K&lt;/td&gt;
&lt;td&gt;$607 / 100K&lt;/td&gt;
&lt;td&gt;$3,039 / 100K&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Benchmark run in April 2026 on Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;120 data files, 8 scenarios, 3 sizes, 5 formats, 3 models.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;All code and data: &lt;a href="https://github.com/webmaster-ramos/yaml-vs-md-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/yaml-vs-md-benchmark&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>claude</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
