<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Webmaster Ramos</title>
    <description>The latest articles on Forem by Webmaster Ramos (@webramos).</description>
    <link>https://forem.com/webramos</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1273297%2F5ca1ac25-c251-4144-a980-975f0b6a7c4d.jpg</url>
      <title>Forem: Webmaster Ramos</title>
      <link>https://forem.com/webramos</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/webramos"/>
    <language>en</language>
    <item>
      <title>One Nav, Two Stacks: A Microfrontend Between Magento and Laravel Without Replatforming</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Thu, 30 Apr 2026 21:04:50 +0000</pubDate>
      <link>https://forem.com/webramos/one-nav-two-stacks-a-microfrontend-between-magento-and-laravel-without-replatforming-3on5</link>
      <guid>https://forem.com/webramos/one-nav-two-stacks-a-microfrontend-between-magento-and-laravel-without-replatforming-3on5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A working reference implementation on two production-grade stacks (Magento 2.4 + Laravel 11), with the host integration shape shown below and a server-rendered nav skeleton shipped on day one - not retrofitted after GSC panic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Mid-market ecommerce rarely lives on one stack. The industry answer - "replatform everything onto one stack" - is a $100-500k, 6-12 month project most of them cannot afford.&lt;/p&gt;

&lt;p&gt;I shipped a smaller answer on a real client stack: &lt;strong&gt;a 15-20 kb Preact microfrontend that mounts into both Magento 2.4 and Laravel 11 via a manifest&lt;/strong&gt;. This is not a Module Federation hello-world - it is two real host integrations, around 120 lines of PHP on Magento and around 90 lines on Laravel, with one &lt;code&gt;pnpm build&lt;/code&gt; and both sites updating in under a minute.&lt;/p&gt;

&lt;p&gt;The opinionated part: &lt;strong&gt;microfrontends failed as a product architecture but work as a repair strategy.&lt;/strong&gt; Greenfield product teams drown in coordination cost. Repair across inherited stacks is a different problem - and the pattern solves it cleanly, if you get the SEO contract right before shipping.&lt;/p&gt;

&lt;p&gt;The proof point is deliberately concrete: &lt;strong&gt;a before/after crawler diff on identical URLs and user-agents&lt;/strong&gt;. Not a modelled SEO score, not a Lighthouse proxy, but raw HTML facts - bytes, anchor counts, and whether the navigation exists in initial markup for non-rendering crawlers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem nobody names out loud
&lt;/h2&gt;

&lt;p&gt;A mid-market ecommerce group with multiple brands, accrued over years:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Magento 2.4 storefront&lt;/strong&gt; - catalogue, cart, checkout.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Laravel 11 marketing site&lt;/strong&gt; - brand story, awards programme, editorial.&lt;/li&gt;
&lt;li&gt;A handful of single-purpose SPAs on top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each stack has its own header and footer. When marketing adds a top-level category, it ships in one stack in a week and in the other in three. When design changes the logo, it takes two sprints to roll out across everything.&lt;/p&gt;

&lt;p&gt;The cost is not engineering hours. The cost is that the brand is visibly inconsistent to customers, the teams know it, and every cross-team sync about the nav takes an hour.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "just consolidate on one stack" is not the answer
&lt;/h2&gt;

&lt;p&gt;The standard advice is a monorepo or a headless rewrite. Both are correct on paper and wrong in the field.&lt;/p&gt;

&lt;p&gt;Monorepos assume teams that want to converge. Inherited teams - Magento folks who have been on that stack for seven years, a Laravel team that came with an acquisition - do not want to converge. They have skill investment, release cadences, and on-call rotations built around their stack. A monorepo migration is a political project before it is an engineering one, and most mid-market companies cannot push one through.&lt;/p&gt;

&lt;p&gt;Headless replatforming is the same project in a different wrapper. Twelve-month runway, executive buy-in, and a new front end that has to outpace the rate at which the old ones break the business.&lt;/p&gt;

&lt;p&gt;A shared nav microfrontend does not compete with monorepo architecturally. It competes with &lt;strong&gt;doing nothing&lt;/strong&gt; - which is what the organisation was going to do for the next two years anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why repair is different from design
&lt;/h2&gt;

&lt;p&gt;Spotify publicly &lt;a href="https://blog.pragmaticengineer.com/spotify-squads/" rel="noopener noreferrer"&gt;rolled back its extreme squad autonomy&lt;/a&gt;. The failure mode is always the same: teams own product-level vertical slices, those slices compose into one surface, coordination cost explodes, UX inconsistency becomes a feature of the architecture.&lt;/p&gt;

&lt;p&gt;That is a real lesson. It does not mean no microfrontend is ever right.&lt;/p&gt;

&lt;p&gt;Repair is a different problem than design. You are not building the surface - the surface already exists, in two incompatible implementations. You are installing the narrowest possible shared layer that brings them back into visual alignment. The nav is exactly that narrow: no business logic, no routing, no data dependencies beyond a flat link tree.&lt;/p&gt;

&lt;p&gt;Everything the microfrontend critique flags - duplicate runtime, fragmented UX ownership, coordination overhead - either does not apply to a 15 kb shell (runtime is negligible) or applies &lt;em&gt;less&lt;/em&gt; than the status quo (UX is already fragmented; we are &lt;em&gt;reducing&lt;/em&gt; coordination by centralising the decision).&lt;/p&gt;

&lt;h2&gt;
  
  
  Shell architecture: 15-20 kb, one build, one file
&lt;/h2&gt;

&lt;p&gt;A Preact tree built with &lt;a href="https://vite.dev/guide/build.html#library-mode" rel="noopener noreferrer"&gt;Vite's library mode&lt;/a&gt; into one IIFE script and one CSS file, both with content-hashed filenames. A &lt;code&gt;manifest.json&lt;/code&gt; maps logical names to hashed URLs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Preact over React&lt;/strong&gt; - ~10 kb gzipped vs ~45 kb. Non-negotiable at a 15-20 kb budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IIFE over ES modules&lt;/strong&gt; - works in Magento's RequireJS environment without extra config, and in any &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tag on any stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cssCodeSplit: false&lt;/code&gt;&lt;/strong&gt; - one file, one request, no FOUC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tailwind with a prefix&lt;/strong&gt; - scoped classes, no collision with host CSS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content-hashed URLs via manifest&lt;/strong&gt; - immutable caching. Hosts read the manifest at render time and emit &lt;code&gt;&amp;lt;link href="/nav/shared-nav.abc123.css"&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;pnpm build&lt;/code&gt; takes ~8 seconds. Hosts pick up new hashes within their cache TTL. One bugfix lands on both sites in ~1 minute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Host integration: Magento 2.4
&lt;/h2&gt;

&lt;p&gt;Around 120 lines of new PHP, three files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Acme\Theme\Model\SharedNavManifest&lt;/code&gt;&lt;/strong&gt; (~85 lines) - HTTP-fetches the manifest with Magento's cache backend, falls back to a non-hashed &lt;code&gt;shared-nav.iife.js&lt;/code&gt; on fetch failure so the nav never disappears, only loses cache-busting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Acme\Theme\ViewModel\SharedNavAssets&lt;/code&gt;&lt;/strong&gt; (~26 lines) - the ViewModel that phtml templates talk to. CSS goes through a ViewModel rather than static &lt;code&gt;&amp;lt;css&amp;gt;&lt;/code&gt; layout XML because the URL has a hash in it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Acme\Theme\etc\frontend\di.xml&lt;/code&gt;&lt;/strong&gt; (~7 lines) - wires the manifest URL through deploy config.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two phtml partials - header and footer - emit the mount divs and asset tags. Included from &lt;code&gt;default.xml&lt;/code&gt;, so every page type inherits the shared nav.&lt;/p&gt;

&lt;h2&gt;
  
  
  Host integration: Laravel 11
&lt;/h2&gt;

&lt;p&gt;Around 90 lines. Smaller because the service container carries more weight.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;App\Services\SharedNavManifest&lt;/code&gt;&lt;/strong&gt; (~65 lines) - HTTP-fetches the manifest, caches via &lt;code&gt;Cache::remember('shared_nav.manifest', 60, ...)&lt;/code&gt;, logs and falls back to the unhashed bundle on fetch failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;config/services.php&lt;/code&gt;&lt;/strong&gt; - three lines exposing &lt;code&gt;services.shared_nav.manifest_url&lt;/code&gt; as env-driven config.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two Blade layouts&lt;/strong&gt; - public and a secondary layout for older marketing pages - emit &lt;code&gt;&amp;lt;link&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;script&amp;gt;&lt;/code&gt; tags from the manifest service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 60-second cache TTL controls how fast a &lt;code&gt;pnpm build&lt;/code&gt; propagates - aggressive enough for release cadence, conservative enough that manifest fetches are one request per minute per worker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Representative code shape (abridged)
&lt;/h2&gt;

&lt;p&gt;The full production classes are client code, so I am not publishing them verbatim here. But the integration should not stay abstract either. This is the &lt;strong&gt;shape&lt;/strong&gt; of the two host adapters - abridged to show the contract rather than every guardrail and framework detail.&lt;/p&gt;

&lt;p&gt;A note on the manifest keys: Vite indexes &lt;code&gt;manifest.json&lt;/code&gt; by entry source path and asset name - &lt;code&gt;src/main.tsx&lt;/code&gt; and &lt;code&gt;style.css&lt;/code&gt; in our build - not by the output filename. The host lookups use those keys; the unhashed filenames (&lt;code&gt;shared-nav.iife.js&lt;/code&gt;, &lt;code&gt;shared-nav.css&lt;/code&gt;) are only used as fallbacks when the manifest fetch fails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Magento 2.4 - manifest service shape:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SharedNavManifest&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getJsUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'src/main.tsx'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fallbackJs&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getCssUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'style.css'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;fallbackCss&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// SSR fallback: fetched once from the shell, cached in Magento's cache&lt;/span&gt;
    &lt;span class="c1"&gt;// backend, and inlined into the mount div at render time.&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getHeaderHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'header.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;getFooterHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'footer.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 1) read cached manifest&lt;/span&gt;
        &lt;span class="c1"&gt;// 2) on miss, fetch remote manifest URL&lt;/span&gt;
        &lt;span class="c1"&gt;// 3) cache parsed JSON&lt;/span&gt;
        &lt;span class="c1"&gt;// 4) on failure, log and fall back to unhashed asset names&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 1) read cached snapshot for $key&lt;/span&gt;
        &lt;span class="c1"&gt;// 2) on miss, fetch rendered HTML from the shell (e.g. /nav/header.html)&lt;/span&gt;
        &lt;span class="c1"&gt;// 3) cache body with a short TTL&lt;/span&gt;
        &lt;span class="c1"&gt;// 4) on failure, return '' so the shell still hydrates later&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Laravel 11 - manifest service shape:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SharedNavManifest&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Cache&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;remember&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'shared_nav.manifest'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// GET config('services.shared_nav.manifest_url'), parse JSON.&lt;/span&gt;
            &lt;span class="c1"&gt;// On failure, log and return [] so fallback filenames kick in.&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;jsUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'src/main.tsx'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="s1"&gt;'shared-nav.iife.js'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;cssUrl&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s1"&gt;'/nav/'&lt;/span&gt; &lt;span class="mf"&gt;.&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s1"&gt;'style.css'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s1"&gt;'file'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="s1"&gt;'shared-nav.css'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;headerHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'header.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;footerHtml&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'footer.html'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;snapshotHtml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Cache&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;remember&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"shared_nav.snapshot:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// fetch rendered HTML from the shell (e.g. /nav/header.html)&lt;/span&gt;
            &lt;span class="c1"&gt;// return '' on failure so the shell still hydrates later&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is why the line counts matter. The host code is small enough to be reviewable, boring enough to be supportable, and explicit enough that another PHP team can own it without learning a front-end platform first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The host &amp;lt;-&amp;gt; shell contract
&lt;/h2&gt;

&lt;p&gt;The two sides agree on a tiny surface:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Host provides:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;link&lt;/span&gt; &lt;span class="na"&gt;rel=&lt;/span&gt;&lt;span class="s"&gt;"stylesheet"&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"{{ $nav-&amp;gt;cssUrl() }}"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-header"&lt;/span&gt; &lt;span class="na"&gt;style=&lt;/span&gt;&lt;span class="s"&gt;"min-height: 80px;"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;{!! $nav-&amp;gt;headerHtml() !!}&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="c"&gt;&amp;lt;!-- page body --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-footer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;{!! $nav-&amp;gt;footerHtml() !!}&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;defer&lt;/span&gt; &lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"{{ $nav-&amp;gt;jsUrl() }}"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Shell provides:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;nav-fallback.html&lt;/code&gt; emitted at build time, split into header and footer snippets the host inlines into the mount divs (the SSR fallback).&lt;/li&gt;
&lt;li&gt;Client-side mount into &lt;code&gt;#sa-header&lt;/code&gt; and &lt;code&gt;#sa-footer&lt;/code&gt; that replaces the SSR snapshot with the interactive tree (dropdowns, mobile menu, state).&lt;/li&gt;
&lt;li&gt;One CSS file, one JS file, no global pollution (IIFE scope).&lt;/li&gt;
&lt;li&gt;No knowledge of Magento or Laravel. No runtime config, no feature flags.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else - routing, authentication, cart state, checkout - stays on the host. The nav does not know the host exists. The host does not know the nav is Preact. That is the whole integration.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;min-height: 80px&lt;/code&gt; on the header mount is anti-CLS insurance - the slot reserves its space before hydration, so Core Web Vitals do not punish the deferred render.&lt;/p&gt;

&lt;h2&gt;
  
  
  The SEO question, answered honestly
&lt;/h2&gt;

&lt;p&gt;This is the part every microfrontend post skips or hand-waves. I will not.&lt;/p&gt;

&lt;p&gt;Also, this section is intentionally based on &lt;strong&gt;observable crawler facts&lt;/strong&gt;, not modelled SEO metrics. I am not claiming a ranking uplift from a synthetic score. I am showing what a crawler can and cannot see in the initial HTML before and after the fallback ships.&lt;/p&gt;

&lt;p&gt;Without a fallback, initial HTML is two empty divs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-header"&lt;/span&gt; &lt;span class="na"&gt;style=&lt;/span&gt;&lt;span class="s"&gt;"min-height: 80px;"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"sa-footer"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Googlebot renders JavaScript (eventually) and sees the nav - with a delay measured in days. But &lt;strong&gt;GPTBot, ClaudeBot, and PerplexityBot do not render JavaScript&lt;/strong&gt;. They see the empty divs. As far as AI search is concerned, the site has no nav.&lt;/p&gt;

&lt;p&gt;I measured this before shipping the SSR fallback. Three pages, five user-agents, identical &lt;code&gt;curl&lt;/code&gt; invocations. Same URLs, same crawl method, same parsing rule - only the fallback changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before SSR fallback:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Homepage&lt;/th&gt;
&lt;th&gt;/about&lt;/th&gt;
&lt;th&gt;/portfolio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bytes&lt;/td&gt;
&lt;td&gt;35,050&lt;/td&gt;
&lt;td&gt;35,050&lt;/td&gt;
&lt;td&gt;97,521&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;a href&amp;gt;&lt;/code&gt; total&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anchors from nav&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Twelve anchors per page, none of them structural. Every page - no matter how deep - exposed the same twelve inline body links to a non-rendering crawler. Sitemap.xml covered URL discovery, but not the four things nav does beyond discovery:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Link equity&lt;/strong&gt; - a multi-level nav is hundreds of internal links per page pointing at categories. Without it, category pages lose authority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Crawl budget&lt;/strong&gt; - Googlebot prioritises pages by incoming-link density. Sitemap-only pages get crawled less often.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topic hierarchy&lt;/strong&gt; - sitemap is flat. Nav signals semantic structure ("Shop -&amp;gt; Men -&amp;gt; Shoes").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI assistant context&lt;/strong&gt; - ChatGPT and Perplexity build mental models from HTML, often ignoring sitemaps. Without nav in HTML, AI knows your URLs but not your structure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The three-level mitigation ladder:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;&amp;lt;noscript&amp;gt;&lt;/code&gt; fallback&lt;/strong&gt; with critical links inside the mount div (hours of work).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSR skeleton&lt;/strong&gt; - Vite emits a &lt;code&gt;nav-fallback.html&lt;/code&gt; at build time; hosts inline it into the mount divs before hydration replaces it (a day or two).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full SSR service&lt;/strong&gt; - a Node process renders each nav request server-side (a week, plus a new production dependency).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Level 2 is the sweet spot for an ecommerce group this size. We shipped it before the first production release. Same &lt;code&gt;curl&lt;/code&gt; invocations, four days later:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After SSR fallback:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Homepage&lt;/th&gt;
&lt;th&gt;/about&lt;/th&gt;
&lt;th&gt;/portfolio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bytes&lt;/td&gt;
&lt;td&gt;98,881&lt;/td&gt;
&lt;td&gt;98,881&lt;/td&gt;
&lt;td&gt;161,348&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;a href&amp;gt;&lt;/code&gt; total&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;td&gt;112&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anchors from nav (&lt;code&gt;#sa-header&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anchors from footer (&lt;code&gt;#sa-footer&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;td&gt;69&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All five user-agents received byte-identical HTML (the only per-request variance is the Laravel CSRF meta token). The nav and footer tree are in initial HTML - 100 additional anchors per page, constant across every page, visible to every crawler that can parse HTML.&lt;/p&gt;

&lt;p&gt;That matters methodologically. A crawler can disagree with my interpretation of the SEO impact, but it cannot disagree with &lt;code&gt;35,050 -&amp;gt; 98,881&lt;/code&gt; bytes or &lt;code&gt;12 -&amp;gt; 112&lt;/code&gt; anchors under the same crawl conditions. This is a reusable audit method, not a one-off anecdote.&lt;/p&gt;

&lt;p&gt;The gap closed on release day. No retroactive GSC panic, no "we measured a drop and here's how we fixed it" narrative. The honest framing is "we knew the risk, we closed it before shipping".&lt;/p&gt;

&lt;h2&gt;
  
  
  What this article proves today - and what it does not yet
&lt;/h2&gt;

&lt;p&gt;This article proves three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;integration pattern is real&lt;/strong&gt; on two production-grade PHP stacks.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;SEO risk is real&lt;/strong&gt; if the shell ships with empty mount points only.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Level 2 fallback&lt;/strong&gt; closes that crawler-visibility gap on day one.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What it does &lt;strong&gt;not&lt;/strong&gt; prove yet is a 90-day business outcome story. I do not have a "three months later, here are CrUX and GSC deltas" chart in this draft, because that would require waiting for the post-release window to mature. I would rather publish the implementation pattern and the crawler evidence honestly than pretend I have impact numbers I do not have yet.&lt;/p&gt;

&lt;p&gt;That makes this a build-and-ship case study, not a finished growth narrative. When post-launch search-console and field-performance data are mature enough to be worth showing, they belong in a follow-up article.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this pattern does not solve
&lt;/h2&gt;

&lt;p&gt;Not overselling: the shared nav is the &lt;strong&gt;minimum viable shared surface&lt;/strong&gt; - that is its strength and its ceiling.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary page content&lt;/strong&gt; still diverges. Magento renders products; Laravel renders marketing copy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared checkout&lt;/strong&gt; - not solved. Checkout lives in Magento; marketing links into it via cookies on a common parent domain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared authentication&lt;/strong&gt; - not solved. Cookies, redirects, OAuth handshakes - all host-specific.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared search&lt;/strong&gt; - could be built on top of the shell, but we did not. Search UX is coupled to Magento-native catalogue data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A shared nav is not a distributed-front-end strategy. It is a band-aid across a healed fracture. If you need a distributed front end, you need a different architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this pattern fits
&lt;/h2&gt;

&lt;p&gt;Short checklist. If you check fewer than three boxes, do something else.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have &lt;strong&gt;two or more existing stacks&lt;/strong&gt; with established teams you cannot realistically move.&lt;/li&gt;
&lt;li&gt;There is &lt;strong&gt;no budget or appetite&lt;/strong&gt; for a full front-end unification right now.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;primary pain is UX inconsistency&lt;/strong&gt;, not performance or architectural debt.&lt;/li&gt;
&lt;li&gt;Nobody on the executive side is willing to own a "unified portal" programme.&lt;/li&gt;
&lt;li&gt;You need to be &lt;strong&gt;AI-agent ready&lt;/strong&gt; - which means the nav must be in initial HTML, not only after JS runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If all five apply, the pattern pays for itself in weeks, not quarters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The same shell is about to land on two more stacks in the same group - a greenfield Magento storefront rewrite and a full Laravel marketing rewrite. Both will consume the existing &lt;code&gt;manifest.json&lt;/code&gt; unchanged. Zero additional shell work, the same integration footprint per host. That is the portability proof.&lt;/p&gt;

&lt;p&gt;If the pattern looks like it might fit your stack, the interesting conversation is not "how do I build a shell" - Vite's library mode docs will get you there in an afternoon. The interesting conversation is the SEO contract and shipping Level 2 on day one instead of retrofitting it after Google Search Console punishes you.&lt;/p&gt;

&lt;p&gt;For a mid-market team, that is usually the real decision framework:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Level 1:&lt;/strong&gt; &lt;code&gt;&amp;lt;noscript&amp;gt;&lt;/code&gt; links when the risk window is small and the nav is shallow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 2:&lt;/strong&gt; build-time SSR fallback when you need full crawler-visible structure without adding a Node service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level 3:&lt;/strong&gt; full SSR service when the nav is dynamic enough that static fallback HTML becomes a maintenance problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where outside perspective usually helps, and where I spend most of my consulting time.&lt;/p&gt;

</description>
      <category>microfrontend</category>
      <category>ecommerce</category>
      <category>magneto</category>
      <category>laravel</category>
    </item>
    <item>
      <title>MCP vs CLI for AI Agents: A Real AWS Benchmark (and Why the Popular Narrative Asks the Wrong Question)</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Tue, 21 Apr 2026 19:12:16 +0000</pubDate>
      <link>https://forem.com/webramos/mcp-vs-cli-for-ai-agents-a-real-aws-benchmark-and-why-the-popular-narrative-asks-the-wrong-4h8</link>
      <guid>https://forem.com/webramos/mcp-vs-cli-for-ai-agents-a-real-aws-benchmark-and-why-the-popular-narrative-asks-the-wrong-4h8</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Full code, aggregated numbers (n=10 across 5 tasks and 5 transports), and a curated selection of 8 hand-picked runs live in the &lt;a href="https://github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark" rel="noopener noreferrer"&gt;&lt;code&gt;mcp-vs-cli-aws-benchmark&lt;/code&gt;&lt;/a&gt; repo. This article is a dense version of &lt;code&gt;docs/findings.md&lt;/code&gt; from the same repo, rewritten for a reader who doesn't have an hour to study the test harness.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The question in the title is wrong.&lt;/strong&gt; "MCP or CLI?" assumes they have the same use case and one of them is objectively better. In reality it's a trade-off between two currencies: &lt;strong&gt;engineering time&lt;/strong&gt; vs. &lt;strong&gt;input tokens per run&lt;/strong&gt;, and you need both numbers to decide.&lt;/p&gt;

&lt;p&gt;I compared &lt;strong&gt;raw aws CLI&lt;/strong&gt; against the official &lt;strong&gt;awslabs.aws-api-mcp-server&lt;/strong&gt; on five read-only tasks against a real production AWS account. The model is Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (no Claude Code and no claude-agent-sdk, to avoid poisoning the context). Ground truth is collected via boto3, verification is automatic. n=10 per &lt;code&gt;(task, transport)&lt;/code&gt; cell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; a well-designed CLI tool beats awslabs MCP by 43-60% on input tokens on &lt;strong&gt;every one&lt;/strong&gt; of the five tasks, at equal success rate. But it takes half a day of engineering work per service.&lt;/p&gt;

&lt;p&gt;If you run 200 agent invocations a day - put MCP in and forget about it. If you run 200 thousand - sit down and write your own tool wrapper following the checklist at the end of the article.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this whole debate comes from
&lt;/h2&gt;

&lt;p&gt;Since February 2026, dev Twitter and dev.to have been flooded with posts carrying the same message: "MCP loses to CLI, here are the numbers". Titles like &lt;a href="https://jannikreinhard.com/2026/02/22/why-cli-tools-are-beating-mcp-for-ai-agents/" rel="noopener noreferrer"&gt;«Why CLI Tools Are Beating MCP for AI Agents»&lt;/a&gt;, &lt;a href="https://www.scalekit.com/blog/mcp-vs-cli-use" rel="noopener noreferrer"&gt;«MCP vs CLI: Benchmarking AI Agent Cost &amp;amp; Reliability»&lt;/a&gt;, &lt;a href="https://oneuptime.com/blog/post/2026-02-03-cli-is-the-new-mcp/view" rel="noopener noreferrer"&gt;«Why CLI is the New MCP for AI Agents»&lt;/a&gt;. They all cite the same Scalekit benchmark, which reported:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP is 10-32x more expensive than CLI&lt;/strong&gt; on input tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; CLI 100%, MCP 72% (the cause of all 28% of failures is TCP timeouts connecting to the GitHub Copilot MCP server).&lt;/li&gt;
&lt;li&gt;Example: a simple "what language is this repo?" query - CLI 1,365 tokens, MCP 44,026 tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The authors' explanation: &lt;strong&gt;schema dump&lt;/strong&gt;. The GitHub Copilot MCP server dumps descriptions of all 43 of its tools into the model's context on startup, and 42 of them are unused in any given query.&lt;/p&gt;

&lt;p&gt;The problem is that this benchmark is &lt;strong&gt;n=1 on a single service&lt;/strong&gt;, with one kind of MCP server ("fat", per-resource). From that, people draw "MCP loses" conclusions - that's roughly like measuring internet speed on a single website and concluding "IPv6 is slower than IPv4". There is a useful signal, but no grounds for generalisation.&lt;/p&gt;

&lt;p&gt;I decided to reproduce the comparison on a different service (AWS), with a larger n, and in a setting where the MCP server is &lt;strong&gt;not&lt;/strong&gt; designed as a "fat" directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS has already done its homework
&lt;/h2&gt;

&lt;p&gt;The first thing I found when I went to look at &lt;code&gt;awslabs/mcp&lt;/code&gt; was &lt;strong&gt;not&lt;/strong&gt; what I had expected. Following the Scalekit GitHub Copilot MCP analogy, I was expecting to see dozens of per-resource MCP servers: &lt;code&gt;awslabs/ec2&lt;/code&gt;, &lt;code&gt;awslabs/s3&lt;/code&gt;, &lt;code&gt;awslabs/iam&lt;/code&gt;, each with their own 20-30 tools (&lt;code&gt;describe_instances&lt;/code&gt;, &lt;code&gt;run_instances&lt;/code&gt;, &lt;code&gt;terminate_instances&lt;/code&gt;, &lt;code&gt;modify_instance_attribute&lt;/code&gt;...). That would have been a clean schema dump in the context of a single task.&lt;/p&gt;

&lt;p&gt;In reality, the main AWS MCP server - &lt;a href="https://awslabs.github.io/mcp/servers/aws-api-mcp-server" rel="noopener noreferrer"&gt;&lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt;&lt;/a&gt; - is built very differently. It exposes &lt;strong&gt;three&lt;/strong&gt; tools:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;call_aws&lt;/code&gt; - takes an aws CLI command string (or an array of up to 20 commands for batch mode) and runs it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;suggest_aws_commands&lt;/code&gt; - natural language to a list of candidate aws CLI commands. The authors explicitly mark it as &lt;code&gt;FALLBACK&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;get_execution_plan&lt;/code&gt; - multi-step plans, experimental, gated behind an environment variable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By default &lt;strong&gt;two&lt;/strong&gt; are published (without &lt;code&gt;get_execution_plan&lt;/code&gt;). And there is a built-in &lt;code&gt;READ_OPERATIONS_ONLY=true&lt;/code&gt; switch - you can tell the server "describe/list/get only" and it will cut everything else off at its own level.&lt;/p&gt;

&lt;p&gt;This is an important engineering choice: AWS itself acknowledged the schema-dump problem and &lt;strong&gt;opted out&lt;/strong&gt; of a fat MCP server in favour of a wrapper over the CLI living under the MCP protocol. Comparing such a wrapper against "raw CLI" is a far more honest experiment than repeating Scalekit on the GitHub MCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;The details (runner code, ground-truth script, whitelist) are in the &lt;a href="https://github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark" rel="noopener noreferrer"&gt;repo&lt;/a&gt;. Here - compressed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5 read-only tasks&lt;/strong&gt; against a production-like AWS account:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;What it tests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_running&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;simple&lt;/td&gt;
&lt;td&gt;List running EC2 in &lt;code&gt;us-west-2&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;One API call + filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_policy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;edge&lt;/td&gt;
&lt;td&gt;Bucket policy for a single bucket&lt;/td&gt;
&lt;td&gt;Handling of an optional resource&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_regions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;chained&lt;/td&gt;
&lt;td&gt;All S3 buckets + region of each&lt;/td&gt;
&lt;td&gt;List + per-item lookup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;iam_admin_roles&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;filter&lt;/td&gt;
&lt;td&gt;IAM roles with &lt;code&gt;AdministratorAccess&lt;/code&gt; policy&lt;/td&gt;
&lt;td&gt;Pagination + content filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;chained&lt;/td&gt;
&lt;td&gt;CloudWatch CPU over 60 min for running EC2&lt;/td&gt;
&lt;td&gt;Composition + time windows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The correct answer for &lt;code&gt;iam_admin_roles&lt;/code&gt; in my account is an &lt;strong&gt;empty list&lt;/strong&gt;. A separate honesty test: will the model make up role names.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; Claude Sonnet 4.6, direct Anthropic API, my own minimal agent loop (~150 lines). Why not &lt;code&gt;claude-agent-sdk&lt;/code&gt; or Claude Code - see the "methodology notes" section below, this choice cost me a day and a half.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transports:&lt;/strong&gt; CLI - &lt;code&gt;subprocess.run(['aws', ...])&lt;/code&gt; behind a whitelist. MCP - the &lt;code&gt;mcp&lt;/code&gt; python lib, which boots &lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt; via &lt;code&gt;uvx&lt;/code&gt; stdio and performs a real MCP handshake.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety:&lt;/strong&gt; a dedicated IAM user &lt;code&gt;mcp-benchmark&lt;/code&gt; with &lt;code&gt;ReadOnlyAccess&lt;/code&gt; + a local command whitelist. Two lines of defense - in case the model tries to break something.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verification:&lt;/strong&gt; a boto3 script captures ground truth before the benchmark, a verifier compares the model's JSON response automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;n=10 per cell&lt;/strong&gt;, median on the main metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  First attempt: CLI loses everywhere
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Spoiler for anyone who won't read to the end:&lt;/strong&gt; everything you are about to see - CLI failing two tasks, 60% success rate, a naive strategy with 36 tool calls - turned into the opposite result three days later: a CLI that beats MCP by 43-60% on tokens. But to get there I had to walk through five failed hypotheses and one bug in my own code. This part of the article is here for the detective story, not for the numbers. The numbers are at the end.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On the pilot run with three transports (plain &lt;code&gt;cli&lt;/code&gt;, &lt;code&gt;cli&lt;/code&gt; with an enriched description, &lt;code&gt;mcp&lt;/code&gt;) the picture looked like a confirmation of the Scalekit narrative. On &lt;code&gt;iam_admin_roles&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cli&lt;/code&gt; plain: &lt;strong&gt;36 tool calls&lt;/strong&gt;, 20k input tokens, 68 seconds. Strategy: &lt;code&gt;list-roles&lt;/code&gt; + &lt;code&gt;list-attached-role-policies&lt;/code&gt; for each of the 34 roles in the account.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;mcp&lt;/code&gt;: &lt;strong&gt;1 tool call&lt;/strong&gt;, 5k input tokens, 4 seconds. One command: &lt;code&gt;iam list-entities-for-policy --policy-arn ... --entity-filter Role&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The same model on the same prompt made a different command choice.&lt;/strong&gt; On MCP - perfect; on CLI - the naive, linear-complexity path.&lt;/p&gt;

&lt;p&gt;Even scarier was &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;. CLI failed in &lt;strong&gt;60% of cases&lt;/strong&gt;: it hit the max_turns limit trying to guess the correct timestamp for CloudWatch &lt;code&gt;get-metric-statistics&lt;/code&gt;. I looked at the logs and saw commands with &lt;code&gt;--start-time 2025-05-16T...&lt;/code&gt;, &lt;code&gt;--start-time 2025-07-14T...&lt;/code&gt; - the model clearly had no idea what year it was.&lt;/p&gt;

&lt;p&gt;MCP in the same conditions made 3 calls, always with correct 2026 timestamps, 100% success.&lt;/p&gt;

&lt;p&gt;This looked like a ready-made "CLI loses" article. Good thing I didn't stop there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five hypotheses, five ablation experiments
&lt;/h2&gt;

&lt;p&gt;Before publishing results like that, I wanted to understand &lt;strong&gt;why&lt;/strong&gt;. "MCP is smarter" is not an explanation, it's a description. Sonnet 4.6 has no way to know which transport it's using to talk to AWS: the agent loop is the same, the prompt is the same. Something &lt;strong&gt;structural&lt;/strong&gt; in the MCP transport was making the model behave differently.&lt;/p&gt;

&lt;p&gt;What follows is five controlled experiments. Each time I took the CLI transport and added &lt;strong&gt;one&lt;/strong&gt; trait from the MCP world to test an isolated hypothesis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 1: tool description length and structure.&lt;/strong&gt; awslabs's &lt;code&gt;call_aws&lt;/code&gt; description is ~3000 characters with examples and best practices. My &lt;code&gt;aws_cli&lt;/code&gt; was ~500. I wrote &lt;code&gt;tools_cli_rich.py&lt;/code&gt; with a description of the same length, including a direct hint: "For 'find roles attached to policy X', use &lt;code&gt;iam list-entities-for-policy --policy-arn ... --entity-filter Role&lt;/code&gt; instead of listing every role and inspecting each one."&lt;/p&gt;

&lt;p&gt;Result on &lt;code&gt;iam_admin_roles&lt;/code&gt;: &lt;strong&gt;37 tool calls&lt;/strong&gt;, the same naive strategy. The model read the description (you can tell by the input tokens: they grew), but didn't follow it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 2: the presence of a second "hinter" tool.&lt;/strong&gt; Besides &lt;code&gt;call_aws&lt;/code&gt;, awslabs exposes &lt;code&gt;suggest_aws_commands&lt;/code&gt;, whose description includes an example: "List all IAM users who have AdministratorAccess policy". Maybe the mere presence of this description in context works as "scaffolding", even if the model never actually calls &lt;code&gt;suggest_aws_commands&lt;/code&gt; itself?&lt;/p&gt;

&lt;p&gt;I made &lt;code&gt;tools_cli_with_fake_suggest.py&lt;/code&gt;: a second tool that returns an error when called, with a &lt;strong&gt;verbatim&lt;/strong&gt; copy of awslabs's &lt;code&gt;suggest_aws_commands&lt;/code&gt; description. Result: &lt;strong&gt;35 tool calls&lt;/strong&gt;, the same naive strategy. The model did &lt;strong&gt;not&lt;/strong&gt; call the fake &lt;code&gt;suggest_aws_commands&lt;/code&gt; (because the description says in black and white "use only when uncertain") - it just read it. And that didn't help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 3: tool and parameter names.&lt;/strong&gt; awslabs's tool is called &lt;code&gt;call_aws&lt;/code&gt; with a &lt;code&gt;cli_command&lt;/code&gt; parameter. Mine was &lt;code&gt;aws_cli&lt;/code&gt; with a &lt;code&gt;command&lt;/code&gt; parameter. Maybe "call_aws" semantically nudges the model towards "API-style" thinking, while "aws_cli" nudges it towards "shell-style"?&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tools_cli_renamed.py&lt;/code&gt;: renamed everything, even added a &lt;code&gt;max_results&lt;/code&gt; parameter for full parity. Result: &lt;strong&gt;39 tool calls&lt;/strong&gt;, naive strategy. This hypothesis was a miss too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 4: MCP capabilities / prompts / resources.&lt;/strong&gt; Maybe the MCP server passes something beyond the tool list to the model? The protocol has three other channels: &lt;code&gt;prompts&lt;/code&gt; (system prompts from the server), &lt;code&gt;resources&lt;/code&gt; (documents for RAG) and &lt;code&gt;instructions&lt;/code&gt; (system-level instructions).&lt;/p&gt;

&lt;p&gt;I wrote a diagnostic script and asked the server directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capabilities: experimental={} logging=LoggingCapability()
              prompts=PromptsCapability(listChanged=False)
              resources=ResourcesCapability(subscribe=False, listChanged=False)
              tools=ToolsCapability(listChanged=True)
instructions: None
prompts: 0
resources: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server &lt;strong&gt;declares&lt;/strong&gt; the capability but publishes nothing. &lt;code&gt;instructions&lt;/code&gt; is &lt;code&gt;None&lt;/code&gt;. It really does send the model only the tool list and nothing else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis 5: runtime context in the system prompt.&lt;/strong&gt; This was the most productive one. I made a &lt;code&gt;cli-ctx&lt;/code&gt; transport - the same &lt;code&gt;aws_cli&lt;/code&gt;, but with four extra lines in the system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Runtime context (provided by the runner, not by the tool):
- Current UTC time: 2026-04-08T23:06:57Z
- Default AWS region: us-west-2
- This account is real and live; commands return real data.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four lines. 118 tokens.&lt;/p&gt;

&lt;p&gt;And here is what happened on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;, n=3:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Wall&lt;/th&gt;
&lt;th&gt;Success&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;cli&lt;/code&gt; plain&lt;/td&gt;
&lt;td&gt;13-15&lt;/td&gt;
&lt;td&gt;26-55k&lt;/td&gt;
&lt;td&gt;50-70s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;13.4k&lt;/td&gt;
&lt;td&gt;14s&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;cli-ctx&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.1k&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;cli-ctx&lt;/code&gt; didn't just catch up with MCP - it &lt;strong&gt;beat&lt;/strong&gt; it. Three times fewer input tokens and faster wall-clock.&lt;/p&gt;

&lt;p&gt;Where did the effect come from? I went into the MCP server logs and looked at &lt;strong&gt;what exactly&lt;/strong&gt; it returns to the model in each tool result. And here's what was in the very first &lt;code&gt;call_aws&lt;/code&gt; response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"ResponseMetadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"RequestId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"HTTPStatusCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"HTTPHeaders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Wed, 08 Apr 2026 00:15:21 GMT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The awslabs MCP server passes the &lt;strong&gt;full HTTP headers&lt;/strong&gt; from the AWS API back, including &lt;code&gt;date&lt;/code&gt;. Raw aws CLI v2 returns only the response body without headers. The model on MCP knows, from the very first tool call, what today's date is; the model on raw CLI does not, because its training cutoff is somewhere in 2025, and it honestly assumes it's still 2025.&lt;/p&gt;

&lt;p&gt;The entire gap on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt; was explained by an &lt;strong&gt;HTTP Date header leaking through the MCP abstraction&lt;/strong&gt;. Four lines in the system prompt reproduce the effect for free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That was the moment I rethought all the previous results.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three mechanisms I found and closed
&lt;/h2&gt;

&lt;p&gt;The first mechanism - &lt;strong&gt;effect A, HTTP metadata&lt;/strong&gt; - is already covered in the previous section. Runtime context in the system prompt closed the failures on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;, and that's the most important of the three effects. But on &lt;code&gt;iam_admin_roles&lt;/code&gt; (36 vs 1) and &lt;code&gt;s3_bucket_regions&lt;/code&gt; (16 vs 2) the gap remained. So there had to be at least one more thing going on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Effect B: batch calling
&lt;/h3&gt;

&lt;p&gt;On &lt;code&gt;s3_bucket_regions&lt;/code&gt; in the MCP run I looked at the second tool call and saw this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;call_aws&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cli_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws s3api get-bucket-location --bucket bucket-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws s3api get-bucket-location --bucket bucket-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An array of 15 commands. In a single call. I went to the &lt;code&gt;call_aws&lt;/code&gt; description and found this section:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Batch Running:&lt;/strong&gt; The tool can also run multiple independent commands at the same time. Call this tool with multiple CLI commands whenever possible. You can call at most 20 CLI commands in batch mode.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So &lt;code&gt;cli_command&lt;/code&gt; accepts &lt;code&gt;anyOf string | array of strings&lt;/code&gt;, and the server executes them in parallel inside its own process, returning the results together. The model reads this and uses it.&lt;/p&gt;

&lt;p&gt;My original &lt;code&gt;aws_cli&lt;/code&gt; accepted only a string. I wrote &lt;code&gt;tools_cli_v2.py&lt;/code&gt;: added batch support to the input schema, rewrote the description following the same structure as awslabs's, and added parallel execution via &lt;code&gt;asyncio.gather&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On &lt;code&gt;s3_bucket_regions&lt;/code&gt; this instantly cut the tool call count from 16 to 2 - exactly like MCP.&lt;/p&gt;

&lt;h3&gt;
  
  
  Effect C: "smart" command choice - turned out to be a benchmark bug
&lt;/h3&gt;

&lt;p&gt;But on &lt;code&gt;iam_admin_roles&lt;/code&gt; the effect remained. The model on &lt;code&gt;cli-v2&lt;/code&gt; kept doing 36 calls. I was convinced this was some subtle feature of how the model models command selection, and I was preparing an "unexplained mystery" section for the article.&lt;/p&gt;

&lt;p&gt;Then I ran &lt;code&gt;cli-v2 iam_admin_roles&lt;/code&gt; again and carefully looked at the raw trace instead of the aggregated numbers. Here is the first tool call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. aws_cli (0ms, error=True)
   aws iam list-entities-for-policy --policy-arn arn:aws:iam::aws:policy/AdministratorAccess
     --entity-filter Role --output json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Execution time 0ms. error=True.&lt;/strong&gt; The model &lt;strong&gt;immediately&lt;/strong&gt; tried the right command - exactly the same one MCP uses. And got an error. Not from AWS - the error never reached AWS. The error came from my own &lt;code&gt;safety.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ALLOWED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iam&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-roles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-attached-role-policies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# list-entities-for-policy WAS NOT IN THIS LIST
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I wrote the whitelist based on how &lt;em&gt;I&lt;/em&gt; pictured this task being solved. And I put in exactly the commands needed for the naive path. The model on CLI tried the optimal command, got rejected, fell back to the naive path and conscientiously walked through all 36 roles.&lt;/p&gt;

&lt;p&gt;The awslabs MCP server has its own allowlist - significantly broader. And &lt;code&gt;list-entities-for-policy&lt;/code&gt; is allowed there.&lt;/p&gt;

&lt;p&gt;This was a &lt;strong&gt;benchmark bug&lt;/strong&gt;, not a property of MCP. I added one line to the whitelist:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iam&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-roles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-attached-role-policies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list-entities-for-policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;- this one
&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And re-ran &lt;code&gt;cli-v2 iam_admin_roles&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;Calls&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Wall&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;cli&lt;/code&gt; plain&lt;/td&gt;
&lt;td&gt;36&lt;/td&gt;
&lt;td&gt;20k&lt;/td&gt;
&lt;td&gt;68s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;5k&lt;/td&gt;
&lt;td&gt;4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;cli-v2&lt;/code&gt; (whitelist fixed)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.8k&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Exactly one tool call. And at the same time fewer input tokens than MCP, because we have one tool description of ~3000 characters and MCP has two descriptions totalling ~5800 characters.&lt;/p&gt;

&lt;p&gt;This is a &lt;strong&gt;methodologically&lt;/strong&gt; important point for anyone who wants to reproduce a benchmark like this: &lt;strong&gt;your own whitelist can silently determine the outcome&lt;/strong&gt;. If the allowlist only covers the commands needed for the naive strategy, you aren't measuring the transport, you're measuring your whitelist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final table: cli-full vs mcp at n=10
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;cli-full&lt;/code&gt; is the union of all three improvements in a single transport:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Batch input&lt;/strong&gt; (cli-v2 tool spec).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich tool description&lt;/strong&gt; with batch examples and best practices (cli-v2).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runtime context&lt;/strong&gt; in the system prompt (cli-ctx).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broad whitelist&lt;/strong&gt; with &lt;code&gt;list-entities-for-policy&lt;/code&gt; and everything else needed for the optimal path.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At n=10 per cell, median:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;cli-full input&lt;/th&gt;
&lt;th&gt;mcp input&lt;/th&gt;
&lt;th&gt;Δ input&lt;/th&gt;
&lt;th&gt;cli-full calls&lt;/th&gt;
&lt;th&gt;mcp calls&lt;/th&gt;
&lt;th&gt;cli-full ok%&lt;/th&gt;
&lt;th&gt;mcp ok%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_running&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3,053&lt;/td&gt;
&lt;td&gt;5,368&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-43%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;90%*&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_policy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,975&lt;/td&gt;
&lt;td&gt;5,425&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-45%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;s3_bucket_regions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5,801&lt;/td&gt;
&lt;td&gt;14,317&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-60%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;iam_admin_roles&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,934&lt;/td&gt;
&lt;td&gt;5,213&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-44%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ec2_cpu_last_hour&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5,345&lt;/td&gt;
&lt;td&gt;9,461&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-44%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;* the single failure on &lt;code&gt;ec2_running cli-full #9&lt;/code&gt; was an &lt;code&gt;HTTP 529 Overloaded&lt;/code&gt; from the Anthropic API. That's infrastructure noise, not a transport problem. I deliberately did not retry failed runs to avoid masking real failures - and this lone 529 made it into the stats as 90%. MCP could just as easily have caught the same 529; it just got lucky.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cli-full beats MCP on input tokens on every one of the five tasks, 43-60%. Success rate - parity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On wall clock MCP wins on 4 of 5 tasks. Reason: wall clock is dominated by AWS API call time, not by model turn time. Tokens don't translate directly into seconds. The only wall-clock win for CLI is &lt;code&gt;s3_bucket_regions&lt;/code&gt;, where MCP spends time marshalling a 15-item batch through its protocol layer, and my &lt;code&gt;asyncio.gather&lt;/code&gt; does not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The right question: how much is your engineering time worth
&lt;/h2&gt;

&lt;p&gt;This is where the popular "CLI is better than MCP" narrative breaks.&lt;/p&gt;

&lt;p&gt;My &lt;code&gt;cli-full&lt;/code&gt; is a few hundred lines of code and &lt;strong&gt;half a day of debugging&lt;/strong&gt;. A tool wrapper with a whitelist, a rich description copied from awslabs best practices, batch support via &lt;code&gt;asyncio.gather&lt;/code&gt;, a system prompt with runtime context, verify + ground truth for a specific task. And that's only for AWS. For GCP, for Linear, for Notion - everything from scratch.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt; is &lt;strong&gt;one command&lt;/strong&gt; (&lt;code&gt;uvx awslabs.aws-api-mcp-server@latest&lt;/code&gt;) and one environment variable. Works with every AWS service, not with five tasks. Best practices are already baked in by the authors (who &lt;strong&gt;know&lt;/strong&gt; AWS better than I do). Updates come with &lt;code&gt;@latest&lt;/code&gt;. Read-only mode is an environment variable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP pays with service knowledge, CLI pays with engineering labour.&lt;/strong&gt; It's a question of which currency you pay for your agent in: person-hours or tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to choose MCP
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;High velocity, low QPS.&lt;/strong&gt; New project, the agent has to work tomorrow. MCP installs in 30 seconds and covers everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broad surface.&lt;/strong&gt; The agent pokes at EC2, S3, IAM, Lambda, CloudWatch, RDS, ECS. Writing a CLI wrapper for each service is an unrealistic budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polyglot environment.&lt;/strong&gt; AWS today, GCP tomorrow, Notion the day after. Per-service CLI wrappers don't scale; one MCP server per service does.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You're not an expert on the service.&lt;/strong&gt; You don't know by heart that &lt;code&gt;list-entities-for-policy&lt;/code&gt; is more efficient than &lt;code&gt;list-attached-role-policies&lt;/code&gt; in a loop. The awslabs authors do. You reuse their knowledge by paying a few extra tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low QPS.&lt;/strong&gt; A few hundred agent invocations a day. Saving 8k tokens per request is a few dollars a month. Engineering time costs more.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  When to choose a purpose-built CLI
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;High-QPS production.&lt;/strong&gt; A million calls a day x 8k extra tokens x $3/M input = $24/day on top. That's $8k a year, which is enough to hire a contractor to write the tool wrapper once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Narrow, stable task set.&lt;/strong&gt; The agent does five specific things. A narrow whitelist and a short description will be more compact than any universal MCP server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full control over the context.&lt;/strong&gt; Every token in the system prompt and tool description is yours. No ~3KB of hidden awslabs guidance, no update surprises, no external dependency that might suddenly change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance / audit.&lt;/strong&gt; Every tool call is visible, every input is validated by your code, every failure mode is known. MCP adds a protocol layer between you and the AWS API that some audits won't accept.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You already have the knowledge.&lt;/strong&gt; If you know how to work with the service efficiently, you can bake that knowledge into the tool description once and reuse it forever.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Checklist: how to build a cli-full equivalent
&lt;/h2&gt;

&lt;p&gt;If after all this you've decided your use case is CLI, here are six items that turn "raw &lt;code&gt;subprocess.run&lt;/code&gt;" into something that beats awslabs MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Accept batch input.&lt;/strong&gt; Tool input schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"cli_command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"anyOf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"array"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the model passes an array, the runner executes the commands in parallel via &lt;code&gt;asyncio.gather&lt;/code&gt; (or equivalent) and returns the results in list order with index headers &lt;code&gt;[1/15]&lt;/code&gt;, &lt;code&gt;[2/15]&lt;/code&gt;... Saves 10-20x on tool calls for tasks where one command has to be run with different parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Put runtime context in the system prompt.&lt;/strong&gt; Minimum - four lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Runtime context (provided by the runner, not by the tool):
- Current UTC time: &amp;lt;now&amp;gt;
- Default region: &amp;lt;region&amp;gt;
- Identity: &amp;lt;arn&amp;gt;
- This account is real and live; commands return real data.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This closes a whole class of problems where the model gets confused about dates, regions, or thinks it's working against documentation rather than production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Write a rich tool description.&lt;/strong&gt; Aim for 2500-3000 characters. A structure that works (copying awslabs):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short tool description (1 sentence).&lt;/li&gt;
&lt;li&gt;Key constraints (allowed commands, region defaults, auth model).&lt;/li&gt;
&lt;li&gt;A "Best practices" section - how to pick commands, when to use batch, when to use &lt;code&gt;--query&lt;/code&gt; and &lt;code&gt;--filters&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;An "Anti-patterns" section - an explicit "don't list-then-iterate if there's a more specific operation".&lt;/li&gt;
&lt;li&gt;2-3 concrete examples covering different task categories.&lt;/li&gt;
&lt;li&gt;Restrictions: no shell pipes, no &lt;code&gt;--profile&lt;/code&gt;, no substitution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model reads this as a cookbook. A badly written description means the model writes naive commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The whitelist must cover the **optimal&lt;/strong&gt; commands, not just the "obvious" ones.** This is the point that cost me half a day. Ask yourself: "what would a senior AWS engineer write for this task?" - and make sure that command is in the whitelist. Not just the commands needed for the naive strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Return structured output, not prose.&lt;/strong&gt; Always &lt;code&gt;--output json&lt;/code&gt; + truncate to a fixed byte budget with an explicit truncation marker. The model has to know that the response was truncated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Forward tool errors to the model verbatim.&lt;/strong&gt; When a command fails with &lt;code&gt;[exit=N] &amp;lt;stderr&amp;gt;&lt;/code&gt;, return it to the model as-is. It can self-correct on the next turn. Silent failures waste turns for nothing.&lt;/p&gt;

&lt;p&gt;Following these six rules turns the CLI wrapper from a parody of a tool into something that actually beats awslabs MCP on tokens. Takes half a day per service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology notes
&lt;/h2&gt;

&lt;p&gt;Three things I spent time on and which are worth knowing if you want to reproduce a benchmark like this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First: claude-agent-sdk and Claude Code poison the context.&lt;/strong&gt; For the first two days I was measuring CLI vs MCP through &lt;code&gt;claude-agent-sdk&lt;/code&gt;, and the numbers were wild. 30k input tokens on a "how many running EC2" task. I thought for a long time that it was protocol overhead, but no - it was Claude Code through the SDK dragging my &lt;strong&gt;entire&lt;/strong&gt; user-level &lt;code&gt;~/.claude.json&lt;/code&gt; into the context: figma MCP, pencil MCP, PubMed MCP, Gmail, Calendar, Bash, Edit, Read... 40+ tools from other servers I hadn't asked for. I rewrote the runner onto the direct Anthropic API - cache_read dropped from 30k to 0, input tokens dropped to "normal" 2k on a simple task. If you are benchmarking agents through someone else's ready-made harness, check &lt;strong&gt;with your own eyes&lt;/strong&gt; what exactly goes into the model on the first system turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second: your own whitelist is an invisible benchmark variable.&lt;/strong&gt; I already wrote about this in the "effect C" section. I'll repeat: &lt;strong&gt;any&lt;/strong&gt; safety / security / validation layer between the model and the real service is &lt;strong&gt;part&lt;/strong&gt; of what you are measuring, even if you don't consciously think of it that way. If your whitelist forces the model into a narrow path, you are measuring the model's behaviour in that narrow path, not the model's behaviour in general.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third: &lt;code&gt;success_rate&lt;/code&gt; and retry policy.&lt;/strong&gt; One of my &lt;code&gt;cli-full ec2_running&lt;/code&gt; runs fell over with an HTTP 529 Overloaded from the Anthropic API. In the stats that's 90% success rate, even though it's not a transport issue. I decided &lt;strong&gt;not&lt;/strong&gt; to retry, because then the risk of masking real problems is too high. The article has to mention that 529 explicitly - otherwise the reader will compare 100% MCP against "90%" CLI and draw the wrong conclusion. Retry policy is yet another invisible variable the benchmark has to state out loud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducibility
&lt;/h2&gt;

&lt;p&gt;Everything is in a public repo: &lt;a href="https://github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/mcp-vs-cli-aws-benchmark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;What's in there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;src/agent_loop.py&lt;/code&gt; - ~150 lines of a self-contained agent loop on the direct Anthropic API.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/tools_cli.py&lt;/code&gt;, &lt;code&gt;tools_cli_v2.py&lt;/code&gt;, &lt;code&gt;tools_mcp.py&lt;/code&gt; - CLI and MCP transports. Plus the ablation variants (&lt;code&gt;tools_cli_rich.py&lt;/code&gt;, &lt;code&gt;tools_cli_renamed.py&lt;/code&gt;, &lt;code&gt;tools_cli_with_fake_suggest.py&lt;/code&gt;) from the "five hypotheses" section.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/runner.py&lt;/code&gt; - CLI for running &lt;code&gt;--tasks &amp;lt;ids&amp;gt; --transports &amp;lt;ids&amp;gt; --n &amp;lt;N&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/aggregate.py&lt;/code&gt; - medians + IQR + success rate from raw JSONL.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/safety.py&lt;/code&gt; - whitelist + injection guard.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;src/ground_truth.py&lt;/code&gt; - a boto3 script that captures ground truth from a live account (parameterised via &lt;code&gt;BENCH_S3_BUCKET&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;results/scrubbed/final_summary.json&lt;/code&gt; - aggregated numbers at n=10 across all &lt;code&gt;(task, transport)&lt;/code&gt; cells. These are the same numbers as in the tables above, in machine-readable form.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;results/scrubbed/sample_runs.jsonl&lt;/code&gt; - 8 hand-curated runs, one per key storyline in the article: naive CLI on &lt;code&gt;iam_admin_roles&lt;/code&gt; (36 calls), MCP on the same task (1 call), cli-full (1 call); CLI failure on &lt;code&gt;ec2_cpu_last_hour&lt;/code&gt; due to 2025 timestamps vs the cli-ctx fix; naive CLI on &lt;code&gt;s3_bucket_regions&lt;/code&gt; (16 calls) vs MCP with batch (2 calls) vs cli-full with batch (2 calls). All role, bucket and instance names are replaced with &lt;code&gt;role-N&lt;/code&gt;, &lt;code&gt;bucket-N&lt;/code&gt;, &lt;code&gt;i-instanceNN&lt;/code&gt;. Metrics and full model response text are preserved.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docs/findings.md&lt;/code&gt; - extended analytical notes, part of which went into this article.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why there are &lt;strong&gt;no&lt;/strong&gt; full 250 raw runs in the repo: the raw JSONL files contain real IAM role names, S3 bucket names and EC2 instance IDs from my AWS account, &lt;strong&gt;woven into free-form text&lt;/strong&gt; of model responses and batch commands. They can't be auto-scrubbed without a manual mapping for every name, and one missed line is a leak. So the repo only includes what I reviewed by eye: the aggregated &lt;code&gt;final_summary.json&lt;/code&gt; and 8 curated sample runs. If you want to see a full dataset, the best way to get a correct one is to run the benchmark on your own account in ~20 minutes (see below).&lt;/p&gt;

&lt;p&gt;To run the benchmark under your own account:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a dedicated IAM user with the &lt;code&gt;ReadOnlyAccess&lt;/code&gt; policy + any extra grants for your tasks.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cp .env.example .env&lt;/code&gt;, fill in &lt;code&gt;AWS_PROFILE&lt;/code&gt;, &lt;code&gt;AWS_REGION&lt;/code&gt;, &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, &lt;code&gt;BENCH_S3_BUCKET&lt;/code&gt; (the name of any bucket in your account for the bucket-policy task).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;python -m src.ground_truth&lt;/code&gt; - captures ground truth for your account.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;python -m src.runner --n 10&lt;/code&gt; - runs the full series, ~15-20 minutes, ~$5-10 on the Anthropic API.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;python -m src.aggregate results/raw/*.jsonl&lt;/code&gt; - prints the table.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you repeat this on your own stack and get different numbers - drop me a line, I'd love to compare.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The popular "MCP loses to CLI" narrative rests on a single benchmark&lt;/strong&gt; (Scalekit, n=1, GitHub Copilot MCP). It is correct &lt;strong&gt;in its own conditions&lt;/strong&gt;, but generalising from it to "MCP is bad" is a mistake.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS has already solved the schema-dump problem&lt;/strong&gt; in &lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt;. Their flagship MCP server is essentially the CLI with two tools, and that's a fair benchmark partner for raw aws CLI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On a fair 5-task series at n=10, &lt;code&gt;cli-full&lt;/code&gt; beats MCP on input tokens by 43-60% on every task.&lt;/strong&gt; But that takes writing a tool wrapper, a whitelist, a system prompt, a rich description. Half a day of engineering per service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The real question isn't "MCP or CLI" but "how much does your engineering time cost vs how much do your tokens cost".&lt;/strong&gt; MCP wins on velocity, broad surface, polyglot, low-QPS. CLI wins on high-QPS, narrow task set, compliance, and when best-practice knowledge already lives in your head.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All three gap mechanisms&lt;/strong&gt; - HTTP metadata, batch calling, a broad allowlist - are &lt;strong&gt;reproducible&lt;/strong&gt; in a CLI tool via 4 lines in the system prompt, &lt;code&gt;anyOf string | array&lt;/code&gt; in the input schema, and one line in the whitelist. None of them is a structural property of the MCP protocol.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Methodologically&lt;/strong&gt; - check with your own eyes what goes into the model's context, treat your own whitelist as a benchmark variable, and state your retry policy explicitly when reporting success rate.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If after all this you look at your own use case and decide you want a well-designed CLI tool, the six-item checklist is above. If you decide you want MCP - &lt;code&gt;uvx awslabs.aws-api-mcp-server@latest&lt;/code&gt; and you're in the game.&lt;/p&gt;

&lt;p&gt;Both options are &lt;strong&gt;correct answers to different questions&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>mcp</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>YAML vs Markdown vs JSON vs TOON: Which Format Is Most Efficient for the Claude API</title>
      <dc:creator>Webmaster Ramos</dc:creator>
      <pubDate>Tue, 14 Apr 2026 20:22:36 +0000</pubDate>
      <link>https://forem.com/webramos/yaml-vs-markdown-vs-json-vs-toon-which-format-is-most-efficient-for-the-claude-api-4l94</link>
      <guid>https://forem.com/webramos/yaml-vs-markdown-vs-json-vs-toon-which-format-is-most-efficient-for-the-claude-api-4l94</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;My own benchmark across three Claude tiers (Haiku, Sonnet, Opus): 120 data files, 8 real-world scenarios, 5 formats. Tokens, cost, and accuracy – numbers, not opinions.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  You Are Overpaying for Prompts
&lt;/h2&gt;

&lt;p&gt;Every time you send data to the Claude API, the format of that data determines how many tokens you spend. The same 200-product catalog in JSON costs 15,879 tokens. In Markdown, it costs 7,814. In TOON, 6,088. That is a 62% difference.&lt;/p&gt;

&lt;p&gt;A 120-task list? JSON consumes 8,500 tokens. TOON uses 2,267. Savings: 73%.&lt;/p&gt;

&lt;p&gt;The problem is that every existing benchmark focuses on GPT, Gemini, and Llama. There has not been a public benchmark for Claude. I decided to fix that.&lt;/p&gt;

&lt;p&gt;I ran 450 API calls on Claude Haiku 4.5, tested Sonnet 4.6 and Opus 4.6, and counted tokens across 120 files using Anthropic’s production tokenizer. Eight real-world scenarios, five formats. In this article – the results, the conclusions, and specific recommendations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Formats at a Glance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  JSON (JavaScript Object Notation)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2001; ECMA-404 standard (2013)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author:&lt;/strong&gt; Douglas Crockford&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; APIs, data exchange between systems, configuration files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; strict typing, nesting via &lt;code&gt;{}&lt;/code&gt; and &lt;code&gt;[]&lt;/code&gt;, mandatory quotes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;JSON is the lingua franca of programmatic interfaces. Every API speaks JSON, and every language can parse it. But that universality comes at a price in an LLM context: quotes, braces, and commas all consume tokens. They carry syntactic weight, but not semantic meaning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"products"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Mouse"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;29.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"in_stock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  YAML (YAML Ain't Markup Language)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2001; YAML 1.2 standard (2009)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authors:&lt;/strong&gt; Clark Evans, Ingy döt Net, Oren Ben-Kiki&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; configuration files (Docker Compose, Kubernetes, GitHub Actions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; indentation-based structure, minimal punctuation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;YAML is the de facto standard of the DevOps world. It reads like pseudocode and usually does not require quotes. The trade-off is that repeating keys for every array item eats up much of the punctuation savings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;products&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Mouse&lt;/span&gt;
    &lt;span class="na"&gt;price&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;29.99&lt;/span&gt;
    &lt;span class="na"&gt;in_stock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Markdown
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2004&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author:&lt;/strong&gt; John Gruber (with Aaron Swartz)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; documentation, READMEs, blogs, wikis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; human-first syntax – headings &lt;code&gt;#&lt;/code&gt;, tables &lt;code&gt;|&lt;/code&gt;, lists &lt;code&gt;-&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Markdown is the most “native” format for LLMs. Models have been trained on billions of READMEs and wiki pages. GitHub, Notion, Obsidian – all rely on Markdown. It is a communication format, not a data format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Products&lt;/span&gt;

| ID | Name  | Price | In Stock |
|----|-------|-------|----------|
| 1  | Mouse | 29.99 | Yes      |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Plain Text
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; human communication – emails, notes, instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; no syntax, no markup, maximum flexibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plain text with no markup. It minimizes token overhead, but it provides no explicit structure for programmatic data extraction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Products: Mouse (ID 1, $29.99, in stock)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  TOON (Token-Oriented Object Notation)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Year created:&lt;/strong&gt; 2025 (v1.0 – November 2025, MIT license)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author:&lt;/strong&gt; open-source community (&lt;a href="https://github.com/toon-format/toon" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Primary use case:&lt;/strong&gt; token optimization in LLM prompts, replacing JSON in AI workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Key characteristic:&lt;/strong&gt; a YAML + CSV hybrid (indentation for objects, row-style encoding for arrays)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The newest format in this comparison. TOON was created for one purpose: minimize tokens while preserving lossless JSON round-tripping. For arrays of homogeneous objects, field names are declared once and values are written as CSV-style rows. On GPT-5 Nano, it showed 99.4% accuracy with 46% token savings. Before this benchmark, it had not been tested on Claude.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;products[1]{id,name,price,in_stock}:
1,Mouse,29.99,true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I Tested
&lt;/h3&gt;

&lt;p&gt;Eight scenarios, each in three sizes (S / M / L), each in five formats. Total: 120 data files.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Data type&lt;/th&gt;
&lt;th&gt;S&lt;/th&gt;
&lt;th&gt;M&lt;/th&gt;
&lt;th&gt;L&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;System prompt / instructions&lt;/td&gt;
&lt;td&gt;Rules, sections&lt;/td&gt;
&lt;td&gt;10 rules&lt;/td&gt;
&lt;td&gt;30 rules&lt;/td&gt;
&lt;td&gt;60 rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Product catalog&lt;/td&gt;
&lt;td&gt;Tabular data&lt;/td&gt;
&lt;td&gt;20 products&lt;/td&gt;
&lt;td&gt;100 products&lt;/td&gt;
&lt;td&gt;200 products&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Roadmap / tasks&lt;/td&gt;
&lt;td&gt;Statuses, dependencies&lt;/td&gt;
&lt;td&gt;15 tasks&lt;/td&gt;
&lt;td&gt;50 tasks&lt;/td&gt;
&lt;td&gt;120 tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Business rules&lt;/td&gt;
&lt;td&gt;Conditional logic&lt;/td&gt;
&lt;td&gt;8 rules&lt;/td&gt;
&lt;td&gt;25 rules&lt;/td&gt;
&lt;td&gt;50 rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Few-shot classification&lt;/td&gt;
&lt;td&gt;Input-output examples&lt;/td&gt;
&lt;td&gt;5 examples&lt;/td&gt;
&lt;td&gt;15 examples&lt;/td&gt;
&lt;td&gt;40 examples&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Organizational hierarchy&lt;/td&gt;
&lt;td&gt;3 levels of nesting&lt;/td&gt;
&lt;td&gt;12 people&lt;/td&gt;
&lt;td&gt;60 people&lt;/td&gt;
&lt;td&gt;150 people&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;Endpoints, parameters&lt;/td&gt;
&lt;td&gt;5 endpoints&lt;/td&gt;
&lt;td&gt;15 endpoints&lt;/td&gt;
&lt;td&gt;30 endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Output format&lt;/td&gt;
&lt;td&gt;Requesting data in a given format&lt;/td&gt;
&lt;td&gt;10 countries&lt;/td&gt;
&lt;td&gt;50 countries&lt;/td&gt;
&lt;td&gt;100 countries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Few-shot&lt;/strong&gt; (scenario 5) is a prompting technique in which several “input → output” examples are included directly in the prompt so the model can infer the task from a pattern. For example: &lt;code&gt;"Great product!" → positive&lt;/code&gt;, &lt;code&gt;"Terrible quality" → negative&lt;/code&gt;, then the question &lt;code&gt;"Love it!" → ?&lt;/code&gt;. Zero examples is zero-shot, one example is one-shot, several examples is few-shot. The format of those examples directly affects cost: 40 pairs in JSON take 2,131 tokens; in TOON, 996.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For scenarios 2, 3, 6, and 7, I prepared questions with precomputed correct answers (ground truth). For scenarios 1, 4, and 5, scoring was manual and rubric-based. For scenario 8, I measured output tokens and format compliance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Models and Pricing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Input ($/1M)&lt;/th&gt;
&lt;th&gt;Output ($/1M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku 4.5&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;$0.80&lt;/td&gt;
&lt;td&gt;$4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Mid&lt;/td&gt;
&lt;td&gt;$3&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.6&lt;/td&gt;
&lt;td&gt;Premium&lt;/td&gt;
&lt;td&gt;$15&lt;/td&gt;
&lt;td&gt;$75&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Accuracy was measured across all three tiers. Sizes S and M were tested for accuracy. L-size was used only for token counts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clean-Test Principle
&lt;/h3&gt;

&lt;p&gt;All requests were sent directly via the &lt;code&gt;anthropic&lt;/code&gt; Python SDK: plain &lt;code&gt;client.messages.create()&lt;/code&gt; with &lt;code&gt;temperature=0&lt;/code&gt;. No MCP servers, IDE plugins, or agent frameworks.&lt;/p&gt;

&lt;p&gt;Token counting was done with &lt;code&gt;client.messages.count_tokens()&lt;/code&gt; – Anthropic’s production tokenizer, i.e. the same numbers used for billing. &lt;strong&gt;The tokenizer is the same across all Claude tiers&lt;/strong&gt; – so the token-count data applies to all Claude models.&lt;/p&gt;

&lt;p&gt;Benchmark code: &lt;a href="https://github.com/webmaster-ramos/yaml-vs-md-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/yaml-vs-md-benchmark&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Input-Token Efficiency
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;These numbers apply to all Claude tiers – Haiku, Sonnet, and Opus all use the same tokenizer. The only cost difference comes from the price per token.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Summary Table: Average Input Tokens Across All Scenarios
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Average tokens&lt;/th&gt;
&lt;th&gt;vs JSON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-32%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-53%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-57%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;TOON saves 62% of input tokens on average versus JSON. Markdown saves 53%. YAML, despite its minimal punctuation, saves only 32% – because of repeated keys and indentation overhead.&lt;/p&gt;

&lt;h3&gt;
  
  
  Breakdown by Scenario (% Savings vs JSON, L-size)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;YAML&lt;/th&gt;
&lt;th&gt;MD&lt;/th&gt;
&lt;th&gt;TXT&lt;/th&gt;
&lt;th&gt;TOON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Instructions&lt;/td&gt;
&lt;td&gt;-22%&lt;/td&gt;
&lt;td&gt;-29%&lt;/td&gt;
&lt;td&gt;-24%&lt;/td&gt;
&lt;td&gt;-24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Products&lt;/td&gt;
&lt;td&gt;-29%&lt;/td&gt;
&lt;td&gt;-51%&lt;/td&gt;
&lt;td&gt;-53%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks&lt;/td&gt;
&lt;td&gt;-35%&lt;/td&gt;
&lt;td&gt;-63%&lt;/td&gt;
&lt;td&gt;-69%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-73%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business Rules&lt;/td&gt;
&lt;td&gt;-28%&lt;/td&gt;
&lt;td&gt;-52%&lt;/td&gt;
&lt;td&gt;-48%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-63%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot&lt;/td&gt;
&lt;td&gt;-31%&lt;/td&gt;
&lt;td&gt;-45%&lt;/td&gt;
&lt;td&gt;-37%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-53%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchy&lt;/td&gt;
&lt;td&gt;-37%&lt;/td&gt;
&lt;td&gt;-61%&lt;/td&gt;
&lt;td&gt;-67%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-68%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Docs&lt;/td&gt;
&lt;td&gt;-35%&lt;/td&gt;
&lt;td&gt;-45%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-59%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-53%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  YAML Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz60wk1w9ooculhvvjok1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz60wk1w9ooculhvvjok1.png" alt="YAML savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  MD Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0vl0ox06w09jse0ncgq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0vl0ox06w09jse0ncgq.png" alt="MD savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  TXT Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsu83ntyoo8rm0ug98vbx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsu83ntyoo8rm0ug98vbx.png" alt="TXT savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  TOON Savings vs JSON (%, L-size)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F070ng6789odiolaota0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F070ng6789odiolaota0y.png" alt="TOON savings vs JSON by scenario" width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Detailed Charts by Scenario
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Instructions
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzchje0o1csh4gvxsnh3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzchje0o1csh4gvxsnh3j.png" alt="Input tokens: Instructions" width="800" height="485"&gt;&lt;/a&gt;)&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Products
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjvnbl7gwa854kp4jv79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqjvnbl7gwa854kp4jv79.png" alt="Input tokens: Products" width="800" height="478"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Tasks
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flc0qtuu6gqkkt9veu9vm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flc0qtuu6gqkkt9veu9vm.png" alt="Input tokens: Tasks" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Rules
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu0vc317rzolm6qit0gc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiu0vc317rzolm6qit0gc.png" alt="Input tokens: Rules" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Few-shot
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rys14l7hw6f7n6pk8q3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rys14l7hw6f7n6pk8q3.png" alt="Input tokens: Few-shot" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: Hierarchy
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foayupph0gz3j4x0rigs0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foayupph0gz3j4x0rigs0.png" alt="Input tokens: Hierarchy" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Input tokens by scenario: API Docs
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw984vwh94vw6r6e1uzod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw984vwh94vw6r6e1uzod.png" alt="Input tokens: API Docs" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Observations
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TOON is the clear leader for tabular data.&lt;/strong&gt; Product catalogs, task lists, few-shot examples – anything that looks like an array of homogeneous objects. Savings: 62–73% versus JSON.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Markdown is the best all-purpose format.&lt;/strong&gt; A stable 50–65% reduction across all data types. It is the only format that performs consistently well across tables, instructions, and hierarchies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;YAML is underwhelming.&lt;/strong&gt; Many people expect YAML to be much more compact than JSON. In practice, the savings are only 14–41%. The reason is repeated keys for every array element.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plain Text wins on API docs.&lt;/strong&gt; For technical specifications, plain text is more efficient than TOON (59% vs 53%). Without extra syntax, descriptive text compresses better.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scale barely affects the percentage savings.&lt;/strong&gt; The difference between S and L is under 2 percentage points. Format drives efficiency more than data volume does.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Haiku 4.5: When Format Matters
&lt;/h2&gt;

&lt;p&gt;Haiku is the most format-sensitive tier. In 35% of questions, it produced different answers depending on the input format. Accuracy spread reached as high as 36 percentage points between the best and worst format within the same scenario.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy by Scenario
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: Products (product catalog)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3auwn1sahrmazcfa9ih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn3auwn1sahrmazcfa9ih.png" alt="Accuracy Haiku: Products" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: Tasks (tasks / roadmap)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rf6ypezr9byqkghfq7q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rf6ypezr9byqkghfq7q.png" alt="Accuracy Haiku: Tasks" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: Hierarchy (organizational hierarchy)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3f3q2rmsufbuj0z6726.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3f3q2rmsufbuj0z6726.png" alt="Accuracy Haiku: Hierarchy" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Accuracy Haiku: API Docs (documentation)
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoum4j3h0tmlw7e36qlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsoum4j3h0tmlw7e36qlw.png" alt="Accuracy Haiku: API Docs" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;JSON&lt;/th&gt;
&lt;th&gt;YAML&lt;/th&gt;
&lt;th&gt;MD&lt;/th&gt;
&lt;th&gt;TXT&lt;/th&gt;
&lt;th&gt;TOON&lt;/th&gt;
&lt;th&gt;Best&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Products&lt;/td&gt;
&lt;td&gt;63.4%&lt;/td&gt;
&lt;td&gt;61.4%&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;66.2%&lt;/td&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;65.7%&lt;/td&gt;
&lt;td&gt;66.7%&lt;/td&gt;
&lt;td&gt;56.7%&lt;/td&gt;
&lt;td&gt;65.3%&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchy&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;78.2%&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Docs&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;57.1%&lt;/td&gt;
&lt;td&gt;78.6%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;85.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;JSON/YAML/TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Hierarchy shows the sharpest gap:&lt;/strong&gt; YAML (92.9%) vs Markdown (57.1%) – a 36-point difference. Tree-like structures are clearly easier for Haiku to parse in an indentation-based format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Docs: Markdown performs unexpectedly poorly&lt;/strong&gt; – 57.1% vs 85.7% for JSON. For technical specifications with parameters and types, explicit structure matters more than compactness.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy by Size (Haiku)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;S (small data)&lt;/td&gt;
&lt;td&gt;80.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M (medium data)&lt;/td&gt;
&lt;td&gt;67.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Scale matters more than format.&lt;/strong&gt; Accuracy drops by 13 points when moving from S to M – more than the average difference between formats (5.7 points). The implication is straightforward: reduce data volume first, then optimize format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost: Haiku
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Avg tokens&lt;/th&gt;
&lt;th&gt;Cost / request&lt;/th&gt;
&lt;th&gt;100K requests / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;$0.0026&lt;/td&gt;
&lt;td&gt;$260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;$0.0018&lt;/td&gt;
&lt;td&gt;$177&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;$0.0012&lt;/td&gt;
&lt;td&gt;$121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;$0.0011&lt;/td&gt;
&lt;td&gt;$111&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;$0.0010&lt;/td&gt;
&lt;td&gt;$98&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON -&amp;gt; TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$162/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Output Format: Haiku
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Output tokens: S-size (10 countries) – Haiku, Sonnet, Opus
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" alt="Output tokens S-size, all 3 models" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Output tokens: M-size (50 countries) – Haiku, Sonnet, Opus
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" alt="Output tokens M-size, all 3 models" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requested format&lt;/th&gt;
&lt;th&gt;S (10 countries)&lt;/th&gt;
&lt;th&gt;M (50 countries)&lt;/th&gt;
&lt;th&gt;Savings vs JSON&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;465&lt;/td&gt;
&lt;td&gt;1,985&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;296&lt;/td&gt;
&lt;td&gt;1,352&lt;/td&gt;
&lt;td&gt;-32..36%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;165&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,125&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-43..65%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;294&lt;/td&gt;
&lt;td&gt;1,381&lt;/td&gt;
&lt;td&gt;-30..37%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;342&lt;/td&gt;
&lt;td&gt;1,369&lt;/td&gt;
&lt;td&gt;-26..31%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Markdown is the cheapest output format on Haiku.&lt;/strong&gt; 165 vs 465 tokens on S-size – a 65% reduction. At $4 per 1M output tokens, that matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important: TOON loses on output.&lt;/strong&gt; Haiku does not know the TOON format and, instead of producing compact CSV-like rows, tends to emit verbose plain text that only vaguely resembles TOON. A few-shot example improves TOON output quality, but it still trails Markdown in efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output-Format Choice: Technical Requirements
&lt;/h3&gt;

&lt;p&gt;Output cost is not the only thing that matters. Often, Claude’s response must be processed programmatically – parsed, inserted into a database, or passed to another service. The best output format depends on who or what is going to read it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Usage scenario&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User-facing answer in UI&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Renders natively, lowest token cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend parsing&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reliable, universal, guaranteed structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config / YAML pipeline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human-readable + machine-parsable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rows for CSV / spreadsheet&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TXT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minimal overhead, structure via delimiters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compact output for TOON SDK&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only if using Opus, or with a few-shot example&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt; if a human reads the output, use Markdown. If code reads it, use JSON or YAML. Do not optimize output cost at the expense of parsing reliability in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendations for Haiku
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data type&lt;/th&gt;
&lt;th&gt;Best input&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Best output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;stable&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalogs, lists&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TXT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70.2%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks / roadmap&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;71.0%&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;MD&lt;/strong&gt; or &lt;strong&gt;JSON&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchies&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;92.9%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON or YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.7%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot examples&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;65.3% (-0.5% vs JSON)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;MD&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On Haiku, format matters – especially for hierarchies and API documentation. Use TOON on input where token savings are worth a small accuracy trade-off, but &lt;strong&gt;do not use TOON on output&lt;/strong&gt; without a few-shot example.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sonnet 4.6: Format Affects Cost, Not Quality
&lt;/h2&gt;

&lt;p&gt;Sonnet 4.6 produced identical answers across all five formats. In 100% of questions, the result was the same regardless of how the data was represented. For Sonnet, format optimization is pure cost reduction with no quality trade-off.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy: Format-Invariant
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Accuracy by model and format
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faman5i1lej65040r6mqc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faman5i1lej65040r6mqc.png" alt="Accuracy by model and format" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The answers are completely identical across all formats. Switching from JSON to TOON saves 62% of input tokens while preserving the same output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost: Sonnet
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Avg tokens&lt;/th&gt;
&lt;th&gt;Cost / request&lt;/th&gt;
&lt;th&gt;100K requests / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;$0.0098&lt;/td&gt;
&lt;td&gt;$975&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;$0.0066&lt;/td&gt;
&lt;td&gt;$663&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;$0.0045&lt;/td&gt;
&lt;td&gt;$454&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;$0.0042&lt;/td&gt;
&lt;td&gt;$417&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;$0.0037&lt;/td&gt;
&lt;td&gt;$368&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON -&amp;gt; TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$607/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 100K requests per month, switching from JSON to TOON saves $607/month. On Sonnet, output costs $15 per 1M tokens, so output optimization also matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output Format: Sonnet
&lt;/h3&gt;

&lt;p&gt;Output tokens for Sonnet (estimated as characters ÷ 3.5 chars/token):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;S (10 countries)&lt;/th&gt;
&lt;th&gt;M (50 countries)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;~210&lt;/td&gt;
&lt;td&gt;~1,120&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;~195&lt;/td&gt;
&lt;td&gt;~1,023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~143&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~746&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;~103&lt;/td&gt;
&lt;td&gt;~549&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~86&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~414&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Comparison of output tokens across all three models (S-size):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" alt="Output tokens S-size, all 3 models" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;M-size (50 countries):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" alt="Output tokens M-size, all 3 models" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On Sonnet, TOON output requires a few-shot example.&lt;/strong&gt; Without extra context, Sonnet interprets “TOON format” literally – as an abbreviation connected to cartoons – and returns an irrelevant answer. With a format example in the prompt, it generates correct TOON.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical requirements for output on Sonnet&lt;/strong&gt; are the same as on Haiku: if a downstream system parses the response programmatically, use JSON or YAML. If a human is going to read it, use Markdown.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendations for Sonnet
&lt;/h3&gt;

&lt;p&gt;On Sonnet, format choice is a pure cost optimization. The logic is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input data:&lt;/strong&gt; use TOON (for tables) or MD (for instructions / hierarchies)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-readable output:&lt;/strong&gt; Markdown (-65% vs JSON)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine-parsed output:&lt;/strong&gt; JSON (most reliable) or YAML (more compact, still parseable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TOON output:&lt;/strong&gt; add a few-shot example to the prompt; otherwise the answer may be incorrect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Optimal prompt design: &lt;strong&gt;MD for instructions + TOON for data + a request for MD/JSON output&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Opus 4.6: Maximum Capability, Also Format-Invariant
&lt;/h2&gt;

&lt;p&gt;Opus 4.6 is the strongest model and the most expensive one. Like Sonnet, it is completely insensitive to input format. But Opus has one unique advantage: it knows TOON “out of the box.”&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy: Format-Invariant
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The answers are 100% identical across all formats. Changing format affects only cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost: Opus
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Avg tokens&lt;/th&gt;
&lt;th&gt;Cost / request&lt;/th&gt;
&lt;th&gt;100K requests / month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;3,252&lt;/td&gt;
&lt;td&gt;$0.0488&lt;/td&gt;
&lt;td&gt;$4,878&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;2,208&lt;/td&gt;
&lt;td&gt;$0.0331&lt;/td&gt;
&lt;td&gt;$3,312&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;1,514&lt;/td&gt;
&lt;td&gt;$0.0227&lt;/td&gt;
&lt;td&gt;$2,271&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TXT&lt;/td&gt;
&lt;td&gt;1,391&lt;/td&gt;
&lt;td&gt;$0.0209&lt;/td&gt;
&lt;td&gt;$2,087&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;1,226&lt;/td&gt;
&lt;td&gt;$0.0184&lt;/td&gt;
&lt;td&gt;$1,839&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON -&amp;gt; TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-62%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$3,039/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On Opus, switching from JSON to TOON saves over $3,000/month at 100K requests. Output costs $75 per 1M tokens – so format optimization has the largest financial impact here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output Format: Opus
&lt;/h3&gt;

&lt;p&gt;Output tokens for Opus (estimated as characters ÷ 3.5 chars/token):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;S (10 countries)&lt;/th&gt;
&lt;th&gt;M (50 countries)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;~254&lt;/td&gt;
&lt;td&gt;~1,271&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;~286&lt;/td&gt;
&lt;td&gt;~1,414&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~177&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~814&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;~194&lt;/td&gt;
&lt;td&gt;~986&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~106&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~543&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Comparison of output tokens across all three models (S-size):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhfvqb33zeladjlrvo53x.png" alt="Output tokens S-size, all 3 models" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;M-size (50 countries):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1q9vzf5yk20kzlf9p6gy.png" alt="Output tokens M-size, all 3 models" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus generates TOON without hints.&lt;/strong&gt; That is the key difference from Sonnet and Haiku. Opus knows the format and produces valid TOON output on the first try.&lt;/p&gt;

&lt;h4&gt;
  
  
  Can Claude generate valid TOON output?
&lt;/h4&gt;

&lt;p&gt;&lt;a href="/media/blog/chart-toon-output.png" class="article-body-image-wrapper"&gt;&lt;img src="/media/blog/chart-toon-output.png" alt="TOON output generation across models"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Without example in prompt&lt;/th&gt;
&lt;th&gt;With few-shot example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Opus 4.6&lt;/td&gt;
&lt;td&gt;Valid TOON&lt;/td&gt;
&lt;td&gt;Valid TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4.6&lt;/td&gt;
&lt;td&gt;Cartoon / irrelevant&lt;/td&gt;
&lt;td&gt;Valid TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Haiku 4.5&lt;/td&gt;
&lt;td&gt;Verbose plain text&lt;/td&gt;
&lt;td&gt;Closer to TOON, but still inaccurate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In practical terms, this means: if you need TOON output and want it to work reliably without prompt scaffolding, use Opus.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Requirements for Output: When Parsing Matters More Than Cost
&lt;/h3&gt;

&lt;p&gt;On Opus, output costs $75 per 1M tokens – so output-format savings are highly relevant. But the requirements of the downstream system still take priority:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenarios where output must be parsed programmatically:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The response goes into a database or structured store – use &lt;strong&gt;JSON&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Another LLM or service consumes the response through an API – use &lt;strong&gt;JSON&lt;/strong&gt; or &lt;strong&gt;YAML&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The response is part of a pipeline (the next step processes the data) – use &lt;strong&gt;JSON&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The response is rendered in the UI as text or a document – use &lt;strong&gt;Markdown&lt;/strong&gt; (lowest token cost)&lt;/li&gt;
&lt;li&gt;You need compact machine-readable output and already have a TOON SDK – use &lt;strong&gt;TOON&lt;/strong&gt; (only Opus works reliably without prompt help)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key point:&lt;/strong&gt; output on Opus costs $75 per 1M – five times more than input. A 65% output reduction (Markdown vs JSON) can matter even more than input savings. But do not trade away parse reliability just to cut cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recommendations for Opus
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input:&lt;/strong&gt; TOON for tabular data (-62%), MD for instructions (-53%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-readable output:&lt;/strong&gt; Markdown (-65% output tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine-parsed output:&lt;/strong&gt; JSON – reliable and universal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TOON output:&lt;/strong&gt; works without few-shot – Opus’s unique advantage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Do not use JSON on input:&lt;/strong&gt; it is the most expensive format with no accuracy benefit&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Accuracy Across All Models and Formats
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;75.3%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML&lt;/td&gt;
&lt;td&gt;75.1%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;69.6%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plain Text&lt;/td&gt;
&lt;td&gt;70.6%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;74.8%&lt;/td&gt;
&lt;td&gt;89.4%&lt;/td&gt;
&lt;td&gt;93.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For Sonnet and Opus, format does not affect accuracy. For Haiku, it matters materially – especially for hierarchies and documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Matrix: Input Format
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data type&lt;/th&gt;
&lt;th&gt;Haiku&lt;/th&gt;
&lt;th&gt;Sonnet / Opus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System prompts / instructions&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;MD&lt;/strong&gt; (-29%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; or &lt;strong&gt;MD&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catalogs, lists&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TXT&lt;/strong&gt; (70.2%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-62%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tasks / roadmap&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;JSON&lt;/strong&gt; (71.0%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-73%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business rules&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;JSON&lt;/strong&gt; (stable)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-63%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot examples&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (≈JSON)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; (-53%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hierarchies&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;YAML&lt;/strong&gt; (92.9%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TOON&lt;/strong&gt; or &lt;strong&gt;MD&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;JSON/YAML&lt;/strong&gt; (85.7%)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TXT&lt;/strong&gt; (-59%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Decision Matrix: Output Format
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Output consumer&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;th&gt;Haiku&lt;/th&gt;
&lt;th&gt;Sonnet&lt;/th&gt;
&lt;th&gt;Opus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UI / end user&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API / JSON parser&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YAML pipeline&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;YAML&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;td&gt;reliable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON SDK&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TOON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;with few-shot*&lt;/td&gt;
&lt;td&gt;with few-shot*&lt;/td&gt;
&lt;td&gt;native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CSV / spreadsheet&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;TXT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;with template&lt;/td&gt;
&lt;td&gt;with template&lt;/td&gt;
&lt;td&gt;with template&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Requires a few-shot example in the prompt&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy was measured only on S+M sizes.&lt;/strong&gt; L-size includes token counts only. Accuracy may degrade more sharply on larger data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The data is synthetic.&lt;/strong&gt; Catalogs and tasks were script-generated. Real-world data may be messier (missing fields, Unicode, long descriptions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic scoring covers 4 of 8 cases.&lt;/strong&gt; Cases 1, 4, and 5 require rubric-based evaluation. The accuracy numbers here cover cases 2, 3, 6, and 7.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet / Opus were tested via subscription (subagents).&lt;/strong&gt; Output-token counts are estimated, not directly measured. Haiku was tested via API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No A/B test on live traffic.&lt;/strong&gt; This is a laboratory benchmark. The impact on a production product must be validated separately.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code and data are open – reproduce it, extend it, challenge it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Opus and Sonnet are completely insensitive to format.&lt;/strong&gt; I expected a 3–5% gap. I got 0%. For the higher tiers, format is pure cost optimization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;YAML is not as efficient as many assume.&lt;/strong&gt; The expectation is usually “YAML is more compact than JSON.” In practice, the savings are only 32%. Repeated keys wipe out much of the benefit from removing braces.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TOON works on Claude without special training.&lt;/strong&gt; Claude may not have seen much TOON in training data, yet all three tiers parse it correctly – essentially on par with JSON.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Opus knows TOON; Sonnet does not.&lt;/strong&gt; Opus generates valid TOON output without hints. Sonnet interpreted “TOON format” as “cartoon” and produced an irrelevant answer. With a few-shot example, both work correctly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Markdown is the best output format.&lt;/strong&gt; The gap in output tokens between JSON and Markdown is 65%. At $75 per 1M on Opus, that is significant. It is also the only format every tier generates natively without extra prompting.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;On Haiku, scale matters more than format.&lt;/strong&gt; Accuracy drops from 80.3% (S) to 67.2% (M) – a 13-point drop. The average difference between formats is 5.7 points. On Sonnet and Opus, scale is much less of an issue.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Do these results apply to other models (GPT, Gemini)?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The trends are similar, but the numbers differ. Every model has its own tokenizer. On GPT-5 Nano, YAML shows 62% accuracy on nested data (&lt;a href="https://www.improvingagents.com/blog/best-nested-data-format/" rel="noopener noreferrer"&gt;ImprovingAgents&lt;/a&gt;); on Claude Haiku, it reaches 93%. Use these results for Claude, and other benchmarks for other models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How were tokens counted?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;client.messages.count_tokens()&lt;/code&gt; – the standard Anthropic SDK method and production tokenizer. These are the same numbers used for billing. The tokenizer is the same across all tiers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why not test XML?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;XML is rarely used in modern LLM workflows. Existing benchmarks (&lt;a href="https://shshell.com/blog/token-efficiency-module-13-lesson-2-format-comparison" rel="noopener noreferrer"&gt;ShShell&lt;/a&gt;) suggest that XML is significantly more expensive than Markdown in token terms, with comparable or worse accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Is TOON a serious format or just hype?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TOON v1.0 was released in November 2025 under MIT, and there are SDKs in 6+ languages. For tabular data, the savings are real – 62% on Claude with JSON-level accuracy. Opus generates TOON output without prompting. Other tiers require a few-shot example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does the input format affect the output format?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Partially. If you provide data in YAML, Claude is more likely to structure its answer with indentation. But an explicit instruction such as “Return as a Markdown table” overrides that tendency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Is it worth converting all prompts away from JSON?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At 100K requests/month on Sonnet, moving from JSON to TOON saves $607/month. On Opus, it saves $3,039/month. For hobby projects with 1K requests, the difference is around $6. Run the math on your own usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can you combine formats in one prompt?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes – and that is usually the recommended approach. Markdown for instructions + TOON for data + a request for output in the format you need. Claude handles multi-format prompts well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Where is the benchmark source code?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/webmaster-ramos/yaml-vs-md-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/yaml-vs-md-benchmark&lt;/a&gt;. All 120 data files, 51 questions, ground truth, runner, and scorer are open for reproduction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Data format in a prompt is not a cosmetic choice. On the Claude API, the gap between JSON and TOON is 62% on input tokens. Markdown saves 65% on output tokens. At 100K requests/month on Opus, that means $3,039 saved on input and even more on output.&lt;/p&gt;

&lt;p&gt;But the main finding is not about tokens. &lt;strong&gt;Claude Sonnet 4.6 and Opus 4.6 are completely insensitive to format.&lt;/strong&gt; They produced 100% identical answers on JSON, YAML, Markdown, Plain Text, and TOON. For the higher tiers, format optimization is pure savings with no quality trade-off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Only Haiku 4.5 is meaningfully format-sensitive&lt;/strong&gt; – and only there does the choice of format affect accuracy (by up to 36 percentage points). On Haiku, format should be matched to data type: YAML for hierarchies, JSON for tasks with dependencies.&lt;/p&gt;

&lt;p&gt;Beyond cost, there are technical requirements: if the output must be parsed programmatically, JSON is more reliable than Markdown. If a human reads the answer, Markdown is cheaper. Opus is the only tier that generates TOON natively; Sonnet and Haiku require a few-shot example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR by tier:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Haiku 4.5&lt;/th&gt;
&lt;th&gt;Sonnet 4.6&lt;/th&gt;
&lt;th&gt;Opus 4.6&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Does format affect accuracy?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes, by up to 36 points&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best input (data)&lt;/td&gt;
&lt;td&gt;YAML/JSON/TXT by data type&lt;/td&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;td&gt;TOON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best input (instructions)&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best output (human-readable)&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;td&gt;MD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best output (parsing)&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;td&gt;JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TOON output without prompt help&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JSON -&amp;gt; TOON savings&lt;/td&gt;
&lt;td&gt;$162 / 100K&lt;/td&gt;
&lt;td&gt;$607 / 100K&lt;/td&gt;
&lt;td&gt;$3,039 / 100K&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Benchmark run in April 2026 on Claude Opus 4.6, Sonnet 4.6, and Haiku 4.5.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;120 data files, 8 scenarios, 3 sizes, 5 formats, 3 models.&lt;/em&gt;&lt;br&gt;
&lt;em&gt;All code and data: &lt;a href="https://github.com/webmaster-ramos/yaml-vs-md-benchmark" rel="noopener noreferrer"&gt;github.com/webmaster-ramos/yaml-vs-md-benchmark&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>claude</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
