<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Marcus Chen</title>
    <description>The latest articles on Forem by Marcus Chen (@marcuswwchen).</description>
    <link>https://forem.com/marcuswwchen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859428%2F572085fe-831d-498b-854b-41102c7902ee.jpg</url>
      <title>Forem: Marcus Chen</title>
      <link>https://forem.com/marcuswwchen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/marcuswwchen"/>
    <language>en</language>
    <item>
      <title>Prefix caching in vLLM under multi-tenant agent traffic</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 26 May 2026 06:35:20 +0000</pubDate>
      <link>https://forem.com/marcuswwchen/prefix-caching-in-vllm-under-multi-tenant-agent-traffic-5e2j</link>
      <guid>https://forem.com/marcuswwchen/prefix-caching-in-vllm-under-multi-tenant-agent-traffic-5e2j</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop from 480ms to 110ms on one tenant and stay exactly the same on another. The split wasn't about traffic volume. It was about how each team templated their system prompts.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Our fine-tuning team serves 14 enterprise agents through a shared inference cluster. Four H100 nodes, vLLM 0.6.x, Qwen2.5-32B as the workhorse model. Traffic is bursty. One customer's nightly workflow can hit 8k requests in twenty minutes while another trickles through 30 calls an hour.&lt;/p&gt;

&lt;p&gt;Before turning on prefix caching, average TTFT across the cluster sat at 410ms p50, 1.2s p95. Cost wasn't the urgent problem. Latency was, because agents loop. A 400ms TTFT on a 12-step plan turns into 4.8 seconds of dead time before the user sees anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the cache actually does
&lt;/h2&gt;

&lt;p&gt;vLLM's prefix cache keeps KV blocks for tokens it has already processed. If a new request shares a prefix with something in the cache, those blocks get reused instead of recomputed. The unit is a block (16 tokens by default), so caching is greedy at block boundaries.&lt;/p&gt;

&lt;p&gt;If your system prompt is 1,024 tokens and identical across requests, you skip prefill for 1,024 tokens. At Qwen2.5-32B prefill speeds, that's roughly 90 to 110ms saved per call on our hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it worked
&lt;/h2&gt;

&lt;p&gt;Tenant A's agent uses a fixed system prompt assembled at deploy time. Same 1,847 tokens for every request, byte-for-byte. After we flipped &lt;code&gt;enable_prefix_caching=True&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TTFT p50: 480ms → 110ms&lt;/li&gt;
&lt;li&gt;TTFT p95: 1.4s → 280ms&lt;/li&gt;
&lt;li&gt;GPU prefill compute dropped by 38%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Their hit rate ran around 94% steady-state. The 6% misses were cold starts after pod restarts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it didn't
&lt;/h2&gt;

&lt;p&gt;Tenant B's agent rebuilds its system prompt every call. They inject the current timestamp, a session UUID, and a hash of recent tool outputs into the first 200 tokens. Looked stable on paper. In practice, every request had a unique prefix starting at token 47.&lt;/p&gt;

&lt;p&gt;vLLM caches at block granularity. One differing token in the first block invalidates everything after it. Tenant B's hit rate: 0.3%.&lt;/p&gt;

&lt;p&gt;We didn't catch this in staging because our staging traffic replays canned prompts. The diff between tenants only showed up under real traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix for Tenant B
&lt;/h2&gt;

&lt;p&gt;I talked their team into pushing the volatile fields to the end of the prompt. Took two hours of refactoring on their side. After:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TTFT p50: 510ms → 145ms&lt;/li&gt;
&lt;li&gt;Hit rate: 0.3% → 87%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then they asked why nobody mentioned this in the vLLM docs. The docs do mention it. Nobody reads docs when defaults already look fine on the neighboring tenant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Config
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vllm serve flags we landed on&lt;/span&gt;
&lt;span class="s"&gt;--model Qwen/Qwen2.5-32B-Instruct&lt;/span&gt;
&lt;span class="s"&gt;--enable-prefix-caching&lt;/span&gt;
&lt;span class="s"&gt;--block-size &lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;
&lt;span class="s"&gt;--gpu-memory-utilization &lt;/span&gt;&lt;span class="m"&gt;0.92&lt;/span&gt;
&lt;span class="s"&gt;--max-num-seqs &lt;/span&gt;&lt;span class="m"&gt;256&lt;/span&gt;
&lt;span class="s"&gt;--swap-space &lt;/span&gt;&lt;span class="m"&gt;16&lt;/span&gt;
&lt;span class="s"&gt;--preemption-mode recompute&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--preemption-mode recompute&lt;/code&gt; matters under memory pressure. We tried &lt;code&gt;swap&lt;/code&gt; and watched the cache thrash when bursts hit. Recompute throws cache blocks away cleanly instead of evicting them to CPU and back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Prompt structure&lt;/th&gt;
&lt;th&gt;Hit rate&lt;/th&gt;
&lt;th&gt;TTFT p50 before&lt;/th&gt;
&lt;th&gt;TTFT p50 after&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tenant A (fixed)&lt;/td&gt;
&lt;td&gt;Static 1,847-token prefix&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;td&gt;480ms&lt;/td&gt;
&lt;td&gt;110ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tenant B (before fix)&lt;/td&gt;
&lt;td&gt;Volatile fields at token 47&lt;/td&gt;
&lt;td&gt;0.3%&lt;/td&gt;
&lt;td&gt;510ms&lt;/td&gt;
&lt;td&gt;505ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tenant B (after fix)&lt;/td&gt;
&lt;td&gt;Volatile fields moved to tail&lt;/td&gt;
&lt;td&gt;87%&lt;/td&gt;
&lt;td&gt;510ms&lt;/td&gt;
&lt;td&gt;145ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal eval pipeline&lt;/td&gt;
&lt;td&gt;Per-eval unique prompts&lt;/td&gt;
&lt;td&gt;4%&lt;/td&gt;
&lt;td&gt;390ms&lt;/td&gt;
&lt;td&gt;380ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The eval pipeline column is honest. Prefix caching does nothing for workloads where every prompt is genuinely unique. We left it on anyway because the overhead is negligible.&lt;/p&gt;

&lt;p&gt;For routing across providers when we burst beyond self-hosted capacity, we run a small gateway in front (Bifrost is what we landed on, but the principle works with any of them). The local cache only helps for traffic that lands back on our own node, not the failover path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;The cache costs GPU memory. We reserved roughly 14% of HBM for cached blocks at our &lt;code&gt;max-num-seqs&lt;/code&gt; setting. That's tokens we can't use for batch concurrency. Worth it for us because TTFT mattered more than throughput. Not worth it if you're optimizing for tokens-per-second on offline batch.&lt;/p&gt;

&lt;p&gt;Cache invalidation is binary at block boundaries. A one-token change at position 0 kills the whole prefix. No fuzzy matching. Semantic-caching products exist for that, but they're a different beast. They cache responses, not KV state, and the failure modes differ.&lt;/p&gt;

&lt;p&gt;The cache is per-node. We have four nodes behind a round-robin LB, so the same prompt hits a cold cache 75% of the time on first contact. We considered sticky routing by prompt hash. Decided the complexity wasn't worth a 200ms improvement on first-contact latency. Maybe later.&lt;/p&gt;

&lt;p&gt;The model is the easy part. Knowing where your tokens go is the hard part.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html" rel="noopener noreferrer"&gt;vLLM automatic prefix caching docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2309.06180" rel="noopener noreferrer"&gt;PagedAttention paper (Kwon et al., 2023)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/text-generation-inference" rel="noopener noreferrer"&gt;Hugging Face TGI prefix caching notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lmsys.org/blog/2024-01-17-sglang/" rel="noopener noreferrer"&gt;SGLang RadixAttention writeup&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>mlops</category>
      <category>infrastructure</category>
      <category>pytorch</category>
    </item>
    <item>
      <title>We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Mon, 25 May 2026 16:03:44 +0000</pubDate>
      <link>https://forem.com/marcuswwchen/we-audited-our-agent-tool-call-traces-half-our-eval-data-was-garbage-152m</link>
      <guid>https://forem.com/marcuswwchen/we-audited-our-agent-tool-call-traces-half-our-eval-data-was-garbage-152m</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We pulled 41,000 production agent traces at Nexus Labs to build a fine-tuning dataset. After a manual audit of 1,200 of them, ~48% were unusable: tool calls that "succeeded" but returned wrong data, retries masking provider failures, and silent fallbacks that changed which model answered. Putting Bifrost in front of the agent fleet fixed the trace problem more than any sampling strategy we tried.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We run an enterprise agent product. Sales-ops automations mostly. Each user task ends up as a chain of 8-40 tool calls across a planner model, a worker model, and roughly 12 internal tools.&lt;/p&gt;

&lt;p&gt;For the last quarter my team has been building a fine-tune dataset from real traces. The plan was straightforward. Pull successful task completions. Filter by user thumbs-up. Use the trace as the training signal.&lt;/p&gt;

&lt;p&gt;It did not work.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "successful" actually meant in our traces
&lt;/h2&gt;

&lt;p&gt;The first audit pass was 1,200 traces, two engineers, three weeks. We tagged each trace as "clean", "noisy", or "corrupted".&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;% of traces&lt;/th&gt;
&lt;th&gt;What it meant&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clean&lt;/td&gt;
&lt;td&gt;52%&lt;/td&gt;
&lt;td&gt;Tool calls returned correct data, model picked the right next step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noisy&lt;/td&gt;
&lt;td&gt;31%&lt;/td&gt;
&lt;td&gt;Right answer eventually, but with hidden retries, fallback to a different model, or stale cache hits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Corrupted&lt;/td&gt;
&lt;td&gt;17%&lt;/td&gt;
&lt;td&gt;Trace claimed success, output was wrong. User had not noticed yet.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The noisy category is the one that broke me. We had been treating these as gold-standard data. A trace where the planner called &lt;code&gt;crm_lookup&lt;/code&gt;, got a 500, retried twice, then succeeded on a fallback Anthropic key while the original trace span still pointed at OpenAI gpt-4o. The training pair we would have generated: "given this user input, output this tool call sequence." But the sequence was the result of three providers and two model versions stitched together. No reproducibility.&lt;/p&gt;

&lt;p&gt;Worse: nothing in our trace told us which model actually produced the final answer. We had a &lt;code&gt;model&lt;/code&gt; field. It logged whichever provider was configured at request start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we ended up putting a gateway in front of everything
&lt;/h2&gt;

&lt;p&gt;We tried two things first. Both partial fixes.&lt;/p&gt;

&lt;p&gt;The first was logging at the application layer. Wrap every provider call, log model, latency, retry count, fallback path. This works until you have four services calling four SDKs with four retry policies. Our Python service used the official &lt;code&gt;openai&lt;/code&gt; client. Our Go service used a hand-rolled HTTP client. The TypeScript planner used Vercel AI SDK. Three different definitions of "retry".&lt;/p&gt;

&lt;p&gt;The second was forcing all traffic through LiteLLM. It got us to a unified call surface but the observability was thin for our needs, and the failover behaviour was harder to reason about under load. Not a knock on LiteLLM, it just was not the shape we wanted.&lt;/p&gt;

&lt;p&gt;We migrated the fleet behind Bifrost about five months ago. Two reasons specific to our problem:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;Automatic Fallbacks&lt;/code&gt; config makes the fallback chain a first-class object.&lt;/strong&gt; When a request fails over from Anthropic to Bedrock, that is in the response metadata. Not in three different log lines you have to join.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native Prometheus metrics&lt;/strong&gt; (observability docs) meant &lt;code&gt;bifrost_requests_total&lt;/code&gt; is tagged by the actual provider that served the request, not the one we asked for.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is a chunk of the config that mattered for trace cleanup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_API_KEY_1&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_API_KEY_2&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_API_KEY&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;fallback_to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;anthropic/claude-sonnet-4-6&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;

&lt;span class="na"&gt;logging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;include_fallback_chain&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;include_provider_actual&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two &lt;code&gt;include_*&lt;/code&gt; flags meant every trace span we emitted downstream had a deterministic answer to "who served this token". Our corrupted-trace rate on the next 5,000 sampled dropped from 17% to under 3%.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the audit actually changed about our fine-tuning
&lt;/h2&gt;

&lt;p&gt;We stopped using user thumbs-up as the primary filter. Thumbs-up correlates with "user got what they wanted eventually", not "the model made the right call". Now the filter is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-provider, single-model trace (no fallback fired)&lt;/li&gt;
&lt;li&gt;No retry on any tool call&lt;/li&gt;
&lt;li&gt;Tool call result schemas validated post-hoc against a recorded ground truth&lt;/li&gt;
&lt;li&gt;Span timing within 1.5x median for that task class&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That filter throws away about 71% of our raw traces. Painful. But the 29% that survives is data we can actually train on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Honest take on what this did not solve.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost is not a debugger.&lt;/strong&gt; It tells you which provider served the request and whether a fallback fired. It does not tell you whether the tool result was &lt;em&gt;correct&lt;/em&gt;. We still need the post-hoc schema validation pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching (docs) made the corruption worse before it got better.&lt;/strong&gt; Cache hits looked like fresh model calls in our old logging. We had to explicitly tag cached responses in the trace pipeline. Once tagged, fine, but the default was confusing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteLLM has a larger provider list at the long-tail.&lt;/strong&gt; If you need niche providers, check both before committing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Portkey's prompt management UI is nicer.&lt;/strong&gt; We do prompt management elsewhere so it did not matter for us. If you want one tool for both, Portkey is worth a look.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The MCP gateway feature (docs) is interesting but we have not put it in production.&lt;/strong&gt; Cannot vouch for it yet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the easy part. The infrastructure around the trace is where your eval dataset lives or dies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks docs&lt;/li&gt;
&lt;li&gt;Bifrost observability defaults&lt;/li&gt;
&lt;li&gt;LiteLLM proxy docs for honest comparison&lt;/li&gt;
&lt;li&gt;Anthropic's tool use guide — the trace structure section is the relevant one&lt;/li&gt;
&lt;li&gt;OpenTelemetry GenAI semantic conventions — what we wish our old logging had matched&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>mlops</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Why Your LLM Eval Harness Is Lying to You (And How to Fix It)</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Fri, 22 May 2026 06:32:58 +0000</pubDate>
      <link>https://forem.com/marcuswwchen/why-your-llm-eval-harness-is-lying-to-you-and-how-to-fix-it-2dmb</link>
      <guid>https://forem.com/marcuswwchen/why-your-llm-eval-harness-is-lying-to-you-and-how-to-fix-it-2dmb</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: Most eval harnesses I see in production are measuring the wrong thing. They report 87% pass rate on a static suite that hasn't been touched in four months, while the model silently regresses on the queries that actually matter. Here's how we restructured ours at Nexus Labs after a bad week in February.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We shipped a fine-tuned Llama 3.1 70B variant in late January. Eval score: 91.2 on our internal suite. Two weeks later, support tickets spiked. Customers running multi-step agent workflows were getting truncated tool calls roughly 12% of the time. Our eval suite caught zero of these.&lt;/p&gt;

&lt;p&gt;The suite wasn't broken. It was answering a question nobody had asked in months.&lt;/p&gt;

&lt;h2&gt;
  
  
  The static-suite trap
&lt;/h2&gt;

&lt;p&gt;Here's the pattern I keep seeing. Team builds an eval set of 500 examples around the time they ship v1. Each example gets a reference answer and a string-match or embedding-similarity check. The suite becomes the source of truth. CI gates on it. Dashboards graph it. Nobody questions it.&lt;/p&gt;

&lt;p&gt;But your traffic distribution shifts. New customer onboards with a different query pattern. A prompt change upstream alters tool-call frequency. The suite still passes because the suite hasn't moved.&lt;/p&gt;

&lt;p&gt;We pulled three months of production traces and binned them by intent cluster. The original eval suite covered four of the eleven clusters that showed up in real traffic. The four it covered were the easiest ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually changed
&lt;/h2&gt;

&lt;p&gt;Three things. None of them clever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Replay-based eval, refreshed weekly.&lt;/strong&gt; We sample 2,000 real production traces per week, strip PII, and run them through the candidate model. We compare structured outputs (tool calls, JSON fields) against the production response using exact match on tool name plus a learned judge for arguments. Free-form text gets a pairwise preference check against the current production model using a separate judge model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cluster-stratified sampling.&lt;/strong&gt; Embed every trace with &lt;code&gt;text-embedding-3-large&lt;/code&gt;, cluster with HDBSCAN, sample proportionally. This stops the eval from being dominated by the one chatty customer who sends 40% of traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Adversarial slices owned by humans.&lt;/strong&gt; Our support team flags any ticket that traces back to a model failure. Those traces get added to a permanent adversarial set. That set grows. It never shrinks. Currently sitting at 847 examples and climbing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;eval_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replay&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;sample_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
    &lt;span class="na"&gt;window_days&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7&lt;/span&gt;
    &lt;span class="na"&gt;strip_pii&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;cluster_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hdbscan&lt;/span&gt;
    &lt;span class="na"&gt;min_cluster_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
  &lt;span class="na"&gt;judges&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;structured&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exact_match_with_arg_judge&lt;/span&gt;
    &lt;span class="na"&gt;freeform&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pairwise_preference&lt;/span&gt;
    &lt;span class="na"&gt;judge_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;
  &lt;span class="na"&gt;adversarial&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./evals/adversarial_permanent.jsonl&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3.0&lt;/span&gt;
  &lt;span class="na"&gt;gates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;regression_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.02&lt;/span&gt;
    &lt;span class="na"&gt;adversarial_floor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;weight: 3.0&lt;/code&gt; on adversarial is deliberate. Those examples represent real customer pain. A 1% regression on adversarial costs us more than a 1% regression on the easy cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing the eval traffic
&lt;/h2&gt;

&lt;p&gt;Running 2,000 traces against a candidate model plus a judge model plus the production baseline gets expensive fast. We were burning $400/week on judge calls alone before we got serious about caching and routing.&lt;/p&gt;

&lt;p&gt;Two things helped. First, semantic caching on the judge prompts. The same trace evaluated twice against the same model pair should not cost twice. Second, we route across providers based on per-token cost for the judge role specifically. We use Bifrost (&lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;) for this because it gives us one OpenAI-compatible endpoint and lets us shift judge traffic between Anthropic and Google without touching the eval code. LiteLLM works similarly if that's already in your stack.&lt;/p&gt;

&lt;p&gt;Cost dropped to $140/week. Same coverage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison: what we tried
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Coverage of real traffic&lt;/th&gt;
&lt;th&gt;Maintenance cost&lt;/th&gt;
&lt;th&gt;Catches silent regressions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Static curated suite&lt;/td&gt;
&lt;td&gt;Low (drifts fast)&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Rarely&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure replay&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Sometimes (misses rare-but-critical)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Replay + cluster sampling + adversarial&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium-high&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-judge-only with no replay&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Inconsistent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Trade-offs and limitations
&lt;/h2&gt;

&lt;p&gt;Replay-based eval has real problems and I don't want to undersell them.&lt;/p&gt;

&lt;p&gt;Judge models are not ground truth. Pairwise preference between two model outputs is noisy. We run each comparison three times with temperature 0.3 and take majority vote. Even then, agreement with human raters sits around 78% on our adversarial slice. Useful, not authoritative.&lt;/p&gt;

&lt;p&gt;PII stripping is fragile. We use a regex stack plus a small NER model. We still find leakage occasionally during audits. If your domain has strict data handling rules, you may need synthetic replays instead of real ones, which loses some of the distributional fidelity that makes this work.&lt;/p&gt;

&lt;p&gt;Replay assumes today's traffic looks like tomorrow's. For a stable product, fine. For one shipping new features weekly, you're always one release behind.&lt;/p&gt;

&lt;p&gt;And the adversarial set has a selection bias. We only add examples that humans flagged. Failures nobody noticed don't make it in. We try to compensate by manually sampling 50 random traces per week for human review, but we're not closing the loop completely.&lt;/p&gt;

&lt;h2&gt;
  
  
  What hasn't worked
&lt;/h2&gt;

&lt;p&gt;Tried benchmark suites like MT-Bench and HELM as our primary gate. Useless for our domain. They measure general capability. We don't ship general capability. We ship agent reliability on a narrow task surface.&lt;/p&gt;

&lt;p&gt;Tried a single LLM-as-judge with one rubric. Too much variance. Rubric drift between runs was higher than the signal we were trying to measure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/EleutherAI/lm-evaluation-harness" rel="noopener noreferrer"&gt;Eleuther's lm-evaluation-harness&lt;/a&gt; — good reference for general benchmark plumbing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/anthropics/anthropic-cookbook/tree/main/misc" rel="noopener noreferrer"&gt;Anthropic's evals cookbook&lt;/a&gt; — pairwise judge patterns worth borrowing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hdbscan.readthedocs.io/" rel="noopener noreferrer"&gt;HDBSCAN docs&lt;/a&gt; — clustering algorithm we use for stratification&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hamel.dev/blog/posts/evals/" rel="noopener noreferrer"&gt;Hamel Husain on evals&lt;/a&gt; — the post that pushed us to take replay seriously&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the easy part. The eval is where you find out if you actually shipped what you thought you shipped.&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
    <item>
      <title>Measuring AI Gateway Failover: 30 Days of Production Data</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Thu, 21 May 2026 16:02:16 +0000</pubDate>
      <link>https://forem.com/marcuswwchen/measuring-ai-gateway-failover-30-days-of-production-data-336k</link>
      <guid>https://forem.com/marcuswwchen/measuring-ai-gateway-failover-30-days-of-production-data-336k</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR: We measured failover latency across three AI gateways (Bifrost, LiteLLM, Portkey) during 30 days of production traffic at Nexus Labs. Bifrost added 11ms p99 overhead with automatic provider fallback. The model is the easy part. Routing it reliably is not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our agent platform at Nexus Labs handles around 2.4M LLM requests per day. Half of those hit OpenAI, the rest spread across Anthropic, Bedrock, and Vertex. When OpenAI had its 4-hour incident on April 23, we lost 38 minutes of traffic before our homegrown retry logic gave up and rerouted.&lt;/p&gt;

&lt;p&gt;That hurt. So we replaced the retry layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual problem
&lt;/h2&gt;

&lt;p&gt;Most gateway benchmarks measure throughput on a cold path with no failures. That tells you very little about production. What I care about: how long does it take for a request to recover when a provider returns 429 or 503? How much p99 latency does the gateway add when nothing is wrong?&lt;/p&gt;

&lt;p&gt;Our team of 9 engineers spent two weeks instrumenting three options. Same hardware (c6i.4xlarge, 2 nodes behind an NLB). Same upstream credentials. Same request distribution sampled from our actual logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;Each gateway sat between our agent service and four providers. We configured identical fallback chains: OpenAI primary, Anthropic secondary, Bedrock tertiary. Cache disabled. Rate limits set to mirror our prod allocation.&lt;/p&gt;

&lt;p&gt;Here's the Bifrost config we used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;openai&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.OPENAI_API_KEY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
  &lt;span class="na"&gt;anthropic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.ANTHROPIC_API_KEY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
  &lt;span class="na"&gt;bedrock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;env.AWS_BEDROCK_KEY&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;

&lt;span class="na"&gt;fallbacks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;fallback_to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bedrock&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic.claude-sonnet-4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Documented behavior is at &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;. LiteLLM and Portkey have equivalent configs. Different YAML shape, same semantics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;We ran 720 hours of mirrored traffic. Numbers below are from the actual logs, not synthetic load.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;p50 overhead&lt;/th&gt;
&lt;th&gt;p99 overhead&lt;/th&gt;
&lt;th&gt;Failover time (provider down)&lt;/th&gt;
&lt;th&gt;Memory at 1k RPS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;3ms&lt;/td&gt;
&lt;td&gt;11ms&lt;/td&gt;
&lt;td&gt;180ms (one retry + switch)&lt;/td&gt;
&lt;td&gt;412 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;8ms&lt;/td&gt;
&lt;td&gt;41ms&lt;/td&gt;
&lt;td&gt;620ms&lt;/td&gt;
&lt;td&gt;890 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portkey (self-hosted)&lt;/td&gt;
&lt;td&gt;6ms&lt;/td&gt;
&lt;td&gt;29ms&lt;/td&gt;
&lt;td&gt;340ms&lt;/td&gt;
&lt;td&gt;650 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bifrost is written in Go. LiteLLM is Python with FastAPI. That accounts for most of the gap on the hot path. Not all of it. Bifrost's fallback chain evaluates synchronously without re-queuing the request, which matters when you're already on retry attempt two.&lt;/p&gt;

&lt;p&gt;Portkey was solid but the self-hosted version lagged their managed offering in feature parity. LiteLLM's killer feature for our team was richer support for custom cost-tracking callbacks. We still use those for finance reporting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we used Bifrost for
&lt;/h2&gt;

&lt;p&gt;Three things, specifically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback routing.&lt;/strong&gt; When OpenAI returns 429, the request goes to Anthropic with the equivalent model. Our agent code never knows. Docs at &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching.&lt;/strong&gt; For our evaluation harness specifically. We replay 18,000 prompts against new model versions nightly. Cache hit rate is 73% because the evaluation suite asks the same questions repeatedly. That's around 13k requests we don't pay for each night. Reference: &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/semantic-caching&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prometheus metrics.&lt;/strong&gt; Native export. We already had a Prom stack. Five-minute integration. The default dashboards aren't great but the metrics themselves are useful. Reference: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we did not use
&lt;/h2&gt;

&lt;p&gt;MCP gateway, governance, SSO. Our auth sits in front of the gateway, not inside it. The custom plugins interface looked interesting but we haven't needed one yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Bifrost is younger than LiteLLM. The provider list is wide (23+) but if you need a niche provider, check the docs first. The plugin interface is straightforward so you can add one yourself, but that's still work.&lt;/p&gt;

&lt;p&gt;The web UI is decent for initial setup, not where you want to be doing complex governance. Configure things in YAML and version them in git like anything else.&lt;/p&gt;

&lt;p&gt;If you're already deep in LiteLLM and using its callback ecosystem, migration cost is real. LiteLLM has more community integrations because it's been around longer. Portkey is also a fine choice if you want a managed control plane and don't want to operate a gateway yourself. Pick based on what your team will actually maintain.&lt;/p&gt;

&lt;p&gt;Last caveat. The numbers above are from our workload. Your traffic shape will differ. Run the test yourself before deciding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bifrost retries and fallbacks: &lt;a href="https://docs.getbifrost.ai/features/retries-and-fallbacks" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/retries-and-fallbacks&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost semantic caching: &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/semantic-caching&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost observability: &lt;a href="https://docs.getbifrost.ai/features/observability/default" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/features/observability/default&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost provider configuration: &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;https://docs.getbifrost.ai/quickstart/gateway/provider-configuration&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Bifrost source: &lt;a href="https://github.com/maximhq/bifrost" rel="noopener noreferrer"&gt;https://github.com/maximhq/bifrost&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the easy part. Routing it under failure is the hard part. Spend the time on the boring infrastructure problem.&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>llm</category>
      <category>infrastructure</category>
      <category>devops</category>
    </item>
    <item>
      <title>What Gemma 4's multi-token prediction head actually means for your eval pipeline</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Tue, 07 Apr 2026 12:21:46 +0000</pubDate>
      <link>https://forem.com/marcuswwchen/what-gemma-4s-multi-token-prediction-head-actually-means-for-your-eval-pipeline-3ik</link>
      <guid>https://forem.com/marcuswwchen/what-gemma-4s-multi-token-prediction-head-actually-means-for-your-eval-pipeline-3ik</guid>
      <description>&lt;p&gt;Gemma 4 dropped with a multi-token prediction (MTP) head and immediately every benchmark thread on r/LocalLLaMA and r/MachineLearning filled up with MMLU scores, HumanEval numbers, and throughput charts.&lt;/p&gt;

&lt;p&gt;Most of those benchmarks are not measuring what the MTP head actually changes. Here's what's actually happening, and what it means if you're running your own eval pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MTP actually is
&lt;/h2&gt;

&lt;p&gt;Standard autoregressive generation predicts one token at a time. At each step, the model outputs a probability distribution over the vocabulary, samples a token, appends it, and repeats.&lt;/p&gt;

&lt;p&gt;Multi-token prediction trains an additional head to predict multiple future tokens simultaneously. The core model still generates token-by-token at inference time, but the MTP head is used during training as an auxiliary loss — forcing the model to maintain internal representations that are useful several tokens ahead.&lt;/p&gt;

&lt;p&gt;The practical effect at inference time (depending on how it's deployed): speculative decoding becomes more effective because the MTP head can propose candidate continuations that the main model is more likely to accept. This is where the throughput numbers come from.&lt;/p&gt;

&lt;p&gt;Here's a simplified view of what speculative decoding with an MTP head looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;speculative_decode_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;main_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;draft_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# or MTP head used as draft
&lt;/span&gt;    &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# number of draft tokens to generate
&lt;/span&gt;    &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    One round of speculative decoding.
    Draft model proposes `gamma` tokens, main model verifies.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;device&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;
    &lt;span class="n"&gt;draft_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate gamma draft tokens
&lt;/span&gt;    &lt;span class="n"&gt;draft_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;draft_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;draft_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;draft_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;next_token_logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;draft_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;
            &lt;span class="n"&gt;next_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_token_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;draft_tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;draft_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;draft_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;next_token&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Verify with main model
&lt;/span&gt;    &lt;span class="n"&gt;candidate_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;draft_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_grad&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;main_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;main_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Accept/reject draft tokens
&lt;/span&gt;    &lt;span class="n"&gt;accepted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gamma&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;main_logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;main_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;:]&lt;/span&gt;
        &lt;span class="n"&gt;draft_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;draft_tokens&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Simple greedy acceptance check (real implementations use sampling)
&lt;/span&gt;        &lt;span class="n"&gt;main_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main_logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main_token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;draft_token&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;accepted&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="c1"&gt;# Return accepted prefix + one correction token
&lt;/span&gt;    &lt;span class="n"&gt;result_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;accepted&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;candidate_ids&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;result_length&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MTP head improves the acceptance rate in that inner loop. More accepted draft tokens per round = higher effective throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why your benchmark results are probably misleading
&lt;/h2&gt;

&lt;p&gt;The throughput gains from MTP are real, but they're not uniform across tasks. The acceptance rate of speculative decoding depends on how predictable the output sequence is.&lt;/p&gt;

&lt;p&gt;High acceptance rate (MTP helps a lot):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code generation — syntax is highly structured&lt;/li&gt;
&lt;li&gt;Structured data extraction — JSON, CSV, templated output&lt;/li&gt;
&lt;li&gt;Formulaic text — boilerplate, standard contract language, templated responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lower acceptance rate (MTP helps less):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open-ended generation with high entropy&lt;/li&gt;
&lt;li&gt;Creative writing&lt;/li&gt;
&lt;li&gt;Reasoning chains that make non-obvious inferential jumps&lt;/li&gt;
&lt;li&gt;Adversarial inputs where the model is already uncertain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your benchmark is mostly code or structured output tasks, your throughput numbers will look great. If your production use case is open-ended dialogue or reasoning-heavy tasks, the gains will be smaller.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually measured
&lt;/h2&gt;

&lt;p&gt;At Nexus, we maintain domain-specific eval suites for our enterprise automation use cases. I ran Gemma 4 through these last week. Three categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structured extraction (contract parsing, form extraction)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# eval structure — simplified
&lt;/span&gt;&lt;span class="n"&gt;eval_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;structured_extraction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_variants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-mtp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;throughput_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;847&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.923&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.871&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-mtp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;throughput_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;f1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.924&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.873&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# ~18% throughput improvement, no quality regression
# This is the good case for MTP
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Open-ended summarization&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eval_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open_ended_summarization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;throughput_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;612&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rouge_l&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.441&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic_drift_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.031&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-mtp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;throughput_tps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;679&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rouge_l&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.438&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic_drift_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.047&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# ~11% throughput improvement
# Small but consistent increase in mid-sentence topic drift
# ROUGE difference is within noise, but topic_drift_rate is reproducible
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;topic_drift_rate&lt;/code&gt; here is an internal metric — we flag spans where the model shifts semantic focus within a sentence boundary. It's a custom eval, not something you'll find in standard benchmarks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adversarial robustness suite&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;eval_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;adversarial_robustness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_families&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paraphrase_invariance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# same meaning, different phrasing
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;format_variation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# valid but unusual formatting
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rare_edge_cases&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# valid but low-frequency inputs
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ambiguity_resolution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# genuinely ambiguous inputs
&lt;/span&gt;    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.847&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4-mtp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.849&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# Effectively identical — MTP doesn't help or hurt adversarial robustness
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The adversarial result is the most important one for production deployments. Throughput gains are nice. Robustness is what keeps you off the incident page.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for your eval pipeline
&lt;/h2&gt;

&lt;p&gt;If you're evaluating Gemma 4 for a production deployment, here's what to actually do:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Build task-specific benchmarks, not generic ones&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Generic benchmarks tell you how the model performs on generic tasks. Your use case is not generic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DomainEvalSuite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_name&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;test_cases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;test_cases&lt;/span&gt;  &lt;span class="c1"&gt;# [{input, expected_output, metadata}]
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;test_cases&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;case&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_aggregate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Implement task-specific scoring — not ROUGE, not BLEU
&lt;/span&gt;        &lt;span class="c1"&gt;# Exact match, F1 over extracted fields, custom rubric, whatever fits
&lt;/span&gt;        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nb"&gt;NotImplementedError&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_aggregate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p10&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# tail performance matters
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fail_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;P10 tail performance and fail rate matter more than mean score for production systems. Mean score hides your worst cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Separate throughput eval from quality eval&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't conflate them. Run throughput benchmarks under controlled conditions (fixed prompt length, fixed output length, known hardware). Run quality benchmarks separately. Don't let throughput optimization choices degrade quality metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Test MTP specifically for your task type&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your use case is structured output: MTP is probably worth it, measure the throughput gain and verify quality holds.&lt;/p&gt;

&lt;p&gt;If your use case is open-ended generation: measure both throughput gain AND any quality regressions before assuming MTP is better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Include an adversarial subset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At minimum, include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Paraphrase variants of your test inputs (same intent, different wording)&lt;/li&gt;
&lt;li&gt;Format variations (valid but unusual)&lt;/li&gt;
&lt;li&gt;A few hand-crafted tricky cases specific to your domain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your model passes the standard tests but fails on paraphrases of those same tests, you have a memorization problem, not a generalization problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Gemma 4 MTP is a real improvement for throughput on structured tasks. The benchmark numbers showing gains on MMLU/HumanEval are real but somewhat misleading — those tasks happen to be ones where MTP acceptance rates are high.&lt;/p&gt;

&lt;p&gt;Build your own eval suite. Measure what matters for your use case. Check both throughput and quality. Include adversarial coverage.&lt;/p&gt;

&lt;p&gt;The model is not the hard part.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm happy to go deeper on any of the eval methodology here — custom metrics design, adversarial suite construction, or the MTP acceptance rate analysis. Drop questions in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>webdev</category>
      <category>future</category>
    </item>
    <item>
      <title>Mastering Local AI Agents for Everyday Programming in 2026</title>
      <dc:creator>Marcus Chen</dc:creator>
      <pubDate>Fri, 03 Apr 2026 17:10:19 +0000</pubDate>
      <link>https://forem.com/marcuswwchen/mastering-local-ai-agents-for-everyday-programming-in-2026-4dj0</link>
      <guid>https://forem.com/marcuswwchen/mastering-local-ai-agents-for-everyday-programming-in-2026-4dj0</guid>
      <description>&lt;p&gt;The landscape of software development is shifting beneath our feet. While large cloud-based LLMs have dominated headlines, 2026 is the year &lt;strong&gt;local AI agents&lt;/strong&gt; have truly matured into indispensable tools for everyday programming. &lt;/p&gt;

&lt;p&gt;By running autonomous, agentic workflows on our own silicon, developers are unlocking new levels of privacy, speed, and offline capability. In this post, we'll explore why local agents matter and how you can seamlessly integrate them into your coding routine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Local Agents?
&lt;/h3&gt;

&lt;p&gt;Cloud LLMs are powerful, but they have limitations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Privacy:&lt;/strong&gt; Not every codebase can or should be sent over the wire. Local agents keep proprietary logic strictly on your machine.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Latency:&lt;/strong&gt; No network trips means near-instant feedback for lightweight refactors or shell queries.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Cost:&lt;/strong&gt; Once you have the hardware, inferences are virtually free. This enables "infinite loop" agents that can continuously run tests and iteratively fix bugs in the background without racking up API bills.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Essential Workflows for Local Agents
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. The Autonomous Test-Fixer
&lt;/h4&gt;

&lt;p&gt;Instead of manually deciphering stack traces, local agents can watch your test output. When a test fails, the agent isolates the failure, analyzes the relevant module, and proposes a fix.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example of a broken function
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_discount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;discount_percent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;discount_percent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Oops, forgot to divide by 100
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A background local agent detects the &lt;code&gt;AssertionError&lt;/code&gt;, understands the logic, and patches the math error before you even switch back to your editor.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. PR Review and Digest
&lt;/h4&gt;

&lt;p&gt;Local models like specialized code models can read your diffs before you commit. They act as a ruthless but helpful rubber duck, pointing out logical gaps or missing edge-case coverage without complaining about rate limits.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Deep-Dive Log Analysis
&lt;/h4&gt;

&lt;p&gt;Sifting through thousands of lines of logs locally? A local agent can process massive log files right where they live, grepping for anomalies and synthesizing a human-readable summary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools to Get Started
&lt;/h3&gt;

&lt;p&gt;If you're looking to build your own local agentic stack, here are a few tools leading the charge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Ollama / LM Studio:&lt;/strong&gt; The backbone for running quantized models efficiently.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;OpenClaw / Aider:&lt;/strong&gt; Terminal-native agents that can directly edit your files and run shell commands.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;The question is no longer &lt;em&gt;if&lt;/em&gt; AI will write code, but &lt;em&gt;where&lt;/em&gt; that AI lives. By embracing local agents, we get the best of both worlds: the intelligence of modern LLMs with the speed, privacy, and control of our own hardware.&lt;/p&gt;

&lt;p&gt;Are you running any agents locally in your workflow? Let me know in the comments!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
