Forem: Marcus Chen

Prefix caching in vLLM under multi-tenant agent traffic

Marcus Chen — Tue, 26 May 2026 06:35:20 +0000

TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop from 480ms to 110ms on one tenant and stay exactly the same on another. The split wasn't about traffic volume. It was about how each team templated their system prompts.

The setup

Our fine-tuning team serves 14 enterprise agents through a shared inference cluster. Four H100 nodes, vLLM 0.6.x, Qwen2.5-32B as the workhorse model. Traffic is bursty. One customer's nightly workflow can hit 8k requests in twenty minutes while another trickles through 30 calls an hour.

Before turning on prefix caching, average TTFT across the cluster sat at 410ms p50, 1.2s p95. Cost wasn't the urgent problem. Latency was, because agents loop. A 400ms TTFT on a 12-step plan turns into 4.8 seconds of dead time before the user sees anything.

What the cache actually does

vLLM's prefix cache keeps KV blocks for tokens it has already processed. If a new request shares a prefix with something in the cache, those blocks get reused instead of recomputed. The unit is a block (16 tokens by default), so caching is greedy at block boundaries.

If your system prompt is 1,024 tokens and identical across requests, you skip prefill for 1,024 tokens. At Qwen2.5-32B prefill speeds, that's roughly 90 to 110ms saved per call on our hardware.

Where it worked

Tenant A's agent uses a fixed system prompt assembled at deploy time. Same 1,847 tokens for every request, byte-for-byte. After we flipped enable_prefix_caching=True:

TTFT p50: 480ms → 110ms
TTFT p95: 1.4s → 280ms
GPU prefill compute dropped by 38%

Their hit rate ran around 94% steady-state. The 6% misses were cold starts after pod restarts.

Where it didn't

Tenant B's agent rebuilds its system prompt every call. They inject the current timestamp, a session UUID, and a hash of recent tool outputs into the first 200 tokens. Looked stable on paper. In practice, every request had a unique prefix starting at token 47.

vLLM caches at block granularity. One differing token in the first block invalidates everything after it. Tenant B's hit rate: 0.3%.

We didn't catch this in staging because our staging traffic replays canned prompts. The diff between tenants only showed up under real traffic.

The fix for Tenant B

I talked their team into pushing the volatile fields to the end of the prompt. Took two hours of refactoring on their side. After:

TTFT p50: 510ms → 145ms
Hit rate: 0.3% → 87%

Then they asked why nobody mentioned this in the vLLM docs. The docs do mention it. Nobody reads docs when defaults already look fine on the neighboring tenant.

Config

# vllm serve flags we landed on
--model Qwen/Qwen2.5-32B-Instruct
--enable-prefix-caching
--block-size 16
--gpu-memory-utilization 0.92
--max-num-seqs 256
--swap-space 16
--preemption-mode recompute

--preemption-mode recompute matters under memory pressure. We tried swap and watched the cache thrash when bursts hit. Recompute throws cache blocks away cleanly instead of evicting them to CPU and back.

Comparison

Workload	Prompt structure	Hit rate	TTFT p50 before	TTFT p50 after
Tenant A (fixed)	Static 1,847-token prefix	94%	480ms	110ms
Tenant B (before fix)	Volatile fields at token 47	0.3%	510ms	505ms
Tenant B (after fix)	Volatile fields moved to tail	87%	510ms	145ms
Internal eval pipeline	Per-eval unique prompts	4%	390ms	380ms

The eval pipeline column is honest. Prefix caching does nothing for workloads where every prompt is genuinely unique. We left it on anyway because the overhead is negligible.

For routing across providers when we burst beyond self-hosted capacity, we run a small gateway in front (Bifrost is what we landed on, but the principle works with any of them). The local cache only helps for traffic that lands back on our own node, not the failover path.

Trade-offs and limitations

The cache costs GPU memory. We reserved roughly 14% of HBM for cached blocks at our max-num-seqs setting. That's tokens we can't use for batch concurrency. Worth it for us because TTFT mattered more than throughput. Not worth it if you're optimizing for tokens-per-second on offline batch.

Cache invalidation is binary at block boundaries. A one-token change at position 0 kills the whole prefix. No fuzzy matching. Semantic-caching products exist for that, but they're a different beast. They cache responses, not KV state, and the failure modes differ.

The cache is per-node. We have four nodes behind a round-robin LB, so the same prompt hits a cold cache 75% of the time on first contact. We considered sticky routing by prompt hash. Decided the complexity wasn't worth a 200ms improvement on first-contact latency. Maybe later.

The model is the easy part. Knowing where your tokens go is the hard part.

We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.

Marcus Chen — Mon, 25 May 2026 16:03:44 +0000

TL;DR: We pulled 41,000 production agent traces at Nexus Labs to build a fine-tuning dataset. After a manual audit of 1,200 of them, ~48% were unusable: tool calls that "succeeded" but returned wrong data, retries masking provider failures, and silent fallbacks that changed which model answered. Putting Bifrost in front of the agent fleet fixed the trace problem more than any sampling strategy we tried.

We run an enterprise agent product. Sales-ops automations mostly. Each user task ends up as a chain of 8-40 tool calls across a planner model, a worker model, and roughly 12 internal tools.

For the last quarter my team has been building a fine-tune dataset from real traces. The plan was straightforward. Pull successful task completions. Filter by user thumbs-up. Use the trace as the training signal.

It did not work.

What "successful" actually meant in our traces

The first audit pass was 1,200 traces, two engineers, three weeks. We tagged each trace as "clean", "noisy", or "corrupted".

Category	% of traces	What it meant
Clean	52%	Tool calls returned correct data, model picked the right next step
Noisy	31%	Right answer eventually, but with hidden retries, fallback to a different model, or stale cache hits
Corrupted	17%	Trace claimed success, output was wrong. User had not noticed yet.

The noisy category is the one that broke me. We had been treating these as gold-standard data. A trace where the planner called crm_lookup, got a 500, retried twice, then succeeded on a fallback Anthropic key while the original trace span still pointed at OpenAI gpt-4o. The training pair we would have generated: "given this user input, output this tool call sequence." But the sequence was the result of three providers and two model versions stitched together. No reproducibility.

Worse: nothing in our trace told us which model actually produced the final answer. We had a model field. It logged whichever provider was configured at request start.

Why we ended up putting a gateway in front of everything

We tried two things first. Both partial fixes.

The first was logging at the application layer. Wrap every provider call, log model, latency, retry count, fallback path. This works until you have four services calling four SDKs with four retry policies. Our Python service used the official openai client. Our Go service used a hand-rolled HTTP client. The TypeScript planner used Vercel AI SDK. Three different definitions of "retry".

The second was forcing all traffic through LiteLLM. It got us to a unified call surface but the observability was thin for our needs, and the failover behaviour was harder to reason about under load. Not a knock on LiteLLM, it just was not the shape we wanted.

We migrated the fleet behind Bifrost about five months ago. Two reasons specific to our problem:

The Automatic Fallbacks config makes the fallback chain a first-class object. When a request fails over from Anthropic to Bedrock, that is in the response metadata. Not in three different log lines you have to join.
Native Prometheus metrics (observability docs) meant bifrost_requests_total is tagged by the actual provider that served the request, not the one we asked for.

Here is a chunk of the config that mattered for trace cleanup:

providers:
  openai:
    keys:
      - value: env.OPENAI_API_KEY_1
        weight: 0.7
      - value: env.OPENAI_API_KEY_2
        weight: 0.3
  anthropic:
    keys:
      - value: env.ANTHROPIC_API_KEY

fallbacks:
  - model: openai/gpt-4o
    fallback_to:
      - anthropic/claude-sonnet-4-6
      - openai/gpt-4o-mini

logging:
  include_fallback_chain: true
  include_provider_actual: true

The two include_* flags meant every trace span we emitted downstream had a deterministic answer to "who served this token". Our corrupted-trace rate on the next 5,000 sampled dropped from 17% to under 3%.

What the audit actually changed about our fine-tuning

We stopped using user thumbs-up as the primary filter. Thumbs-up correlates with "user got what they wanted eventually", not "the model made the right call". Now the filter is:

Single-provider, single-model trace (no fallback fired)
No retry on any tool call
Tool call result schemas validated post-hoc against a recorded ground truth
Span timing within 1.5x median for that task class

That filter throws away about 71% of our raw traces. Painful. But the 29% that survives is data we can actually train on.

Trade-offs and limitations

Honest take on what this did not solve.

Bifrost is not a debugger. It tells you which provider served the request and whether a fallback fired. It does not tell you whether the tool result was correct. We still need the post-hoc schema validation pass.
Semantic caching (docs) made the corruption worse before it got better. Cache hits looked like fresh model calls in our old logging. We had to explicitly tag cached responses in the trace pipeline. Once tagged, fine, but the default was confusing.
LiteLLM has a larger provider list at the long-tail. If you need niche providers, check both before committing.
Portkey's prompt management UI is nicer. We do prompt management elsewhere so it did not matter for us. If you want one tool for both, Portkey is worth a look.
The MCP gateway feature (docs) is interesting but we have not put it in production. Cannot vouch for it yet.

The model is the easy part. The infrastructure around the trace is where your eval dataset lives or dies.

Why Your LLM Eval Harness Is Lying to You (And How to Fix It)

Marcus Chen — Fri, 22 May 2026 06:32:58 +0000

TL;DR: Most eval harnesses I see in production are measuring the wrong thing. They report 87% pass rate on a static suite that hasn't been touched in four months, while the model silently regresses on the queries that actually matter. Here's how we restructured ours at Nexus Labs after a bad week in February.

We shipped a fine-tuned Llama 3.1 70B variant in late January. Eval score: 91.2 on our internal suite. Two weeks later, support tickets spiked. Customers running multi-step agent workflows were getting truncated tool calls roughly 12% of the time. Our eval suite caught zero of these.

The suite wasn't broken. It was answering a question nobody had asked in months.

The static-suite trap

Here's the pattern I keep seeing. Team builds an eval set of 500 examples around the time they ship v1. Each example gets a reference answer and a string-match or embedding-similarity check. The suite becomes the source of truth. CI gates on it. Dashboards graph it. Nobody questions it.

But your traffic distribution shifts. New customer onboards with a different query pattern. A prompt change upstream alters tool-call frequency. The suite still passes because the suite hasn't moved.

We pulled three months of production traces and binned them by intent cluster. The original eval suite covered four of the eleven clusters that showed up in real traffic. The four it covered were the easiest ones.

What we actually changed

Three things. None of them clever.

1. Replay-based eval, refreshed weekly. We sample 2,000 real production traces per week, strip PII, and run them through the candidate model. We compare structured outputs (tool calls, JSON fields) against the production response using exact match on tool name plus a learned judge for arguments. Free-form text gets a pairwise preference check against the current production model using a separate judge model.

2. Cluster-stratified sampling. Embed every trace with text-embedding-3-large, cluster with HDBSCAN, sample proportionally. This stops the eval from being dominated by the one chatty customer who sends 40% of traffic.

3. Adversarial slices owned by humans. Our support team flags any ticket that traces back to a model failure. Those traces get added to a permanent adversarial set. That set grows. It never shrinks. Currently sitting at 847 examples and climbing.

eval_config:
  replay:
    sample_size: 2000
    window_days: 7
    strip_pii: true
    cluster_method: hdbscan
    min_cluster_size: 15
  judges:
    structured: exact_match_with_arg_judge
    freeform: pairwise_preference
    judge_model: claude-sonnet-4-6
  adversarial:
    path: ./evals/adversarial_permanent.jsonl
    weight: 3.0
  gates:
    regression_threshold: 0.02
    adversarial_floor: 0.85

The weight: 3.0 on adversarial is deliberate. Those examples represent real customer pain. A 1% regression on adversarial costs us more than a 1% regression on the easy cases.

Routing the eval traffic

Running 2,000 traces against a candidate model plus a judge model plus the production baseline gets expensive fast. We were burning $400/week on judge calls alone before we got serious about caching and routing.

Two things helped. First, semantic caching on the judge prompts. The same trace evaluated twice against the same model pair should not cost twice. Second, we route across providers based on per-token cost for the judge role specifically. We use Bifrost (https://github.com/maximhq/bifrost) for this because it gives us one OpenAI-compatible endpoint and lets us shift judge traffic between Anthropic and Google without touching the eval code. LiteLLM works similarly if that's already in your stack.

Cost dropped to $140/week. Same coverage.

Comparison: what we tried

Approach	Coverage of real traffic	Maintenance cost	Catches silent regressions
Static curated suite	Low (drifts fast)	Low	Rarely
Pure replay	High	Medium	Sometimes (misses rare-but-critical)
Replay + cluster sampling + adversarial	High	Medium-high	Yes
LLM-judge-only with no replay	Medium	Low	Inconsistent

Trade-offs and limitations

Replay-based eval has real problems and I don't want to undersell them.

Judge models are not ground truth. Pairwise preference between two model outputs is noisy. We run each comparison three times with temperature 0.3 and take majority vote. Even then, agreement with human raters sits around 78% on our adversarial slice. Useful, not authoritative.

PII stripping is fragile. We use a regex stack plus a small NER model. We still find leakage occasionally during audits. If your domain has strict data handling rules, you may need synthetic replays instead of real ones, which loses some of the distributional fidelity that makes this work.

Replay assumes today's traffic looks like tomorrow's. For a stable product, fine. For one shipping new features weekly, you're always one release behind.

And the adversarial set has a selection bias. We only add examples that humans flagged. Failures nobody noticed don't make it in. We try to compensate by manually sampling 50 random traces per week for human review, but we're not closing the loop completely.

What hasn't worked

Tried benchmark suites like MT-Bench and HELM as our primary gate. Useless for our domain. They measure general capability. We don't ship general capability. We ship agent reliability on a narrow task surface.

Tried a single LLM-as-judge with one rubric. Too much variance. Rubric drift between runs was higher than the signal we were trying to measure.

Measuring AI Gateway Failover: 30 Days of Production Data

Marcus Chen — Thu, 21 May 2026 16:02:16 +0000

TL;DR: We measured failover latency across three AI gateways (Bifrost, LiteLLM, Portkey) during 30 days of production traffic at Nexus Labs. Bifrost added 11ms p99 overhead with automatic provider fallback. The model is the easy part. Routing it reliably is not.

Our agent platform at Nexus Labs handles around 2.4M LLM requests per day. Half of those hit OpenAI, the rest spread across Anthropic, Bedrock, and Vertex. When OpenAI had its 4-hour incident on April 23, we lost 38 minutes of traffic before our homegrown retry logic gave up and rerouted.

That hurt. So we replaced the retry layer.

The actual problem

Most gateway benchmarks measure throughput on a cold path with no failures. That tells you very little about production. What I care about: how long does it take for a request to recover when a provider returns 429 or 503? How much p99 latency does the gateway add when nothing is wrong?

Our team of 9 engineers spent two weeks instrumenting three options. Same hardware (c6i.4xlarge, 2 nodes behind an NLB). Same upstream credentials. Same request distribution sampled from our actual logs.

Setup

Each gateway sat between our agent service and four providers. We configured identical fallback chains: OpenAI primary, Anthropic secondary, Bedrock tertiary. Cache disabled. Rate limits set to mirror our prod allocation.

Here's the Bifrost config we used:

providers:
  openai:
    keys:
      - value: env.OPENAI_API_KEY
        weight: 1.0
  anthropic:
    keys:
      - value: env.ANTHROPIC_API_KEY
        weight: 1.0
  bedrock:
    keys:
      - value: env.AWS_BEDROCK_KEY
        weight: 1.0

fallbacks:
  - provider: openai
    model: gpt-4o
    fallback_to:
      - provider: anthropic
        model: claude-sonnet-4
      - provider: bedrock
        model: anthropic.claude-sonnet-4

Documented behavior is at https://docs.getbifrost.ai/features/retries-and-fallbacks. LiteLLM and Portkey have equivalent configs. Different YAML shape, same semantics.

Results

We ran 720 hours of mirrored traffic. Numbers below are from the actual logs, not synthetic load.

Gateway	p50 overhead	p99 overhead	Failover time (provider down)	Memory at 1k RPS
Bifrost	3ms	11ms	180ms (one retry + switch)	412 MB
LiteLLM	8ms	41ms	620ms	890 MB
Portkey (self-hosted)	6ms	29ms	340ms	650 MB

Bifrost is written in Go. LiteLLM is Python with FastAPI. That accounts for most of the gap on the hot path. Not all of it. Bifrost's fallback chain evaluates synchronously without re-queuing the request, which matters when you're already on retry attempt two.

Portkey was solid but the self-hosted version lagged their managed offering in feature parity. LiteLLM's killer feature for our team was richer support for custom cost-tracking callbacks. We still use those for finance reporting.

What we used Bifrost for

Three things, specifically.

Fallback routing. When OpenAI returns 429, the request goes to Anthropic with the equivalent model. Our agent code never knows. Docs at https://docs.getbifrost.ai/features/retries-and-fallbacks.

Semantic caching. For our evaluation harness specifically. We replay 18,000 prompts against new model versions nightly. Cache hit rate is 73% because the evaluation suite asks the same questions repeatedly. That's around 13k requests we don't pay for each night. Reference: https://docs.getbifrost.ai/features/semantic-caching.

Prometheus metrics. Native export. We already had a Prom stack. Five-minute integration. The default dashboards aren't great but the metrics themselves are useful. Reference: https://docs.getbifrost.ai/features/observability/default.

What we did not use

MCP gateway, governance, SSO. Our auth sits in front of the gateway, not inside it. The custom plugins interface looked interesting but we haven't needed one yet.

Trade-offs and Limitations

Bifrost is younger than LiteLLM. The provider list is wide (23+) but if you need a niche provider, check the docs first. The plugin interface is straightforward so you can add one yourself, but that's still work.

The web UI is decent for initial setup, not where you want to be doing complex governance. Configure things in YAML and version them in git like anything else.

If you're already deep in LiteLLM and using its callback ecosystem, migration cost is real. LiteLLM has more community integrations because it's been around longer. Portkey is also a fine choice if you want a managed control plane and don't want to operate a gateway yourself. Pick based on what your team will actually maintain.

Last caveat. The numbers above are from our workload. Your traffic shape will differ. Run the test yourself before deciding.

What Gemma 4's multi-token prediction head actually means for your eval pipeline

Marcus Chen — Tue, 07 Apr 2026 12:21:46 +0000

Gemma 4 dropped with a multi-token prediction (MTP) head and immediately every benchmark thread on r/LocalLLaMA and r/MachineLearning filled up with MMLU scores, HumanEval numbers, and throughput charts.

Most of those benchmarks are not measuring what the MTP head actually changes. Here's what's actually happening, and what it means if you're running your own eval pipeline.

What MTP actually is

Standard autoregressive generation predicts one token at a time. At each step, the model outputs a probability distribution over the vocabulary, samples a token, appends it, and repeats.

Multi-token prediction trains an additional head to predict multiple future tokens simultaneously. The core model still generates token-by-token at inference time, but the MTP head is used during training as an auxiliary loss — forcing the model to maintain internal representations that are useful several tokens ahead.

The practical effect at inference time (depending on how it's deployed): speculative decoding becomes more effective because the MTP head can propose candidate continuations that the main model is more likely to accept. This is where the throughput numbers come from.

Here's a simplified view of what speculative decoding with an MTP head looks like:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def speculative_decode_step(
    main_model,
    draft_model,  # or MTP head used as draft
    input_ids: torch.Tensor,
    gamma: int = 4,  # number of draft tokens to generate
    temperature: float = 1.0,
) -> torch.Tensor:
    """
    One round of speculative decoding.
    Draft model proposes `gamma` tokens, main model verifies.
    """
    device = input_ids.device
    draft_tokens = []

    # Generate gamma draft tokens
    draft_input = input_ids.clone()
    with torch.no_grad():
        for _ in range(gamma):
            draft_out = draft_model(draft_input)
            next_token_logits = draft_out.logits[:, -1, :]
            next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
            draft_tokens.append(next_token)
            draft_input = torch.cat([draft_input, next_token], dim=-1)

    # Verify with main model
    candidate_ids = torch.cat([input_ids] + draft_tokens, dim=-1)
    with torch.no_grad():
        main_out = main_model(candidate_ids)

    # Accept/reject draft tokens
    accepted = 0
    for i in range(gamma):
        main_logits = main_out.logits[:, input_ids.shape[1] + i - 1, :]
        draft_token = draft_tokens[i]

        # Simple greedy acceptance check (real implementations use sampling)
        main_token = torch.argmax(main_logits, dim=-1, keepdim=True)
        if torch.equal(main_token, draft_token):
            accepted += 1
        else:
            break

    # Return accepted prefix + one correction token
    result_length = input_ids.shape[1] + accepted + 1
    return candidate_ids[:, :result_length]

The MTP head improves the acceptance rate in that inner loop. More accepted draft tokens per round = higher effective throughput.

Why your benchmark results are probably misleading

The throughput gains from MTP are real, but they're not uniform across tasks. The acceptance rate of speculative decoding depends on how predictable the output sequence is.

High acceptance rate (MTP helps a lot):

Code generation — syntax is highly structured
Structured data extraction — JSON, CSV, templated output
Formulaic text — boilerplate, standard contract language, templated responses

Lower acceptance rate (MTP helps less):

Open-ended generation with high entropy
Creative writing
Reasoning chains that make non-obvious inferential jumps
Adversarial inputs where the model is already uncertain

If your benchmark is mostly code or structured output tasks, your throughput numbers will look great. If your production use case is open-ended dialogue or reasoning-heavy tasks, the gains will be smaller.

What I actually measured

At Nexus, we maintain domain-specific eval suites for our enterprise automation use cases. I ran Gemma 4 through these last week. Three categories:

Structured extraction (contract parsing, form extraction)

# eval structure — simplified
eval_results = {
    "task": "structured_extraction",
    "model_variants": ["gemma4-base", "gemma4-mtp"],
    "metrics": {
        "gemma4-base":  {"throughput_tps": 847,  "f1": 0.923, "exact_match": 0.871},
        "gemma4-mtp":   {"throughput_tps": 1001, "f1": 0.924, "exact_match": 0.873},
    }
}
# ~18% throughput improvement, no quality regression
# This is the good case for MTP

Open-ended summarization

eval_results = {
    "task": "open_ended_summarization",
    "metrics": {
        "gemma4-base":  {"throughput_tps": 612, "rouge_l": 0.441, "topic_drift_rate": 0.031},
        "gemma4-mtp":   {"throughput_tps": 679, "rouge_l": 0.438, "topic_drift_rate": 0.047},
    }
}
# ~11% throughput improvement
# Small but consistent increase in mid-sentence topic drift
# ROUGE difference is within noise, but topic_drift_rate is reproducible

topic_drift_rate here is an internal metric — we flag spans where the model shifts semantic focus within a sentence boundary. It's a custom eval, not something you'll find in standard benchmarks.

Adversarial robustness suite

eval_results = {
    "task": "adversarial_robustness",
    "test_families": [
        "paraphrase_invariance",     # same meaning, different phrasing
        "format_variation",          # valid but unusual formatting
        "rare_edge_cases",           # valid but low-frequency inputs
        "ambiguity_resolution",      # genuinely ambiguous inputs
    ],
    "metrics": {
        "gemma4-base": {"overall_pass_rate": 0.847},
        "gemma4-mtp":  {"overall_pass_rate": 0.849},
    }
}
# Effectively identical — MTP doesn't help or hurt adversarial robustness

The adversarial result is the most important one for production deployments. Throughput gains are nice. Robustness is what keeps you off the incident page.

What this means for your eval pipeline

If you're evaluating Gemma 4 for a production deployment, here's what to actually do:

1. Build task-specific benchmarks, not generic ones

Generic benchmarks tell you how the model performs on generic tasks. Your use case is not generic.

class DomainEvalSuite:
    def __init__(self, task_name: str, test_cases: list[dict]):
        self.task_name = task_name
        self.test_cases = test_cases  # [{input, expected_output, metadata}]

    def run(self, model, tokenizer) -> dict:
        results = []
        for case in self.test_cases:
            output = self._generate(model, tokenizer, case["input"])
            score = self._score(output, case["expected_output"])
            results.append({
                "input": case["input"],
                "output": output,
                "score": score,
                "metadata": case["metadata"]
            })
        return self._aggregate(results)

    def _score(self, output: str, expected: str) -> float:
        # Implement task-specific scoring — not ROUGE, not BLEU
        # Exact match, F1 over extracted fields, custom rubric, whatever fits
        raise NotImplementedError

    def _aggregate(self, results: list) -> dict:
        scores = [r["score"] for r in results]
        return {
            "mean": sum(scores) / len(scores),
            "p10": sorted(scores)[len(scores) // 10],  # tail performance matters
            "fail_rate": sum(1 for s in scores if s < 0.5) / len(scores),
        }

P10 tail performance and fail rate matter more than mean score for production systems. Mean score hides your worst cases.

2. Separate throughput eval from quality eval

Don't conflate them. Run throughput benchmarks under controlled conditions (fixed prompt length, fixed output length, known hardware). Run quality benchmarks separately. Don't let throughput optimization choices degrade quality metrics.

3. Test MTP specifically for your task type

If your use case is structured output: MTP is probably worth it, measure the throughput gain and verify quality holds.

If your use case is open-ended generation: measure both throughput gain AND any quality regressions before assuming MTP is better.

4. Include an adversarial subset

At minimum, include:

Paraphrase variants of your test inputs (same intent, different wording)
Format variations (valid but unusual)
A few hand-crafted tricky cases specific to your domain

If your model passes the standard tests but fails on paraphrases of those same tests, you have a memorization problem, not a generalization problem.

The bottom line

Gemma 4 MTP is a real improvement for throughput on structured tasks. The benchmark numbers showing gains on MMLU/HumanEval are real but somewhat misleading — those tasks happen to be ones where MTP acceptance rates are high.

Build your own eval suite. Measure what matters for your use case. Check both throughput and quality. Include adversarial coverage.

The model is not the hard part.

I'm happy to go deeper on any of the eval methodology here — custom metrics design, adversarial suite construction, or the MTP acceptance rate analysis. Drop questions in the comments.

Mastering Local AI Agents for Everyday Programming in 2026

Marcus Chen — Fri, 03 Apr 2026 17:10:19 +0000

The landscape of software development is shifting beneath our feet. While large cloud-based LLMs have dominated headlines, 2026 is the year local AI agents have truly matured into indispensable tools for everyday programming.

By running autonomous, agentic workflows on our own silicon, developers are unlocking new levels of privacy, speed, and offline capability. In this post, we'll explore why local agents matter and how you can seamlessly integrate them into your coding routine.

Why Local Agents?

Cloud LLMs are powerful, but they have limitations:

Privacy: Not every codebase can or should be sent over the wire. Local agents keep proprietary logic strictly on your machine.
Latency: No network trips means near-instant feedback for lightweight refactors or shell queries.
Cost: Once you have the hardware, inferences are virtually free. This enables "infinite loop" agents that can continuously run tests and iteratively fix bugs in the background without racking up API bills.

Essential Workflows for Local Agents

1. The Autonomous Test-Fixer

Instead of manually deciphering stack traces, local agents can watch your test output. When a test fails, the agent isolates the failure, analyzes the relevant module, and proposes a fix.

# Example of a broken function
def calculate_discount(price, discount_percent):
    return price - (price * discount_percent) # Oops, forgot to divide by 100

A background local agent detects the AssertionError, understands the logic, and patches the math error before you even switch back to your editor.

2. PR Review and Digest

Local models like specialized code models can read your diffs before you commit. They act as a ruthless but helpful rubber duck, pointing out logical gaps or missing edge-case coverage without complaining about rate limits.

3. Deep-Dive Log Analysis

Sifting through thousands of lines of logs locally? A local agent can process massive log files right where they live, grepping for anomalies and synthesizing a human-readable summary.

Tools to Get Started

If you're looking to build your own local agentic stack, here are a few tools leading the charge:

Ollama / LM Studio: The backbone for running quantized models efficiently.
OpenClaw / Aider: Terminal-native agents that can directly edit your files and run shell commands.

Conclusion

The question is no longer if AI will write code, but where that AI lives. By embracing local agents, we get the best of both worlds: the intelligence of modern LLMs with the speed, privacy, and control of our own hardware.

Are you running any agents locally in your workflow? Let me know in the comments!

Forem: Marcus Chen

Prefix caching in vLLM under multi-tenant agent traffic

The setup

What the cache actually does

Where it worked

Where it didn't

The fix for Tenant B

Config

Comparison

Trade-offs and limitations

Further reading

We Audited Our Agent Tool-Call Traces. Half Our Eval Data Was Garbage.

What "successful" actually meant in our traces

Why we ended up putting a gateway in front of everything

What the audit actually changed about our fine-tuning

Trade-offs and limitations

Further Reading

Why Your LLM Eval Harness Is Lying to You (And How to Fix It)

The static-suite trap

What we actually changed

Routing the eval traffic

Comparison: what we tried

Trade-offs and limitations

What hasn't worked

Further Reading

Measuring AI Gateway Failover: 30 Days of Production Data

The actual problem

Setup

Results

What we used Bifrost for

What we did not use

Trade-offs and Limitations

Further Reading

What Gemma 4's multi-token prediction head actually means for your eval pipeline

What MTP actually is

Why your benchmark results are probably misleading

What I actually measured

What this means for your eval pipeline

The bottom line

Mastering Local AI Agents for Everyday Programming in 2026

Why Local Agents?

Essential Workflows for Local Agents

1. The Autonomous Test-Fixer

2. PR Review and Digest

3. Deep-Dive Log Analysis

Tools to Get Started

Conclusion