Forem: Ian L. Paterson

Building llama.cpp from source on a Dell Precision T5820 with an RTX 3090 Ti (after seven power cycles)

Ian L. Paterson — Mon, 18 May 2026 20:09:10 +0000

I pulled a Quadro M4000 out of a used Dell Precision T5820, dropped in an RTX 3090 Ti, and turned the box into a homelab inference node running Qwen3.6-27B at 42 tok/s. Getting there took seven BIOS power cycles before the PCIe link would train. The Dell forum threads and the LLM-generated answers all miss the same thing: the fix is patience.

This post has the working recipe, the from-source llama.cpp build, the 12VHPWR connector physics that nobody explains, and the long-context tricks that let a $700 used GPU serve 262K-token windows on a single 24 GB card. Numbers are from May 2026 against driver 580.142, Qwen3.6-27B Q4_K_M, and llama.cpp at the commit current at publish. Versions in this stack move fast, so treat the specific numbers as a snapshot.

The working recipe

If you landed here from a Dell forum thread and just need the answer:

BIOS 2.41 or newer on the T5820. Verify in System Information.
Disable Secure Boot, set boot mode to UEFI only, and leave Primary Video on Auto.
3090 Ti in slot 1 (top, CPU lanes) or slot 4 (also CPU lanes). Slot 1 is x8 on a Xeon W-2223 build and slot 4 is x16, but PCIe Gen3 x8 does not bottleneck a single GPU inference workload. Pick on clearance.
12VHPWR seated until you hear the latch click. Three separate PSU cables to the 3-to-1 adapter. Y-splitters and pigtails are a fire hazard at 450 W, so all three 8-pin inputs need to be populated from three independent rails.
Both PSUs powered before you press the Dell power button. If you are running dual-PSU, bring up the GPU PSU first.
First boot may power-cycle five to seven times before POST. Do not abort early. The BIOS is retraining the PCIe link.
After Linux boots, sudo apt install nvidia-driver-580, reboot, then verify with nvidia-smi.

Step 6 is the step most forum advice skips.

Building llama.cpp from source for the 3090 Ti

For a single-user 24 GB box, the right answer is to build llama.cpp from source against your exact GPU compute capability. Ollama, Docker images, and prebuilt binaries all lag on features and hide tuning flags. Building from source picks up upstream improvements the same day they land, runs faster on this hardware, and gives you access to every knob.

git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DLLAMA_CURL=ON
cmake --build build --config Release -j$(nproc)

86 is sm_86, Ampere, the 3090 Ti's compute capability (same number for the regular 3090). Build takes about fifteen minutes on eight threads, with CUDA kernel codegen (nvcc, ptxas, cicc) doing most of the work. Subsequent rebuilds are quick if you set up ccache before the first build.

Pull a model and start the server:

huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir ~/models
~/llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-Q4_K_M.gguf \
  -ngl 99 --host 127.0.0.1 --port 8080 \
  -c 8192 --jinja

That gives you an OpenAI-compatible API on localhost:8080. -ngl 99 puts all layers on the GPU and --jinja is mandatory if you want tool calling to work, which is covered in its own section below. At this baseline configuration the box uses 17.3 GiB of VRAM and serves Qwen3.6-27B Q4_K_M at 42 tok/s on a 200-word generation, with 86% GPU utilization and 445 W under load.

The seven-power-cycle install diary

Order matters, so here is what actually happened.

Fresh Ubuntu 25.10 install on the box, SSH access via Tailscale.
lspci saw only the Quadro M4000 (PCH-attached, x4) on bus 04. The 3090 Ti did not enumerate, both CPU PCIe root ports were empty, and the card was inert with no fan twitch on power-on and no LED.
False lead. I assumed PSU sequencing was the issue, pulled the card to the bench, and jumped PS_ON on the 1 kW supply. The card stayed inert. Twenty minutes in I remembered that most retail Add-In Board (AIB) cards refuse to light or spin until they are seated in a PCIe slot, which makes bench tests unreliable for "is the card alive" checks on Ampere-class GPUs.
Real cause: the 12VHPWR connector was loose. I reseated it firmly until it clicked, the card LED lit up, and it was alive.
Installed in slot 1 of the T5820, powered up the GPU PSU first, then pressed the Dell power button.
Boot loop. The box power-cycled four, five, six, seven times before settling into another black screen with no SSH, and I aborted.
Searched Dell forums and LinusTechTips and found multiple unresolved threads. Dell's official guidance qualifies the RTX 3090 for slots 2 and 4 of the T5820, the two x16 CPU slots.
Tried slot 4, same boot-loop pattern, aborted again.
Pulled the M4000 entirely and booted to BIOS on the 3090 Ti to confirm Secure Boot disabled, UEFI only, Primary Video on Auto. The BIOS 2.41 System Setup screen does not expose a user-facing toggle for memory-mapped I/O (MMIO) above 4 GB on this revision, and the firmware appears to handle the mapping automatically.
Reinstalled in slot 1 (more clearance for the 3.5-slot girth than slot 4, accepting the x8 lane drop). The boot loop returned, but this time I waited instead of aborting and the box POSTed cleanly on the seventh cycle. SSH came back. lspci showed 0000:b3:00.0 GA102 [GeForce RTX 3090 Ti] [10de:2203] on a CPU root complex, which is what was supposed to happen all along.
Installed nvidia-driver-580 via apt, rebooted, and nvidia-smi came up clean.

Why this is not a software fix

The BIOS gets there or it doesn't, and on a fresh install with a high-power GPU it takes more attempts than feels reasonable. The Dell community threads stay unresolved because the resolution lives in the firmware's PCIe link-training routine, which runs on its own schedule. Once you understand that, the right move is to wait.

The fixes that don't work (and why the forums keep recommending them)

Every Dell community thread I read in the BIOS 2.41 era, every LLM-generated answer I asked for, and every YouTube tutorial in the first two pages of results converges on the same four pieces of advice. None of them matched what I was seeing on this build. They are worth naming, because if you are debugging this in the moment, you will burn an hour on each before you give up.

The 10-pin CPU2 to 8-pin adapter

This is a real Dell part, and on a build where the GPU draws power from the workstation's stock 950 W supply it does matter. Once you put the 3090 Ti on a separate dedicated PSU (which you should, given the 450 W thermal design power, or TDP), the CPU2 adapter becomes irrelevant because the GPU is no longer pulling from the Dell rail at all.

Above 4G Decoding in BIOS

The toggle is not in my BIOS 2.41 System Setup, and on this revision the firmware appears to handle MMIO above 4 GB automatically. The screenshots in those forum threads are from older Dell consumer BIOSes or other workstation lines, so if your firmware does expose the toggle, leave it on. The qualification doesn't break anything either way.

Use slot 2 or slot 4, not slot 1

All three options are CPU lanes (slots 2 and 4 at x16, slot 1 at x8 on Xeon W-2223 builds), and the practical difference for a single-GPU inference workload is negligible because PCIe Gen3 x8 is not the bottleneck. I tried slot 1 first, hit the boot loop, moved to slot 4 because that's what the table said, hit the same boot loop, and went back to slot 1 because it has more clearance for the 3.5-slot girth. Slot choice was a dead end.

Deep power reset (hold the power button for thirty seconds)

This drains residual charge from the PSU capacitors and addresses stuck power states, which is a real failure mode for some hardware but the wrong diagnosis here. The boot loop is the BIOS taking multiple cycles to negotiate PCIe link training with a card outside its qualification database, and holding the power button is harmless against that but also does nothing to speed it up.

The link-training cycles run on firmware time, and every shortcut the forums recommend (BIOS flags, adapters, slot moves) leaves that time unchanged. If the card is seated correctly and the power is right, stop aborting after cycle four and let it run to seven. The forum threads stay open because most people abort before the BIOS finishes.

12VHPWR: the connector that fails silently

The 16-pin 12VHPWR connector is its own category of pain, and most write-ups about 3090 Ti, 4090, or 5090 problems are downstream of it. The Founders Edition adapter that ships with most cards is a 3-to-1 setup where three 8-pin PCIe inputs collapse into a single 16-pin output that plugs into the GPU. Three rules the marketing material does not stress:

All three 8-pin inputs need to be populated from three separate PSU rails with three separate cables. Y-splitters and pigtails are a fire hazard at 450 W. The 2023 CableMod recall covered angled adapters where the connector could shift loose under cable tension, but the underlying physics (partial contact at 35-40 A) is the same failure mode you create when you split rails or share them.
The 12VHPWR latch needs to audibly click. A connector seated 95% of the way will pass continuity tests, fail under load, and on some cards melt the connector housing. The audible click is the only reliable signal, so push until it clicks. If it does not click, the card is not seated.
The card will not light on the bench. Ampere-generation cards keep the fans and the LED off until they are seated in a PCIe slot, so you cannot validate "is this card alive" by jumping PS_ON on the PSU and looking for fan spin. The card has to be installed.

The third rule cost me the most time. I pulled the card to the bench, jumped PS_ON, watched it sit dark and inert, and concluded the card was DOA. It was fine all along, just waiting for a slot before it would wake up.

262K context on a single 24 GB card

This is the upgrade nobody mentions in the homelab threads. The 8 K baseline is what the recipe ships with, but Q4 KV cache and flash attention rewrite the memory math entirely.

~/llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-Q4_K_M.gguf \
  -ngl 99 --host 127.0.0.1 --port 8080 \
  -c 262144 \
  -fa on -ctk q4_0 -ctv q4_0 \
  --parallel 1 \
  --jinja

Qwen3.6-27B Q4_K_M holds 262,144 tokens of context on a single 24 GB 3090 Ti at 39 tok/s eval and 86 tok/s prompt processing. VRAM use sits at 21.3 GiB with 2.7 GiB headroom against the 24,564 MiB cap, which is 3 tok/s slower than the 8K baseline and rounding error for most workloads.

KV cache type is what makes this possible. Most setups leave KV at fp16 default, or push it to q8 thinking "more bits equals more quality." On Qwen3.6-27B dense at 262 K context:

KV cache type	VRAM at 262K	Throughput
`fp16`	does not fit	n/a
`q8_0`	23 GiB (just fits)	extrapolated ~3x slower from 96K measurement
`q4_0`	21.3 GiB	39 tok/s

I verified the q8 trap on this rig at the context length where I could measure it directly. At 96 K context I observed a 23% throughput hit on the q4 to q8 swap, which dropped from 39 tok/s to 30 tok/s. The penalty scales with context length because more KV cells means more per-token dequant work. The 3x slowdown at 262 K is an extrapolation from that scaling rather than a head-to-head measurement, since q8 barely fits at 262K and I did not push a long run through it. Either way, the direction is consistent: q8 KV is a trap on consumer 24 GB hardware.

A couple of tradeoffs worth flagging:

--parallel 1 is a single slot, which is fine for solo use. Concurrent users will queue, which is rarely what you want.
KV cache quality at q4 over very long contexts is empirically untested for this model. Long-document recall could degrade in ways that throughput numbers will never show, so a needle-in-haystack pass is a precondition for trusting this configuration for real long-document work.

Hobbyist-tier hardware can now serve frontier-tier context lengths if you accept single-user throughput. A $700 used 3090 Ti running 262 K context locally breaks even against Claude Sonnet's $15/M output pricing after about two weeks of pegged inference. Add roughly $18 of electricity during those two weeks at $0.12 per kWh, or about $39 a month if you keep the card pegged. The ceiling shifted, and most homelab write-ups have not caught up.

For comparison, as measured on my Mac Studio M2 Max with 32 GB unified memory, MLX 1.6.0 runs Qwen3.6-35B-A3B (UD-Q4_K_XL, 35B total / 3B active per token) at 49 tok/s on a 32K context window. That is roughly the same throughput as the 3090 Ti on dense Qwen3.6-27B at 39 tok/s, with a smaller context ceiling and a larger model. The mixture-of-experts (MoE) bandwidth-divided-by-active-params speedup math (3B active / 400 GB/s) does not translate cleanly at Max-class memory bandwidth, since the headline MoE wins live on M-Ultra (800 GB/s) and leave Studio Max behind.

The silent OOM: context checkpoints and prompt cache

Two runtime allocations can add up to 19 GiB on top of the model and don't appear in the static VRAM math everyone publishes. Context checkpoints and the slot prompt cache absolutely show up at peak load, and the failure mode is silent.

Context checkpoints (-ctxcp or --ctx-checkpoints) cache intermediate KV states so the server can rewind without reprocessing the prefix. The default is 32 per slot. Each checkpoint on Qwen3.6-27B runs roughly 150 MiB, so 4 parallel slots × 32 checkpoints × 150 MiB gives a worst case of 19 GiB on top of the model. That is not headroom anyone is publishing about.

Slot prompt cache caches recent prompts (default 8 GiB limit) so reused prefixes skip reprocessing. Invisible in the "model plus KV" math, very visible at peak.

A 128 K q8-KV configuration typically reports 21 GiB at startup with 3 GiB free and runs fine for a dozen short turns, until a long context-heavy turn lands. The checkpoint cache spikes 4-5 GiB, the prompt cache takes another 2-3 GiB, and the server dies with cudaMalloc failing on the next 200 MiB allocation. The log shows happy request handling and then srv operator(): cleaning up before exit... followed by silence, with no OOM trace and no backtrace. The CUDA layer just quits.

Two flags neither tutorial mentions will fix this. -np 1 collapses the parallel slot pool to one (the pool is just an OOM multiplier on every per-slot cache when you are the only user), and -ctxcp 4 caps context checkpoints at 4 per slot, which drops that allocation from 4.8 GiB to 600 MiB. With both caps plus q4 KV at 128 K context, the configuration holds at 18.6 GiB used and 6 GiB free across long-context sessions. Without them, the same flags die on the first long-prompt turn.

If your llama-server "randomly" dies under load and the log tail shows cleaning up before exit with no error, you are probably hitting this. Watch GPU memory during a real workload rather than only at startup, since startup numbers underreport the peak by several GiB.

Why your tool calls are hallucinating: the --jinja flag

Add --jinja to the llama-server launch command. Without it, llama-server falls back to a C++ template path that silently drops the tools parameter before the model ever sees the request, so any request that depends on tool schemas behaves as if no tools were declared. I verified this with a direct curl. Same weights, same prompt, flag on versus off: with the flag off the model roleplayed the tool call as plain text, and with the flag on it returned a proper tool_calls array. One server flag, entirely different observable behavior.

To confirm the flag is doing what it should, grep the launch command for --jinja and check the log for chat template, thinking = 1. If the template line shows but the flag is absent, that is the bug.

A related Qwen3-specific gotcha. The /no_think sentinel as a system-prompt string is silently ignored by Qwen3.6-27B, and the working lever is chat_template_kwargs.enable_thinking=false in the request body. The intuitive next move ("turn off thinking on leaf subagents to save time-to-first-token") does not survive a controlled test. I ran 24 trials of parent-with-thinking vs subagent-without across two task types, both modes always hit max quality, and thinking-ON was consistently faster end-to-end. The intuition was noise. A separate post on speculative decoding covers that bench in detail. For this article, leave thinking on for both roles and add --jinja.

The numbers in one place

metric	value
GPU	RTX 3090 Ti, GA102, sm_86, 24,564 MiB
Model	Qwen3.6-27B Q4_K_M (unsloth GGUF)
Throughput at 8K context	42 tok/s eval
Throughput at 262K context	39 tok/s eval, 86 tok/s prompt
VRAM at 262K context	21.3 GiB used, 2.7 GiB headroom
Q4→Q8 KV penalty at 96K	23% throughput hit (39 → 30 tok/s)
Power under load	445 W, 86% GPU util, 57°C
Idle, no model loaded	32°C, 10 W
Idle, model resident at P8	50°C, 26 W
Cold boot to first POST	7 BIOS power cycles
Build time, llama.cpp from source	~15 min on 8 threads

FAQ

Will a Dell Precision T5820 accept an RTX 3090 Ti?

Yes. Dell qualifies the 3090 for slots 2 and 4 (the x16 CPU slots), but slot 1 works just as well on the x8 lanes. PCIe Gen3 x8 does not bottleneck single-GPU inference, so the slot choice comes down to clearance rather than throughput. The card is a 3.5-slot girth, so slot 1 has the most physical clearance. Run the 3090 Ti off a separate dedicated PSU rather than the Dell 950 W stock supply, because the Dell rails are not designed for the 450 W transient spikes the card pulls on the 12 V line.

Why does my Dell Precision boot-loop with a new GPU?

The BIOS is retraining the PCIe link with a card that is outside its qualification database. On BIOS 2.41 with a 3090 Ti in a fresh install, five to seven power cycles is normal. The official advice (10-pin CPU2 adapter, Above 4G Decoding, slot 2/4, deep power reset) does not change the outcome, so the correct response is to let the system cycle until it POSTs.

At what point should I give up and assume the boot loop is a real fault?

Ten cycles without POST is my own abort threshold. Beyond that, check 12VHPWR seating first (audible latch click), then PSU rail integrity (continuity test on each 8-pin input from the wall to the adapter), then try a different slot. If all three pass and it still loops on a fresh install, you may have a genuine PCIe fault or a card with degraded power delivery, which is a return-to-vendor situation.

Is the T5820 950 W PSU enough for a 3090 Ti?

Technically yes, but practically you want a separate PSU for the GPU. The Dell stock supply has the cable connectors, but the 12 V rail was not designed for the 450 W transient spikes the 3090 Ti pulls under load. A dedicated 1 kW supply with PS_ON jumped to ground costs about $80 and removes the entire failure class. (PS_ON is the green wire on a 24-pin ATX connector. Tied to a black ground, it tells the PSU to stay on without a motherboard.)

What is sm_86 in the CUDA build command?

sm_86 is the compute capability identifier for Nvidia's Ampere generation, which covers the RTX 3090, 3090 Ti, A40, A100, and a few others. The -DCMAKE_CUDA_ARCHITECTURES=86 flag tells nvcc to generate kernels for that target only, which keeps build time down and avoids fat-binary bloat. 4090 owners use 89, H100 owners use 90.

What does -ngl 99 do in llama-server?

-ngl is the number of model layers to offload to the GPU. Setting it to 99 means "all of them," since no current model has more than 99 layers, so the entire model lives in VRAM. Lower numbers split the model between CPU RAM and VRAM, which costs throughput badly. On a 24 GB card with a 27B Q4 model, 99 fits comfortably and there is no reason to do anything else.

Where is the Above 4G Decoding toggle on a Dell Precision T5820?

Not in BIOS 2.41 System Setup as a user-facing toggle. On this revision the firmware appears to handle MMIO above 4 GB automatically. Older Dell consumer BIOSes and other workstation lines expose it, which is what the forum screenshots are showing. If your firmware does expose it, leave it on, since the qualification does not break anything either way.

llama.cpp vs Ollama on a 3090 Ti, which should I run?

llama.cpp from source is the right call for single-user latency and tuning headroom. Ollama works fine for a "just works" start, but it ships pre-built binaries that lag on features, wraps llama.cpp anyway, and hides flags like --jinja, -ctk, and -ctxcp that materially change throughput and VRAM behavior on a 24 GB card. Build llama.cpp yourself and you get the same backend and every knob.

What's left on this box

nvidia-smi -pl 350 power-limit to drop heat with a marginal throughput cost. The card is still running at the 450 W default.
vLLM comparison on the same model. llama.cpp wins on single-user latency. vLLM should win on batched throughput, so it is worth measuring.
RAM upgrade in transit. 16 GB is anemic for a Skylake-W board, so 4×32 GB RDIMMs are ordered to take the C422 chipset to its quad-channel ceiling.

A note for anyone copying this verbatim: my production unit since this writeup has swapped the vanilla build/bin/llama-server for the MTP branch (build-mtp/bin/) with --spec-type mtp for speculative decoding, and the bind moved from 127.0.0.1 to 0.0.0.0 so a separate agent host on the Tailscale mesh can reach it. The recipe above is still the right starting point, and the companion post on DFlash vs MTP benchmarks covers the swap in full detail.

For the "so what do I actually run on this box" question, see the Inference Arbitrage write-up, which covers how I route calls across this box, Mac Studio, and cloud frontier models based on task type and cost.

Companion post: for the speculative decoding benchmarks on this build (DFlash vs MTP, decode rates across output lengths, lossless probe results), see Three Months of Speed-Up Experiments on a 3090 Ti.

Sources

Dell Community: Precision 5820 with RTX 3090 boot loop: unresolved
Dell Community: Anyone running a 3090 Ti in a 5820: speculative, no working fix
Dell Community: 5820 boot loop when Tesla P100 installed: same pattern, different GPU
Dell Precision 5820 Owner's Manual, PCIe slots: slot lane assignments
keturk/llm_on_rtx_3090: closest competitor (Ubuntu + Docker + Ollama, does not address the boot loop)
llama.cpp: primary upstream

Anti-detect browser benchmark 2026: 7 stealth tools, 31 Cloudflare targets, 651 verdicts

Ian L. Paterson — Mon, 18 May 2026 20:00:02 +0000

I built a scraper. Cloudflare killed it in 48 hours.

I built a web scraper for Canadian small-cap stock data and Cloudflare blocked it within 48 hours. After testing seven popular stealth-browser libraries against that gate, three times each from a real residential network, only one of the seven got through.

The one that worked drives Chrome directly, without going through Playwright (the standard Python library for browser automation). Every other browser in the test fails the same gate, regardless of whether it uses Playwright, a patched Chromium fork, a Firefox fork, or raw HTTP.

Meet the contenders

nodriver

nodriver drives system Chrome over a direct WebSocket connection to the browser's DevTools port. There is no Playwright shim in the control plane, no Runtime.enable call sequence at startup, and no middleware layer between the Python code and the browser process. It is the successor to undetected-chromedriver, from the same author (ultrafunkamsterdam). The key feature in plain terms: removing Playwright from the loop means the browser's automation footprint looks different to a detection gate, because the CDP handshake sequence no longer has Playwright's fingerprint on it.

GitHub: github.com/ultrafunkamsterdam/nodriver
License: AGPL-3.0 (note: AGPL requires services using nodriver to open-source modifications. Different from undetected-chromedriver's MIT license.)
Why I tested it: it scored 28 OK / 0 blocked, the only browser with zero blocked cells across all 31 targets.

CloakBrowser

CloakBrowser is a patched Chromium fork with, per CloakHQ's documentation, 49 source-level C++ modifications targeting automation signals. It ships its own bundled Chromium build, so the browser binary itself has been modified before launch, not patched via Python hooks at runtime. It has a drop-in Playwright-compatible API.

GitHub: github.com/CloakHQ/CloakBrowser (13.5k stars, as of 2026-05-17)
License: MIT (Python wrapper); custom CloakBrowser binary license
Note: the macOS darwin-arm64 build is pinned to Chromium 145 as of 2026-03-04. CloakHQ has shipped 14 Linux/Windows releases since then with no macOS update.
In this bench because: it is the most-starred free anti-detect browser and makes the strongest marketing claims. Also because @hasantoxr's 14/14 claim needed production verification.

curl_cffi

curl_cffi is a Python HTTP library that wraps curl-impersonate, which replaces the entire HTTPS client stack with one shaped like a real Chrome installation. It has no JavaScript engine, it is an HTTP-only tool, but the TLS handshake it sends looks indistinguishable from Chrome to a fingerprinting gate. Version 0.15.0 with impersonate="chrome" (the launch flag that selects which browser shape to imitate) defaults to Chrome 145/146 shape.

GitHub: github.com/lexiforest/curl_cffi
License: MIT
In this bench because: it is the correct "raw HTTP floor" baseline, and the interesting question is how many production targets require JavaScript at all.

Patchright

Patchright is a Playwright fork that patches the CDP-leak signals Playwright exposes during browser startup, specifically the Runtime.enable and Target.setAutoAttach call sequences that anti-bot systems can detect in the protocol handshake. It supports channel=chrome, which tells it to drive the system's installed Google Chrome binary instead of a bundled Chromium, giving it a real Chrome 148 TLS fingerprint and version stamp.

GitHub: github.com/Kaliiiiiiiiii-Vinyzu/patchright (3.2k stars, as of 2026-05-17)
License: Apache-2.0
In this bench because: it is the most actively maintained patched Playwright fork and the most technically credible alternative to vanilla Playwright.

Camoufox

Camoufox is a Firefox fork modified at the C level to spoof fingerprinting APIs (canvas, WebGL, screen geometry, navigator properties) using randomized but internally consistent values. Because it is Firefox-derived, its TLS handshake has a Firefox shape (a different cipher suite order than Chrome), which is detectable but also whitelisted on many targets that block Chrome-shaped automation.

GitHub: github.com/daijro/camoufox (8.4k stars, as of 2026-05-17)
License: MPL-2.0
In this bench because: Firefox-derived TLS is a genuinely different attack surface from the Chromium-based tools, and the question of whether that difference helps or hurts on production targets is worth measuring.

vanilla Playwright

Vanilla Playwright is the baseline. Chromium 147, no stealth patches, Microsoft's official automation library. It does not attempt to hide that it is an automation tool. Its TLS handshake is a Chromium handshake, its CDP startup sequence is stock, and its navigator properties advertise webdriver: true unless patched.

GitHub: github.com/microsoft/playwright
License: Apache-2.0
Included as: the baseline every other browser has to beat.

rebrowser-playwright

rebrowser-playwright is a Playwright fork that applies CDP-leak patches similar to Patchright's approach, but with a different patch strategy and different bundled Chromium version (136, which is twelve versions behind Patchright's Chrome 148). The repository's last code commit was September 2024 (the GitHub "pushed_at" of 2025-05-09 reflects a metadata-only touch), so it is effectively unmaintained.

GitHub: github.com/rebrowser/rebrowser-playwright
License: unspecified (no LICENSE file in the repo)
Reason for inclusion: it is the second major patched Playwright fork. The head-to-head with Patchright and vanilla was the direct question the bench was built to answer.

What I did

Seven browsers, 31 targets across four categories (JS-layer detection panels, TLS fingerprint endpoints, live Cloudflare and other anti-bot production sites, and high-traffic content sites that fingerprint quietly), three independent sweeps from one residential Mac Studio IP across one night. Headed mode, all free or already on hand. Full source and raw records at github.com/ianlpaterson/anti-detect-browser-bench.

Which browser wins the bench?

nodriver wins outright with zero blocked targets. Patchright, CloakBrowser, and Camoufox cluster in the middle. Vanilla and rebrowser tie at the bottom. Raw curl_cffi ties CloakBrowser at 26 OK: a 21-line wrapper performs identically to a Chromium fork with 49 source-level C++ patches.

The signal driving most disagreement is automation-protocol fingerprinting (anti-bot gates checking HOW the browser is being driven, not what the browser claims to be). nodriver wins because it drives Chrome over CDP directly, with no Playwright shim. Playwright leaves protocol-level traces that fingerprint-patch tools do not address.

Per-browser totals (identical across N=3)

651 records (217 cells × 3 runs), zero verdict drift across five hours from one residential IP.

Browser	OK	Gated	Blocked	Engine
nodriver	28	3	0	Google Chrome 148.0.7778.168 (system browser)
cloak	26	3	2	Chromium 145.0.7632.109
curl_baseline	26	3	2	curl_cffi 0.15.0 (impersonate=chrome)
patchright	25	3	3	Chrome 148.0.7778.168 (channel=chrome)
camofox	25	3	3	Firefox 135.0.1-beta.24 (camoufox)
vanilla	24	2	5	Chromium 147.0.7727.15
rebrowser	24	2	5	Chromium 136.0.7103.25 (rebrowser bundle v1169)

Per-target verdict matrix

Twenty-five of thirty-one targets agree across all seven browsers. The six that disagree are where the signal lives.

Target	vanilla	patch	cloak	camo	rebro	nodrv	curl
amazon-product	ok	ok	ok	ok	ok	ok	ok
booking-search	ok	ok	ok	ok	ok	ok	ok
bot-incolumitas	gated	gated	gated	gated	gated	gated	gated
browserleaks	ok	ok	ok	ok	ok	ok	ok
browserleaks-tls	ok	ok	ok	ok	ok	ok	ok
browserscan-bot	ok	ok	ok	ok	ok	ok	ok
canadianinsider	BLK	BLK	BLK	BLK	BLK	ok	BLK
ceo-ca	ok	ok	ok	ok	ok	ok	ok
creepjs	ok	ok	ok	ok	ok	ok	ok
crunchbase-cf	ok	ok	ok	ok	ok	ok	ok
devto	ok	ok	ok	BLK	ok	ok	ok
github-explore	ok	ok	ok	ok	ok	ok	ok
glassdoor	BLK	BLK	BLK	BLK	BLK	gated	BLK
google-search	BLK	BLK	ok	ok	BLK	ok	ok
indeed-jobs	ok	ok	ok	ok	ok	ok	ok
instagram-post	ok	ok	ok	ok	ok	ok	ok
linkedin-jobs	ok	ok	ok	ok	ok	ok	ok
medium	BLK	gated	gated	gated	BLK	ok	gated
newsfilecorp	ok	ok	ok	ok	ok	ok	ok
nowsecure-cf	ok	ok	ok	ok	ok	ok	ok
pixelscan-bot	ok	ok	ok	ok	ok	ok	ok
pixelscan-fp	ok	ok	ok	ok	ok	ok	ok
rebrowser-detector	ok	ok	ok	ok	ok	ok	ok
reddit	ok	ok	ok	ok	ok	ok	ok
sannysoft	ok	ok	ok	ok	ok	ok	ok
sedarplus	gated	gated	gated	gated	gated	gated	gated
stackoverflow	BLK	ok	ok	ok	BLK	ok	ok
stockwatch	ok	ok	ok	ok	ok	ok	ok
tiktok-user	ok	ok	ok	ok	ok	ok	ok
tls-peet	ok	ok	ok	ok	ok	ok	ok
x-explore	ok	ok	ok	ok	ok	ok	ok

Which six targets disagree across browsers?

canadianinsider: nodriver passes, every other browser hard-blocked. Six Chromium and Firefox stealth approaches fail the same Cloudflare-Turnstile-protected page, while Chrome 148 driven over plain CDP with no Playwright passes. I retried canadianinsider across all three sweeps to confirm nodriver wasn't catching a brief Cloudflare window. The pass reproduced every time.

medium: nodriver alone passes OK. Patchright/Cloak/Camoufox/curl hit a Cloudflare interstitial, vanilla and rebrowser hard-blocked. Same axis as canadianinsider, softer landing.

glassdoor: nodriver gets a soft DataDome challenge, the other six hit a 403.

google-search: Cloak, Camoufox, nodriver, and curl_cffi pass. Patchright, vanilla, and rebrowser hard-block. Patchright joining the unpatched-browsers club is notable: the patches don't address whatever this gate keys on.

dev.to: six browsers pass, Camoufox alone blocked, consistent with a Firefox TLS quirk the CDN flags.

stackoverflow: vanilla and rebrowser hard-blocked, every other browser passes. The clearest single-finding evidence that rebrowser's CDP-leak patches do not change outcomes. rebrowser fails exactly where vanilla fails, on this one cell and across the rest of the matrix.

rebrowser-playwright vs vanilla Playwright: identical block sets

rebrowser-playwright and vanilla Playwright both score 24 OK, 2 gated, 5 blocked. Block sets identical: canadianinsider, medium, google-search, stackoverflow, glassdoor. On these 31 targets, rebrowser is functionally vanilla.

stackoverflow is the clearest evidence: vanilla and rebrowser both blocked while every other browser passes. One caveat: rebrowser ships Chromium 136, eleven versions behind vanilla's Chromium 147 and twelve behind Patchright's Chrome 148. Some of the identical-to-vanilla performance is plausibly a version effect.

Patchright vs vanilla Playwright: +1 OK, same google-search blind spot

Patchright is +1 OK and -2 blocked versus vanilla (25 vs 24 OK, 3 vs 5 blocked). The gain comes from stackoverflow. Patchright does not recover canadianinsider or google-search.

The channel=chrome flag matters as much as the patches: running system Chrome 148 delivers fingerprint protection at a layer no patch can replicate. I expected the patches to be the bigger lever. The matrix showed channel=chrome is.

Camoufox vs Patchright: tied at 25 OK, different failure modes

Camoufox runs Firefox 135.0.1-beta.24. Blocks: canadianinsider (every browser except nodriver fails), sedarplus (every browser fails, F5 BIG-IP ASM), and dev.to (only Camoufox fails, Firefox TLS quirk).

In the camoufox vs playwright comparison, Camoufox passes google-search where three Chromium-based browsers (including vanilla Playwright) block, and passes medium's gate where vanilla and rebrowser are hard-blocked. The Firefox TLS shape loses on one cell (dev.to) and wins on one (google-search). When I started this bench I assumed Firefox would lose harder than it did on Chromium-shaped gates. It didn't. The matrix driver is automation-protocol fingerprinting, not cipher lists.

curl_cffi vs CloakBrowser: identical scoreboards on 31 targets

26 OK, 2 blocked, 3 gated each. They share the same failures (canadianinsider, glassdoor) and the same gates (medium, sedarplus, bot-incolumitas), differing only in which specific cells fall where.

CloakBrowser is a 130MB Chromium fork with 49 C++ patches. curl_cffi is a 6.4MB Python wheel with a Chrome-shaped HTTPS stack. Same matrix result. I triple-checked the curl_baseline column because it reads like a typo. It isn't. If a 21-line wrapper ties a 130MB patched fork, that fork is paying for something the matrix doesn't measure.

Why is CloakBrowser's macOS build stuck on Chromium 145?

The CloakBrowser darwin-arm64 build at version 0.3.28 ships Chromium 145.0.7632.109. CloakHQ's GitHub releases page shows fourteen Linux and Windows releases since then, the most recent at 146.0.7680.177.4 on 2026-04-28. The macOS pipeline has been dead for two months.

The bench measures what's actually shipping. The old version is the only version macOS users can install. rebrowser's bundled Chromium 136 is the same story: twelve versions behind stable.

nodriver vs Playwright: why automation-protocol fingerprinting beats patches

nodriver passes canadianinsider while every patched Chromium fails it. nodriver connects to system Chrome's DevTools port over a plain WebSocket, without Playwright's accessibility layer, without the Runtime.enable and Target.setAutoAttach sequence Playwright issues at startup.

Why fingerprint patches don't reach the protocol layer

The gate checks the protocol handshake the browser exposes on the way to rendering. Static fingerprints (TLS handshake, JA4 hash, navigator properties, canvas readback) are the wrong surface. CDP through Playwright leaves a recognizable shape. A fingerprint-patch tool can rewrite navigator properties all day without touching this layer.

canadianinsider, medium, and glassdoor confirm this across three vendors. Cloudflare on canadianinsider doesn't look at the JS-runtime layer because the automation-protocol layer already gave the browser away. nodriver still gets gated on sedarplus, bot-incolumitas, and glassdoor (soft, not hard-blocked).

Why a residential proxy doesn't help: shape coherence

The gate cross-checks layers for consistency. The Mac Studio is shape-coherent: residential British Columbia IP, macOS Chrome TLS handshake, macOS Chrome JavaScript fingerprints, macOS Chrome HTTP/2 SETTINGS frame ordering (each HTTP/2 client advertises tuning parameters in a predictable order that fingerprints the client library), all from the same host, with no mismatched signal for a gate to flag. A proxy only rewrites the source IP, and everything above TCP still leaks the real client.

Anti-bot signal	Where it actually comes from	Does a proxy fix it?
IP address (residential vs datacenter ASN)	TCP source	Yes
TLS / JA4 handshake fingerprint	The HTTPS client	No
HTTP/2 SETTINGS frame ordering	The HTTPS client	No
navigator.platform, navigator.userAgent	The browser process	No
Canvas / WebGL / audio fingerprints	Browser process, host GPU, host fonts	No
screen.width, screen.height, devicePixelRatio	The browser process	No
navigator.connection rtt + effectiveType	Network conditions	No (round-trip time gets worse through a proxy)

A Linux server behind a residential proxy manufactures a fresh contradiction the gate uses against it.

For HTTP-only scraping, the TLS layer alone is recoverable with curl_cffi. For JavaScript-rendered scraping, the browser process has to live on the host that owns the residential IP. There is no proxy shortcut that delivers macOS-shape JavaScript fingerprints from a Linux server. I tried a residential proxy in front of a Linux VPS during an earlier sweep, and every Cloudflare-protected target blocked it harder than the unproxied Mac Studio.

How the bench classifies each cell

Methodology lives in the bench repo at github.com/ianlpaterson/anti-detect-browser-bench. Highlights: a four-way verdict classifier (ok/gated/blocked/error) checking title regexes, body vendor signatures, and short-body shim pages. Each cell gets up to 3 attempts with early-exit on 2 consecutive matching non-error verdicts. The seven browsers run in randomized order each sweep to prevent reputation accumulation. Phase 6 wall clock was 5h11m across N=3 from 22:39:39 PT through 03:51:17 PT. Peak RSS ranged 57.9MB (curl_baseline) to 13306MB (Patchright). Eighty unit and regression tests cover the classifier, response-frame logic, retry logic, and the curl_baseline parser. The README has the full details.

What this bench can't tell you

The matrix is one residential IP, one operating system, thirty-one targets, one night. Patterns that survived three independent runs are stable inside that frame. Outside it, several axes can flip the result.

Rotating proxies change the ranking

The techinz/browsers-benchmark repository tests with rotating residential proxies and up to 3 retries per cell. Their headline: Camoufox bypasses 100%, CloakBrowser 83.3% headed and 50.0% headless. The order is reversed from mine. Rotating proxies prevent IP reputation from accumulating against Camoufox's identifiable Firefox fingerprint. My single-IP setup is the worst case for Camoufox in that one respect.

Targets shift mid-test

sedarplus.ca swapped anti-bot vendors mid-session during Phase 5: F5 BIG-IP ASM early, Radware Shieldsquare later, gated on every browser by Phase 6. A matrix is a snapshot, not a permanent fact. Cloudflare Turnstile is also site-state-dependent (the v2 stub reported all browsers failed nowsecure-cf, Phase 6 shows all pass).

Version effects masquerade as patch quality

Browser versions span Chrome 136 (rebrowser) to Chrome 148 (Patchright via channel=chrome and nodriver via system Chrome), with Firefox 135 (Camoufox) in the mix. Some apparent stealth is "this build happens to match what users have installed."

The durable finding

Automation-protocol fingerprinting is a separate problem from TLS fingerprinting and JS-layer detection. The stealth-browser ecosystem is mostly solving the first two. Defeating the third requires a control plane that is not Playwright.

Which anti-detect browser should you use?

The most important decision happens before browser choice: identify which layer your target gates on. JS-fingerprint targets are where current Chromium passes unpatched. TLS-fingerprint targets reward Camoufox's Firefox shape and curl_cffi's impersonate=chrome about equally. Automation-protocol-fingerprinting targets are the cliff: Playwright forks fail regardless of patch quality.

If you're scraping...	Use	Why
Cloudflare-gated production targets where canadianinsider matters	nodriver	The only browser on the matrix with zero blocked cells
A drop-in Playwright replacement and your stack is locked in	Patchright with channel=chrome	Real Chrome 148 + the smallest patch-quality risk
Sites that whitelist Firefox or fingerprint Chrome shape	Camoufox	Firefox 135 stealth, beats Chromium forks on google-search
HTML you can parse without JavaScript execution	curl_cffi	26 of 31 targets in a 21-line wrapper, no browser process
A turnkey patched Chromium and you'll write the cookie layer	CloakBrowser	Matches curl_baseline on this matrix, real product, real build chain

nodriver is the only browser through canadianinsider. Tradeoff: its asyncio object model requires an adapter layer in any Playwright codebase. AGPL-3.0 license has commercial implications worth reviewing with counsel.

Patchright with channel=chrome beats vanilla by one OK and unblocks stackoverflow by running system Chrome 148 instead of bundled Chromium.

Camoufox beats Chromium forks on google-search and medium's gate but loses dev.to on a Firefox TLS quirk.

curl_cffi is pip install curl_cffi + requests.get(url, impersonate="chrome"). The TLS handshake defaults to Chrome 145/146 shape.

CloakBrowser is a real product with cookie handling and Turnstile auto-resolve worth paying for. Raw stealth headroom over curl_cffi is not supported by the data.

rebrowser-playwright posts the same OK set as vanilla and has had no real code commits since September 2024. Skip.

In production, I run nodriver for canadianinsider scraping, Patchright with channel=chrome everywhere else, and curl_cffi for the targets that don't need JavaScript at all.

FAQ

What is automation-protocol fingerprinting and how does it differ from JA4 or JS-layer detection?

JA4 hashes the TLS handshake. JS-layer detection inspects navigator properties after the page loads. Automation-protocol fingerprinting sits between them, detecting the protocol shape used to drive the browser. Fingerprint patches do not reach this layer.

Why does nodriver pass canadianinsider when every patched Chromium fails?

nodriver drives system Chrome over a plain CDP connection with no Playwright in the loop, which removes the protocol-handshake shape the gate keys on at startup.

Can I just use a residential proxy with my existing Playwright on a VPS?

No. A proxy rewrites only the IP layer. TLS handshake, HTTP/2 frames, navigator properties, and canvas fingerprints all originate from the actual host. A Linux server behind a residential proxy still advertises a Linux-shape browser, which is the contradiction gates flag.

Is CloakBrowser actually no better than curl?

On 31 targets, OK counts are identical (26 each). CloakBrowser has features the matrix does not test: cookie handling, Turnstile auto-resolve, human-cursor modeling. For HTML parsing, curl_cffi is sufficient. For persistent browser sessions, Cloak is the better tool.

Did Cloudflare Turnstile pass for every browser? The v2 stub said no browser passed.

Yes, every browser passed nowsecure-cf in Phase 6. The v2 stub measured the same target two weeks earlier and got the opposite result. Turnstile is site-state-dependent. Treat any single-session result as transient.

What about rebrowser passing JS-layer detection panels its README highlights?

rebrowser passes browserscan-bot, pixelscan-bot, and rebrowser-detector. So does every other browser including vanilla. Those panels are saturated by current Chromium. The disagreement happens on production targets, where rebrowser fails identically to vanilla.

Should I run rotating residential proxies?

If your target list overlaps with mine, yes. Camoufox's Firefox TLS fingerprint is identifiable on repeated single-IP hits, and rotation solves that. Proxies are a larger lever than browser choice for sustained workloads.

Why didn't you include real Mozilla Firefox?

Manual Firefox 150 on canadianinsider passed Cloudflare Turnstile from a fresh private window. Selenium-driving the same Firefox with dom.webdriver.enabled=false got blocked. The gate keys on detectable automation, not Firefox itself.

What is the difference between Patchright and vanilla Playwright?

Patchright patches CDP-leak signals at startup and supports channel=chrome to drive system Chrome 148. On Phase 6 it scores +1 OK over vanilla (25 vs 24) and unblocks stackoverflow. The version advantage does as much work as the patches.

Does curl_cffi bypass Cloudflare?

Sometimes. On Phase 6 it passed 26 of 31 targets by replacing the HTTPS client stack with one shaped like current Chrome. It fails on canadianinsider, glassdoor, and any target requiring JavaScript execution.

What is JA4 fingerprinting?

JA4 hashes the TLS handshake: cipher suites, extensions, ALPN, and their order. Anti-bot services match that hash against known browser profiles. A headless-Chrome JA4 differs measurably from a real-Chrome JA4.

How does Cloudflare detect headless browsers?

Cloudflare layers signals: TLS handshake (JA3/JA4), HTTP/2 SETTINGS frame ordering, JavaScript runtime properties, behavioral patterns, and automation-protocol shape. The Phase 6 matrix shows automation-protocol shape is the layer most patched browsers ignore.

Is nodriver safe to use in production?

Functionally yes. nodriver is actively maintained and scored 28 of 31 with zero blocked cells. Caveats: its asyncio object model means no Playwright drop-in, cookies carry across targets by default, and AGPL-3.0 has commercial-license implications worth checking with counsel.

Does Patchright work with Cloudflare Turnstile?

Patchright passes some Cloudflare-gated targets but not canadianinsider or google-search in Phase 6 testing. It scores +1 OK over vanilla Playwright by recovering stackoverflow, but the CDP-leak patches do not address the automation-protocol layer that Cloudflare Turnstile keys on for its hardest challenges.

How do I install curl_cffi?

Run pip install curl_cffi. Then from curl_cffi import requests and requests.get(url, impersonate="chrome"). The impersonate="chrome" flag selects Chrome 145/146 TLS shape (curl_cffi 0.15.0). No browser binary required. The full library is 6.4MB.

Is nodriver a drop-in replacement for Playwright?

No. nodriver uses an asyncio object model incompatible with Playwright's sync API and page/context/browser hierarchy. Migrating requires rewriting control flow, not just swapping the import. The payoff is zero Playwright in the protocol stack, which is what gets through canadianinsider and similar Cloudflare-Turnstile-protected targets.

Tools mentioned

nodriver (github) - Direct-CDP successor to undetected-chromedriver. No Playwright shim in the control plane. License: AGPL-3.0. Pricing: free.
CloakBrowser (github) - Patched Chromium fork with 49 source-level C++ fingerprint modifications. Drop-in Playwright API. License: MIT (wrapper), custom binary license. Pricing: free.
curl_cffi (docs | github) - Python HTTP client wrapping curl-impersonate. Spoofs TLS/JA4 fingerprints without a JS engine. License: MIT. Pricing: free.
Patchright (github) - Playwright fork that patches CDP-leak signals at startup. Supports channel=chrome for real Chrome TLS. License: Apache-2.0. Pricing: free.
Camoufox (site | github) - Firefox fork with C-level canvas/WebGL/navigator spoofing. Firefox TLS shape. License: MPL-2.0. Pricing: free.
Playwright (site | github) - Microsoft's reference browser automation library. The unpatched baseline. License: Apache-2.0. Pricing: free.
rebrowser-playwright (github) - CDP-patch Playwright fork. Last code commit September 2024, Chromium 136. License: unspecified. Pricing: free.

Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive DFlash MTP for Qwen3.6-27B

Ian L. Paterson — Mon, 18 May 2026 19:59:51 +0000

The setup

The starting line was 43 tokens per second decode on vanilla llama.cpp. The finishing line, three months later, is 39 to 49 tokens per second decode that doesn't collapse at long context, using a completely different speculative decoding technique than the one Claude and Ian started with. This box runs Qwen3.6-27B Q4_K_M on a single RTX 3090 Ti (the T5820 build notes are here), serving an agent stack (Hermes/k2) over the OpenAI-compatible llama.cpp HTTP API.

The full audit trail is below: autoregressive baseline, DFlash via the BeeLlama fork, then MTP via vanilla llama.cpp once the workload reality caught up to the bench. Every knob got measured, most got rejected, and the production state at the end is simpler than what it replaced.

Terms used: Autoregressive = baseline generation, one token at a time, no speculation. Drafter = small model that proposes tokens for the target to verify. KV cache = stored key/value pairs from previous tokens so attention doesn't recompute every step. Prefill = the model reading the prompt before generating. Decode = generating tokens after prefill. TTFT = time to first token. MTP = multi-token prediction (extra head layers on the target that predict several tokens in parallel).

What's running in prod right now

End state as of 2026-05-15:

Binary: vanilla llama.cpp, build-mtp/bin/llama-server, built from the MTP PR branch (commit ebe4fca, PR #22673). PR #22673 merged to master on 2026-05-16, so any master checkout after that date ships --spec-type mtp natively.
Model: Qwen3.6-27B-Q4_K_M-mtp.gguf from unsloth/Qwen3.6-27B-MTP-GGUF (MTP heads baked into the weights, no separate drafter file).
Flags:

--spec-type mtp \
--spec-draft-n-max 6 \
--spec-draft-p-min 0.75 \
--reasoning-budget 256 \
-c 131072 -fa on -ctk q4_0 -ctv q4_0 --jinja \
--alias qwen3.6-27b-q4_k_m

VRAM: ~20.4 to 20.9 GiB depending on measurement state (idle vs under light load on a 24 GiB card).
Decode rate: 39 to 49 tok/s across tested output lengths (100, 500, 1000, 2000 tokens), low end at out=500, high end at out=2000. Plain autoregressive's flat ~29 tok/s plus its faster prefill still beats MTP on wall clock below output ~900 tokens.
Wall time at output 2000: 91 seconds, vs 112 seconds on DFlash and 107 on plain autoregressive.
Reversibility: previous-state backups under ~/.config/systemd/user/llama-server.service.pre-*. Each swap in the table below is one cp away from a revert.

Hermes hits the same alias it always did. Zero client changes through five binary swaps.

TL;DR

MTP wins on wall clock above output ~900 tokens. Below that, plain autoregressive is faster.
The DFlash Decode Collapse. DFlash decode drops from 46.9 to 30.1 tok/s as output grows from 100 to 2000 tokens. MTP holds 39 to 49 tok/s flat across the same range.
Speculative decoding is NOT bit-for-bit lossless at temp=0 on free-form prose. Tool-call schema integrity is preserved identically. True for both DFlash and MTP. Backed empirically (the lossless probe, N=3 per workload) and academically (arxiv:2605.09992 on drafter attention drift).
--spec-draft-p-min 0.75 is the vanilla llama.cpp flag that changed the MTP verdict from "buried at Phase 2" to "shipped at Phase 9." Filter lives in PR #22397 (April 28).
--reasoning-budget 256 saves ~10 seconds per request on Qwen3.6 with no quality regression.
Single-prompt vendor benchmarks overstate DFlash gains by 30 to 60%. Distribution evidence (N=10+) is non-optional. N=1 misleads even at the right prompt size.
Cache reuse beats every decode optimization when it hits. ~60x prefill speedup on realistic varying traffic with a stable prefix.
PR #22673 (MTP) merged 2026-05-16. Builds from master after that date have --spec-type mtp natively.

What got tested

#	Knob	Result	Status	Notes
1	Bigbatch (`-ub 256 -b 2048`)	+19% decode, +10% prompt, +186 MiB VRAM	✅ kept	Free win on prompt-eval kernels. Carried into every later config.
2	DFlash (BeeLlama fork, default `n_max=16`)	43 → 148 tok/s on linked-list code	✅ shipped 2026-05-12	First big swap. Same OpenAI/jinja API surface, same flags Hermes needed.
3	TurboQuant KV (`turbo4` K + `turbo3_tcq` V)	Decode parity (178 vs 176), prompt -37% on sm_86 (Ampere, RTX 3090/3090 Ti)	❌ rejected	Cross-stack corroboration: Red Hat AI researchers (Kurtić, Goin, Marques) reach the same verdict on H100 in the TurboQuant writeup on the vLLM blog. Hopper FP8 path not available on Ampere.
4	DDTree branch verify (`--spec-branch-budget 22`)	-49% to -58% decode across workloads	❌ rejected	Anbeeld (BeeLlama maintainer) flags DDTree as "very much work in progress" in the README. Confirmed.
5	`enable_thinking:false` server-wide	+30% peak decode	❌ rejected	Hermes/k2 hallucinated within minutes. Qwen3.6 reasoning is load-bearing. Reverted within ~15 min.
6	`--spec-draft-n-max` sweep (4 to 16)	12 wins by a hair; surface flat ±5% across 10-14	✅ kept 12	Initial sharp-peak finding flattened on the N=10 sweep. Same prod number, more nuance.
7	Q8_0 drafter (1.77 GB)	Tied Q4_K_M at N=10	❌ rejected	N=3 looked like a +10% surprise; N=10 collapsed it to noise.
8	Q5_K_S target (18 GB) + Q4_K_M drafter	+5% code, -10% chat	❌ rejected	"Precision combo" wins on pure code. Hermes traffic is reasoning + chat heavy.
9	q8_0 KV cache (instead of q4_0)	-25% throughput; VRAM +1.8 GB	❌ rejected	Re-confirmed pre-DFlash lesson under DFlash. Workload-consistent penalty.
10	CopySpec (suffix matching, no drafter)	> 300x slowdown: 150-token prompt timed out at 600s	❌ rejected	The drafter is load-bearing: on workloads without repetitive structure, CopySpec timed out in 600 seconds. The drafter is the entire performance story for speculative decoding on such workloads.
11	MTP, first pass (`n_max=3`, no `p_min`)	1.4x autoregressive; ~2x slower than DFlash on long workloads	❌ buried (Phase 2)	Tested at short context with original flags. Same prose drift profile as DFlash. Looked dead. Wasn't.
12	MTP, second pass (`n_max=6 --spec-draft-p-min 0.75`)	1.8x autoregressive; decode doesn't collapse at long context	✅ shipped 2026-05-15	The `--spec-draft-p-min` filter (vanilla llama.cpp, PR #22397, April 28) changed the verdict. Decode holds 39 to 49 tok/s across every output length while DFlash drops 47 → 30 tok/s.
13	`--reasoning-budget 256`	Saves ~10 seconds per request, no quality regression	✅ shipped 2026-05-15	Caps runaway reasoning chains at 256 tokens. Highest-impact, lowest-risk flag in the sweep.

End-state decode rates:

output_len	Autoregressive tok/s	DFlash tok/s	MTP (prod) tok/s	Wall-clock winner
100	28.9	46.9	44.6	Autoregressive (prefill dominates)
500	29.1	37.0	39.1	Autoregressive by 4-7s
1000	29.1	30.2	44.4	MTP/autoregressive tied within 86ms
2000	29.0	30.1	48.9	MTP by 16-21s

What does "speed up a local LLM" actually mean

Three numbers that decouple at production scale:

Decode rate (tok/s): how fast tokens come out once generation starts.
TTFT (time-to-first-token): how long until the first visible character appears.
Wall clock (TTFT + decode × length): what users actually feel.

Most speculative decoding marketing optimizes the first number. Production users feel the third. At 43K of context (Hermes-shape traffic, N=10 sampling), prefill is roughly 80% of wall clock, reasoning is ~12%, and decode is the remaining ~8%. A 2x decode improvement doesn't double the wall clock. It nudges the smallest of three timescales.

Three months of decode tuning got production from 43 to 124 tok/s on a synthetic linked-list benchmark, then a single Hermes-shape bench at 43K context showed the gains evaporating at the real workload. The fix was changing the speculative decoding technique to one that doesn't collapse at long context.

Plain autoregressive: the baseline

Plain autoregressive means generating one token at a time with no speculation, no drafter, no MTP heads. It's the reference every measurement here compares against, and it's a live competitor: on certain workloads it still wins on wall clock.

On Qwen3.6-27B Q4_K_M with -c 131072 -fa on -ctk q4_0 -ctv q4_0, plain autoregressive decodes at ~29 tok/s and stays there regardless of output length. It has the fastest prefill of the three modes (~37.5s at 43K context, vs MTP's ~49s and DFlash's ~44s). No drafter KV cache means no bandwidth contention and no collapse at long output. It also never speeds up.

How DFlash got here (2026-05-12)

The Dell T5820 install was the hardware story (companion post forthcoming). DFlash was the software follow-up. Initial scan of Luce-Org/lucebox-hub (advertising 3.43x decode + 10x TTFT on RTX 3090) ran into the same blocker: their daemon is a raw generate primitive with no OpenAI API, no jinja chat templates, no tool calling. Slotting it behind Hermes/k2 would need a chat-template shim written from scratch.

BeeLlama.cpp by Anbeeld already had the shim baked in: DFlash speculative decoding, TurboQuant KV cache, and CopySpec fallback layered onto the OpenAI server with --jinja and tool-call detection preserved. Different binary. Same flags Hermes needed.

The clean A/B: same Qwen3.6-27B Q4_K_M target, same KV quant, same -c 131072, same -fa on. Same workload (1200-token Python linked-list class, temperature 0, seed 42). BeeLlama with --spec-type dflash vs the same BeeLlama with no --spec-* flags. DFlash was the only variable.

Config	Decode tok/s	Prompt tok/s	VRAM MiB	vs Autoregressive
Autoregressive baseline	43.15	202	18634	1.00x
DFlash, q4_0 KV	148.46	189	19760	3.44x
DFlash + bigbatch, thinking ON	~124	~210	20372	2.88x
DFlash + bigbatch, thinking OFF (peak)	176.02	209	19946	4.08x

The headline is the third row. Server-wide enable_thinking:false was tested, ran for ~15 minutes in production, and reverted because the model started narrating work it never did ("running on a Pi, give it a second" on a 3090 Ti) and made up status messages. Qwen3.6 is a reasoning model. Server-wide thinking-off tanks output quality across the agent stack, and reasoning back on costs ~30% of the peak.

Tool calling stayed intact through the swap. Standard OpenAI-shape tools array request came back with finish_reason: "tool_calls" and a clean tool_calls array. No shim needed.

The drafter knobs

DFlash's default --spec-draft-n-max is 16: the drafter guesses up to 16 tokens per round, the target verifies them all at once. Anything the drafter got right is free; anything past the first wrong guess is wasted compute. A sweep across 4, 8, 12, 16 (plus a fine pass at 10, 11, 13, 14 a week later) put the optimum at 12 on this workload , a wide valley flat to ±5% across n_max 10-14.

Config	latency	tool-call	chat-short	code-long
prod (n_max=16, cross=1024, adaptive ON)	92	76	59	130
nmax-8 noadapt	104	75	63	126
nmax-12 noadapt	111	78	73	137
nmax-16 noadapt	104	77	65	126
crossctx-2048	98	70	64	157

The chat-short bump (+24% decode, -27% wall clock) came from making each wrong guess cheaper. With n_max=16, every rejected draft on <think> content burned 12-15 wasted tokens. At n_max=12, the same rejections cost 8-11 tokens. Multiplied across thousands of speculation cycles per response, that's the 24%.

The lossless probe

The textbook claim: speculative decoding with rejection sampling (the mechanism that accepts draft tokens matching the target's distribution and corrects ones that don't) at temperature 0 produces output bit-for-bit identical to autoregressive. At temp=0 there's no random sampling, so every token should land on the target model's argmax. Pure speed optimization, zero quality impact.

Nobody on either side of the Qwen 3.6 and BeeLlama conversation had published a measurement, so Claude and Ian ran one: same target, temp=0, fixed seed 42, five deterministic prompts. We cached the autoregressive baseline, then ran DFlash against it with a character-level Levenshtein diff.

Workload	Identical to autoregressive?	Median char drift	Notes
latency (single digit "4")	YES (3/3)	0	Trivial
tool-call (`get_weather`)	YES (3/3)	0	Schema + args match exactly
chat-short (TCP handshake, ~1300 chars)	NO (0/3)	~1100	~86% of reference length. Semantically similar, textually distinct.
code-long (Python class, ~2600 chars)	NO (0/3)	94	~3-6% drift. Variable names + docstrings varied.

Lossless held narrowly for short deterministic answers and structured outputs like tool calls. It broke for sustained prose past a few hundred tokens. Probable cause: drafter distributional drift. The DFlash drafter is not an exact match for the target's logit distribution, so at rejection-sampling boundaries where two tokens sit at near-equal probability, even small drafter drift flips the accepted token. One flipped token branches into a different sentence.

The agentic-stack consequence: tool-call schema integrity is preserved (tool_call_schema_match_all = true across all iterations). Function names, argument keys, JSON shape all stay identical run-to-run. Free-form chat text varies at temp=0, the same way any non-deterministic backend would.

The MTP head-to-head (Phase 2) ran the same probe a week later and got the same drift profile : ~1000 chars on chat-short, lossless on tool calls. Two implementations, same theoretical guarantee failing the same way. Academic backing arrived at the right time: arxiv:2605.09992 "Attention Drift in Autoregressive Speculative Decoding Drafters" measured the same phenomenon at the model-internal level. Our Levenshtein probe and their attention-pattern analysis are pointing at the same thing from opposite ends.

Phase 1: what lost

Before committing to MTP testing, four more configs against the n_max=12 baseline:

Config	latency	tool-call	chat-short	code-long	Outcome
prod (nmax=12)	105.6	77.4	69.7	131.3	Baseline
nmax-4	77.6 (-26%)	64.9 (-16%)	58.7 (-16%)	82.4 (-37%)	Regression on every workload
nmax-8	105.7 (tie)	74.7 (-3%)	63.2 (-9%)	124.4 (-5%)	Strictly worse
q8-kv (nmax=12)	97.3 (-8%)	68.0 (-12%)	63.9 (-8%)	100.2 (-24%)	25% penalty confirmed, VRAM +1.8 GB
copyspec	TIMED OUT at >600s	-	-	-	Catastrophic

The Unsloth MTP configuration guide's recommendation of n_max=2 does not transfer to DFlash. The nmax curve is monotonically worse going smaller. MTP and DFlash are different techniques with different optimal draft windows. CopySpec without a drafter is a 300x slowdown, not a floor. BeeLlama's README describes it as model-free suffix matching. On the 150-token latency prompt it didn't complete in 600 seconds. The drafter is load-bearing; the drafter is the entire performance story for speculative decoding on workloads with no repetitive structure.

Phase 2: MTP buried

Multi-token prediction in vanilla llama.cpp (PR #22673, am17an) predicts multiple target-model tokens in parallel without a separate draft model. Simpler architecture, smaller VRAM footprint, comparable advertised speedup.

The first MTP test ran at short context with --spec-draft-n-max 3 (no --spec-draft-p-min flag existed yet). Same target weights as DFlash, same q4_0 KV, reasoning ON. Three findings: MTP wasn't lossless on prose either (~1000 chars drift on chat-short, same magnitude as DFlash); MTP ran ~1.3-1.4x slower than DFlash on matched chat-short workloads (52-57 tok/s vs ~73 tok/s); MTP preserved tool-call schema integrity (safe for Hermes).

Conclusion at the time: same drift profile on prose, lower throughput on matched workloads, no operational win. DFlash stays. Phase 2 looked dead. It was on the wrong settings, at the wrong context size.

Phase 6: prefill is most of TTFT

Production was running DFlash + bigbatch + nmax=12. Hermes felt slow. The decode bench said 70 tok/s on chat-short, which should have been ~10 seconds wall-clock on a typical answer. Real Hermes traffic was hitting 12-32 second TTFT. The numbers didn't add up.

So Ian and Claude ran one bench at the actual Hermes workload shape: ~43K context (system message + multi-turn history + tools array), reasoning ON, then sent the same body twice in a row.

Metric	Iter 1 (cold)	Iter 2 (warm, identical body)
Wall TTFT	48.48s	7.32s
Server `prompt_ms` (prefill)	46.90s	0.24s
Tokens evaluated (`prompt_n`)	43,241	4
Tokens reused (`cache_n`)	0	43,237
Decode rate	33.9 tok/s	43.2 tok/s

Cold prefill was most of TTFT. The model spent 46.9 seconds reading the prompt before generating anything. The whole decode-throughput investigation had been tuning the smallest slice of the wall clock. Cache reuse was the entire game when it worked : second request: 43,237 of 43,241 tokens reused, prefill dropped to 0.24s, ~200x speedup on the smallest slice that suddenly mattered.

Phase 7 immediately re-validated at N=10-20 and corrected the headlines: the 200x cache speedup required byte-identical bodies, realistic Hermes traffic (varying user message turn-to-turn) gets closer to 60x. The "96.7% prefill share of TTFT" was a high-tail observation; at N=10 the median is 88%. Reasoning budget effect, originally measured as <5%, was actually 20.7% with proper sampling. Three Phase 6 numbers off by 1.5x to 4x from a single observation.

Phase 7: DFlash hurts prefill

The Phase 6 framing was "speculative decoding optimizes the wrong axis." It implicitly assumed DFlash was neutral on prefill. Phase 7 actually measured it.

Metric	Autoregressive (N=10)	DFlash (N=10)	Δ
Prefill at 43K context	37,498 ms	44,153 ms	DFlash is 17.8% SLOWER
Decode at 43K context (out=100)	29.3 tok/s	38.4 tok/s	DFlash +30%

DFlash trades prefill speed for decode speed: the drafter prefills alongside the target, adding 17.8% wall time at 43K context. On a TTFT-dominated workload (large system message, mostly tool calls + short replies, which is roughly what Hermes runs), DFlash was making user-felt latency worse, not better. Production was about to swap.

Phase 8: the DFlash Decode Collapse

Two sweeps, four output lengths each at N=3.

The DFlash Decode Collapse:

output_len	Autoregressive wall (ms)	DFlash wall (ms)	DFlash decode tok/s
100	41,888	48,394	46.9
500	55,785	59,995	37.0
1000	72,966	79,528	30.2
2000	107,476	112,567	30.1

DFlash decode at out=100 is 46.9 tok/s. At out=2000 it's 30.1, basically autoregressive speed. The drafter's KV cache grows alongside the target's, the small drafter is more bandwidth-bound, and the speedup erodes as the conversation gets longer. By out=2000, DFlash is paying its 6-7 second prefill tax for no decode benefit. arxiv:2604.26412 "When Hidden States Drift: KV Caches and Long-Range Speculative Decoding" names this drafter-bandwidth bottleneck at the research level; the table above is the practitioner measurement. Three months of tuning a drafter ratio that evaporates at output > 1000 tokens.

MTP, with the --spec-draft-p-min 0.75 filter on drafter logits:

output_len	Autoregressive wall (ms)	DFlash wall (ms)	new MTP wall (ms)	MTP decode tok/s
100	41,888	48,394	52,609	44.5
500	55,785	59,995	62,951	39.1
1000	72,966	79,528	73,089	44.5
2000	107,476	112,567	91,427	48.8

MTP's decode rate does not collapse. It holds 39 to 49 tok/s across every output length tested. MTP has no separate drafter model: the multi-token heads share the target's hidden state and its KV cache. No second KV cache to feed, no bandwidth contention, no drafter-KV-grows-with-output bottleneck.

Crossover math: MTP has worse prefill (~49s vs autoregressive's 37.5s) but sustained-high decode (~46 tok/s vs autoregressive's 29). MTP overcomes its 11.5s prefill tax at output ~900 tokens: 11.5 / (1/29 - 1/46) ≈ 902. Below that, autoregressive wins. Above, MTP wins. DFlash is below both at every tested length.

Phase 2 buried MTP at short context with the wrong settings. The p_min=0.75 filter plus a long-context workload exhumed it.

Can a second GPU help?

The first instinct on the bandwidth-starved-drafter problem is to throw a second GPU at it: drafter on one card, target on the other, in parallel. The math doesn't reward it. Within a single draft/verify cycle there's no parallelism (target verification depends on drafter output). Across cycles, async or lookahead spec decoding gives a theoretical speedup ceiling of 1 + min(time_d, time_t) / max(time_d, time_t).

At long context the drafter dominates the cycle, so parallel speedup tops out around 1.125x. A second 3090 Ti buys a 1.2x win for ~$700. MTP gives the same architectural win for $0 via shared KV and no bandwidth contention. llama.cpp doesn't support async 2-GPU spec decoding anyway. vLLM and TensorRT-LLM do, which means buying hardware AND switching the inference stack.

Phase 9: production switch (2026-05-15)

The llama-server unit on ubuntu1 got rewritten. BeeLlama out, vanilla llama.cpp build-mtp in. Standard Q4_K_M GGUF out, MTP-variant Q4_K_M in. Separate drafter file dropped entirely. Hermes didn't need touching: same alias, same OpenAI shape, zero client changes. About 5 minutes total wall-time.

Verification bench against the live prod unit, N=3, four output lengths:

output_len	wall median (ms)	decode tok/s	reasoning_chars median
100	53,138	44.6	145
500	62,997	39.1	186
1000	73,052	44.4	803
2000	91,304	48.9	816

Numbers match Phase 8 standalone within 1%. --reasoning-budget 256 introduces no measurable regression. Production VRAM 20,902 MiB. The swap deleted code.

Sidebar: Tailscale relay vs LAN throughput

The MTP-variant model swap was painful the first time. Mac Studio and ubuntu1 are on the same Tailscale network, so the obvious move is scp over the 100.x address. That tops out at 1 to 2 MB/s because Tailscale routes through a DERP relay in Seattle. The 16 GB swap would have taken hours.

Both boxes are also physically Ethernet-bridged on the local LAN: ubuntu1 at 192.168.2.2, Mac Studio at 192.168.2.1. Same scp command pointed at the LAN address: 100 MB/s, ~3 minutes for the 16 GB model. 50 to 100x speedup over the Tailscale path.

The DigitalOcean droplet pulled MTP weights directly from HuggingFace at 67 MB/s when the Tailscale-from-Mac-Studio attempt hung on auth for 30 minutes. Public-internet egress was 30x faster than the mesh peer.

Methodology lessons

Distribution evidence is mandatory. Single-prompt benchmarks inflated DFlash gains 30 to 60% versus an 8-prompt diversity sweep. N=1 misleads even at the right prompt size: Phase 6 → 7 had three headline numbers off by 1.5x to 4x from one observation. When a vendor publishes "3.4x," assume the median is closer to 1.5 to 2x. And match the bench prompt size to production: Phase 5 was the right experiment on the wrong workload (Phase 7 redid it at 43K context and got a 4x larger effect).
Decode tok/s and TTFT decouple on reasoning models at production context. At 43K context they're three timescales (prefill, reasoning, decode) at roughly 80% / 12% / 8% of wall clock. Optimize what users feel.
Spec decoding is workload-dependent, and "workload" includes output length. DFlash is 1.6x on chat-short (out~700) and 1.04x at out=2000. Autoregressive holds ~29 tok/s flat. MTP holds 39 to 49 tok/s under prod flags. Pick the technique whose curve fits your traffic. Cache reuse is binary: ~60x prefill speedup with a stable prefix, full cost when it isn't.
Spec decoding at temp=0 is NOT bit-for-bit lossless on prose. Identical on tool calls and one-token answers, completely different on free-form prose. True for both DFlash and MTP. Both implementations fail the textbook lossless guarantee on prose-heavy workloads.

FAQ

What is multi-token prediction (MTP)?

MTP adds "head" layers to the target model that predict several tokens in parallel with the main next-token prediction. Drafts get verified on the next forward pass: accepted tokens are free, the first rejected one cuts off the rest. No separate drafter model, shared KV cache, same speculative-decoding mechanism as DFlash but with drafts coming from inside the target.

Does MTP change the output?

On tool calls and short structured answers, bit-for-bit identical to autoregressive at temp=0 (verified N=3 per workload). On free-form prose past a few hundred tokens, ~1000 characters of textual drift on chat-short, same magnitude as DFlash. Semantically equivalent, textually different. For regression tests that diff against gold outputs, disable speculative decoding entirely.

MTP vs DFlash on a 3090 Ti, which one's faster?

Depends on output length. Below output 500, autoregressive > DFlash > MTP because prefill dominates. By output 1000, autoregressive ≈ MTP > DFlash. By output 2000, MTP > autoregressive > DFlash by 16 to 21 seconds wall clock. DFlash decode collapses from 47 to 30 tok/s as output grows because its drafter's KV cache competes for bandwidth. MTP shares the target's KV, no collapse.

InsiderLLM has a DFlash-vs-MTP head-to-head on the same hardware that benches a single short-output point. The DFlash Decode Collapse only appears past output 500-1000 tokens, which is why it doesn't surface in short-prompt comparisons.

Does MTP work on Qwen3.6-27B dense?

Yes. Unsloth ships Qwen3.6-27B-MTP-GGUF with the heads baked in. The HackMD MoE benchmark initially said "MTP doesn't help" on Qwen3.6-35B-A3B, then a May 8 2026 update flipped to +27.5% with corrected flags. The MoE story is still evolving. Dense 27B with --spec-type mtp --spec-draft-p-min 0.75 --spec-draft-n-max 6 runs at 39 to 49 tok/s across the tested output lengths.

Why does MTP need a custom llama.cpp build?

PR #22673 (am17an, opened May 4 2026, merged to master May 16 2026) added --spec-type mtp. Builds from master after that date have it natively. For earlier checkouts, fetch the PR branch and build with -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_NATIVE=ON. Takes ~4 to 13 minutes depending on cache state.

What about Apple Silicon?

This post is RTX 3090 Ti specific. MTP on Metal has different gotchas (issue #23011 flags "MTP slower than baseline on Apple Metal despite high acceptance" on the 35B-A3B variant). The MLX backend is a separate story. For the Apple Silicon side of local inference (LM Studio tuning, KV-cache quantization, the sysctl GPU-memory-cap fix), see LM Studio Errors on Apple Silicon.

Money quotes

"Same box, same model, same flags. Different binary. Nearly three times the throughput."
"Speculative decoding is supposed to be lossless at temp=0. We measured it. It isn't, on prose. Tool calls survive."
"CopySpec without the drafter is a 300x slowdown. The drafter isn't overhead. The drafter is the entire performance story."
"Three months of tuning a drafter ratio that evaporates at output > 1000 tokens."
"MTP doesn't have a separate drafter. The heads are part of the same forward pass. There's no second KV cache to feed. So when the context gets long, MTP doesn't slow down. DFlash does."
"Production went from BeeLlama + DFlash + custom drafter back to vanilla llama.cpp + the MTP-variant GGUF. The swap deleted code."
"Hermes never noticed. We changed the binary, the model, the spec-decoding technique, and the drafter situation, and Hermes kept hitting the same alias."

Companion post forthcoming: Building a 3090 Ti Homelab Inference Node on a Dell Precision T5820. All commands and configs reproducible.

LLM Benchmark Rankings 2026: 15 Models Tested on 38 Real Coding Tasks

Ian L. Paterson — Mon, 18 May 2026 19:59:50 +0000

Most LLM benchmarks measure raw intelligence. Real deployment decisions also depend on latency, format reliability, and data boundaries, including when a task should stay on-prem instead of going to a public cloud.

Most LLM benchmarks measure raw intelligence. Real deployment decisions also depend on response speed, format reliability, and data boundaries, including when a task should stay on-prem instead of going to a public cloud. And while every model vendor says "test on your own data," but almost nobody publishes those results with cost, latency, and pass-rate data attached.

This is that test. Total cost: $2.29.

Fifteen models across thirty-eight tasks from my daily work as a tech executive, five hundred seventy API calls scored deterministically with an LLM judge pass for QA.

And the winning model is not a model, but a maxim:

Routing beats model selection.

For most daily tasks, the cheap models are good enough, and the routing decision is worth more than picking the "best" model.

Key findings:

Opus & Sonnet both scores 100%.
Gemini Flash scores 97% at $0.003/run.
GPT-oss-20b scores 98.3% running locally on-prem.

The surprises:

MiniMax M2.5 is not in most people's rotation, and yet scores 98.6% with 100% pass rate and returns clean structured output (bare JSON, no wrapper text) on 23 of 38 tests.
And GPT-oss-20b, which barely exists on public leaderboards, outscored Haiku, R1, and GPT-5-Nano while costing nothing.

The Problem I Was Solving

I run a publicly traded cybersecurity company and use AI in production every day. This benchmark started as a cost-control question: when is a budget model good enough that wasting money on frontier prices becomes indefensible? I do not need a model that wins one-shot snake games. I need a model that completes production work reliably, is quick to respond and doesn't break the bank.

LLM inference prices have fallen 10-50x per year since 2022 (Epoch AI), which makes the routing question worth actually answering with real data.

Scope: text-only, single-shot prompts routed by task type. No agent harness. The constraints (speed, cost, data sovereignty, on-prem option) determined which 15 models made the list. Thirty-eight tasks is a small sample, and a different practitioner's workload would produce different rankings. The test suite and harness are published on GitHub so anyone can extend it with their own tasks.

The Practical Stack

This benchmark is a routing guide, not a ranking. The data consistently points to three tiers working together, not one model doing everything.

Speed and cost (Flash, GPT-oss-20b, Haiku, DeepSeek V3): extraction, batch jobs, classification, health checks. Gemini Flash at 1.1s and $0.003/run handles 97.1% of tasks. GPT-oss-20b hits 98.3% for free. These models cover the high-volume, low-stakes layer.

General purpose workhorse (Sonnet): 100% accuracy, $0.20/run, 4.6s median. The model you reach for when the task matters and you don't want to think about routing. MiniMax M2.5 belongs here too for batch pipelines that need clean structured output.

High-end reasoner (Opus, Codex CLI, Kimi K2.5): multi-step causal chains, complex planning, style-constrained writing. This is where cheaper models drop to 60-80% and frontier models earn their cost. Codex is free with a ChatGPT Pro subscription.

One latency caveat: Kimi K2.5 (29s median), DeepSeek R1 (23s), and MiniMax M2.5 (16s) are thinking models. Accurate, but impractical for interactive agent loops.

The 38 Tasks and 15 Models

I had Claude analyze two weeks of my session logs to build the test suite. Coding and data dominate my workload, so they got the most tests. A few tasks use Canadian context (TSX-V press releases, regulatory classification) since that's my actual data.

Group	Tests	What it tests	Real-world example
E - Extraction	5	Pull structured data from messy text	Mining press releases with null traps - hallucinating a missing Cu grade corrupts a downstream database
C - Code	7	Write and fix code	Bash, Python, Rust, TypeScript - my actual production stack, in my order of preference
R - Reasoning	5	Multi-step logic, contradiction detection	Cause-effect chains and root cause analysis - the group that differentiates models most in practice
W - Writing	5	Style-constrained drafting	Tested against my specific rules (no em dashes, no "Not X. Y." device)
P - Planning	4	Task decomposition, spec writing	Edge case enumeration for production systems
I - Investments	4	Prediction markets, options extraction	Portfolio signals from financial text
D - Data	4	CSV/JSON manipulation	Transformation and normalization from real automation pipelines
H - Health	2	Ops parsing	Cron log analysis, schema drift detection
L - Letter Counting	1	Character-level processing	"How many e's in nevertheless?" The trap: it's 4, not 3
M - Math	1	Multi-step arithmetic	Modular arithmetic chain where step 1 wrong cascades through everything

Llama, Mistral, and Cohere aren't in my daily rotation so they're absent, but the suite is published for anyone to extend.

Scored deterministically, with an Opus judge pass for QA. Every test is rerunnable. 14 of 15 models score above 85%, which suggests the tasks are broadly achievable rather than shaped to favor any one provider.

2026 Benchmark Results

Rank	Model	Quality	Pass Rate	Cost	Median Time	Total Time
1	Claude Sonnet 4.6	100.0%	38/38 (100%)	$0.20	4.6s	3.6 min
1	Claude Opus 4.6	100.0%	38/38 (100%)	$0.69	4.1s	3.3 min
3	Kimi K2.5	98.6%	38/38 (100%)	$0.13	29.2s	33 min
4	MiniMax M2.5	98.6%	38/38 (100%)	$0.07	15.9s	19 min
5	Gemini 2.5 Pro	98.3%	37/38 (97%)	$0.71	13.8s	10 min
6	GPT-5.2-codex (Codex CLI)	98.3%	37/38 (97%)	$0.16	4.6s	3.9 min
7	GPT-oss-20b	98.3%	37/38 (97%)	$0.00	4.1s	3.3 min
8	GPT-5.2	98.0%	37/38 (97%)	$0.15	3.0s	2.5 min
9	Gemini 2.5 Flash	97.1%	35/38 (92%)	$0.003	1.1s	52s
10	DeepSeek R1	96.8%	37/38 (97%)	$0.12	23.1s	22 min
11	Claude Haiku 4.5	95.9%	37/38 (97%)	$0.04	2.2s	1.6 min
12	GPT-5-Nano	94.8%	35/38 (92%)	$0.03	11.1s	11 min
13	DeepSeek V3 (Chat)	88.7%	34/38 (89%)	$0.008	5.8s	4.6 min
14	Qwen 3.5 35B (local)	85.8%	33/38 (87%)	$0.00	6.1s	4.4 min
15	Gemma 3 12B (local)	80.6%	32/38 (84%)	$0.00	5.8s	5.8 min

All models run through OpenRouter for apples-to-apples comparison. Differences under ~2 percentage points are within the noise floor - treat models within that band as statistically tied.

Gemini Flash is the speed/cost champion: 1.1s median, $0.003 total, 92% pass rate. If you can tolerate 3 failures out of 38, this is absurd value.

GPT-oss-20b at $0.00 and 98.3% is the free-tier surprise. It ties Gemini Pro and Codex on points while costing nothing.

Key Findings

1. Sonnet is the benchmark ceiling on value

172.5/172.5 points, 38/38 pass rate, $0.20 total cost, 4.6s median response time. Opus matches Sonnet on accuracy (100.0%) but costs 3.5x more ($0.69). No other model matched Sonnet's combination of perfect accuracy, reasonable cost, and fast response.

2. MiniMax M2.5 is the format compliance champion

100% pass rate under both deterministic scoring and Opus-as-judge. MiniMax responses are extremely concise: 23 of 38 are JSON-only with zero explanation text, 14 under 200 characters. It never triggers must_not_contain penalties because there is no text to penalize. Format compliance is a real, separate capability that matters in production pipelines.

MiniMax is doing something operationally important here. Most models add wrapper text ("Here's the JSON..."), markdown fences, or extra rationale that breaks deterministic parsers. MiniMax mostly does not. It returns parseable payloads with almost no conversational scaffolding, which means fewer downstream failures in automation.

3. Gemini Flash redefines the cost floor

$0.003 for 97.1% quality. 1.1-second median response. 110.6 tok/s. For extraction, data transformation, batch jobs, and health checks, Flash is the rational default and the best Brains per Buck in this benchmark. The 3 tests it fails (E4, R1, R4) are all reasoning-adjacent. On pure data tasks, Flash is perfect.

4. The free tier is competitive

GPT-oss-20b at $0.00 outscores Haiku, R1, and GPT-5-Nano. The free tier is no longer an afterthought.

5. Thinking models pay a steep latency tax for marginal gains

Kimi K2.5 matches Sonnet's 100% pass rate but takes 33 minutes vs 3.6 minutes (9.3x slower) and produces 4.8x more output tokens. MiniMax M2.5 is the fastest thinker at 19 minutes, still 5.3x slower than Sonnet. The quality improvement from extended thinking is marginal on these tasks, but the latency cost is not.

6. Reasoning is the only category with a hard quality split

R-group has a 13.3% failure rate, the highest of any category. Four models failed R1 (gold production calculation), four failed R4 (root cause identification). The models that fail reasoning tasks are predictable: smaller models and budget options. If your task involves multi-step causal chains or root cause analysis, that is where frontier models justify their price.

2026 Category Difficulty by Task Type

Reasoning and Math are where models split. Planning and Code are essentially solved at 0-1% failure rates across all 15 models. Writing (5.3%) and Reasoning (13.3%) are where routing decisions matter most.

Want a personalized recommendation? Try the LLM Picker tool to find the right model for your specific use case, budget, and priorities.

Is This Benchmark Too Easy?

A fair critique is that this benchmark looks easy. Twenty-six of thirty-eight tests had zero failures across all fifteen models. A skeptic could call that too easy, and they would be missing the point.

Most daily LLM work is already easy for current models. The core decision is rarely "which model can solve olympiad-level physics." The core decision is "which model can handle Tuesday afternoon production tasks with predictable quality, speed, and cost." A $0.002-per-task model scoring 98.6% on real work is the result that matters.

MiniMax M2.5 is the clearest example. It sits at #27 on LiveBench, where competition math and agentic coding dominate the ranking. On SWE-bench Verified it ranks #4 (80.2%), and in this benchmark it scores 98.6% with 100% format compliance. LiveBench asks if a model can do IMO-style problems. This benchmark asks whether it can extract Q3 revenue from an earnings call and return clean structured output. The gap between academic benchmarks and practical benchmarks is not a bug in either one. It is evidence that the market has segmented. Frontier intelligence is one product, reliable task completion is another, and most teams are buying the first when they need the second.

What This Means for Actual Usage

LLM Model Routing: Which Tier for Which Task?

The benchmark data clusters into four natural tiers. In practice, LLM model routing means sending each task to the cheapest model that reliably clears your quality bar, then escalating only when the task type needs more reasoning depth or stricter output quality.

Tier	Models	Quality	Cost/Run	Best For
Free	GPT-oss-20b, Qwen 3.5 35B, Gemma 12B	80-98%	$0.00	Extraction (GPT-oss: 98.3%), local-only workloads
Budget	Gemini Flash, DeepSeek V3	88-97%	$0.003-$0.008	Batch jobs, health checks, data transforms, speed-critical agentic loops
Mid	Haiku, MiniMax M2.5, GPT-5-Nano	94-99%	$0.03-$0.07	Code, data, most production tasks
Frontier	Sonnet, Opus, GPT-5.2-codex, Kimi K2.5	98-100%	$0.13-$0.69	Reasoning, writing with style constraints, complex planning

This is Inference Arbitrage in practice: route each task to the cheapest model that still clears your bar.

For a session where the same tier mapping ran headfirst into a three-month LLM hallucination loop, and only a deterministic tool caught the real bug, see The LLM Kept Saying ‘Fixed.’ For Three Months, It Wasn’t.

There is a second optimization layer most teams miss, Remnant Tokens

If you pay for Google Workspace, you get a daily bucket of Gemini calls. If you pay for ChatGPT Pro ($200/month), Codex usage is covered by subscription. Those allowances are prepaid capacity, and unused capacity expires.

A practical routing policy is: burn remnant token capacity first, then use the tiered routing table for overflow and for tasks that need higher reasoning depth.

The actual routing logic, including escalation thresholds, fallback patterns, and cost guardrails for agent loops, will be covered in a follow-up.

This test was run in March of 2026, and with the rate of change in AI land, could be out of date quickly. Including a reminder of the year for future users reference.

Best LLM for Coding Tasks (2026)

For coding tasks, Sonnet and GPT-5.2-codex both scored 100%, and planning/code categories were near-solved across the full field (0-1% failure rates). The ranking differences come mostly from reasoning and style-constrained writing, not core code generation.

Cheapest LLM for Production Use (2026)

Gemini 2.5 Flash posted 97.1% quality for $0.003 per 38-test run with a 1.1s median response time, making it the cheapest paid production option in this benchmark. DeepSeek V3 is also cheap at $0.008 but trails on quality at 88.7%.

Best Open Source LLM (2026)

GPT-oss-20b scored 98.3% with a 97% pass rate at $0.00, outperforming the other local/open models in this benchmark. Qwen 3.5 35B scored 85.8% and Gemma 3 12B scored 80.6%, so GPT-oss-20b is the strongest free open model here.

GPT-oss barely exists on public leaderboards. Artificial Analysis tracks the larger 120B variant as the highest-ranked American open-weight model (Intelligence Index 33). The 20B version we tested locally is not independently ranked anywhere we found. This benchmark may be its first independent public evaluation.

What I'm Actually Going to Use

The benchmark confirmed some choices and changed others. Here's my actual routing plan going forward.

Opus 4.6 as the orchestrator for main work. This is the model I sit in front of for interactive sessions: managing plans, interactive dialogue, coordinating subagents. Opus ties Sonnet on batch accuracy but costs 3.5x more, though interactive debugging rewards extended context handling and multi-turn coherence over single-shot accuracy.

Extensive Sonnet subagent work. Sonnet scored 100% and costs $0.20 per run. There's very little downside to forking Sonnet agents to grind in the background on research, code review, data analysis, and file processing. The benchmark confirmed what I'd already suspected: Sonnet is the workhorse. Sonnet matching Opus on quality at one-third the price is the clearest signal in the data.

Gemini Flash for quick classification, web searches, and batch extraction. Paid Google accounts come with a generous bucket of free API calls and OAuth calls per day. Flash at 1.1s median and 97.1% quality is effectively free at my usage volume. For anything where I need a fast answer and can tolerate the occasional reasoning miss, Flash is the default.

On-prem: keep Qwen 3.5 35B as the primary local model, but explore GPT-oss-20b. Qwen runs on my Mac Studio through LM Studio, powered by OpenClaw as the agent framework, and handles basic tasks at 20.3 tok/s for $0.00. GPT-oss-20b's 98.3% score is hard to ignore, though. I'll be running both against real-world scenarios over the coming weeks to see if GPT-oss-20b holds up outside the benchmark.

If you have ChatGPT Pro ($200/month), use Codex CLI. GPT-5.2-codex scored 98.3% with 97% pass rate. The Pro subscription covers the API cost, so every Codex call is effectively free. For coding tasks especially, this is a strong ceiling at no additional per-call cost.

A note on batch scores vs interactive debugging

These benchmark results measure batch accuracy: give a model a well-defined task, score the output deterministically. They do not fully predict how a model performs in an interactive debugging session where context accumulates, the goal shifts mid-conversation, and the model needs to track its own prior reasoning.

After running this benchmark, Claude Code was switched to Haiku as the default model (95.9% here, $0.036/run). Claude Code is my daily operating system, with a persistent memory architecture that carries context across sessions. The first real test was a cron job that had stopped firing. Haiku circled the problem, made plausible-looking changes, and failed to fix it across multiple turns. Switching to Sonnet with extended thinking resolved it in one exchange. The routing table holds for batch API work.

For interactive debugging with long context chains, the benchmark scores understate the gap between tiers. Treat them as a floor, not a ceiling.

This benchmark covers 38 tasks from one practitioner's workflow. It does not cover creative writing, image analysis, long-context document tasks, or multi-turn conversation. The routing table above is a starting point, not a universal prescription.

Cost Considerations

What It Cost to Run

Total benchmark cost: $2.29 for 570 calls across 15 models via OpenRouter. That is the entire cost to rank 15 models against 38 tasks and compute Inference ROI from real workloads instead of assumptions.

The table tells the story: spending $0.20 (Sonnet) gets you perfect marks. Spending $0.69 (Opus) gets you the same marks at 3.5x the price. The marginal return on spending above Sonnet is zero, not negative.

On academic benchmarks the ranking is reversed. LiveBench places Opus at #3 (76.33) and Sonnet at #17 (68.19). SWE-bench Verified has Opus at #1 (80.8%) and Sonnet mid-pack (77.2%). Aider Polyglot has Opus at #14 (72.0%) and Sonnet at #22 (61.3%). The parity in our results reflects task difficulty: for structured daily work, Sonnet's instruction-following precision matches Opus's reasoning depth. The gap reopens on competition math and multi-file refactoring.

Gemini Flash at $0.003 per 38-test run delivers 97.1% quality. Opus at $0.69 delivers 100.0%. That is The 265x Question: when is a 265x cost difference worth a 2.9 percentage point quality gap? That ratio holds for data tasks where Flash scores 100%. On reasoning tasks, Flash drops to 60% while Opus stays near 100%. The routing thesis is that the 265x stat applies to some tasks and not others.

Methodology

How the Benchmark Actually Ran

The 500-line Python harness: runner, scorer, adapters, report generator, schema validator. Adapters handle auth, ID mapping, response normalization per model. The runner fires 38 prompts per model in parallel threads, capturing time.monotonic() per call, writing raw results to JSON. One model, 38 calls, one output file. All models tested at default weights with no fine-tuning.

METHODOLOGY
===========

Step 1: Task Inventory
  |
  +--> Pulled 38 tasks from Ian's own Claude Code
  |    session history (not academic benchmarks)
  |
  +--> 10 groups: E/C/R/W/P/H/I/D/L/M
  |    (extraction, code, reasoning, writing,
  |     planning, health, investments, data,
  |     letter counting, math)
  |
  +--> Canadian context built in: TSX-V drill
       results, prediction markets, cron ops
         |
         v
Step 2: Test Harness
  |
  +--> 5 model adapters built:
  |    Anthropic SDK / Gemini REST /
  |    OpenRouter / LM Studio / Codex CLI MCP
  |
  +--> 11 deterministic scorer types:
  |    json_object, code_exec,
  |    writing_constraints, json_array...
  |    (NO LLM judge - can't test with the
  |     same tool you're evaluating)
  |
  +--> Runner: parallel threads, wall_time
       captured via time.monotonic()
         |
         v
Step 3: Benchmark Run (March 1-8, 2026)
  |
  +--> 15 models x 38 tests = 570 calls
  |
  +--> Captured per-call:
  |    quality score, wall time,
  |    tok/s, cost (USD)
  |
  +--> Total cost: $2.29
         |
         v
Step 4: QA Pass (parallel)
  |
  +--[Codex subagent]--> Automated integrity:
  |    completeness, score sanity, cost
  |    >> Found 3 scorer bugs (CSV parser,
  |       JSON regex, R1 format instruction)
  |
  +--[Opus subagent]---> Manual review:
       every failure examined
       >> Found 4 subtler bugs (wrong extract
          fn, I2 all-pass regression, haiku
          penalized for showing work, R2 narrow)
         |
         v
Step 5: Results
  |
  +--> Corrected rankings published
  +--> Raw results + all 38 prompts on GitHub
  +--> Routing table derived from group scores

Scoring is deterministic. The 11 scorer types (json_object, code_exec, writing_constraints, etc.) score raw responses against defined criteria. Pass/fail is computed, not judged. Every result is verifiable by rerunning the same call.

All 15 models were called through the same OpenRouter API with identical parameters: a system prompt, a user prompt, and max_tokens=8192. No model received special reasoning configuration, thinking budgets, or elevated inference modes. MiniMax M2.5, Kimi K2.5, and DeepSeek R1 are architecturally thinking models. They produce reasoning traces by default on every API call. There is no "turn it off" switch the way Gemini Flash offers a thinkingBudget parameter. SWE-bench separately evaluates these models in a "high reasoning" configuration that boosts reasoning effort beyond the default. We did not use high reasoning mode. What you see in the results is what you would get if you called each model's API today with a standard prompt and no special parameters. The latency and token cost of that default reasoning (19 minutes for MiniMax, 33 minutes for Kimi, 22 minutes for R1 on 38 tests vs Sonnet's 3.6 minutes) is documented in the Speed and Key Findings sections above.

LiveBench's ICLR 2025 research showed that LLM-as-judge scoring has 21-46% error rates on hard tasks, which is one reason this benchmark uses deterministic scoring for all 38 tests, with the Opus judge as a QA layer rather than the primary scorer.

No model refused any of the 38 prompts. Refusal rate testing (how often safety filters block production-style requests) matters for deployment but is outside this benchmark's scope.

The full test harness, all 38 prompts, and raw model responses are on [GitHub repo - coming soon].

Why Your LLM Benchmark Infrastructure Matters More Than the Models

The original benchmark ran 15 models through five different adapters: claude -p for Anthropic, codex exec for GPT-5.2 variants, the Gemini CLI for Google, LM Studio for open-source, and OpenRouter for everything else. The v2 rerun routed seven of those models through OpenRouter instead. Same prompts. Same scorer. Different plumbing.

Four Gemini responses turned out to be CLI artifacts, not model output. C3 came back as </code>. D2 came back as "I have completed the task as requested." These aren't bad answers. They're the CLI capturing a status message while the model's actual output went to a tool-use sandbox the CLI never surfaced. Through OpenRouter, all four returned valid, high-scoring responses.

Fourteen tests flipped between v1 and v2 on identical prompts. R4 (root cause analysis) was the most contested: three models changed their answer on different days. GPT-5.2-codex and Gemini Flash went from correct to incorrect. Gemini Pro went from incorrect to correct. That's the inherent noise floor of LLM evaluation, and any benchmark that doesn't acknowledge it is reporting signal mixed with static.

The lesson: if your models aren't all going through the same adapter path, your numbers contain an unknown amount of infrastructure noise. You might be ranking adapters, not models.

Every QA Layer Caught Something the Previous Layer Missed

The benchmark went through five QA passes before the numbers were publishable. Each one found problems the previous missed.

The Codex pass ran in 9 minutes, the Opus pass in 13, and they caught completely different categories of bugs. Automated testing found structural problems (CSV parser brittleness, JSON regex overreach, a max_score calculation that produced quality scores over 100%). LLM-powered review found semantic ones (a wrong answer key in the letter-counting test, one model giving the right answer in prose and getting zero because the scorer required JSON, "Haiku beats Sonnet" turning out to be a pure scorer artifact). Neither alone sufficed.

The takeaway for anyone running their own benchmark: parallel QA with different model types catches different failure modes. Single-pass evaluation ships errors.

Format Compliance Is a Real Capability (and Small Models Don't Have It)

Gemma (12B) and Qwen 3.5 (35B) both returned correct answers in formats the scorer couldn't parse. Repeatedly, on different test types, despite explicit format instructions.

Gemma returned Python code for D3 (a CSV transformation task). The prompt said "Return CSV." Gemma wrote a Python script that would produce the correct CSV if executed. The csv_transform scorer tried to parse Python as CSV: 0% quality. Gemma did the same thing on L1 (letter counting), returning a Python function instead of JSON. Qwen returned a Markdown table for D3 instead of CSV. The values were correct. The parser crashed.

MiniMax M2.5, by contrast, returned bare JSON on 23 of 38 responses. No explanations, no code blocks, no Markdown wrapping. It scored 100% pass rate under both the deterministic scorer and the Opus LLM-as-judge. That discipline, understanding that the consumer of your output is a machine and not a human who can interpret Python as "approximately JSON," is itself a form of intelligence.

No other benchmark we reviewed (Artificial Analysis, SWE-bench, Aider, LiveBench, SEAL, Epoch AI) scores format compliance as an independent capability. They either penalize it silently or ignore it. For production pipelines where the consumer of model output is a parser, not a human, this gap in benchmark coverage is significant.

How Does Claude 4 Compare?

Claude-specific queries are the 4th largest search cluster for this post, so here is the dedicated breakdown. Opus 4.6 and Sonnet 4.6 both scored 100.0% on all 38 tasks with a 100% pass rate. The difference is cost: Sonnet costs $0.20 per run, Opus costs $0.69. That is 3.5x the price for identical accuracy on these tasks. Unless you need Opus-tier reasoning for competition math or multi-file refactoring (where academic benchmarks show a gap), Sonnet is the better allocation.

Haiku 4.5 scored 95.9% at $0.04 per run with a 97% pass rate (37/38). It failed one reasoning task (R4) and lost partial credit on a few others, but for batch classification, extraction, and health-check jobs it handles the workload at one-fifth the cost of Sonnet. The real-world test was less encouraging: Haiku circled a broken cron job for multiple turns without fixing it, while Sonnet with extended thinking resolved it in one exchange. Haiku is a solid batch workhorse. For anything requiring multi-step debugging, escalate to Sonnet.

If you want the full Claude family routing logic: Haiku for batch API work under $0.04/call, Sonnet for everything interactive or reasoning-heavy at $0.20/call, and Opus only when you have confirmed that Sonnet fails on the specific task type. In this benchmark, that confirmation never came.

LLM Price-Performance Rankings

Sorted by cost-per-correct-task from cheapest to most expensive, with a minimum 90% quality threshold to filter out models that are cheap but unreliable:

$0.00/run: GPT-oss-20b scored 98.3% running locally. Zero API cost. If you have the hardware, this is the price-performance champion by definition.

$0.003/run: Gemini 2.5 Flash hit 97.1% quality at 110.6 tok/s. Three-tenths of a cent per run. For extraction, batch jobs, and data transforms, Flash is the rational default.

$0.04/run: Claude Haiku 4.5 posted 95.9% with a 2.2s median response. Reliable for production tasks that need Anthropic-level instruction following without Sonnet pricing.

$0.07/run: MiniMax M2.5 scored 98.6% with 100% format compliance. The most structured output of any model tested, which matters when your downstream consumer is a parser.

$0.13/run: Kimi K2.5 matched MiniMax at 98.6% quality. Strong across all categories with a 100% pass rate.

$0.20/run: Claude Sonnet 4.6 scored 100.0%. Perfect marks. Every dollar above this bought zero additional accuracy in this benchmark.

The gap between $0.003 (Flash) and $0.20 (Sonnet) is 67x in cost for 2.9 percentage points of quality. On pure data tasks, Flash scores 100% and that 67x gap buys nothing. On reasoning tasks, Flash drops to 60% and Sonnet stays near 100%. The routing decision depends entirely on the task type.

Which LLMs Work Best for AI Agents?

Agentic reliability is not just accuracy. It is pass rate (does the model produce usable output every time?), format compliance (does the output parse without manual cleanup?), and consistency under multi-step reasoning. From the benchmark data, three tiers emerge for agent use.

Tier 1 (ship it): Sonnet 4.6 (100% pass rate, 100% quality), Opus 4.6 (same marks), MiniMax M2.5 (100% pass rate, 98.6% quality with near-zero wrapper text), and Kimi K2.5 (100% pass rate, 98.6%). These four models never failed to produce scoreable output and rarely lost partial credit. For autonomous loops where a single malformed response can cascade into retry spirals, that pass rate matters more than marginal quality differences.

Tier 2 (reliable with guardrails): GPT-5.2-codex (97% pass rate, 98.3% quality), Haiku 4.5 (97%, 95.9%), and Gemini 2.5 Pro (97%, 98.3%). Each failed one task outright. In an agent loop with retry logic, these are fine. Without retries, expect occasional dropped steps.

Tier 3 (batch only): Gemini Flash (92% pass rate), GPT-5-Nano (92%), DeepSeek V3 (89%), and the local models (84-87%). These models skip or malform output often enough that unsupervised agent loops will accumulate errors. They excel in supervised batch pipelines where a human or a Tier 1 model reviews the output.

The single best predictor of agentic reliability in this benchmark was not accuracy, it was pass rate. A model that scores 95% but always returns parseable output is more useful in a pipeline than one that scores 98% but occasionally returns unparseable responses that require exception handling.

How Do These Compare to Artificial Analysis?

Artificial Analysis runs synthetic benchmarks (MMLU, HumanEval, MATH, and similar) across a large model set with excellent speed and pricing data. Their rankings reflect performance on academic-style questions. This benchmark uses 38 tasks pulled from my actual Claude Code session history: regex extraction, API debugging, cron troubleshooting, writing with style constraints, prediction market parsing. The methodology is different, and so are some of the rankings.

For example, Artificial Analysis ranks Opus well above Sonnet on most metrics. In this benchmark, they tied at 100%. The gap in academic benchmarks reflects tasks (competition math, multi-file refactoring, PhD-level science) that did not appear in my workload. Conversely, this benchmark surfaces format compliance and deterministic scoring failures that synthetic benchmarks typically ignore.

Neither benchmark is wrong. They measure different things. If you are choosing a model for academic research or competition coding, Artificial Analysis and SWE-bench are more relevant. If you are choosing a model for production pipelines, data extraction, agent loops, and daily coding tasks, this benchmark is closer to what your actual usage will look like. Use both.

Frequently Asked Questions

What are Kimi K2.5's benchmark results in 2026?

Kimi K2.5 scored 98.6% quality with a 100% pass rate for $0.13 per run and a 29.2-second median response time. It is a thinking model from Moonshot AI, tying MiniMax M2.5 on quality and pass rate. The catch is latency: 2,002 seconds total for 38 tests (33 minutes). It generates 57,569 output tokens, 4.8x more than Sonnet, because extended reasoning traces are included in the output count.

Kimi scored 100% in the earlier 37-test version. The additional test didn't change the picture. At $0.13 per run, Kimi is cheaper than Sonnet but the latency tax (29s median) makes it impractical outside of batch processing where wall-clock time is irrelevant.

Kimi K2.5 is the most consistently validated model across benchmarks. Artificial Analysis ranks it #2 among open-weight models (Intelligence Index 47). SEAL places it #2 on MultiChallenge. In our benchmark: 98.6% quality, 100% pass rate. The only thing holding it back is speed.

How does DeepSeek R1 compare to Claude Sonnet in 2026?

DeepSeek R1 scored 96.8% with a 97% pass rate for $0.12 per run, versus Claude Sonnet 4.6 at 100.0% and $0.20 on the same 38 tasks. Every existing "DeepSeek R1 vs Claude" comparison references Claude 3.5 or 3.7 Sonnet. This DeepSeek R1 review is the first head-to-head with Opus 4.6 and Sonnet 4.6 on the same test suite.

Metric	DeepSeek R1	Claude Sonnet 4.6	Claude Opus 4.6
Quality	96.8%	100.0%	100.0%
Pass Rate	97%	100%	100%
Cost/Run	$0.12	$0.20	$0.69
Median Time	23.1s	4.6s	4.1s
Total Time	22 min	3.6 min	3.3 min

R1 is close on accuracy (96.8% vs Sonnet's 100%) but the latency gap is the story. R1 takes 5x longer per call than Sonnet. In agentic loops where calls chain sequentially, that compounds into minutes of dead time per task.

DeepSeek V3 (Chat) is the faster alternative at 5.8s median, but drops to 88.7% quality with 89% pass rate. The R1/V3 split within DeepSeek's own lineup mirrors the Flash/Pro split within Google's: the cheaper model is dramatically faster, the expensive model buys reasoning accuracy at a steep latency cost.

DeepSeek R1's weakness is consistent across benchmarks. Artificial Analysis ranks the original R1 at #32 of 65 open-weight models (Intelligence Index 27), though the updated R1 0528 scores significantly higher. Aider Polyglot has it at 56.9%. The thinking architecture buys correctness on reasoning tasks at the cost of format compliance and speed everywhere else.

If latency doesn't matter and you need to minimize cost, R1 at $0.12 is competitive with Sonnet at $0.20. If response time matters at all, Sonnet wins on every dimension except price.

How does Codex CLI compare to Claude Code for coding tasks?

GPT-5.2-codex scored 98.3% with a 97% pass rate for $0.16 per run, while Sonnet scored 100.0% with a 100% pass rate for $0.20. GPT-5.2-codex is the model powering Codex CLI, and Sonnet is the default model in Claude Code. The gap is small: 1.7 percentage points and one additional failed test.

Metric	Codex CLI (GPT-5.2-codex)	Claude Code (Sonnet 4.6)
Quality	98.3%	100.0%
Pass Rate	97% (37/38)	100% (38/38)
Cost/Run	$0.16	$0.20
Median Time	4.6s	4.6s
Code Tasks (C-group)	100%	100%

On coding tasks specifically, both scored 100%. The difference shows up in reasoning: Codex failed R4 (root cause identification), picking the proximate trigger ("traffic spike") instead of the underlying vulnerability ("pool config"). Sonnet passed it.

The cost story matters more than the accuracy gap. Codex CLI is subscription-backed with ChatGPT Pro ($200/month), so usage is a predictable fixed-cost envelope instead of an open-ended per-call API bill. Sonnet costs $0.20 per 38-test run through the API, or is available via Claude Code Pro ($20/month, rate-limited) or Max ($100-200/month, higher limits). If you're already paying for ChatGPT Pro and your work is code-heavy, Codex CLI is strong value.

What are MiniMax M2.5's benchmark results in 2026?

MiniMax M2.5 scored 98.6% with a 100% pass rate for $0.069 per run and a 15.9-second median response time. It is the least-discussed model in this benchmark and one of the strongest performers, making it one of four models (alongside Sonnet, Opus, and Kimi K2.5) to pass every test.

Metric	MiniMax M2.5	Claude Sonnet 4.6	Kimi K2.5
Quality	98.6%	100.0%	98.6%
Pass Rate	100%	100%	100%
Cost/Run	$0.069	$0.20	$0.13
Median Time	15.9s	4.6s	29.2s
Output Tokens	55,856	11,985	57,569

Where MiniMax stands out is format compliance. It returned bare JSON on 23 of 38 responses with zero explanation text, no code blocks, no Markdown wrapping. No other model was that disciplined. For batch pipelines where you need reliable structured output and the downstream consumer is a parser (not a human), MiniMax at $0.069 is hard to beat.

MiniMax M2.5 independently ranks #4 on SWE-bench Verified (80.2%), confirming this result is not a fluke of our benchmark design.

The speed penalty is real: 15.9s median, 19 minutes total for 38 tests, making it 4.4x slower than Sonnet. The thinking-model architecture generates 55,856 output tokens (4.7x more than Sonnet) because reasoning traces are included, which inflates both cost and wall-clock time.

Is Claude Haiku as good as Sonnet?

For most tasks, yes. Haiku scored 95.9% and passed 37 of 38 tests, handling extraction, code, data, and planning cleanly. The gap shows up in reasoning: Haiku failed R4 (root cause analysis), where it picked the surface-level trigger instead of the underlying config flaw. I ran Haiku as my default model for a week after this benchmark, and the pattern held. Batch work was fine, but multi-step debugging sessions needed Sonnet.

What is the cheapest LLM that actually works?

Gemini 2.5 Flash at $0.003 for a 38-test run, with 97.1% quality and a 1.1-second median response. I use it for all my batch classification and health-check jobs because the Google Workspace free tier covers most of my volume. If you want true $0.00, GPT-oss-20b scored 98.3% running locally on a Mac Studio with 192GB RAM.

Is DeepSeek as good as Claude?

R1 scored 96.8% vs Sonnet's 100%, so accuracy is close. The real gap is speed: R1's 23-second median means a 10-call agent chain takes nearly 4 minutes of dead time, compared to 46 seconds with Sonnet. I tested R1 in an agentic loop for cron debugging and the wait times made it unusable for interactive work, though it would be fine for overnight batch jobs where nobody is watching.

How does Codex CLI compare to Claude Code?

Both hit 100% on coding tasks (C-group). The divergence is reasoning: Codex failed R4 by identifying "traffic spike" as the root cause instead of "connection pool misconfiguration." At $0.16 per run vs Sonnet's $0.20, the cost difference is marginal, but Codex's subscription model (ChatGPT Pro at $200/month) makes per-call cost predictable in a way API billing does not.

Is GPT-5.2 better than Claude Sonnet?

In this benchmark, Sonnet scored 100.0% (38/38) and GPT-5.2 scored 98.0% (37/38). GPT-5.2 is faster at 3.0s median vs Sonnet's 4.6s, which adds up in high-volume loops. On the one test GPT-5.2 failed (W3, style-constrained writing), it produced an em dash despite the prompt explicitly prohibiting them, a formatting discipline issue rather than a reasoning failure.

Which LLM is best for reasoning tasks?

Sonnet 4.6 and Opus 4.6 both scored 100% across all 38 tests, including the R-group where 13.3% of all model responses failed. The specific tests that separate models are R1 (multi-step gold production calculation, where four models got the math wrong) and R4 (root cause identification, where four models picked symptoms over causes). Kimi K2.5 and MiniMax M2.5 are close at 98.6%, but they take 6-8x longer per call.

Is MiniMax M2.5 worth using?

For batch pipelines, absolutely. MiniMax returned bare JSON on 23 of 38 tests with zero wrapper text, which means fewer parser failures downstream. At $0.069 per run with 100% pass rate, I'm adding it to my overnight batch rotation for structured extraction jobs. The 15.9-second median makes it too slow for interactive work, but for anything where the output feeds a script rather than a human, it is hard to beat on value.

Should I just pay for Claude Max and stop worrying about model selection?

The Max plan removes per-call anxiety, but three constraints remain. Data sovereignty is the biggest: some of my workloads (Canadian securities data, defense-adjacent analysis) cannot leave my infrastructure, period. Second, parallel subagents burn tokens fast, and even "unlimited" plans have effective rate limits that throttle heavy concurrent usage. Third, GPT-oss-20b at $0.00 outscores Haiku, R1, and GPT-5-Nano on these tasks. Routing by task type took me about 30 minutes to set up and saves real money every day.

Benchmarks Worth Reading

This benchmark tests 38 practical tasks from one person's workflow. For broader coverage, these are the benchmarks and leaderboards I actually reference when evaluating models:

Artificial Analysis - Independent quality, speed, latency, and price measurements across 100+ models, updated daily. The best single-page comparison of the dimensions practitioners actually care about.
SWE-bench - Tests whether models can resolve real GitHub issues from popular open-source repos, end-to-end. The standard measure of agentic coding capability. The Verified subset is human-validated.
Aider Polyglot Leaderboard - 225 coding exercises across 6 languages, testing both generation and iterative debugging. Polyglot scope reflects real developer work better than Python-only benchmarks.
LiveBench - Monthly-refreshed questions with verifiable ground-truth answers, designed to combat data contamination. No LLM-as-judge, objective scoring only. Top models still score below 70%.
SEAL Leaderboards (Scale AI) - Expert-driven evaluations including software engineering, professional reasoning (finance, law), and agentic tool use. Enterprise-oriented benchmarks that other leaderboards lack.
Epoch AI Benchmarks - Historical performance tracking across major benchmarks, covering 3,200+ models. Useful for seeing where the trajectory is heading, not just where it is today.
Sebastian Raschka: A Dream of Spring for Open-Weight LLMs (Jan-Feb 2026) - Comprehensive roundup of 10 open-weight models, covering the Qwen, Gemma, and GPT-oss families this benchmark tests locally.

About the Author

Ian Paterson is CEO of a publicly traded cybersecurity company in Canada, and has used GenAI in production since 2022 for extraction pipelines, trading automation, investment analysis, and writing workflows. He runs a motley collection of personal infrastructure ranging from cheap VPS providers, repurposesed Mac Studios and old PCs on their 6th life, mostly talking to Anthropic, Google, and OpenRouter APIs. This benchmark came out of frustration at not knowing whether his API spend was optimally allocated.

For more on managing costs across multiple models, see how I route 200+ daily LLM calls across five models. I also built a unified quota tracker that monitors rate limits across Claude, Codex, and Gemini from one script.

Inference Arbitrage: How I Route 200+ Daily LLM Calls Across Five Models

Ian L. Paterson — Mon, 18 May 2026 19:58:51 +0000

Inference arbitrage means routing each AI task to the cheapest model that can handle it at acceptable quality, instead of sending everything to the most expensive one. No benchmark tells you which model to use for which task at which price point. I published a 38-task benchmark across 15 models last week and the top finding was a routing principle, not a model name: match the model to the task, and most of your tasks don't need the expensive one.

What Does My AI Workday Look Like?

I was on a flight last month, SSH'd into a cloud server over spotty airplane wifi, half a dozen subagents running in parallel. I watched my weekly token allocation drain faster than I'd planned, and by the time I landed I was rationing for the rest of the week.

I now plan heavy jobs around the weekly reset cycle. Monday, when the budget is flush, I queue the expensive reasoning tasks. By Thursday, everything possible routes to cheaper models or defers to the next cycle.

I parsed my Claude Code logs from Feb 28 through Mar 2 and categorized every session by task type.

Task Type	Sessions	Avg Duration	Estimated Calls/Day
Coding / system work / ops	48	~45m	50-80
Data analysis	18	~45m	25-40
Research	12	~35m	15-25
Writing / content	8	~60m	20-35
Email / comms	3	~30m	5-10

A typical day runs 80-120 API calls during interactive work, plus 50-200 from automated scripts. Peak days during benchmark development spiked to 7,700 calls (during benchmark automation, not typical usage). I'm a Claude Max subscriber, so take the daily-driver recommendation with that context.

The Five-Model Stack

Sonnet (Claude Code, daily driver). Where I spend most of my time. Sonnet handles everything interactive: coding, debugging, file edits, writing, planning. It scored 100% on my benchmark at $0.20/run with a 4.6s median response, and for my call volume the quality-to-cost ratio is unmatched.

Opus (escalation model). When Sonnet gets something wrong or I'm debugging a genuinely hard problem, I escalate. Opus also scored 100%, but at $0.69/run, a 3.5x premium for zero additional quality on most tasks. Where it earns that premium: ambiguous reasoning, multi-step causal chains, and problems where the first answer needs to be right because verification is expensive.

Codex subagents (cross-checking and cost spreading). I run OpenAI's Codex CLI as a deliberately separate inference channel, spreading token consumption across subscription plans and cross-checking Opus's work. Same problem, both models, compare answers: agreement means high confidence, disagreement tells me where to dig. GPT-5.2-codex scored 98.3% on the benchmark, and a second opinion from a differently-architected model has caught real bugs that single-model workflows miss. During one refactor last week, Codex flagged a race condition in a monitoring script that Sonnet had approved twice.

Gemini Flash CLI (research and file reads). Gemini reads local files via @file syntax, has built-in Google Search, and runs fast enough that I've burned through 1,000 calls in a single research sprint. I once needed founding dates and employee counts for 100 companies, and Gemini had it done in five minutes flat while Claude's budget stayed untouched. Every Gemini query is one that doesn't count against my Claude budget.

Qwen 3.5 35B on-prem (Mac Studio, async work). The slowest model in my stack, running through OpenClaw on a Mac Studio. Qwen handles cron jobs, overnight batch processing, and anything I can queue and forget: sovereignty (nothing leaves the machine) and cost (free after hardware), scoring 85.8% on the benchmark. Solid for extraction and code, but only 60% on reasoning. I tried it on a reasoning-heavy debugging session once and lost 20 minutes before escalating to Sonnet.

Full LM Studio setup and tuning guide for Apple Silicon.

The Routing Decision Tree

                                ┌──────────────────┐
                                │      OPUS        │
                                │ Complex reasoning│
                                │ $0.69/call       │
                                └────────┬─────────┘
                                         │
┌────────────────────┐          ┌────────┴────────┐          ┌────────────────────┐
│   QWEN LOCAL       │──────────│     SONNET      │──────────│     GEMINI         │
│   Sensitive data   │          │    (default)     │          │  Research, free    │
│   Overnight batch  │          │   $100/mo Max    │          │  @file, web search │
│   $0 (on-prem)     │          │   100% quality   │          │  1,000 calls/day   │
└────────────────────┘          └────────┬────────┘          └────────────────────┘
                                         │
                                ┌────────┴─────────┐
                                │     CODEX        │
                                │  Cross-check     │
                                │  Diff. arch.     │
                                │  $20/mo Plus     │
                                └──────────────────┘

Three heuristics drive most of the routing (38 tasks, so treat these percentages as directional):

Sensitive data stays on-prem. Anything touching client work or regulated industries goes to Qwen local, regardless of quality scores.
Reasoning tasks pay for frontier. Extraction and simple code score 100% on every model including free ones, but reasoning and planning show a 20-44 point gap between free and premium.
Everything else defaults to Sonnet. 100% across all categories at $0.20/run, and Claude Code's native file access makes it the only option for agentic coding loops.

I choose models at session and tool level, while Claude Code handles sub-call routing internally. In my session data, 12.2% of calls were auto-routed to Haiku for simple tasks like file reads and short bash commands, regardless of the parent session's model.

Models, Costs, and Why Each One's There

Model	Tasks	Subscription	Per-Call Cost	Why
Sonnet (Claude Code)	Interactive coding, debugging, file edits, writing, planning	$100/mo (Max 5x)	~$0.20/call	100% quality, 4.6s median, native file access
Opus (Claude Code)	Complex reasoning, ambiguous problems, escalation	Included in Max ($0.69 at API rates)	~$0.69/call	3.5x premium justified on multi-step reasoning
GPT-5.2-codex (Codex CLI)	Cross-checking critical decisions, parallel work	$20/mo (ChatGPT Plus)	Included	Different architecture catches different bugs
Gemini Flash (CLI + API)	Research, web lookups, file summaries, bulk classification	$0 (free tier)	$0	Built-in search, 1.1s response, 1,000 calls/day free
Qwen 3.5 (Mac Studio, local)	Overnight batch, cron jobs, extraction	$0 (on-prem)	$0	Sovereign, 100% on extraction, 97% on code

Total monthly spend: $120/mo. At API rates, my typical 80-120 daily interactive Sonnet calls alone would cost $480-720/mo, so the Max subscription pays for itself on volume before accounting for Opus access.

Capability Constraints and Quality Gaps by Task Type

Not every model can do everything. Before routing by quality, check whether the model even supports the capability you need.

Model	Web Search	Usage Limits	Cost
Qwen 3.5 local	Needs API key	Single-threaded	$0
Gemini Flash CLI	Yes (built-in Google)	1,000/day free	$0
Claude Code (Sonnet/Opus)	Limited (WebFetch)	225/5hr (Max 5x)	$100/mo
Codex (via ChatGPT Plus)	Yes (browser)	Quota-based	$20/mo

Quality varies dramatically by task type. These numbers come from my 38-task benchmark across 15 models, grouped by category to show where cheap models hold up and where they fall apart.

Task Category	Free Models	Cheap Paid	Premium	Gap	Verdict
Extraction	100%	100%	100%	0	Use cheapest
Simple code	97-100%	97-100%	100%	0-3%	Use cheapest
Complex code + reasoning	60-100%	80%	100%	13-40%	Pay for frontier
Writing	77-96%	89-100%	97-100%	11%	Context-dependent
Planning + system health	50-94%	94-100%	100%	25-44%	Pay for frontier
Data analysis	75-80%	75-95%	95-100%	20%	Pay for frontier
Investments	83-87%	87-100%	87%	2%	Use cheapest

If the gap between free and paid exceeds 10 percentage points, pay for frontier. Below 10, free or cheap is fine, and the savings compound across hundreds of calls per week. Paying Opus rates for extraction is a 17x premium for zero quality improvement. Routing reasoning tasks to Qwen means getting the wrong answer 40% of the time.

The full quality and cost breakdown across all 15 models is in my 38-task LLM benchmark.

Routing in Practice: Three Real Examples

Web research batch (100 companies, pulling founding year, HQ, employee count, latest funding). Gemini Flash handles this in 5 minutes at $0 because it's the only programmable option with built-in web search.

Categorize 1,000 local files. Qwen local runs overnight at $0 but takes 107 minutes; Gemini Flash finishes in 17 minutes via @file syntax. Claude Code Max could do it, but burning a $100/mo subscription on classification wastes its real value.

Clean up a 100-file codebase. Claude Code Max is the only option that autonomously navigates a repo, edits files, runs tests, and recovers from errors, so there's no real alternative for this class of work.

What Are You Actually Paying For?

Free Only	Split Stack	Max 5x + Supplements (my setup)	Max 20x + Supplements
Monthly cost	~$10 (electricity for the Mac Studio)	~$40	~$120
What's included	Qwen local + Gemini free + gpt-oss-20b via OpenRouter	Claude Pro ($20) + ChatGPT Plus ($20) + Gemini free + Qwen local	Claude Max 5x ($100) + ChatGPT Plus ($20) + Gemini free + Qwen local
Message limit	None (local), 1,000/day (Gemini)	Rolling caps on Claude Pro and ChatGPT	225/5hr window
Best for	Privacy-sensitive work, budget-zero	Individual devs, <4 hrs AI coding/day	Most professional developers

The free tier can't do agentic coding without building the plumbing yourself, and reasoning accuracy drops to 60%. The split stack at $40/mo gets to 95-97% quality, but Claude Pro's rolling usage cap will hit at the worst moment, deep into a complex session. Max 5x at $100/mo is the practical sweet spot for most work: Sonnet and Opus on demand, native file access, and 225 messages per 5-hour window that most sessions don't exhaust.

I use Max 5x ($100/mo), and most weeks the 225-message/5hr ceiling is enough. Some weeks I barely touch it. Other weeks, batch jobs and sustained research sessions push past it by Tuesday, and I'm rationing or deferring work to the next reset window. Max 20x ($200/mo) would eliminate that ceiling anxiety, but I haven't found the extra $100/mo justified yet. For competitive context, ChatGPT Plus runs $20/mo, Google AI Ultra hits $250/mo, and SuperGrok is $30/mo.

Provider Trust and Jurisdiction Risk

DeepSeek R1 scores 96.8% and MiniMax M2.5 hits 98.6% at $0.07/run, so the quality is genuinely competitive. The question is whether you trust the provider's data handling. The Canadian federal government restricted DeepSeek from government devices in February 2025, BC banned it from provincial devices, and in February 2026 Anthropic alleged that DeepSeek, Moonshot AI, and MiniMax ran coordinated distillation attacks targeting Claude.

My position: less-trusted models for personal experimentation on non-sensitive data, kept away from client work or regulated industries. Running them via OpenRouter routes calls through US infrastructure, which reduces but doesn't eliminate the risk.

Where Benchmark Meets Practice

The benchmark suggests Haiku (95.9%, $0.04/run) is the optimal cost-quality model, and Claude Code already routes Haiku-appropriate calls (short responses, file reads, simple bash) automatically. In my session data, roughly 70% of calls fit that profile.

But I also offload work to Gemini Flash that would otherwise burn Haiku calls against my Claude quota. Gemini is faster (1.1s vs 2s), has built-in web search, and doesn't count against my Max 5x message ceiling at all. Every file summary or web lookup I route to Gemini is one fewer call ticking down my 225-message window.

Claude Code doesn't expose per-turn routing decisions, so the gap between the optimal routing table and what the tooling actually supports is where real savings sit.

For the practitioner companion piece - what happens when a routed session fails to catch a subtle bug because the deterministic tool isn't in the pipeline - see The LLM Kept Saying ‘Fixed.’.

FAQ

Does extended thinking burn more Claude Max quota?

Yes. Thinking tokens count against your quota, and a thinking-heavy model generates roughly 5x more tokens than standard. On Max 5x (225 messages/5hr), heavy thinking hits the ceiling 3-4x faster than standard Sonnet calls.

If I spawn a Haiku subagent from Claude Code, does that count as a Haiku call or Sonnet call?

Claude Code's subagent routing is opaque but observable. In my session data, 12.2% of calls were routed to Haiku automatically, and subagent-heavy workflows are more quota-efficient than single-thread sessions.

Why not use an ML-based router like RouteLLM?

For my workflow (80-120 calls/day, 5 models), the routing logic fits in my head: extraction goes cheap, reasoning goes frontier, everything else defaults to Sonnet. The router overhead only makes sense at enterprise scale where per-call savings outweigh the routing infrastructure cost across millions of calls.

Should I use a thinking model for agentic coding loops?

Generally no. Thinking models add 10-25 seconds of latency above baseline per call for the reasoning phase, and in agentic loops with 50+ sequential calls that compounds to roughly 24 minutes of wall clock time versus about 1.7 minutes with Haiku at 2s per call. Use thinking models for single high-stakes decisions and fast models for the iterative loop.

Is it worth running a local model daily?

Yes, but depends on workload mix. Qwen 3.5 locally scores 100% on extraction and 97% on code, covering overnight batch jobs at zero marginal cost. The tradeoff: ~29s median response and 60% reasoning accuracy versus 100% for frontier. If you have batch work that can run overnight and care about data sovereignty, a local model pays for itself in the first month.

If you want the full build recipe for a CUDA-side local setup, I documented the from-source llama.cpp build on a Dell T5820 and RTX 3090 Ti separately, including the boot-loop fix and the production server flags. For the three months of speed tuning that followed (autoregressive baseline, DFlash, MTP, ending at 39 to 49 tok/s on a single 3090 Ti), see Three Months of Speed-Up Experiments on a 3090 Ti.

Can I use DeepSeek or other less-trusted providers for production work?

The quality is real (R1 scored 96.8%), but several governments have restricted specific providers, and Anthropic has documented distillation attacks by some. For anything touching client data, trusted providers or local only.

Companion toLLM Benchmark 2026: 38 Actual Tasks, 15 Models for $2.29, which has the full quality, cost, and speed data across all 15 models. The benchmark test suite and scoring harness are on GitHub.

Stop Claude Code from Lobotomizing Itself Mid-Task

Ian L. Paterson — Mon, 18 May 2026 19:58:28 +0000

Claude Code has a feature called auto-compact that quietly destroys your session quality.

The Problem

I was three hours into a multi-file refactoring session, had just finished explaining which modules needed interface changes and which were already done. Auto-compact fired at 80% context. When Claude came back, it had no idea which files had been edited and which still needed changes. It suggested modifying a file I'd finished an hour earlier and forgot a constraint I'd repeated twice. I had to re-explain the entire task from scratch.

The community calls this "lobotomization" and it's accurate. Post-compaction Claude loses track of what repo you're in, forgets constraints you set, drops skills you invoked. The quality drop is immediate and obvious.

A note on terminology: throughout this post I reference /flush, /project, and other slash commands. These are custom Claude Code skills I built (stored as markdown files in ~/.claude/commands/), not built-in features. /project loads a project's saved state (CLAUDE.md with Working/Blocked/Next, domain topic files, recent daily logs) so Claude knows what happened last session. /flush captures the current session state to multiple places (daily log, project state, MEMORY.md, topic files) so the next session can pick up where this one left off. Think of /project as "load game" and /flush as "save game." The skill files are simple markdown prompts that Claude follows as instructions.

The Technical Reality

Community analysis of Claude Code's minified source found a hardcoded buffer that triggers compaction. The exact value has changed across versions (it was ~13k in early 2026, reportedly ~33k in later builds), but the mechanism is the same:

var jbA = 13000;

function lwB() {
    let T = UhT();      // Available input (context - max_output_tokens)
    let R = T - jbA;    // Subtract 13k buffer
    return R;           // This is where auto-compact triggers
}

Buffer Reserved = max output tokens + safety buffer.

Output Token Setting	Buffer Reserved	Usable Context
64k (max)	77k (38.5%)	123k
32k (default)	45k (22.5%)	155k
Auto-compact off	None	200k

With default settings, you're losing 22.5% of your context window to a safety buffer you may never need.

These numbers reflect the ~13k buffer from early 2026 builds. If the buffer has increased in your version, the usable context will be lower. The principle holds regardless: auto-compact reserves a meaningful chunk of your context window.

Why This Matters

Research on iterative context rewriting (arXiv 2510.04618) calls it "context collapse": when LLMs rewrite their own context iteratively, accuracy drops. Claude Code's auto-compact does exactly this.

Compaction is lossy compression. You're asking an LLM to decide what's "important" and discard the rest. For the kind of work I do (multi-file refactors, constraint-heavy debugging sessions, anything where I've spent twenty minutes explaining what not to touch), that's a bad tradeoff.

The Fix

Add one line to ~/.claude.json:

{
  "autoCompactEnabled": false
}

Or run /config in an active session and toggle "Auto-compact enabled" off.

The New Workflow

With auto-compact disabled, you get the full 200k context window with no buffer held in reserve. No surprise compaction mid-task. The session stays coherent until you decide otherwise.

When I do need to reclaim context, I use /compact with explicit instructions: /compact "focus on the authentication refactor, discard the earlier debugging". This way I control what survives instead of letting Claude guess. More often, though, the better move is /export to save the conversation, then start a fresh session and have Claude read it back in. Fresh context, curated history, no lossy summarization.

I also dropped my flush threshold from 75% to 60%. When context hits 60%, Claude prompts me to run /flush before continuing. This captures the session state while there's still room to maneuver.

The Tradeoff

Disabling auto-compact means sessions will hit the context limit. You'll need to manually compact, export and restart, or finish the task before running out. I'd rather use that 45k toward skills, rules, and context I deliberately set up than reserve it for an automatic summarization that makes Claude worse.

Anthropic's own guidance has moved toward recommending auto-compact stay enabled, and for simple single-task sessions that's reasonable. For constraint-heavy work where I've built up significant context (multi-file refactors, debugging with specific rules about what not to touch), the automatic summarization loses too much.

Claude Code now supports 1M token context windows on some plans, which reduces the urgency of this problem. But context hygiene still matters at any window size. A 1M window fills up the same way a 200k window does, just slower. Knowing when to flush deliberately versus letting auto-compaction guess is a workflow discipline, not a context size question.

Session Management (Beyond Compaction)

The Context % Statusline

I have a custom statusline script that shows [XX%] at every prompt so I always know how much context is left. The script reads context_window.used_percentage from Claude Code's JSON input and renders it inline:

# ~/.claude/statusline-command.sh
PERCENT=$(echo "$input" | jq -r '.context_window.used_percentage // 0' | cut -d. -f1)
printf "\033[01;32m%s@%s\033[00m:\033[01;34m%s\033[00m [%s%%]" "$user" "$host" "$cwd" "$PERCENT"

The Emergency Rescue

When you hit >90% and can't even run /flush (the output won't fit), the workaround is a second terminal:

# In the dying session:
"Write a summary of everything we've done to ~/session-dump.txt"

# In the fresh session:
"Read ~/session-dump.txt and run /flush for those projects"

The dying session has enough context left to write a file. The fresh session has enough context to process it. One session as the brain, the other as the hands.

Thresholds

Context %	Action
<60%	Work normally
60%	Claude suggests /flush
85%	Claude auto-runs /flush (non-interactive)
>90%	Emergency: dump to file, rescue from new session
100%	Session dead. Hope you flushed.

The Session Lifecycle

Session Start: /project

Every session starts with /project <name>, which fuzzy-matches against a project index, reads the project's CLAUDE.md (including the State section with Working/Blocked/Next), pulls in domain-specific topic files, and checks recent daily logs for open threads. After that, Claude knows what I was doing last session, what's blocked, and what's next. I've wasted enough time re-explaining context to treat this as non-optional.

During the Session

MEMORY.md is always available (auto-loaded). Topic files get pulled when the task crosses domains. Context files from llm-context/ are loaded as needed.

Session End: /flush

Think of /flush as a save game hotkey. You just scraped through a gnarly debugging session, you're at 65% context, and the next task is going to be heavy. /flush captures session state to three or four places: daily log, project state (Working/Blocked/Next), MEMORY.md for new lessons, and optionally a domain topic file if something specialized came up. All append-only, new entries prepended, old entries never dropped.

I learned the hard way that skipping /flush before switching tasks means the next session starts cold. Claude picks up CLAUDE.md and has no idea what happened in the gap.

I watched this exact failure mode play out over three months on a cron health system, where each cold-start session ratified a broken fix (see: The LLM Kept Saying ‘Fixed.’).

After /flush: Why /clear Beats /compact

After flushing, I use /clear instead of /compact. Compaction spends tokens on a lossy summary. /clear costs zero tokens and gives a clean 200k window. The memory system means nothing is lost: /flush already persisted everything worth keeping, and /project reloads it.

/compact still has a use case mid-session (at ~70% context, same task, don't want to reload everything). For session boundaries, /clear is strictly better.

Daily Cron

Drift check confirms indexes match filesystem reality. Heartbeat alerts for overdue P1 tasks get written to the daily log automatically.

Data Flow Summary

/project (read) → work → /flush (write) → /clear → /project (read) → ...

Each session starts where the last one left off, without carrying the previous session's token baggage.

Hooks as Workflow Enforcement

Claude Code's settings.json supports lifecycle hooks: PreToolUse, PostToolUse, and PermissionRequest. These fire on every tool call, and the output gets injected into the conversation as a system reminder. Most people use them for security (blocking dangerous commands). But they're also the best way to enforce workflow patterns that instructions alone can't guarantee.

Example: Forcing a Plan Template

I have a plan template at ~/llm-context/plan-template.md with pre-flight checklists, agent routing guidance, TDD decisions per step, and a verification matrix. CLAUDE.md says "use this template for any plan with 3+ steps." Claude follows this instruction maybe 70% of the time. When it doesn't, the plan is worse: missing verification steps, no fail-fast conditions, no agent routing.

The fix is a PostToolUse hook on EnterPlanMode. When Claude enters plan mode, a 5-line shell script fires and injects the entire plan template into the conversation as a system reminder. Claude can't miss it because the template is literally in the conversation at the moment planning starts.

The hook script:

#!/bin/bash
TEMPLATE="$HOME/llm-context/plan-template.md"
if [[ -f "$TEMPLATE" ]]; then
  echo "MANDATORY: Use the following plan template. Follow its structure exactly."
  echo ""
  cat "$TEMPLATE"
fi

The settings.json entry:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "EnterPlanMode",
        "hooks": [{
          "type": "command",
          "command": "/path/to/plan-mode-template.sh",
          "timeout": 5
        }]
      }
    ]
  }
}

Why This Works Better Than Instructions

A PostToolUse hook injects the template into the conversation at the moment it's needed. Claude can't miss what's already on screen. CLAUDE.md instructions compete for attention with everything else in the system prompt, and compliance hovers around 70%. The hook makes it 100%.

Any time you catch yourself thinking "Claude keeps forgetting to do X before Y," that's a hook. I have a PostToolUse on WebFetch that injects prompt injection warnings after fetching external content (caught a real issue once when a scraped page had instructions embedded in it). I have a PreToolUse on Bash that adds safety checks before shell commands in my infrastructure directories. A PostToolUse on Write/Edit that reminds Claude to lint after file modifications.

What Belongs Here vs Memory Architecture

The session management side is the easy part. How the memory files are structured, what goes in MEMORY.md vs topic files vs project state, is the part most people get wrong. That's a separate post.

The LLM Kept Saying “Fixed.” For Three Months, It Wasn’t.

Ian L. Paterson — Mon, 18 May 2026 19:56:26 +0000

That afternoon a Slack bot told me a script had NEVER RUN. That was a lie. The script had pulled 81 weather observations two minutes earlier. Unwinding the lie took three hours.

The bigger lie had been running for three months underneath it.

Three months of "got it"

Before the session in this post, the cron health alert had been firing two or three times a week for three months. Each time, I'd paste the alert into a Claude Code session and ask the LLM to figure out why a script was reporting NEVER RUN. Each time the LLM would root around, land on something plausible, propose a fix, and confirm it with some variation of "yep, that's it, we got it." I'd apply the fix and move on.

The fix was never the fix. Some of the time the LLM had just pushed alerted_until a few months forward, quieting the alert without touching the structural bug. Some of the time it edited the wrong file. Either way the alert came back within a week or two on a different script, and the loop rolled forward.

Each session was a cold start. The model had no memory of the previous session, of the pattern, of anything. The workflow failure was mine. I was treating fifteen independent debugging sessions as if they were one ongoing conversation, and the model was only seeing the one in front of it.

I had an inbox and a model saying "got it," and that was enough.

I run about sixty-six scheduled scripts on a personal VPS. This is one story from that pile. I'd benchmarked the frontier models a few weeks earlier. The tier mapping that came out of it was fine for planning work. It was not sufficient for catching hallucinated fixes.

What was actually broken

The cron health monitor is a dead man's switch. Every scheduled script is supposed to send a heartbeat ping on each run. If the monitor doesn't see a ping within the expected window, it fires a Slack alert.

Healthchecks.io, Cronitor, and Dead Man's Snitch all solve this as a service: you get an HTTP endpoint per check, you hit it from your cron script, and they alert you if a ping goes missing. My system was a homegrown version of the same pattern, which is why the bugs described here were possible. A SaaS monitor would have refused to let me register a slug that had never sent a ping.

The architecture had three components that had to agree:

 crontab.txt
  (slug tag)
      |
      v
checks.json
(registry) ------> source code
      |           (health_run())
      |                  |
      +------> pings.json <------+
              (runtime record)

Nothing enforced the edges. You could register a script in checks.json without adding a health_run() call. You could tag a cron line with a slug that didn't exist in the registry. You could mute an alert indefinitely without touching source.

Every new cron script I'd shipped had reproduced the same bug. Registered in the registry, no ping call in source, alert fires, alert gets muted. The monitor was doing its job. I was systematically ignoring it.

"Didn't we just fix this?" - Me, that afternoon, wrong.

The bug that would have wiped everything

I opened a session and put Opus 4.6 on architecture review while Codex CLI (GPT-5) rewrote the validator and tagged 66 cron lines with their slugs. About thirty-five minutes, start to finish.

"[K-2SO] The chaos has been slightly inconvenienced." - Opus 4.6, after round 2, before we discovered the crontab-apply.sh bug that would have silently deleted every scheduled job on the VPS.

(K-2SO is the sardonic persona I prompt Claude Code with.)

Four audit passes in, Sonnet 4.5 with shellcheck wired in flagged a bug in crontab-apply.sh that neither Opus nor Codex had caught during implementation or the cold audit round. The script was supposed to install a new crontab safely. The actual sequence was:

# BEFORE: install first, verify after
crontab new.txt                             # already live
crontab -l > verify.txt
diff crontab.txt verify.txt || exit 1       # too late

If the diff failed, the script exited 1. The new crontab was already live. A malformed crontab.txt would have wiped every scheduled job on the VPS with no restore path.

The fix is obvious once you see it:

# AFTER: verify first, install only on pass
crontab -n new.txt || { restore_backup; exit 1; }
crontab new.txt
crontab -l > verify.txt
diff crontab.txt verify.txt || { restore_backup; exit 1; }

This bug had been in the script since the script was written. Opus and Codex both looked at the file and missed it. I never looked at it at all. I was trusting the two frontier models to flag anything off.

Before this session, shellcheck wasn't in my review pipeline at all. When Sonnet 4.5 caught the bug on Round 4, it wasn't because the model out-reasoned Opus and Codex. The qa-bash skill wires shellcheck into the review. Once shellcheck scanned the file, it flagged the order-of-operations pattern on its own. Sonnet read the output and passed it upstream. A validator that always passes isn't a validator. It's a confidence injection machine. I had been using the whole session pipeline as one for three months.

The numbers

Session length:            3 hours
Distinct bugs found:       18 (9 during implementation wave, 9 across 4 audit passes)
Audit passes:              4 (Codex cold audit + qa-bash + qa-python + Opus final)
Pass-by-pass bug count:    5, 1, 3, 0
Cron lines tagged:         66
Coverage:                  0% -> 94% (86 tests)
Scripts never pinging:     handful -> 1 (legitimate edge case)
Alerts muted to 2099:      15 -> 1 (legitimate intentional mute)

Key lessons when working with LLMs

1. Use deterministic tools wherever possible

Shellcheck, pytest-cov, mypy, a type checker, a linter, a schema validator, any tool that either finds a bug or doesn't find a bug with no probabilistic layer in between is the first thing you should reach for. LLMs are useful for everything that can't be checked deterministically, but stacking more LLM passes is not a substitute for a single deterministic tool with domain-specific rules.

The LLM on its own had confidently endorsed broken fixes for three months. A shell linter caught a crontab-wipe bug on its first scan.

2. Picking which LLM for each pass

The tier mapping I use with Claude Code:

Opus 4.6 : architecture review, session planning, final-pass oversight. Best at noticing what's missing from a diff.
Codex CLI (GPT-5) : implementation. Writes the code fastest and sticks closest to the plan.
Sonnet 4.5 : skill-driven QA passes where a deterministic tool is wired in. I use two custom Claude Code skills, qa-bash (which runs shellcheck) and qa-python (which runs pytest, pytest-cov, and mypy). The model drives the skill, the tool finds the bugs.
Haiku 4.5 : structured extraction, tight tasks with known output shape, anything that would be overkill for a bigger model.

Your stack will be different. The piece worth copying is pairing each LLM review pass with a deterministic tool, not stacking prompts.

3. Loop until zero, in contained systems

On a personal cron supervisor, roughly two hundred lines of logic with deterministic inputs, the bug count trends toward zero across passes. Pass one found five bugs. Pass two found one. Pass three found three (the implementation wave had created new surface to audit). Pass four found zero. That's where I stopped.

Past that size, or once database side effects and real concurrency enter, the pass count stops converging and "loop until zero" becomes a paralysis spiral. This is a rule for small, fully-owned systems, not for production services with moving dependencies.

Quick reference

What is a dead man's switch in cron monitoring?

A monitoring pattern where the absence of a signal triggers the alert, not the presence of one. Every scheduled script sends a heartbeat ping on each run. If the monitor doesn't see a ping within the expected window, the script is assumed dead. Healthchecks.io, Cronitor, and Dead Man's Snitch are commercial implementations.

How does this compare to healthchecks.io or Cronitor?

Same pattern, rolled by hand. A SaaS monitor wouldn't have let me register a slug and never ping it, because the slug doesn't exist until the first ping lands. Most of the referential integrity gaps that caused the bugs in this post are enforced by those services at signup.

What is the "install-before-verify" anti-pattern?

Installing a change to a live system before validating it, with no rollback path if validation fails. In the crontab case, crontab new.txt made the new config live immediately, and the diff check that followed couldn't undo it. The fix is to validate syntax in a staging slot first with crontab -n, then install on pass.

What is the "muting is not fixing" anti-pattern?

Responding to a noisy monitor by silencing it (pushing alerted_until forward, adding a filter, raising a threshold) without addressing why the alert fired. Debt accumulates invisibly. Three months of mutes looks fine on the dashboard. One bad state escaping to production recovers the debt with interest.

What is referential integrity in a cron monitoring system?

The property that the three components describing a scheduled job (the crontab line, the registry entry in checks.json, and the health-ping calls in source code) must all agree. Without enforcement, you can register a job that has no ping call, tag a cron line with a slug that doesn't exist in the registry, or mute an alert indefinitely without touching source. SaaS monitors enforce this at signup. A homegrown system has to add the gate deliberately.

I had been fixing this bug, one alert at a time, for three months. Every fix was a mute. Every mute was a debt I told myself I'd deal with later. I didn't.

Three hours with a shell linter. I had spent more than that, cumulatively, letting a confident LLM talk me out of reading my own code.

Fix once is a lie. Loop until zero.

How I Track Claude, Codex, and Gemini Quotas from One Script

Ian L. Paterson — Mon, 18 May 2026 19:56:25 +0000

(If you're trying to decide which model to switch to when one runs dry, I benchmarked 15 models on 38 real coding tasks with full cost-per-task breakdowns.)

I run three AI coding CLIs daily. None of them tell me whether I'm about to hit a rate limit. I periodically get locked out mid-task and spend ten minutes figuring out which tool ran out, when it resets, and whether I should switch models or wait.

I built a script that collects quota data from all three, writes it to a single JSON file, and runs on an hourly cron. The whole thing feeds a status line in Claude Code:

Session: ███░░░⏐░░░░░░░░░░░░░ 10% (3h12m left)
Weekly:  ████████░░⏐░░░░░░░░░ 44% (Thu Mar 05 8pm PT)

The filled blocks (█) show usage consumed. The marker (⏐) shows where you are in the time window. If the blocks outpace the marker, you're burning budget faster than time is passing.

How do you query Claude Code's rate limit programmatically?

Claude Code authenticates via OAuth, with credentials stored at ~/.claude/.credentials.json. What I couldn't find in any official documentation is that api.anthropic.com/api/oauth/usage returns the data you need: utilization percentages and reset timestamps for both the 5-hour rolling window and the 7-day weekly allocation. It's used internally by Claude Code's HUD, but it doesn't appear in Anthropic's public API reference.

TOKEN=$(jq -r '.claudeAiOauth.accessToken // empty' "$HOME/.claude/.credentials.json")

curl -s --max-time 10 \
  -H "Authorization: Bearer $TOKEN" \
  -H "anthropic-beta: oauth-2025-04-20" \
  https://api.anthropic.com/api/oauth/usage

The response:

{
  "five_hour": { "utilization": 0.42, "resets_at": "2026-02-28T17:00:00Z" },
  "seven_day": { "utilization": 0.61, "resets_at": "2026-03-07T08:00:00Z" },
  "seven_day_sonnet": { "utilization": 0.35, "resets_at": "2026-03-07T08:00:00Z" }
}

I found this by searching Anthropic's GitHub issues for "usage" and "quota," then trying endpoints until one worked. The catch is that anthropic-beta: oauth-2025-04-20 header. Without it, you get a 401. The date in the version string suggests this header has been updated before, so when it changes again, the collector breaks silently.

Claude Code uses a 5-hour rolling session window, not a 24-hour one, and the weekly limit resets Thursday at 8pm PT, not midnight UTC. I only discovered both by watching the numbers change over a week of collection. The docs don't mention either detail.

I've been hitting this endpoint hourly since February with zero failures. But "undocumented beta endpoint" is not a phrase that inspires long-term confidence.

How do you track Gemini CLI usage without an API?

Gemini CLI has a /stats command that works interactively, but non-interactively it completes silently with no output (the stats render through Ink, a React-based terminal renderer, which doesn't survive piping or capture). GitHub issue #19067 has a user asking the same question I had. The maintainer response: "there's no way within Gemini CLI to see your daily quota."

The workaround: Gemini stores session files at ~/.gemini/tmp/_/chats/session-YYYY-MM-DD_.json. Each file is one session. The free tier allows 1,000 requests per day, resetting daily. There's no official way to query how many you've used. You count your own files. I verified the session counts against Google's AI Studio usage dashboard to make sure the file-based approach was tracking correctly.

A note on what this actually measures: the free tier limit is requests, but what we're counting is session files and token consumption. Session files don't map 1:1 to API requests (a single session can contain multiple turns), so the session count is a lower bound, not an exact quota meter. For my usage patterns, the counts track closely enough to be useful as a warning signal, but they won't catch you at exactly request 999.

Counting files only tells you session counts. For actual consumption data, you parse the JSON. Each session file has a messages array where every message includes a tokens.total field:

files = glob.glob(os.path.join(base, '*/chats/session-*.json'))
for f in files:
    fname = os.path.basename(f)
    file_date = fname[8:18]  # YYYY-MM-DD from session-YYYY-MM-DDTHH-MM-*.json
    with open(f) as fh:
        data = json.load(fh)
    file_tokens = sum(
        m.get('tokens', {}).get('total', 0)
        for m in data.get('messages', [])
    )
    if file_date in week:
        week[file_date]['sessions'] += 1
        week[file_date]['tokens'] += file_tokens

The output includes today's session and token counts, lifetime totals, and per-day breakdowns for the past week. All from flat files that were never intended to be an API. If Google changes the session file schema or directory structure, there's no migration path. You find out when the numbers stop updating.

How does Codex CLI expose rate limits?

Codex has an interactive /status command that shows rate limits in a TUI modal. The same problem as Gemini: it only works inside the REPL (the interactive read-eval-print loop). codex /status exits immediately with nothing.

My first approach felt like performing surgery with oven mitts: launch Codex inside a tmux session, send /status as keystrokes, wait for the modal to render, press Escape to dismiss it, send /help to push the status into the scroll buffer, capture 300 lines of scroll history, grep for "% left." It required sleep statements between every step and never worked reliably.

Then I found codex app-server.

Codex ships with an app-server subcommand that speaks JSON-RPC over stdin/stdout. OpenAI has documentation for it, though I only found it after discovering the feature in the source code. The account/rateLimits/read method isn't prominently featured, but it works. You spawn the process, send an initialize handshake, then call it. The 500ms delay between handshake and request is necessary because shorter values produce empty responses before the connection is ready.

const proc = spawn('codex', ['app-server'], { stdio: ['pipe','pipe','ignore'] });
const rl = readline.createInterface({ input: proc.stdout });

const send = (m) => proc.stdin.write(JSON.stringify(m) + '\n');

// Handshake
send({
  method: 'initialize', id: 0,
  params: { clientInfo: { name: 'quota-collector', title: 'Quota Collector', version: '1.0' } }
});

// Request rate limits after handshake completes
setTimeout(() => send({ method: 'account/rateLimits/read', id: 1, params: {} }), 500);

The response comes back with rateLimits.primary (five-hour window) and rateLimits.secondary (weekly), each with usedPercent and resetsAt. No scroll buffer archaeology required.

Codex also maintains a SQLite database at ~/.codex/state_5.sqlite (the _5 is a schema version, so this path may change in future releases) with a threads table that tracks every session:

WITH days AS (
    SELECT date('now', '-' || n || ' days') AS day
    FROM (SELECT 0 AS n UNION SELECT 1 UNION SELECT 2
          UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6)
)
SELECT d.day, COUNT(t.id) AS sessions, COALESCE(SUM(t.tokens_used),0) AS tokens
FROM days d
LEFT JOIN threads t ON date(t.created_at, 'unixepoch') = d.day
GROUP BY d.day ORDER BY d.day;

The collector: one cron, one JSON

claude_json=$(collect_claude)
codex_json=$(collect_codex)
gemini_json=$(collect_gemini)

ts=$(date -u +%Y-%m-%dT%H:%M:%SZ)

jq -n \
    --argjson claude "$claude_json" \
    --argjson codex "$codex_json" \
    --argjson gemini "$gemini_json" \
    --arg ts "$ts" \
    '{collected_at: $ts, claude: $claude, codex: $codex, gemini: $gemini}' \
    > "$OUTFILE"

After writing latest.json, the script appends a compacted copy to a daily JSONL file for historical tracking. One line per collection, roughly 15 lines per day, about 2KB. Want to know how fast you burned through your five-hour window last Tuesday? jq -r '.claude.five_hour.utilization' quota/history/2026-03-11.jsonl and watch the numbers climb line by line.

How I display it

A separate status line script reads latest.json and renders progress bars on every Claude Code interaction. Green below 70%, amber at 70-90%, red above 90%. The time marker (⏐) gives you an instant burn-rate signal without reading numbers.

The same JSON feeds a dashboard with budget bars and threshold alerts. Both are reading a flat file, no live API calls during rendering. The hourly cron does the expensive work once, everything downstream reads the result.

Why not just use ccusage?

ccusage reads Claude Code and Codex JSONL logs, giving you per-model cost breakdowns with date filtering. It's better than the collector at USD cost tracking and historical queries. If that's what you need, use it.

I built my own because the collector does three things ccusage doesn't:

It tracks Gemini (1,000 req/day free tier, no JSONL logs to read, only session file parsing).
It acts on the data (configurable spend alerts, pipeline kill switches when costs spike).
It burns unused Codex budget on maintenance tasks when weekly utilization drops below 60%.

It also doesn't require an additional MCP server. It reads a flat JSON file.

The usage data exists somewhere in every AI CLI. It might be an undocumented endpoint, a pile of session files, or a JSON-RPC server. An hourly cron that normalizes it all into one JSON file is about 250 lines of bash, Python, and JavaScript. An afternoon of reverse engineering, once you know where to look.

OpenClaw: 13 Errors, $1.50/Month, and an AI Team That Doesn’t Need the Cloud

Ian L. Paterson — Sat, 16 May 2026 23:10:31 +0000

I run a team of AI agents on a Mac I bought in 2022. They handle my Slack, run research, draft content, monitor infrastructure, and spawn sub-agents for compound tasks. The whole operation costs me $1.50 a month in electricity. Zero API fees, zero cloud provider. Just OpenClaw in a Docker container and a 30-billion parameter model running locally.

This is the full setup, including every config that matters, every error I hit, and the performance tuning that took me from 12 tokens per second to 49. (Full cost breakdown: $1.50 electricity vs. the $330/month I was paying before. Details at the end.)

What OpenClaw Actually Is

I first installed OpenClaw the week it launched in January 2026. By that point it had already crossed 100,000 GitHub stars. Peter Steinberger's interview with Lex Fridman is worth the watch if you want the full backstory.

OpenClaw is an open-source AI agent framework. You give it access to tools (shell, browser, file system, messaging) and a model, and it acts autonomously. You can talk to it through Slack, a web UI, or the CLI. It supports sub-agents for compound tasks, custom skills, and model routing across multiple providers.

The key difference from Claude Code: OpenClaw runs in a loop. It has a heartbeat system that keeps agents alive and working even when you're not at the keyboard. Claude Code (in its default interactive mode) waits for you to type something. OpenClaw doesn't. That's what drew me in. I wanted agents running overnight, picking up tasks from a queue, monitoring infrastructure, drafting reports while I slept. Right now my agents handle Slack triage (summarizing threads, flagging action items), run multi-source research with parallel sub-agents, draft first passes of content like this article, and monitor my infrastructure for drift. The compound task pattern is the most useful: I describe what I want, and OpenClaw spawns three sub-agents to tackle different angles simultaneously.

You can run it in Docker, on a VPS, on a Mac Mini under your desk, or an old PC with a GPU. Pair it with a local model and your cloud costs go to zero.

The 30,000+ exposed instances found by security researchers tell you two things: a lot of people are running this, and most of them didn't read the security docs. I nearly joined that list myself (see Error #9 and the network isolation section above).

OpenClaw Setup: Installation and First Config

I started with a fresh Ubuntu system and followed the install docs on GitHub. The first thing you hit is missing build tools. apt-get install python3, gcc, make, and the usual suspects before the install script will complete. Node.js 18+ is required.

The first run drops you into a config wizard that asks about gateway mode, model selection, and channel setup. I skipped it and wrote the config files directly. The wizard is interactive, which doesn't work if you're setting up over SSH or scripting the deployment.

You need three things configured:

A model provider. I started with OpenRouter for testing before moving everything to my on-prem server. Initially I pointed it at Opus, which worked perfectly and consumed credits at an alarming rate. Scaled down to Sonnet, then found MiniMax M2.1 and Kimi2.5 as workable cheap alternatives. I avoided the free models entirely because free endpoints on OpenRouter don't reliably support tools, system prompts, and structured output, which OpenClaw requires. They also route through providers whose data retention policies vary. Defeats the purpose if you're trying to own your stack.
A messaging channel. I use Slack with Socket Mode. No public webhook endpoint needed, just a bot token and an app token. The non-obvious gotcha: under App Home, "Allow users to send messages from the messages tab" must be checked. It's separate from OAuth scopes and nothing in the docs tells you this. Socket Mode connects fine without it. Messages just silently never arrive.
Network isolation. I didn't bother with the application-level security settings initially. Instead I firewalled everything off, set up Tailscale, restricted SSH to private keys only, and bound all services to localhost. If nothing is listening on a public port, most of the hardening guides are solving a problem you don't have.

The config lives in three places:

File	What it does
Environment file	API keys, tokens, gateway token
clawdbot.json	Model, plugins, channels
auth-profiles.json	Model provider authentication

One warning: clawdbot config set says "Updated" but doesn't always write to the file the gateway actually reads. Edit the JSON directly. I stopped using the CLI for config changes after the third time it silently did nothing.

Connecting LM Studio as a Local LLM Provider

I run LM Studio on a 2022 Mac Studio (M1 Max, 32GB). The model is Qwen3-Coder-30B-A3B, a mixture-of-experts architecture where 3 billion parameters are active per token but all 30 billion live in memory. In GGUF Q4_K_S quantization it's about 17.5GB on disk.

LM Studio exposes an OpenAI-compatible API on localhost. OpenClaw connects to it like any other model provider. If the machine running OpenClaw and the machine running LM Studio are the same box, you just point at localhost and you're done.

My setup is split across two machines. OpenClaw runs on a server, the Mac Studio runs the model. An SSH reverse tunnel connects them over Tailscale, which means you've got an encrypted tunnel inside an encrypted VPN. Belt and suspenders.

Server (OpenClaw, localhost:port)
        ↓
   Tailscale mesh (encrypted)
        ↓
   SSH tunnel (encrypted)
        ↓
Mac Studio (LM Studio :port)

The tunnel adds less than one millisecond of latency. The bottleneck is always model inference, never the network.

I use a macOS LaunchAgent to keep the tunnel alive. It reconnects automatically after network drops, sleep/wake, router reboots. No third-party tools needed.

Bonus: Residential IP

There's an added bonus to running on-prem hardware at your house that I didn't anticipate. Your home internet connection has a residential IP address. When your agents browse the web, fetch pages, or interact with APIs, they're coming from an IP that looks like a normal person, not a datacenter.

Residential IPs don't get automatically blocked or hit with CAPTCHAs the way cheap hosting providers do. I've had agents get blocked on a VPS and work fine through the Mac Studio on the same site, same request, same minute.

Config Gotchas for LM Studio + OpenClaw

This is where I lost the most time.

The two settings that burned me longest were both silent failures. The provider name in your config must be "openai," not "lmstudio" or "local" or anything creative. OpenClaw's auth resolution silently fails on unrecognized provider names. Nothing in the logs, nothing on screen. It just doesn't connect. Same story with the API mode: it must be "openai-completions" (which calls /v1/chat/completions). The other option, "openai-responses," calls /v1/responses, which hangs indefinitely on LM Studio. The naming suggests they're interchangeable. They're not. I read the OpenClaw source code to figure both of these out.

The auth profile format also isn't documented. After guessing for two hours, I found it requires a v1 store format:

{
  "version": 1,
  "profiles": {
    "openai:default": {
      "type": "api_key",
      "provider": "openai",
      "key": "lm-studio"
    }
  }
}

The key value doesn't matter for LM Studio (it doesn't authenticate) but it can't be empty or OpenClaw skips the provider.

Two more that bit me later: the context window defaults to 4,096 tokens, but OpenClaw's system prompt alone is 17,000 tokens. You'll get "Cannot truncate prompt" errors until you bump the context in LM Studio's UI to at least 32,768. Set it in the UI, not the CLI. The CLI setting doesn't survive a crash.

And the Jinja template bug: Qwen3-Coder's GGUF template includes a | tojson | safe filter. LM Studio's Jinja engine doesn't support | safe. It only triggers with complex tool schemas (nested JSON in parameters), so you might run fine for days before hitting it. Fix: edit the template in LM Studio's UI (My Models > Prompt Template), find the two occurrences of | tojson | safe, and change them to | tojson.

13 OpenClaw Errors and How I Fixed Them

Thirteen errors across the full setup. None of them had useful documentation when I searched. That's why they're all here.

Error	Root Cause
1	"Unknown model" on OpenRouter
2	Free models fail
3	"Cannot truncate prompt"
4	Provider auth silent fail
5	"openai-responses" hangs
6	"Unknown filter: safe"
7	MLX crashes under load
8	Session history bloat
9	Sub-agent token mismatch
10	"openclaw: command not found"
11	GATEWAY_BIND=lan breaks auth
12	bootstrapMaxChars crash
13	Speculative decoding freeze

1. "Unknown model" on OpenRouter

OpenClaw has an internal model registry. Models not in it get rejected before the request leaves the box. MiniMax M2.1 wasn't listed. Fix: add it to agents.defaults.models (plural, not model singular) as an allowlist entry. The naming matters too. OpenRouter uses lowercase IDs (minimax/minimax-m2.1, not MiniMax-M2.1). Check openrouter.ai/models for exact strings.

2. Free models can't handle the agent protocol

Llama 3.3 70B returned 404. Gemma 3 27B threw "Upstream error from OpenInference." Free endpoints on OpenRouter don't reliably support tools, system prompts, and structured output, which is everything OpenClaw needs. Save yourself the debugging: use paid models or go local.

3. "Cannot truncate prompt with n_keep >= n_ctx"

LM Studio defaults to 4,096 tokens of context. OpenClaw's system prompt is 17,000 tokens. The math doesn't work. Set context to at least 32,768 in LM Studio's UI settings, not the CLI. The CLI setting doesn't survive a crash.

4. Provider auth silently fails

If you name your provider "lmstudio" instead of "openai" in the config, OpenClaw's auth resolution doesn't error. It just doesn't connect. Nothing in the logs, nothing on screen. I read the source code to find this.

5. "openai-responses" hangs indefinitely

openai-responses calls /v1/responses, which LM Studio doesn't serve. openai-completions calls /v1/chat/completions, which it does. The naming suggests they're interchangeable. They're not.

6. Jinja template: "Unknown StringValue filter: safe"

Qwen3-Coder's GGUF template uses | tojson | safe. LM Studio doesn't support the safe filter. Only triggers with complex nested tool schemas, so you might run fine for days before hitting it. Fix: edit the template in LM Studio UI, remove | safe from both occurrences.

7. 30B MLX model crashes under sustained load

I tried the MLX 4-bit version of Qwen3-Coder first. It loaded fine, ran fine for short conversations, then started producing hallucinated gibberish followed by "Exit code: null." No useful error in logs. Switched to GGUF Q4_K_S with sysctl tuning for the GPU memory cap and it's been stable since.

8. Session history bloat

OpenClaw persists conversation history per session. After extended debugging, one session had 1,836 lines. New requests were failing because the history plus the system prompt exceeded the context window. Fix: delete stale session files from ~/.openclaw/agents/main/sessions/. Not obvious that this is a thing you need to do.

9. Sub-agent auth: the three-token problem

When sub-agents spawn, they connect back to the gateway via WebSocket. Three auth layers must agree:

Device identity token
Gateway paired device record
Gateway auth token (stored in the config AND the env file AND gateway-token.txt)

After upgrading OpenClaw, the config auto-migrated with a new gateway token, but the env file kept the old one. Sub-agents read from env, the gateway validates against config. Every spawn failed with "device_token_mismatch." Zero useful error messages.

Post-upgrade checklist I wish I'd had:

chown -R clawdbot:clawdbot .git/ dist/ if git ops ran as root

10. "openclaw: command not found" from sub-agents

Source installs don't add the binary to PATH. Sub-agents need it to announce back to the gateway. Fix: ln -sf /opt/clawdbot/openclaw.mjs /usr/local/bin/openclaw

11. GATEWAY_BIND=lan breaks sub-agents

Known issue (#916 on GitHub). Internal gateway calls don't pass auth correctly when bound to the LAN interface. Change to loopback and sub-agents start working.

12. bootstrapMaxChars as an object crashes the gateway

The config expects a plain number (e.g., 8000). I passed it as an object with per-file settings. The gateway crashed on startup with no indication of which config key was wrong.

13. Speculative decoding froze the Mac twice

I tried enabling speculative decoding with a 0.75B draft model to speed up generation. At 120k context, the Mac Studio hard-froze. Power cycled, tried again at 140k. Froze again. On 32GB Apple Silicon with a 30B model near the context ceiling, there's no memory headroom for a draft model. The display server gets killed first, which means no graceful shutdown. Just a black screen and a hard reboot.

Apple Silicon Performance Tuning: 12 to 49 Tokens Per Second

Out of the box, I was getting 12 tokens per second with frequent crashes. After tuning, 49 tokens per second at 140,000 tokens of context, stable for days.

KV Cache Quantization: The Single Biggest Win

The key-value cache stores attention state for every token in your context window. By default, LM Studio keeps it in F16 (16-bit floating point). Switching both K and V to Q8_0 (8-bit) nearly doubles your usable context and increases generation speed.

Setting	Max Context	Gen Speed	Notes
F16 KV	75,000	12-35 t/s	Default
Q8_0 KV	140,000	49 t/s	Production config

I tested every increment:

32k stable → 49k tight → 65k comfortable → 75k ceiling (F16) → 80k fails to load → 120k works (Q8_0) → 140k production ceiling (Q8_0) → 150k fails → 200k OOM kills the display server.

Set Flash Attention to explicit "On" in LM Studio, not "Auto." Auto doesn't always activate, and it's required for KV cache quantization to work.

sysctl: Raise the GPU Memory Cap

macOS caps GPU memory at about 66% of unified RAM by default. On a 32GB machine, that's roughly 21GB. A 17.5GB model plus KV cache at any reasonable context length blows right past that.

sudo sysctl iogpu.wired_limit_mb=24576

This raises the cap to 24GB. Persist it in /etc/sysctl.conf so it survives reboots. This was the difference between "crashes under load" and "stable at 140k context."

CPU Threads: Less is More

Apple Silicon has performance cores and efficiency cores. The M1 Max has 8 P-cores and 2 E-cores. I assumed 10 threads would be faster than 8. It's not. The E-cores are slower and create a bottleneck. Use P-cores only.

What Didn't Help

I also tried batch size 1536 (vs 768), speculative decoding with a 0.75B draft model, and sub-4-bit quantization (Q3_K_M, IQ3_XS). Batch size made zero practical difference (48.5 vs 49.2 t/s). Speculative decoding froze the Mac twice at 120K+ context because there's no memory headroom for a draft model on 32GB. And sub-4-bit quants are actually slower on Apple Silicon because of dequantization overhead. Q4_K_S is the sweet spot.

OpenClaw Config Optimization

The model is only half the story. OpenClaw's defaults are designed for 200k-context cloud models. On local hardware, you need to trim.

The first thing I changed was bootstrapMaxChars, from 20,000 down to 8,000. OpenClaw loads project files into context at the start of every request. At 20k per file, a single bootstrap was consuming half my context window before the conversation even started.

Next was contextPruning (cache-ttl, 5 minutes). Old context that hasn't been referenced gets dropped automatically. Before this, I was running out of context mid-conversation because stale tool outputs were taking up space.

Finally, historyLimit: 3. This caps how much Slack conversation history gets loaded. I had a busy channel filling the context with old messages, crowding out the actual work.

Production Config Summary

For anyone running a 30B MoE model on 32GB Apple Silicon:

Setting	Value
Model	Qwen3-Coder-30B-A3B Q4_K_S
Context	140,000 tokens
KV Cache (K and V)	Q8_0
Flash Attention	On (explicit)
CPU Threads	8 (P-cores only)
GPU Layers	All (49/49)
Batch Size	768
bootstrapMaxChars	8000
historyLimit	3
contextPruning	cache-ttl, 5m

Generation speed: 49 tokens/second. Prompt eval: ~470 tokens/second. Three sub-agents spawning concurrently at 120k context, all completing successfully.

10 Things I'd Change on a Fresh Install

If I were starting over tomorrow, in order:

Run sysctl iogpu.wired_limit_mb=24576 before loading any model. I was convinced the model was unstable when the real problem was macOS starving the GPU of memory. Persist it in /etc/sysctl.conf immediately.
Start with GGUF, not MLX. MLX has better integration with some tools but its KV cache quantization is buggy (fails above 1k tokens on 8-bit). GGUF's Q8_0 KV cache just works and it's the single biggest performance unlock.
Set LM Studio's default context in the UI on first launch. When the model crashes (and it will, while you're testing limits), LM Studio reloads it with the default from settings.json. If that default is 4,096 and your system prompt is 17,000 tokens, you're in a crash loop that looks like the model is broken when it's actually a config problem.
Read the OpenClaw source for auth-profiles.json. The format isn't documented anywhere. I burned two hours guessing before reading the source.
Symlink the openclaw binary to /usr/local/bin immediately after a source install. You won't know sub-agents need it in PATH until they silently fail, and "silently" means a 60-second timeout with no error message.
After every OpenClaw upgrade, verify the env file tokens match the config tokens. The config auto-migrates. The env file doesn't.
Set both K and V cache to Q8_0 from day one. I ran F16 initially because I didn't know better. The switch doubled my context and increased speed. There's no reason not to do it.
Don't try 200k context on 32GB. I know the model card says it supports it. It will OOM kill your display server and you'll be reaching for the power button. 140k is the ceiling on 32GB with this model. Respect it.
Clear session history periodically. OpenClaw doesn't do this for you. Sessions accumulate conversation history that counts against your context window. I didn't notice until a session had 1,836 lines and new requests were failing for no apparent reason.
Test with complex tool schemas early. The Jinja template bug only triggers with nested JSON parameters. Simple requests work fine. You'll think everything is stable, deploy to production, and then it breaks on the first real compound task.

Cost Breakdown

My Setup

Item	Monthly Cost
Electricity (BC Hydro)	~$1.50
VPS (optional)	$0 - $24
Cloud API	$0
Total	$1.50 - $25.50

The Mac Studio draws about 11W idle, 60W under load. In practice it averages around 20W because agents run in bursts, not continuously. BC Hydro's residential rate ($0.0996/kWh) on 20W average works out to about $1.50 a month. Your mileage varies by jurisdiction, but even in expensive markets you're looking at $3-5.

The VPS is optional. If you run OpenClaw directly on the same machine as your model, it's electricity only. I use a VPS for remote access and 24/7 uptime independent of my home network, but it's not required.

What I Was Paying Before

Service	Monthly Cost
Claude Code (Max)	$100
ChatGPT Pro	$200
OpenRouter credits	$20-50
Misc API calls	$10-30
Total	$330-380

Not all of that is replaced. I still use Claude Code for interactive work and ChatGPT for specific tasks. But the 24/7 autonomous agents, the batch jobs, the research tasks, the monitoring, the content drafts, all of that moved to OpenClaw on local hardware. The subscriptions that were funding autonomous work went to zero.

Cloud API Cost Comparison

These costs assume heavy autonomous agent usage (millions of tokens per month). Light usage would be significantly cheaper on cloud APIs:

Setup	Monthly Cost
VPS + Anthropic Sonnet 4	$124+
VPS + Anthropic Sonnet 4.5	$44-74
VPS + OpenRouter MiniMax M2.1	$29-39
VPS + local model	$25.50
Local only + local model	$1.50

The bottom line is the one nobody writes about because it requires owning hardware. But if you already have a Mac with 32GB sitting on a desk, you already own the most expensive part.

About the Author

Ian L. Paterson is CEO of Plurilock, a publicly traded cybersecurity company. This is part of a series documenting what it looks like to build AI-powered infrastructure for real work. Other posts cover persistent memory for Claude Code, session lifecycle management, and the daily automation layer that ties it all together.

Related: LM Studio Troubleshooting

Hit an error running this setup? Fixes for the most common LM Studio issues on Apple Silicon, including prompt truncation, jinja template failures, and performance tuning (12 to 49 tok/s): LM Studio Errors on Apple Silicon.

Related: Building a Memory System for Claude Code

This setup powers the persistent memory behind my Claude Code workflows. See how MEMORY.md, topic files, and automated maintenance keep context alive across sessions: Claude Code Memory System.

For a deeper look at how I decide which models handle which tasks, see how I route 200+ daily LLM calls across five models.

Which cloud models give the best results per dollar? I tested 15 models on 38 real coding tasks and ranked them by accuracy and cost. Sonnet 4.6 scored 100% at $0.20 per task. Flash 2.5 hit 97% for $0.003.

Ian Paterson is CEO of Plurilock (TSXV: PLUR) and writes about AI engineering, cybersecurity, and building in public at ianlpaterson.com.