Forem: Parag Darade

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

Parag Darade — Thu, 30 Apr 2026 14:25:20 +0000

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

Here is a specific claim: most teams building on Claude or GPT-4o are paying three to ten times more per token than they need to, and the reason is not model selection or request volume — it is prompt construction.

Prompt caching has been available on Anthropic's API since August 2024 and on OpenAI's since October 2024. The economics are not subtle: Anthropic cache reads cost $0.30 per million tokens versus $3.00 per million for fresh tokens, a 90 percent reduction. OpenAI's automatic caching runs at roughly 50 percent off. If your application sends any system prompt longer than a few hundred tokens — and most do — you are either capturing that discount or you are not, and the line between those two outcomes is exactly one architectural decision most teams have never made deliberately.

Why most applications do not actually capture the discount

Prefix caching works by hashing the leading bytes of your prompt against a stored version. If the leading bytes match, you pay the cache-read price. If they do not, you pay full price and the provider stores the new prefix for the next call.

The mistake I have seen repeatedly is teams interleaving dynamic content — the user's name, a session ID, a timestamp, the current date — into the beginning or middle of their system prompt. It looks like a small thing. A system prompt that opens with "Today is April 30, 2026. You are assisting {user_name}..." is entirely uncacheable, because the prefix changes with every request. You have spent fifteen thousand tokens of careful prompt engineering writing tool definitions, personas, and step-by-step instructions, and none of it is ever reused.

The fix is architectural, not subtle: static content at the front, dynamic content at the end. Everything that varies per-request belongs in the user turn or at the tail of the message array, never in the prefix.

What that looks like in production

ProjectDiscovery, the security tooling company behind Nuclei, published a detailed post-mortem in early 2025 on how they cut LLM costs by 59 percent overall — reaching 70 percent in the most recent ten-day window of their measurement period. Their agentic system runs on Claude with system prompts exceeding 20,000 tokens per agent, and average tasks run 26 steps with 40 tool calls. At that scale, every prompt construction mistake compounds badly.

Their starting cache hit rate was under 8 percent. After deploying what they call the relocation trick — moving working memory and runtime context from the cacheable prefix to the message tail — their cache hit rate jumped to 74 percent overnight. One case in their logs shows a single task processing 67.5 million input tokens at a 91.8 percent cache hit rate. That is a different cost structure entirely, not a marginal improvement.

The implementation involved three explicit cache breakpoints: one at the end of the static system prompt with a one-hour TTL shared across users, one at the last static tool definition, and one at a sliding window of recent conversation history with a five-minute TTL. Stable template variables replaced actual values in the system prompt to maintain byte-identical prompts across different users. The architecture is not complex, but it requires thinking about your prompt as having structure — a static spine with dynamic arms — rather than as a single string you construct fresh at request time.

What the research confirms

A January 2026 paper from researchers at Tsinghua and Microsoft, "Don't Break the Cache," evaluated caching strategies across OpenAI, Anthropic, and Google over 500 agentic sessions using a multi-turn benchmark with 10,000-token system prompts and up to 50 tool calls per session. The measured cost reduction range was 41 to 80 percent depending on provider and strategy, with time-to-first-token improving by 13 to 31 percent as a secondary effect.

The paper's central finding matches what ProjectDiscovery discovered operationally: full-context caching can paradoxically increase latency because dynamic tool results appearing mid-context invalidate the cache entry. The winning strategy across every provider was placing dynamic content at the end of the prompt, never in the middle. Strategic cache block placement — not just enabling caching — is what separates teams at 10 percent hit rates from teams at 75 percent.

What to do if you are not measuring this yet

Add cache hit rate to your request logging today. Both Anthropic and OpenAI return cache usage metadata in their API responses. Anthropic provides cache_creation_input_tokens and cache_read_input_tokens directly in the usage object. If you are not recording those fields, you cannot see whether you are paying full price on every call, which means you almost certainly are.

Once you are measuring, audit your system prompt construction. Look for any field that changes between requests — user ID, current date, session state — and move it to the tail. If you are using function calling, make sure your tool definitions are static and appear before any dynamic content. If you are building agentic systems with multi-turn history, mark your cache breakpoints explicitly on Anthropic, or structure your message array so that the first several turns are stable across sessions.

The numbers are real. At production scale — tens of millions of tokens per day — the difference between an 8 percent cache hit rate and a 75 percent cache hit rate is a monthly bill that looks like a different company. The engineering to get there is a few hours of work. The actual win was never about which model you picked.

Cache Hit Rate Is the Cost Lever Your Team Is Probably Ignoring

Parag Darade — Thu, 30 Apr 2026 08:39:01 +0000

Cache Hit Rate Is the Cost Lever Your Team Is Probably Ignoring

I have watched teams spend a month on model selection benchmarks — GPT-4o versus Claude Sonnet 4.5 versus Gemini 2.5 Pro — then deploy with a prompt structure that breaks cache hits on every single request, paying three to five times more than they should for work the provider has already done. The model selection decision is worth something. The prompt structure decision is worth more. For any workload with repeated or agentic patterns, it is not close.

The mechanism is the KV cache. Every major LLM API — Anthropic, OpenAI, Google — reuses computation when a new request begins with tokens it has already processed. Anthropic charges cache reads at 10 percent of the standard input price. A clean cache hit is a 90 percent discount on those tokens. The catch is that cache hits only fire when the prefix — the exact sequence of tokens at the start of your request — matches what was previously cached. Move one token, change one value, and the cache restarts. This is why the position of dynamic content inside your prompt is the variable that determines whether any of this pays off.

What happens when you get it wrong

ProjectDiscovery ships Neo, an agentic task runner built on Claude Opus 4.5. In early February 2025, their cache hit rate sat at 4.2 percent. Their system prompts ran to 2,500 lines of YAML — roughly 20,000 tokens per agent — but working memory, skills context, and runtime session identifiers were embedded inside that prefix. Every request began with a slightly different token sequence. The cache never fired.

The cost difference between a 3 percent cache rate and a 91 percent cache rate on identical workloads is not linear. ProjectDiscovery found a comparison task that ran 66.8 million input tokens at 3.2 percent cache utilization pre-optimization. Equivalent tasks post-optimization ran at 91.8 percent cache rates. The cost differential was roughly 60x for the same computation. Not 60 percent. Sixty times.

What they changed was the position of dynamic content. Working memory, runtime context, and session-specific data moved out of the cacheable prefix into the user message tail. The static YAML system prompt stayed put, with an explicit cache breakpoint marking it cacheable. One structural change shifted cache rates from 4.2 percent to 73.7 percent in two weeks. By March 2025, the rate was 84.3 percent, 9.8 billion tokens had been served from cache, and their overall token bill had dropped 59 percent. The implementation is not a new service or a new model — it is a reordering of content within requests you are already making.

The research backs this up across providers

A January 2026 paper on prompt caching for agentic tasks tested caching strategies across 500-plus agent sessions on GPT-5.2, Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-4o, all with 10,000-token system prompts on PhD-level research tasks. Cost savings ranged from 41 to 80 percent. But the distribution matters: the best-performing strategy for every model was caching the system prompt only — not full context.

The counterintuitive finding was that caching too aggressively makes things worse. GPT-4o showed an 8.8 percent latency regression when full-context caching was on, because volatile tool results in the middle of the context created cache mismatches that added overhead instead of reducing it. The same model with system-prompt-only caching showed a 30.9 percent latency improvement. Claude Sonnet 4.5 with system-prompt-only caching achieved 78.5 percent cost savings and a 22.9 percent reduction in time-to-first-token. The lesson generalizes: a misplaced cache boundary costs you overhead without delivering savings, while a well-placed one compounds.

Three questions worth asking about your prompts today

First: where does your dynamic content live? If timestamps, user identifiers, session context, or per-request state are inside your system prompt — especially near the start — they are breaking cache hits on every request. Move them to the user message.

Second: are your tool definitions ordered consistently? The arXiv paper found that dynamic tool sets — where tool lists change between requests — break cache hits because the prefix diverges. If you are adding or removing tools based on user context, that variation belongs at the tail, not the front.

Third: what is your actual cache hit rate right now? Anthropic's API returns cache read token counts in the usage field of every response. If you are not logging cache_read_input_tokens alongside your standard token counts, you have no visibility into whether caching is doing anything at all. In my experience, teams that start logging this number for the first time are surprised by how low it is — and usually find a misplaced dynamic value in the first fifteen minutes of investigation.

The pricing math closes quickly. Anthropic charges 1.25x base input price for the cache write at the 5-minute TTL, and 0.10x for reads. A prompt that is hit twice breaks even on the write cost. A prompt hit ten times runs at roughly 18 percent of the cost of uncached requests. At any production volume above a few hundred requests per day, that breakeven happens on day one.

The boring optimization wins here, as it usually does: audit where your static content ends and your dynamic content begins, draw that boundary explicitly with a cache breakpoint, and let the provider do the work it is already equipped to do.

Fix Your Prompt Structure Before You Touch Your Infrastructure

Parag Darade — Thu, 30 Apr 2026 04:05:41 +0000

Fix Your Prompt Structure Before You Touch Your Infrastructure

Most engineering teams treat LLM inference costs as an infrastructure problem. They evaluate model quantization, shop for cheaper GPU rentals, debate whether to move from GPT-4o to Claude Sonnet, and benchmark open-source alternatives. I have watched teams spend weeks on this and save fifteen percent. The same teams were running their system prompts with a timestamp in the first line and paying full token price on every single request.

The optimization I am talking about is prompt caching. Anthropic charges $0.30 per million tokens for cache reads versus $3.00 per million for fresh input tokens — a 10x price difference for bytes the model already processed in the last hour. OpenAI applies automatic 50% discounts on cached tokens. The savings are not theoretical. They compound over every request your system makes, and most teams are not capturing them because they are breaking the cache themselves.

What cache-busting actually looks like

Caching works by hashing the prefix of your prompt. If the hash matches a recent request, you pay the cheap rate. If it does not, you pay full price on the entire prefix.

The failure mode is deceptively simple: any dynamic content in your system prompt breaks the hash. A current timestamp. A user ID. A session context block. A "today's date is April 30, 2026" string you added because the model kept getting dates wrong. Any of these changes between requests, which pushes the cache hit rate to near zero and guarantees you pay $3.00/M on every input token.

I have seen this specific mistake in every LLM system I have audited that started as a quick prototype. The system prompt grows organically — someone adds a date, someone adds a user's account tier, someone adds a "recent conversation summary" block — and by the time the system is in production, the cacheable prefix is maybe the first hundred tokens of a twenty-thousand-token prompt. Cache hit rate: seven percent.

The ProjectDiscovery case

ProjectDiscovery's engineering team published a detailed breakdown of exactly this problem in early 2025. Their security agent Neo runs an average of 26 steps per task with roughly 40 tool calls. Each step sent a prompt that included a 20,000-token system prompt — 2,500 lines of YAML, tool definitions, and runtime state including working memory and skills context that changed every step.

Their initial cache hit rate: 7%.

The fix was structural. They moved dynamic content — working memory, runtime variables, skills context — out of the system prompt and into a user message appended at the tail of the conversation. The static system prompt stayed static. The dynamic state moved to the only place it should have been: after the stable prefix.

Cache hit rate after the change: 74% within the same deployment cycle, 84% by mid-March 2025. Total cost reduction: 59% compared to baseline. Over the six weeks following deployment, they served 9.8 billion input tokens from cache rather than paying full price for them.

That last number is worth sitting with. 9.8 billion tokens at $0.30/M instead of $3.00/M. The engineering work took days.

The structural rule

Everything static goes first. Everything dynamic goes last.

Your system prompt — instructions, persona, output format, tool definitions — is static. It changes when you ship a new version, not on every request. Mark it as cacheable and never mix runtime state into it. Your dynamic content — user context, current date, session variables, retrieved chunks from RAG — goes into the user message at the end of the conversation. This is where the model expects context to live anyway.

Tool definitions deserve a specific note. If your tool list is partly static (the core tools your agent always has) and partly dynamic (tools you inject based on user permissions or task context), sort the static tools first and place them before the dynamic ones. ProjectDiscovery made tool definitions their second cache breakpoint, keeping a 1-hour TTL on the stable portion even in conversations with changing tool sets. The incremental token cost of alphabetically sorting a list of tool names before your agent runs is zero. The cache savings compound over every task.

How to tell if you are affected

Pull your Anthropic usage metrics for the last seven days. If your cache read token rate is below 40% of total input tokens, your prompts are almost certainly structured wrong. Below 20% means something dynamic is almost certainly in the system prompt itself — probably a date, a user attribute, or a context block that varies per session.

For OpenAI, automatic caching applies the 50% discount whenever a matching prefix exists, so the failure mode is less visible in billing. You are still missing cache hits, you just do not see the rate directly in your dashboard without explicitly measuring prefix stability.

Where this sits in the optimization stack

Before re-ranking. Before switching embedding models. Before evaluating managed vector databases. Before any model fine-tuning. The ZenML survey of 1,200 production LLM deployments cites Care Access achieving 86% cost reduction through prompt caching, and Riskspan cutting per-deal processing costs by 90x through LLM optimization. Both numbers are large enough to sound inflated, but the mechanism is reliable: if you move from a 7% cache hit rate to a 74% cache hit rate on a 20,000-token prompt, you have changed the effective price of those input tokens by roughly 6x.

The audit your system needs first is not an architecture review. It is reading your own system prompt and asking which lines change between requests. Move those lines to the bottom. The savings will show up in your billing dashboard before the end of the week.

The Prompt Tax Most LLM Teams Are Silently Paying

Parag Darade — Wed, 29 Apr 2026 19:57:42 +0000

The Prompt Tax Most LLM Teams Are Silently Paying

Anthropic shipped prompt caching in August 2024. Nearly two years later, Datadog's State of AI Engineering report found that only 28 percent of LLM API calls across their observed production deployments show cached-read tokens — despite the fact that 69 percent of all input tokens in those same deployments live in system prompts. The math is not subtle: most teams are sending the same fifty thousand tokens on every request and paying full rate for all of them.

This is not an obscure optimization from a recent release. Both Anthropic and OpenAI have had prompt caching available for over a year. OpenAI applies it automatically on GPT-4o calls longer than 1,024 tokens, at a 50 percent discount, requiring zero code changes. Anthropic's implementation requires marking your cache breakpoints explicitly, but the discount is steeper: cache reads on Claude Sonnet cost $0.30 per million tokens versus $3.00 per million for fresh input — a 90 percent reduction. The 72 percent of teams not using this are not missing some edge-case optimization. They are missing the most obvious cost lever available.

Why System Prompts Are the Structural Problem

Most LLM applications I have seen share the same shape: a system prompt running anywhere from five hundred to fifty thousand tokens — instructions, persona text, policy constraints, tool definitions, few-shot examples — followed by a user message and sometimes retrieved context. The system prompt does not change between requests. It is identical for user one and user ten thousand.

This is exactly the workload prompt caching was built for. The model processes the system prompt once, writes the KV cache state to a fast-access store, and on every subsequent request within the cache window, reads from that state instead of recomputing from scratch. You pay the write cost once — 1.25x the normal input rate on Anthropic's five-minute TTL — and then ten percent of the normal rate on every read thereafter.

Du'An Lightfoot's YouTube analytics bot puts the economics in concrete terms. His system included 81,262 tokens of video metadata JSON in every request — paying $0.24 per call against Claude 3.5 Haiku's base rate. After caching, subsequent requests dropped to $0.024. The monthly bill went from $720 to $72. The only change was marking the cacheable prefix in the API request.

The Thomson Reuters Labs engineering team measured similar numbers on research paper analysis. A 30,000-token document with three parallel queries cost $0.34 per session on Claude 3.5 Sonnet without caching. With a cache-warmed prefix, the same workload ran at $0.14 — a 59 percent reduction — and subsequent queries ran 20 percent faster. For larger prompts the latency gains are more dramatic: Anthropic's own documentation shows a 100,000-token prompt dropping from 11.5 seconds to 2.4 seconds with caching, an 85 percent reduction in time-to-first-token.

Where This Actually Breaks

The implementation failure mode is not the API call. That part is two lines of JSON. The failure mode is prompt structure.

Caching works by prefix matching. Everything up to your first cache_control breakpoint must be byte-for-byte identical across requests for a cache hit to register. This means your cacheable content has to come first in the prompt. If you build prompts dynamically and prepend user-specific context before the system instructions — which is a common pattern when you want to personalize early — you get zero cache hits and no error to investigate. The system just silently processes fresh tokens on every call.

The correct structure is: static system instructions first, then cacheable reference material, then dynamic context, then the user query. The TR Labs team discovered the second half of this the hard way: they parallelized three document-analysis requests before issuing a sequential cache write, and ended up with a 4.2 percent cache hit rate and costs 60 percent higher than their fully-uncached baseline. Each parallel thread had written its own redundant cache entry. Their fix was a single synchronous "warming" call before fanning out.

The second trap is the cache TTL. Anthropic's default window is five minutes. If your workload has gaps longer than that — overnight batch jobs, infrequent API calls, anything without steady traffic throughout the day — you pay the write premium on every call and recover nothing. The one-hour TTL doubles the write cost to 2x the normal input rate, but that premium is recovered within the first two or three requests in the same window. Know your request distribution before picking a TTL.

The Actual First Step

Before you add a re-ranker, upgrade your embedding model, or benchmark a new chunking strategy: open your provider dashboard, find your average input token count per call, and separate it into system tokens versus dynamic tokens. If the static portion is above two thousand tokens and your request volume is more than a few hundred calls per day, you are probably leaving 50 to 90 percent of your input token spend on the table.

On OpenAI, caching is already happening automatically — check whether your prompt structure is prefix-stable enough to be hitting it. On Anthropic, add cache_control markers to your system prompt and reference content, run a day of traffic, and look at cache_read_input_tokens in the response metadata.

The boring optimization is the one that ships in an afternoon, requires no new infrastructure, and cuts your monthly API bill in half. Most teams are still waiting for a reason to look at it.

Prompt Caching Works. Your Prompt Assembly Code Does Not.

Parag Darade — Wed, 29 Apr 2026 18:31:37 +0000

Prompt Caching Works. Your Prompt Assembly Code Does Not.

I have watched teams enable Anthropic's prompt caching, wait a billing cycle, and conclude that the advertised 90% discount on input tokens is marketing fiction. It is not. The discount is real — Anthropic charges $0.30 per million tokens for cache reads against $3.00 for fresh input, a genuine 10x difference. What is fiction is the assumption that flipping the flag is sufficient.

The failure mode is architectural. The default way engineers build LLM applications — dynamically assembling prompts from system instructions, retrieved context, conversation history, and user input — produces prompts that defeat the cache on every single call, regardless of what the documentation says.

What prefix invariance actually means

Anthropic's cache operates on prefix invariance. It checks the prompt from the beginning outward. The cached prefix must be byte-for-byte identical to a prior request. The moment any content changes, the cache misses for that position and everything that follows it.

This seems obvious until you look at how most production prompt assembly actually works. A typical chain: [system prompt] + [RAG chunks from this query] + [conversation history] + [user message]. If the RAG chunks differ between requests — which they do, by definition — then the cache never gets a stable prefix long enough to activate, even though the system prompt is identical across every request. The dynamic content is injected upstream of the static content, and the cache sees a novel prompt every time.

Anthropic requires a minimum of 1024 tokens in the cached block and supports up to four explicit breakpoints per prompt. These parameters are not the bottleneck. The bottleneck is content ordering.

From 7% to 85% in one deployment

ProjectDiscovery runs an AI security research platform built on agent swarms. Each task averages 26 steps and 40 tool calls, working from a system prompt that exceeds 2,500 lines of YAML — over 20,000 tokens per agent. The economics of caching a system prompt that size are not subtle: sent 100 times, it costs roughly $6.00 at fresh input pricing and $0.67 with caching. They had every incentive to get this right.

Their initial cache hit rate was 7%.

The diagnosis was prompt structure. Dynamic task context — the current scan target, task parameters, variable tool outputs — was being injected into the cacheable prefix before the static system prompt content. From the cache's perspective, every request opened with novel content. The 20,000-token system prompt that should have dominated the cached prefix was sitting downstream of tokens that changed on every call.

The fix was architectural, not technical: relocate all dynamic content from the cacheable prefix to the tail of the prompt, after the cache breakpoints, delivered as part of the user message rather than embedded in the system prompt. They also structured three explicit breakpoints — one for the static system prompt, one for the conversation sliding window, one for tool definitions. A single deployment on February 16 moved the hit rate from 7% to 73.7%. By March 23 it had reached 85%. The cost reduction was 59% overall and climbing toward 70% in the most recent measurement window.

For their longest agentic tasks — one ran to 1,663 steps and 57.5 million input tokens — cache rates hit 92.9%. At that scale, the difference between a 7% and 93% cache rate on a single task is not rounding error. It is the difference between running the task economically and not running it at all.

The parallel request trap

There is a second structural failure mode that hits applications using parallel LLM calls for throughput — batch document analysis, concurrent summarization, fan-out agent patterns.

Thomson Reuters Labs published a specific breakdown of this problem. Their pipeline ingested a 30,000-token document and ran multiple analytical queries against it in parallel to reduce latency. Cache hit rate without modification: 4.2%.

The cause is a race condition in cache population. When two parallel requests arrive simultaneously against a prefix that has no existing cache entry, both trigger a cache write. The second write is redundant — you pay the $3.75/M write premium twice, and the second entry is wasted. Every subsequent request that arrives before any cache entry is established repeats this. In a burst of parallel calls, you can write the same prefix dozens of times and read it zero times in the same request window.

The fix is cache warming: a single synchronous call to establish the cache entry before the parallel batch is dispatched. The warming call costs 3.98 seconds of overhead. Against a session with three parallel queries, that overhead is roughly 5% of total session time. Against a session with twenty queries, under 1%. The cost comparison on their 30,000-token document with three questions: $0.34 without warming, $0.14 with it — 60% cheaper, from a wrapper function that fires one request before releasing the batch.

This failure produces correct results. Both code paths return valid completions. The only signal that something is wrong is a bill that is higher than it should be, which most teams attribute to volume rather than structure.

Where to look

Find every place in your codebase where a prompt is assembled. Identify which content is stable across requests and which is dynamic. If dynamic content appears before any cache breakpoint, you have a structural problem that no amount of breakpoint configuration will fix.

The three offenders I see most often: RAG chunks injected into the system prompt block rather than the user message, user-specific metadata prepended as a system prefix, and timestamp or request-ID fields inadvertently baked into the cacheable portion for debugging purposes.

Once you restructure for a stable prefix, add cache_control: {"type": "ephemeral"} at the end of the static block and watch the cache_read_input_tokens field in the response. If that field is zero on requests after the first, your prefix is still changing. The field will tell you immediately whether the fix held.

The savings that caching advertises are real. They are just gated behind understanding that the cache cannot compensate for a prompt assembly pattern that was never designed with prefix stability in mind.