<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: gauravdagde</title>
    <description>The latest articles on Forem by gauravdagde (@gauravdagde).</description>
    <link>https://forem.com/gauravdagde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F370559%2F2c8c382f-9e05-4ad6-93eb-07769ca1b8c6.jpeg</url>
      <title>Forem: gauravdagde</title>
      <link>https://forem.com/gauravdagde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gauravdagde"/>
    <language>en</language>
    <item>
      <title>GPT-5 vs Claude Sonnet 4: real per-task cost and benchmark comparison for production workloads</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Mon, 27 Apr 2026 03:44:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/gpt-5-vs-claude-sonnet-4-real-per-task-cost-and-benchmark-comparison-for-production-workloads-2c8d</link>
      <guid>https://forem.com/gauravdagde/gpt-5-vs-claude-sonnet-4-real-per-task-cost-and-benchmark-comparison-for-production-workloads-2c8d</guid>
      <description>&lt;p&gt;You're choosing between GPT-5 and Claude Sonnet 4 for a production workload. Pricing pages give you per-million-token numbers. Benchmark leaderboards give you scores that don't always survive contact with your actual queries. The honest comparison lives in between — per-task cost on workloads that look like yours, with the gotchas that don't show up on either page.&lt;/p&gt;

&lt;p&gt;This post is that comparison.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Deprecation note before we go further.&lt;/strong&gt; Claude Sonnet 4 (&lt;code&gt;claude-sonnet-4-20250514&lt;/code&gt;, launched May 22, 2025) is &lt;strong&gt;deprecated and retires on June 15, 2026.&lt;/strong&gt; Anthropic's recommended migration target is Claude Sonnet 4.6 — same pricing, larger context window, more capable. Most teams choosing between GPT-5 and "Sonnet 4" today are practically choosing between GPT-5 and Sonnet 4.6 because the original Sonnet 4 won't be in production a quarter from now. Numbers below cover both.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GPT-5 is roughly &lt;strong&gt;1.6–2.0x cheaper per task&lt;/strong&gt; than Sonnet 4 / 4.6 on most workload mixes ($1.25/$10 vs $3/$15 per MTok).&lt;/li&gt;
&lt;li&gt;GPT-5 wins on math and science reasoning (AIME 2025: 94.6% vs 70.5%; GPQA Diamond: 88.4% vs 75.4%). Sonnet wins on agentic tool-use reliability (Tau-Bench).&lt;/li&gt;
&lt;li&gt;The most cost-effective production setup is &lt;strong&gt;neither alone&lt;/strong&gt; — a routing layer that uses GPT-5 nano or Haiku 4.5 for simple work and escalates to GPT-5 or Sonnet 4.6 only when needed.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Pricing reality (April 2026)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;GPT-5 (Aug 2025)&lt;/th&gt;
&lt;th&gt;Sonnet 4 (deprecating)&lt;/th&gt;
&lt;th&gt;Sonnet 4.6 (current)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1.25 / MTok&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$3.00 / MTok&lt;/td&gt;
&lt;td&gt;$3.00 / MTok&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$10.00 / MTok&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$15.00 / MTok&lt;/td&gt;
&lt;td&gt;$15.00 / MTok&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cached input&lt;/td&gt;
&lt;td&gt;~$0.125 / MTok (~90% off)&lt;/td&gt;
&lt;td&gt;$0.30 / MTok (cache read)&lt;/td&gt;
&lt;td&gt;$0.30 / MTok (cache read)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;400K (2x rate &amp;gt;272K on GPT-5.4)&lt;/td&gt;
&lt;td&gt;200K&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1M (flat pricing)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max output&lt;/td&gt;
&lt;td&gt;128K&lt;/td&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;td&gt;64K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning model?&lt;/td&gt;
&lt;td&gt;Yes — reasoning tokens billed as output; 5 effort levels&lt;/td&gt;
&lt;td&gt;Extended thinking; thinking tokens as output&lt;/td&gt;
&lt;td&gt;Extended + adaptive thinking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch API&lt;/td&gt;
&lt;td&gt;50% off&lt;/td&gt;
&lt;td&gt;50% off&lt;/td&gt;
&lt;td&gt;50% off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge cutoff&lt;/td&gt;
&lt;td&gt;~Sep/Oct 2024&lt;/td&gt;
&lt;td&gt;~Mar 2025&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Aug 2025&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sources: &lt;a href="https://platform.openai.com/docs/models/gpt-5" rel="noopener noreferrer"&gt;OpenAI GPT-5 model page&lt;/a&gt;, &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;Anthropic pricing&lt;/a&gt;, &lt;a href="https://platform.claude.com/docs/en/about-claude/models/overview" rel="noopener noreferrer"&gt;Anthropic models overview&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The headline:&lt;/strong&gt; GPT-5 is 2.4x cheaper on input, 1.5x cheaper on output. On a typical 4:1 input:output mix that blends to a 1.6–2.0x cost advantage. Caching changes the picture: GPT-5's 90% cached-input discount is competitive with Anthropic's 90% cache-read discount, but Anthropic charges 1.25x premium on cache writes (5-min TTL), front-loading cost on the first call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The long-context line is where Sonnet 4.6 quietly wins.&lt;/strong&gt; GPT-5.4 charges &lt;strong&gt;2x the standard rate above 272K input tokens.&lt;/strong&gt; Sonnet 4.6 has flat pricing across its full 1M token context. For document-heavy workloads (large codebases, long PDF analysis, research synthesis), Sonnet 4.6 is often &lt;strong&gt;cheaper per request&lt;/strong&gt; despite the higher per-token rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Head-to-head benchmarks (launch window)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;GPT-5&lt;/th&gt;
&lt;th&gt;Sonnet 4&lt;/th&gt;
&lt;th&gt;Margin&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SWE-bench Verified&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;74.9%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;72.7% (80.2% high-compute)&lt;/td&gt;
&lt;td&gt;Tight; high-compute Sonnet leads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIME 2025 (no tools)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;94.6%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70.5%&lt;/td&gt;
&lt;td&gt;GPT-5 +24.1pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPQA Diamond&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88.4% (Pro)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;75.4%&lt;/td&gt;
&lt;td&gt;GPT-5 +13.0pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MMMU (multimodal)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;74.4%&lt;/td&gt;
&lt;td&gt;GPT-5 +9.8pp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aider Polyglot&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;88%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;n/a published&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tau-Bench Retail&lt;/td&gt;
&lt;td&gt;n/a published&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;80.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tau-Bench Airline&lt;/td&gt;
&lt;td&gt;n/a published&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;60.0%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sources: &lt;a href="https://openai.com/index/introducing-gpt-5/" rel="noopener noreferrer"&gt;OpenAI GPT-5 launch&lt;/a&gt;, &lt;a href="https://www.anthropic.com/news/claude-4" rel="noopener noreferrer"&gt;Anthropic Claude 4 launch&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The pattern: &lt;strong&gt;GPT-5 dominates pure reasoning benchmarks&lt;/strong&gt; (math, science, multimodal). &lt;strong&gt;Sonnet 4 holds its own on agentic tool-use&lt;/strong&gt; where reliability matters more than peak intelligence. Tau-Bench is a stronger predictor of how a model behaves inside a long agent loop than MMLU is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current generation narrows significantly.&lt;/strong&gt; &lt;a href="https://localaimaster.com/models/swe-bench-explained-ai-benchmarks" rel="noopener noreferrer"&gt;SWE-bench Verified&lt;/a&gt;: Sonnet 4.6 &lt;strong&gt;79.6%&lt;/strong&gt;, GPT-5.4 &lt;strong&gt;~80%&lt;/strong&gt;, Opus 4.5/4.6 &lt;strong&gt;80.8–80.9%&lt;/strong&gt;. The gap between picking GPT-5.4 and Sonnet 4.6 for SWE work is much smaller than the launch numbers suggest.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost per real task
&lt;/h2&gt;

&lt;p&gt;Five workloads at typical sizes, no caching, no batching.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customer support reply (200 in / 150 out)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;At 100K replies/mo&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.001750&lt;/td&gt;
&lt;td&gt;$175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4 / 4.6&lt;/td&gt;
&lt;td&gt;$0.002850&lt;/td&gt;
&lt;td&gt;$285&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;GPT-5 1.6x cheaper.&lt;/strong&gt; Both handle this well; pick on cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code review of 500-line PR (4,000 in / 800 out)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.013000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4 / 4.6&lt;/td&gt;
&lt;td&gt;$0.024000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;GPT-5 1.85x cheaper.&lt;/strong&gt; Sonnet 4.6's SWE-bench parity + agentic tool-use track record makes it the conventional choice for code-review agents anyway. The premium buys reliability on the tool-use chain, not raw intelligence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Document summarization (3,000 in / 400 out)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.007750&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4 / 4.6&lt;/td&gt;
&lt;td&gt;$0.015000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;GPT-5 1.94x cheaper.&lt;/strong&gt; Quality comparable for sub-200K-token documents. Above 272K tokens, &lt;strong&gt;Sonnet 4.6's flat 1M context flips the cost picture&lt;/strong&gt; — cheaper per long-document call than GPT-5.4.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG-enabled Q&amp;amp;A (2,500 in / 250 out)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.005625&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4 / 4.6&lt;/td&gt;
&lt;td&gt;$0.011250&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;GPT-5 2.0x cheaper.&lt;/strong&gt; Both handle RAG well; pick on cost unless you've benchmarked Sonnet 4.6 outperforming on your specific retrieval-grounded answer quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic task with 5 tool calls (~8,000 in / ~3,500 out incl reasoning)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.045000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sonnet 4 / 4.6&lt;/td&gt;
&lt;td&gt;$0.076500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;GPT-5 1.7x cheaper per agent run.&lt;/strong&gt; Sonnet 4 / 4.6's agentic reliability advantage means production agents often pay the premium for fewer retries and tool-use failures. &lt;strong&gt;Cost-per-successful-task ratio narrows considerably once retry math is included.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Reasoning token caveat.&lt;/strong&gt; Add ~30–60% to GPT-5 output cost at &lt;code&gt;reasoning_effort &amp;gt;= medium&lt;/code&gt;. Reasoning tokens are silent and consume your output budget. Sonnet 4.6 extended thinking has the same dynamic. Numbers above assume &lt;code&gt;low&lt;/code&gt; effort.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Production gotchas neither pricing page mentions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. GPT-5 reasoning inflation
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;reasoning_effort&lt;/code&gt; has 5 levels (none/low/medium/high/xhigh). &lt;strong&gt;xhigh runs ~3–5x the cost of low&lt;/strong&gt; because of hidden reasoning token volume. &lt;code&gt;max_completion_tokens&lt;/code&gt; ≠ visible output for reasoning models — the budget includes the reasoning tokens you're billed for but never see.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Good — explicit; controls hidden reasoning cost
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;effort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;      &lt;span class="c1"&gt;# explicit, not default
&lt;/span&gt;    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;             &lt;span class="c1"&gt;# ceiling includes reasoning
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Be concise" in the prompt does not control reasoning verbosity. &lt;a href="https://docs.bswen.com/blog/2026-04-04-gpt54-reasoning-effort-levels-guide/" rel="noopener noreferrer"&gt;Source&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sonnet output verbosity
&lt;/h3&gt;

&lt;p&gt;Anthropic's own docs note Sonnet's "engaging responses" and recommend prompt-tuning for concision. Real-world reports consistently describe Sonnet outputs as more verbose than GPT-5. Cost implication: more output tokens at $15/M. Mitigation: explicit length constraints + &lt;code&gt;max_tokens&lt;/code&gt; ceiling.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Long-context pricing flip
&lt;/h3&gt;

&lt;p&gt;Above 272K input tokens, GPT-5.4 charges 2x standard input rate. Sonnet 4.6's 1M context has flat pricing throughout. For workloads regularly using long context — codebases &amp;gt;50K lines, multi-document research synthesis, RAG with large retrieval windows — &lt;strong&gt;Sonnet 4.6 can be cheaper per request&lt;/strong&gt; despite the higher per-token rate.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Rate limits
&lt;/h3&gt;

&lt;p&gt;OpenAI GPT-5 (post-Sept 2025 increase): T1 500K TPM, T2 1M, T3 2M, T4 4M, T5 40M. &lt;a href="https://x.com/OpenAIDevs/status/1966610846559134140" rel="noopener noreferrer"&gt;Source&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Anthropic: T1 (after $5 deposit) 50 RPM, T2 1K, T3 2K, T4 4K.&lt;/p&gt;

&lt;p&gt;Not directly comparable (TPM vs RPM), but production teams hitting bursty traffic on Anthropic regularly need tier escalation earlier than equivalent OpenAI workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Reliability (Dec 2025 reference)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://isdown.app/blog/llm-providers-status-report-december-2025" rel="noopener noreferrer"&gt;IsDown LLM provider report Dec 2025&lt;/a&gt;: Anthropic 20 incidents (7 major), 184.5 hrs total. OpenAI 22 incidents (1 major), 182.7 hrs. Anthropic fewer total but more severe; OpenAI more frequent but minor.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Recent security incidents
&lt;/h3&gt;

&lt;p&gt;November 2025 OpenAI Mixpanel breach (API portal customer profiles); Anthropic Claude Code internal-files exposure. Both public. &lt;a href="https://incidentdatabase.ai/blog/incident-report-2025-november-december-2026-january/" rel="noopener noreferrer"&gt;AI Incident DB&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Geographic / data residency premiums
&lt;/h3&gt;

&lt;p&gt;Anthropic 1.1x multiplier for US-only &lt;code&gt;inference_geo&lt;/code&gt; on Opus 4.6+. Bedrock and Vertex regional endpoints add ~10% premium for Sonnet 4.5+. &lt;a href="https://platform.claude.com/docs/en/about-claude/pricing" rel="noopener noreferrer"&gt;Source&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The decision framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Choose GPT-5 (or GPT-5 mini) when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Math-heavy reasoning, science Q&amp;amp;A, technical analysis&lt;/li&gt;
&lt;li&gt;Structured-output generation at lower cost&lt;/li&gt;
&lt;li&gt;High-volume workloads where the 1.6–2.0x price gap compounds&lt;/li&gt;
&lt;li&gt;RAG and document summarization on documents under 272K tokens&lt;/li&gt;
&lt;li&gt;Customer support replies where per-reply cost matters more than peak quality&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose Claude Sonnet 4.6 when
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Agentic workflows with tool-use reliability requirements&lt;/li&gt;
&lt;li&gt;Software engineering agents (code review, refactoring, multi-file patches)&lt;/li&gt;
&lt;li&gt;Long-context workloads — flat 1M pricing beats GPT-5's 2x above 272K&lt;/li&gt;
&lt;li&gt;Writing-heavy tasks where coherent verbose output is preferred&lt;/li&gt;
&lt;li&gt;Workloads where retry math (failed tool calls) makes "cheaper" GPT-5 more expensive in practice&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Choose neither alone for production
&lt;/h3&gt;

&lt;p&gt;The architecture that minimizes cost-per-task across a real product is a &lt;strong&gt;routing layer&lt;/strong&gt; that uses GPT-5 nano or Haiku 4.5 for simple classification and FAQ-pattern requests, escalating to GPT-5 or Sonnet 4.6 only for requests that need the capability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.lmsys.org/blog/2024-07-01-routellm/" rel="noopener noreferrer"&gt;Berkeley's RouteLLM benchmarks&lt;/a&gt; demonstrate ~85% cost reduction at 95% quality on routable workloads. The setup is straightforward; the gain is much larger than the GPT-5-vs-Sonnet pricing gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the pricing comparison doesn't capture
&lt;/h2&gt;

&lt;p&gt;The 2x price gap between GPT-5 and Sonnet 4.6 is real but it's not the most consequential variable in your bill. The variables that matter more, in roughly this order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Whether you're routing at all.&lt;/strong&gt; A team running 100% on Sonnet 4.6 is paying 6–10x what a team running a routed mix of Haiku 4.5 + Sonnet 4.6 pays for the same product.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whether prompt caching is active.&lt;/strong&gt; Up to 90% off cached input on both providers. The bug that breaks the cache (timestamps in the prefix) is more expensive than picking the more expensive model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whether &lt;code&gt;reasoning_effort&lt;/code&gt; is set.&lt;/strong&gt; Default reasoning settings on GPT-5 can blow your output budget by 3–5x silently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whether you're on the Batch API for batchable work.&lt;/strong&gt; Flat 50% off — invisible if traffic is all real-time, enormous if any non-realtime work is in the mix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Then, finally, the per-token rate of the model you picked.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Picking the right model matters. Picking the right routing, caching, and reasoning configuration matters more.&lt;/p&gt;

&lt;p&gt;Full breakdown with the per-task cost table, deprecation timeline, and production gotchas: &lt;strong&gt;&lt;a href="https://preto.ai/blog/gpt-5-vs-claude-sonnet-4/" rel="noopener noreferrer"&gt;preto.ai/blog/gpt-5-vs-claude-sonnet-4/&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>openai</category>
      <category>claude</category>
      <category>llm</category>
      <category>ai</category>
    </item>
    <item>
      <title>How We Log LLM Requests at Sub-50ms Latency Using ClickHouse</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Thu, 23 Apr 2026 18:30:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/how-we-log-llm-requests-at-sub-50ms-latency-using-clickhouse-3jbn</link>
      <guid>https://forem.com/gauravdagde/how-we-log-llm-requests-at-sub-50ms-latency-using-clickhouse-3jbn</guid>
      <description>&lt;p&gt;We were logging 50,000 LLM requests per day to PostgreSQL. Query latency was fine. At 400,000 requests, cost aggregation queries started taking 3 seconds. At 2 million, the database was the slowest thing in the stack.&lt;/p&gt;

&lt;p&gt;We switched to ClickHouse. Here's exactly what changed and why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM request logs are append-only, high-cardinality, analytics-heavy — that's a ClickHouse workload, not Postgres&lt;/li&gt;
&lt;li&gt;Switching dropped our cost dashboard p95 from 3.2s to 12ms (with materialized views)&lt;/li&gt;
&lt;li&gt;Our async Go write path keeps logging overhead under 2ms p95&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Schema That Broke PostgreSQL
&lt;/h2&gt;

&lt;p&gt;Our initial PostgreSQL schema was straightforward — UUID primary key, indexes on &lt;code&gt;(tenant_id, created_at)&lt;/code&gt;, &lt;code&gt;model&lt;/code&gt;, and &lt;code&gt;feature&lt;/code&gt;. It worked fine at low volume.&lt;/p&gt;

&lt;p&gt;The problem was our query pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost_usd&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;llm_requests&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'30 days'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 10 million rows, these GROUP BY queries were scanning hundreds of thousands of rows per tenant. PostgreSQL is row-oriented — to compute &lt;code&gt;SUM(cost_usd)&lt;/code&gt;, it reads entire rows to extract one column. At scale, that's a lot of I/O for a field you could pre-aggregate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LLM Logs Are a ClickHouse Workload
&lt;/h2&gt;

&lt;p&gt;Three properties make LLM request logs a natural fit for ClickHouse:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Append-only writes.&lt;/strong&gt; LLM logs are never updated or deleted (outside retention policies). ClickHouse is optimized for high-throughput inserts — it doesn't pay the MVCC overhead PostgreSQL does on write-heavy workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Columnar storage.&lt;/strong&gt; When you query &lt;code&gt;SUM(cost_usd) GROUP BY model&lt;/code&gt;, ClickHouse reads only the &lt;code&gt;cost_usd&lt;/code&gt; and &lt;code&gt;model&lt;/code&gt; columns from disk. PostgreSQL reads entire rows. For a 20-column table where you're aggregating 2 columns, ClickHouse does 10% of the I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Materialized views.&lt;/strong&gt; ClickHouse pre-aggregates data at insert time. When a new log row lands, a materialized view fires and updates a running total in a summary table. Your dashboard hits the summary — not 10 million raw rows.&lt;/p&gt;

&lt;p&gt;For reference: Langfuse uses PostgreSQL for its logging backend. It works well at lower volumes and for teams that need ACID transactions. If you're running more than 1M requests/month and querying with analytics patterns, you'll eventually feel the difference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Our ClickHouse Schema
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;llm_requests&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;request_id&lt;/span&gt;    &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tenant_id&lt;/span&gt;     &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;         &lt;span class="n"&gt;LowCardinality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;provider&lt;/span&gt;      &lt;span class="n"&gt;LowCardinality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;feature&lt;/span&gt;       &lt;span class="n"&gt;LowCardinality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt;       &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;input_tokens&lt;/span&gt;  &lt;span class="n"&gt;UInt32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="n"&gt;UInt32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;cost_usd&lt;/span&gt;      &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;latency_ms&lt;/span&gt;    &lt;span class="n"&gt;UInt32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;cached&lt;/span&gt;        &lt;span class="n"&gt;UInt8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;    &lt;span class="nb"&gt;DateTime&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MergeTree&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;toYYYYMM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;index_granularity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things worth noting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;LowCardinality(String)&lt;/code&gt; for &lt;code&gt;model&lt;/code&gt;, &lt;code&gt;provider&lt;/code&gt;, &lt;code&gt;feature&lt;/code&gt; — bounded value sets. ClickHouse dictionary-encodes them, cutting storage ~4x and improving scan speed.&lt;/li&gt;
&lt;li&gt;Monthly partitions let us drop old data with &lt;code&gt;ALTER TABLE DROP PARTITION&lt;/code&gt; instead of slow DELETE queries.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Materialized Views That Pre-Aggregate at Insert Time
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;cost_by_model_daily&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;tenant_id&lt;/span&gt;   &lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;       &lt;span class="n"&gt;LowCardinality&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;day&lt;/span&gt;         &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;total_cost&lt;/span&gt;  &lt;span class="n"&gt;AggregateFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;total_reqs&lt;/span&gt;  &lt;span class="n"&gt;AggregateFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UInt64&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;avg_latency&lt;/span&gt; &lt;span class="n"&gt;AggregateFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UInt32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AggregatingMergeTree&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;cost_by_model_mv&lt;/span&gt;
&lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;cost_by_model_daily&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;toDate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;sumState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost_usd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;countState&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;          &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_reqs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;avgState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_latency&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;llm_requests&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dashboard queries now hit &lt;code&gt;cost_by_model_daily&lt;/code&gt; — a pre-aggregated summary with orders of magnitude fewer rows than the raw logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query latency comparison (10M rows):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query&lt;/th&gt;
&lt;th&gt;PostgreSQL p95&lt;/th&gt;
&lt;th&gt;ClickHouse raw&lt;/th&gt;
&lt;th&gt;ClickHouse mat. view&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost by model, 30d&lt;/td&gt;
&lt;td&gt;3.2s&lt;/td&gt;
&lt;td&gt;180ms&lt;/td&gt;
&lt;td&gt;12ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost by feature, 7d&lt;/td&gt;
&lt;td&gt;2.8s&lt;/td&gt;
&lt;td&gt;160ms&lt;/td&gt;
&lt;td&gt;9ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hourly spend, last 24h&lt;/td&gt;
&lt;td&gt;4.1s&lt;/td&gt;
&lt;td&gt;210ms&lt;/td&gt;
&lt;td&gt;15ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Async Write Path: Keeping Logging Under 2ms
&lt;/h2&gt;

&lt;p&gt;Getting queries fast was the first problem. Making sure logging never blocks a request was the second.&lt;/p&gt;

&lt;p&gt;Our Go implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Buffered channel — fire and forget from request handler&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;logCh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;LogEntry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="n"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;logRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="n"&gt;LogEntry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;logCh&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="c"&gt;// buffered, will be flushed async&lt;/span&gt;
  &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="c"&gt;// channel full — drop rather than block&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"log_dropped"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Background goroutine: batch flush every 500ms&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;runLogFlusher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;LogEntry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;clickhouse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;ticker&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTicker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;500&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Millisecond&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="n"&gt;LogEntry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;flushBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;flushBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client gets its LLM response immediately. The log entry goes into a buffered channel. A background goroutine flushes to ClickHouse in batches of 500 every 500ms. If the channel fills under load, we drop the entry — we'd rather lose a log than slow down a request.&lt;/p&gt;

&lt;p&gt;This gives us p95 logging overhead under 2ms. The ClickHouse batch insert takes 10–30ms for 500 rows — invisible to any individual request.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to Switch (and When Not To)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;PostgreSQL&lt;/th&gt;
&lt;th&gt;ClickHouse&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write pattern&lt;/td&gt;
&lt;td&gt;Mixed read/write, ACID&lt;/td&gt;
&lt;td&gt;Append-only inserts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query pattern&lt;/td&gt;
&lt;td&gt;Point lookups, joins&lt;/td&gt;
&lt;td&gt;Aggregations, scans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale&lt;/td&gt;
&lt;td&gt;~5M rows comfortably&lt;/td&gt;
&lt;td&gt;Billions of rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational cost&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right when&lt;/td&gt;
&lt;td&gt;&amp;lt;500K req/month&lt;/td&gt;
&lt;td&gt;&amp;gt;1M req/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Don't over-engineer. If you're under 500K LLM requests per month, Postgres with a &lt;code&gt;(tenant_id, created_at)&lt;/code&gt; index is fine. Migrate when your analytics queries start timing out — not before.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost intelligence that runs on this ClickHouse stack. One URL change to see your spend broken down by model and feature. Free up to 10K requests, no credit card required.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>clickhouse</category>
      <category>backend</category>
      <category>ai</category>
    </item>
    <item>
      <title>Streaming SSE Proxying for LLM APIs: The Hard Parts</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Mon, 20 Apr 2026 18:30:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/streaming-sse-proxying-for-llm-apis-the-hard-parts-4d60</link>
      <guid>https://forem.com/gauravdagde/streaming-sse-proxying-for-llm-apis-the-hard-parts-4d60</guid>
      <description>&lt;p&gt;OpenAI streaming looks simple from the outside. Set &lt;code&gt;stream: true&lt;/code&gt;, iterate the response, pipe it to the client. One afternoon of work.&lt;/p&gt;

&lt;p&gt;Then you ship it. A client disconnects mid-generation and you eat 2,000 tokens nobody received. A slow mobile client causes your proxy's memory to climb. An OpenAI rate limit hits after you've already sent a 200. Here's what actually happens when you proxy SSE at scale — and the Go patterns that fix each failure mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are four production failure modes: chunk boundary corruption, token leaks on disconnect, unbounded buffering under backpressure, and mid-stream errors after a 200&lt;/li&gt;
&lt;li&gt;Each has a clean fix in ~50 lines of Go&lt;/li&gt;
&lt;li&gt;All four are running in production at Preto at 5,000+ streaming req/s, &amp;lt;50ms p95 overhead&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How LLM SSE Actually Works
&lt;/h2&gt;

&lt;p&gt;OpenAI's streaming response is plain HTTP with &lt;code&gt;Content-Type: text/event-stream&lt;/code&gt;. The body is a sequence of newline-delimited frames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"chatcmpl-..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Hello"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"chatcmpl-..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:[{&lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;" world"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;DONE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each event is &lt;code&gt;data: {JSON}\n\n&lt;/code&gt;. The stream ends with &lt;code&gt;data: [DONE]\n\n&lt;/code&gt;. A proxy needs to read this upstream, optionally inspect frames, and write them downstream without adding latency. Here's what breaks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 1: Chunk Boundary Corruption
&lt;/h2&gt;

&lt;p&gt;TCP delivers what it wants. A single SSE event may arrive split across multiple reads, or multiple events may arrive in one read.&lt;/p&gt;

&lt;p&gt;If you need to inspect frames (extract token counts, inject headers, filter events), a naive fixed-buffer read corrupts the framing. Use &lt;code&gt;bufio.Scanner&lt;/code&gt; with a custom SSE split function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;proxySSE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadCloser&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;onEvent&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bufio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scanSSEEvents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;flusher&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flusher&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scan&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// push each event immediately — don't buffer&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;onEvent&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"data: "&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;6&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c"&gt;// strip "data: " prefix&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Split on double-newline (SSE event boundary)&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;scanSSEEvents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;atEOF&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;advance&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;atEOF&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimRight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;atEOF&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimRight&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;flusher.Flush()&lt;/code&gt; call on every event is critical. Without it, Go's &lt;code&gt;http.ResponseWriter&lt;/code&gt; buffers writes and the client gets batched chunks — defeating streaming entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 2: Token Leaks on Client Disconnect
&lt;/h2&gt;

&lt;p&gt;This is the most expensive failure mode.&lt;/p&gt;

&lt;p&gt;When a client closes the connection mid-stream — tab closed, app backgrounded, network timeout — a naive proxy keeps the upstream OpenAI request running. OpenAI finishes generating the full completion. You're billed for every token.&lt;/p&gt;

&lt;p&gt;At 1,000 req/s with a 5% disconnect rate and 500 average output tokens: 25,000 wasted tokens per second — hundreds of dollars per day at GPT-4o-mini pricing, more at GPT-4o.&lt;/p&gt;

&lt;p&gt;The fix is Go context propagation. When a client disconnects, Go's &lt;code&gt;net/http&lt;/code&gt; server cancels &lt;code&gt;r.Context()&lt;/code&gt;. Pass that context to the upstream call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;handleStream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// r.Context() is cancelled when the client disconnects&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;upstreamReq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewRequestWithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"POST"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"https://api.openai.com/v1/chat/completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;upstreamReq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Clone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstreamReq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Canceled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="c"&gt;// client disconnected — upstream cancelled, no leak&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"upstream error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;502&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// ... proxy the stream&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;http.NewRequestWithContext&lt;/code&gt; instead of &lt;code&gt;http.NewRequest&lt;/code&gt;. When context cancels, Go's HTTP client aborts the upstream TCP connection. OpenAI stops generating. You stop paying.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 3: Backpressure and Unbounded Buffering
&lt;/h2&gt;

&lt;p&gt;OpenAI streams at ~50–100 tokens/second. A fast client reads fine. A slow client — one processing each chunk before reading the next, or on a congested connection — falls behind.&lt;/p&gt;

&lt;p&gt;Go's kernel socket buffers are bounded (~64–256KB per connection). When they fill, &lt;code&gt;Write()&lt;/code&gt; blocks, which blocks your upstream reader, which stalls the TCP window, which pauses OpenAI. The stream stalls but the connection stays open — holding resources indefinitely.&lt;/p&gt;

&lt;p&gt;The dangerous alternative is an unbounded in-memory buffer. Under load with many slow clients, OOM.&lt;/p&gt;

&lt;p&gt;Our fix: a bounded channel between reader and writer goroutines, with a timeout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;streamWithBackpressure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadCloser&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="n"&gt;eventCh&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// max 64 events buffered&lt;/span&gt;
    &lt;span class="n"&gt;flusher&lt;/span&gt;  &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flusher&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="nb"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;eventCh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scanner&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;bufio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewScanner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scanSSEEvents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scan&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;eventCh&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;After&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="c"&gt;// client too slow — abort cleanly&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;eventCh&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the downstream writer doesn't consume an event within 5 seconds, the reader exits, closes the channel, and the writer loop exits cleanly. Context cancellation terminates the upstream connection.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 4: Mid-Stream Errors After HTTP 200
&lt;/h2&gt;

&lt;p&gt;HTTP status codes are sent before the body. Once you've written &lt;code&gt;200 OK&lt;/code&gt; and started streaming, you cannot send a &lt;code&gt;429&lt;/code&gt; or &lt;code&gt;503&lt;/code&gt; if something goes wrong. This happens: OpenAI sends a 200 header, begins streaming, then hits an internal rate limit mid-generation. The stream truncates. The client sees success with an incomplete response — and typically retries, paying for partial tokens twice.&lt;/p&gt;

&lt;p&gt;The fix: send an in-band error event before closing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;writeSSEError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flusher&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"data: {&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;:{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;%s&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// After your scanner loop:&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;scanner&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Canceled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;writeSSEError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"stream_error"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"upstream stream interrupted"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clients should check every &lt;code&gt;data:&lt;/code&gt; payload for an &lt;code&gt;error&lt;/code&gt; key, not just the HTTP status code. A truncated stream with an in-band error should retry differently from a clean completion.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bonus: Cost Tracking Without Blocking
&lt;/h2&gt;

&lt;p&gt;You can't calculate output token cost until the stream ends. Without the final count, you'd need to build your own token counter — and you'd still be estimating mid-stream.&lt;/p&gt;

&lt;p&gt;Solution: pass &lt;code&gt;stream_options: {"include_usage": true}&lt;/code&gt; in the request body. OpenAI sends a final usage chunk before &lt;code&gt;[DONE]&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:[],&lt;/span&gt;&lt;span class="nl"&gt;"usage"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="nl"&gt;"prompt_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;142&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"completion_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;387&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;data:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;DONE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In your &lt;code&gt;onEvent&lt;/code&gt; handler, watch for the &lt;code&gt;usage&lt;/code&gt; field. When it appears, fire the log entry into your async channel (see: &lt;a href="https://preto.ai/blog/clickhouse-llm-logging/" rel="noopener noreferrer"&gt;how we log at sub-50ms with ClickHouse&lt;/a&gt;) and pass the chunk through unmodified. Zero latency added.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chunk corruption&lt;/td&gt;
&lt;td&gt;Fixed-buffer reads split events&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;bufio.Scanner&lt;/code&gt; with SSE split func&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token leak&lt;/td&gt;
&lt;td&gt;Upstream outlives client&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NewRequestWithContext(r.Context())&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backpressure / OOM&lt;/td&gt;
&lt;td&gt;Unbounded buffer&lt;/td&gt;
&lt;td&gt;Bounded channel + write timeout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mid-stream errors&lt;/td&gt;
&lt;td&gt;Can't change 200 status&lt;/td&gt;
&lt;td&gt;In-band error event before close&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All four are in production in Preto's Go proxy. If you want to see the full latency breakdown — proxy overhead vs. provider TTFT — on your own traffic, &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto is free up to 10K requests&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost intelligence that runs on this proxy stack.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>llm</category>
      <category>backend</category>
      <category>ai</category>
    </item>
    <item>
      <title>Prompt Hashing for Duplicate Detection: Cutting LLM Waste With SHA-256</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Thu, 16 Apr 2026 18:30:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/prompt-hashing-for-duplicate-detection-cutting-llm-waste-with-sha-256-5hfm</link>
      <guid>https://forem.com/gauravdagde/prompt-hashing-for-duplicate-detection-cutting-llm-waste-with-sha-256-5hfm</guid>
      <description>&lt;p&gt;You know your LLM bill is higher than it should be. You can see total spend and total tokens in the OpenAI dashboard. What you can't see: how many of those requests are asking the exact same question that was answered five minutes ago.&lt;/p&gt;

&lt;p&gt;Prompt hashing is the cheapest LLM optimization available — no model changes, no prompt rewrites, under 1ms added to the request path, no false positives. You hash the request, check the cache, skip the LLM on a hit — and don't pay for that call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The average production app sends 15–30% duplicate LLM requests. SHA-256 exact hashing catches all of them with zero false positives&lt;/li&gt;
&lt;li&gt;The hash key must cover model + full messages array + normalized generation params — not just the prompt string&lt;/li&gt;
&lt;li&gt;When teams first connect Preto, the average exact duplicate rate on day one is &lt;strong&gt;18%&lt;/strong&gt; — pure recoverable waste&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Duplicates Are More Common Than You Think
&lt;/h2&gt;

&lt;p&gt;The first reaction: "our requests aren't duplicates — every user's query is different." True for open-ended chat. Not true for most production LLM use cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Support/FAQ bots.&lt;/strong&gt; "What are your hours?" "How do I reset my password?" "What's your return policy?" These arrive hundreds of times per day, word for word. Every one hits the LLM fresh.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scheduled jobs.&lt;/strong&gt; Weekly reports, nightly summaries, cron-triggered pipelines. Same prompt template, same data, same request — billed every run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Application-layer cache misses.&lt;/strong&gt; The most common root cause: someone added LLM calls to a hot path without adding a cache. Every page load calls the LLM even when nothing changed. The cache was on the roadmap. It never shipped.&lt;/p&gt;

&lt;p&gt;SHA-256 hashing at the proxy layer catches all of these before they reach the provider — even if the application has no caching at all. You stop paying for duplicates without touching a line of app code.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Hash (Most Implementations Get This Wrong)
&lt;/h2&gt;

&lt;p&gt;The naive approach: hash the last user message. Wrong in production.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"Summarize this document"&lt;/code&gt; sent to GPT-4o with temperature 0.2 should cache differently from the same string sent to GPT-4o-mini with temperature 0.8. Same prompt text, different request, different output.&lt;/p&gt;

&lt;p&gt;The hash key must include everything that deterministically affects the output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model name&lt;/strong&gt; — &lt;code&gt;gpt-4o&lt;/code&gt; and &lt;code&gt;gpt-4o-mini&lt;/code&gt; are different requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full messages array&lt;/strong&gt; — role + content for every message, including system prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generation params&lt;/strong&gt; — temperature, max_tokens, top_p, frequency_penalty, presence_penalty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Exclude: &lt;code&gt;user&lt;/code&gt; field, request IDs, &lt;code&gt;stream&lt;/code&gt; flag, &lt;code&gt;stream_options&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Go Implementation
&lt;/h2&gt;

&lt;p&gt;The critical requirement: &lt;em&gt;canonical&lt;/em&gt; serialization. The same logical request must always produce the same byte sequence. JSON marshaling is not canonical by default (map key order undefined). We fix this with structs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;CacheKey&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Model&lt;/span&gt;    &lt;span class="kt"&gt;string&lt;/span&gt;    &lt;span class="s"&gt;`json:"model"`&lt;/span&gt;
    &lt;span class="n"&gt;Messages&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt; &lt;span class="s"&gt;`json:"messages"`&lt;/span&gt;
    &lt;span class="n"&gt;Params&lt;/span&gt;   &lt;span class="n"&gt;KeyParams&lt;/span&gt; &lt;span class="s"&gt;`json:"params"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Role&lt;/span&gt;    &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"role"`&lt;/span&gt;
    &lt;span class="n"&gt;Content&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`json:"content"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;KeyParams&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Temperature&lt;/span&gt;      &lt;span class="kt"&gt;float64&lt;/span&gt; &lt;span class="s"&gt;`json:"temperature,omitempty"`&lt;/span&gt;
    &lt;span class="n"&gt;MaxTokens&lt;/span&gt;        &lt;span class="kt"&gt;int&lt;/span&gt;     &lt;span class="s"&gt;`json:"max_tokens,omitempty"`&lt;/span&gt;
    &lt;span class="n"&gt;TopP&lt;/span&gt;             &lt;span class="kt"&gt;float64&lt;/span&gt; &lt;span class="s"&gt;`json:"top_p,omitempty"`&lt;/span&gt;
    &lt;span class="n"&gt;FrequencyPenalty&lt;/span&gt; &lt;span class="kt"&gt;float64&lt;/span&gt; &lt;span class="s"&gt;`json:"frequency_penalty,omitempty"`&lt;/span&gt;
    &lt;span class="n"&gt;PresencePenalty&lt;/span&gt;  &lt;span class="kt"&gt;float64&lt;/span&gt; &lt;span class="s"&gt;`json:"presence_penalty,omitempty"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;ComputePromptHash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletionRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;CacheKey&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Messages&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;normalizeMessages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Messages&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;Params&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;KeyParams&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Temperature&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Temperature&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;MaxTokens&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MaxTokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;TopP&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TopP&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c"&gt;// ... other params&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Marshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// struct fields marshal in declaration order — deterministic&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sum256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EncodeToString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;normalizeMessages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msgs&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletionMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msgs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;msgs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Role&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Role&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;Content&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TrimSpace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Content&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The float normalization (&lt;code&gt;math.Round(x*1000)/1000&lt;/code&gt;) matters. Different SDK versions can produce &lt;code&gt;0.7&lt;/code&gt; vs &lt;code&gt;0.6999999999999998&lt;/code&gt; for the same value. Without normalization, these hash differently and your cache never hits.&lt;/p&gt;

&lt;p&gt;Redis cache:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;cacheTTL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Hour&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Cache&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CachedResponse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ph:"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="n"&gt;CachedResponse&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Cache&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;CachedResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Marshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"ph:"&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cacheTTL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real Duplicate Rates by Use Case (Anonymized Production Data)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Exact Duplicate Rate&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Support / FAQ bots&lt;/td&gt;
&lt;td&gt;35–45%&lt;/td&gt;
&lt;td&gt;Users ask the same questions constantly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduled / batch jobs&lt;/td&gt;
&lt;td&gt;20–60%&lt;/td&gt;
&lt;td&gt;Retried jobs, multi-instance deploys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code assist / review&lt;/td&gt;
&lt;td&gt;8–15%&lt;/td&gt;
&lt;td&gt;Boilerplate patterns, shared stubs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;General chat / copilot&lt;/td&gt;
&lt;td&gt;3–8%&lt;/td&gt;
&lt;td&gt;Lowest — open-ended queries rarely repeat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Weighted average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~18%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Discovered on day one of proxy connection&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;18% is not an edge case — it's the first thing visible when a team adds a proxy with prompt hashing. Some teams see 40%. That's the floor of recoverable waste, before semantic caching, model routing, or any other optimization.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to Do With the Hash Beyond Caching
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Duplicate rate reporting.&lt;/strong&gt; Track cache hit rate per endpoint. This surfaces which parts of your application send duplicate traffic — usually a missing application-layer cache that should be fixed at the root.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost projection.&lt;/strong&gt; A hash seen 200 times this month × $0.008/request (roughly GPT-4o-mini at 500 tokens) = $1.60 in recoverable waste for one prompt. Multiply across all duplicate hashes for a concrete monthly saving figure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abuse detection.&lt;/strong&gt; A single hash 10,000 times in an hour from one user is a different pattern from organic duplicates. Rate limit by hash to catch prompt injection loops and runaway retry logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Hashing Falls Short
&lt;/h2&gt;

&lt;p&gt;SHA-256 catches exact duplicates. It misses semantic duplicates.&lt;/p&gt;

&lt;p&gt;"What are your business hours?" and "When do you open?" hash differently. They're the same question.&lt;/p&gt;

&lt;p&gt;Semantic caching handles this with embedding similarity — it can catch 2–3x more waste, but requires an embedding model, a vector store, and careful threshold tuning to avoid false positives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The right order:&lt;/strong&gt; implement exact hashing first. Zero-risk, one afternoon, immediate savings. Add semantic caching once you've measured the remaining duplicate rate.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — the proxy layer that computes prompt hashes on every request and surfaces duplicate rates, recoverable waste, and per-feature cost breakdown in real time. Free up to 10K requests.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>llm</category>
      <category>ai</category>
      <category>backend</category>
    </item>
    <item>
      <title>LLM Gateway vs LLM Proxy vs LLM Router: What's the Difference?</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Sun, 12 Apr 2026 18:30:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/llm-gateway-vs-llm-proxy-vs-llm-router-whats-the-difference-3o5a</link>
      <guid>https://forem.com/gauravdagde/llm-gateway-vs-llm-proxy-vs-llm-router-whats-the-difference-3o5a</guid>
      <description>&lt;p&gt;Everyone calls their product a "gateway" now. LiteLLM markets itself as both a proxy and a gateway. Portkey is a gateway. Helicone's docs use proxy and gateway interchangeably. There's a well-cited Medium post by Bijit Ghosh that ranks on Google for this comparison — correct high-level definitions, but it stops before the implementation details that tell you what to actually choose and deploy.&lt;/p&gt;

&lt;p&gt;Here's the precise version: three different layers, concrete Go code for each, and a decision framework based on team size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proxy&lt;/strong&gt; = transport layer. Pipes requests from your app to the provider&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Router&lt;/strong&gt; = decision layer. Chooses which model or provider handles the request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gateway&lt;/strong&gt; = policy layer. Auth, rate limits, budget enforcement, audit trails&lt;/li&gt;
&lt;li&gt;They're not separate products — they're three layers of the same stack&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Proxy: Transport Layer
&lt;/h2&gt;

&lt;p&gt;A proxy intercepts your HTTP request and forwards it to the provider. Your app changes one thing: the &lt;code&gt;base_url&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Before&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// After — same SDK, same code, different URL&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithBaseURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://proxy.your-company.com/v1"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A minimal Go proxy handler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Swap client key → upstream provider key&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Bearer "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;providerKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://api.openai.com"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;httputil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSingleHostReverseProxy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the core. A proxy doesn't decide anything — it doesn't choose GPT-4o over GPT-4o-mini, doesn't enforce rate limits. It pipes traffic. Everything else is built on top of this.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Router: Decision Layer
&lt;/h2&gt;

&lt;p&gt;A router decides which model and provider handle each request. It returns a routing decision; the proxy executes it. The router is pure business logic — no HTTP, no transport — which makes it testable independently and swappable without touching the proxy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-based routing&lt;/strong&gt; (most valuable):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChatRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;estimateComplexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Short, simple: classification, extraction, booleans&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Medium: summarization, structured output&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Complex: multi-step reasoning, code generation&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"claude-opus-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Failover routing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;providerChain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gemini-1.5-pro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"google"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RouteWithFailover&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChatRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;providerChain&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;circuit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsAvailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;providerChain&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;providerChain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metadata-based routing&lt;/strong&gt; (route by feature tag your app sets):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RouteByTag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChatRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-Feature"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"support-bot"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"code-review"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Gateway: Policy Layer
&lt;/h2&gt;

&lt;p&gt;A gateway adds policy enforcement above the router and proxy. The defining characteristic: the gateway has a concept of &lt;em&gt;identity&lt;/em&gt;. It knows which team or user is sending each request and enforces rules based on that identity.&lt;/p&gt;

&lt;p&gt;In Go, a gateway is a middleware chain wrapping the proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;BuildGateway&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;AuthMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c"&gt;// validate key → resolve tenant identity&lt;/span&gt;
        &lt;span class="n"&gt;RateLimitMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// per-tenant request + token rate limits&lt;/span&gt;
        &lt;span class="n"&gt;BudgetMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c"&gt;// per-team monthly spend enforcement&lt;/span&gt;
        &lt;span class="n"&gt;AuditMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c"&gt;// log every request with identity + decision&lt;/span&gt;
        &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;AuthMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LookupTenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"unauthorized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;401&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;tenantKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Bearer "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProviderKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;BudgetMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;tenant&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenantKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Tenant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MonthlySpend&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BudgetLimit&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;`{"error":"budget_exceeded"}`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;429&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A proxy is stateless with respect to the caller. A gateway is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Products Map to These Layers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Proxy&lt;/th&gt;
&lt;th&gt;Router&lt;/th&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;Cost Intelligence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓ (100+ providers)&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portkey&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓ (enterprise)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;td&gt;— (async only)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preto&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓ (recommendations)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One thing to know about Langfuse: it's an async observer — it doesn't sit in the request path. Zero proxy latency, but also no caching, routing, or real-time budget enforcement. A deliberate architectural choice — just a different layer entirely — fine if you only need post-hoc observability and don't need caching, routing, or budget enforcement.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Actually Need
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One team, one model, under $2K/month&lt;/strong&gt; → direct SDK calls. Add a proxy for logging once you have real traffic to observe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple models, cost visibility needed&lt;/strong&gt; → proxy + router. One URL change gives you per-request cost attribution and the ability to route simple tasks to cheaper models. Teams typically see 20–40% cost reduction within the first week of enabling model routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple teams, budget enforcement needed&lt;/strong&gt; → gateway. The moment two teams share an API key and neither can see what the other spends, you have a governance problem. A bill spike hits. Nobody knows which team caused it. Nobody can be held accountable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance requirements (SOC 2, HIPAA, GDPR)&lt;/strong&gt; → gateway with audit logging and PII controls. A gateway gives you the audit trail to prove it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — all three layers (proxy + router + gateway) plus cost intelligence in one URL change. Free up to 10K requests.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>backend</category>
      <category>go</category>
    </item>
    <item>
      <title>LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Sun, 05 Apr 2026 08:47:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/llm-semantic-caching-the-95-hit-rate-myth-and-what-production-data-actually-shows-8ga</link>
      <guid>https://forem.com/gauravdagde/llm-semantic-caching-the-95-hit-rate-myth-and-what-production-data-actually-shows-8ga</guid>
      <description>&lt;p&gt;You opened your OpenAI dashboard this morning and felt that familiar pit in your stomach. The number was higher than last month. Again. Somebody mentioned semantic caching — "just cache the responses, cut costs by 90%." So you looked into it.&lt;/p&gt;

&lt;p&gt;The vendor pages all say the same thing: 95% cache hit rates, 90% cost reduction, millisecond responses. Then you ran the numbers on your own traffic and the reality was different. Much different.&lt;/p&gt;

&lt;p&gt;This post breaks down how semantic caching actually works, what the published production hit rates are (not the marketing numbers), and which use cases benefit — and which don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Published production hit rates range from 20-45%, not 90-95%. The 95% number refers to &lt;em&gt;accuracy&lt;/em&gt; of cache matches, not frequency of hits.&lt;/li&gt;
&lt;li&gt;Even a 20% hit rate saves real money — $1,000/month on a $5K LLM bill — while cutting latency from 2-5s to under 5ms on cached requests.&lt;/li&gt;
&lt;li&gt;Start with exact caching. Add semantic caching only if the marginal improvement justifies the complexity.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Exact Caching vs. Semantic Caching: Two Different Problems
&lt;/h2&gt;

&lt;p&gt;Before diving into architecture, the distinction matters because most teams should start with exact caching and only add semantic caching if exact caching alone doesn't cover enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exact caching
&lt;/h3&gt;

&lt;p&gt;Hash the full prompt (including model name, temperature, and other parameters) with SHA-256. If the hash matches a stored request, return the cached response. Zero ambiguity — the prompt is identical, so the response is valid.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;5ms, zero LLM cost
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Zero false positives. Sub-millisecond lookup. Trivial to implement.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Misses rephrased duplicates. "How do I reset my password?" and "password reset help" are different hashes.&lt;/p&gt;

&lt;p&gt;Exact caching alone catches more traffic than you'd expect. The average production app sends &lt;strong&gt;15-30% identical requests&lt;/strong&gt; — automated pipelines, retries, and users asking the same FAQ.&lt;/p&gt;
&lt;h3&gt;
  
  
  Semantic caching
&lt;/h3&gt;

&lt;p&gt;Generate a vector embedding of the prompt, compare it via cosine similarity to stored embeddings, and return a cached response if the similarity exceeds a threshold. This catches rephrased duplicates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embed_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ~2-5ms
&lt;/span&gt;&lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;5ms total
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Catches semantically similar requests with different wording.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Embedding generation adds 2-5ms. False positives are possible. Threshold tuning is critical and use-case dependent.&lt;/p&gt;


&lt;h2&gt;
  
  
  The 95% Myth: What the Numbers Actually Say
&lt;/h2&gt;

&lt;p&gt;The "95% cache hit rate" claim circulates across vendor marketing pages. Here's what the published data actually shows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Portkey (production)&lt;/td&gt;
&lt;td&gt;~20%&lt;/td&gt;
&lt;td&gt;RAG use cases, 99% match accuracy&lt;/td&gt;
&lt;td&gt;Vendor data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EdTech platform (production)&lt;/td&gt;
&lt;td&gt;~45%&lt;/td&gt;
&lt;td&gt;Student Q&amp;amp;A — high repetition&lt;/td&gt;
&lt;td&gt;Case study&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT Semantic Cache (academic)&lt;/td&gt;
&lt;td&gt;61-69%&lt;/td&gt;
&lt;td&gt;Controlled benchmark, curated dataset&lt;/td&gt;
&lt;td&gt;Research paper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;General production estimate&lt;/td&gt;
&lt;td&gt;30-40%&lt;/td&gt;
&lt;td&gt;Mixed traffic across use cases&lt;/td&gt;
&lt;td&gt;Industry average&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open-ended chat (production)&lt;/td&gt;
&lt;td&gt;10-20%&lt;/td&gt;
&lt;td&gt;Unique conversations, low repetition&lt;/td&gt;
&lt;td&gt;Observed range&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 95% number, when you trace it back, almost always refers to &lt;strong&gt;match accuracy&lt;/strong&gt; — meaning 95% of the time a cache returns a response, that response is correct for the query. Not that 95% of queries hit the cache. These are fundamentally different metrics.&lt;/p&gt;

&lt;p&gt;The honest range for production semantic caching: &lt;strong&gt;20-45% hit rate&lt;/strong&gt;, depending heavily on use case.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why academic benchmarks are misleading:&lt;/strong&gt; Academic benchmarks test against curated datasets where similar questions are intentionally grouped. Production traffic is messier — 60-70% of real queries are genuinely unique. The 61-69% hit rates from research papers don't survive contact with production diversity.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Hit Rates by Use Case: Where Caching Works (and Doesn't)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Expected Hit Rate&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FAQ / customer support&lt;/td&gt;
&lt;td&gt;40-60%&lt;/td&gt;
&lt;td&gt;Users ask the same questions in slightly different ways. High repetition, bounded answer space.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Classification / labeling&lt;/td&gt;
&lt;td&gt;50-70%&lt;/td&gt;
&lt;td&gt;Automated pipelines often send identical or near-identical inputs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal knowledge base Q&amp;amp;A&lt;/td&gt;
&lt;td&gt;30-45%&lt;/td&gt;
&lt;td&gt;Employees ask similar questions about policies, processes, docs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG with document retrieval&lt;/td&gt;
&lt;td&gt;15-25%&lt;/td&gt;
&lt;td&gt;Context varies per query even if questions are similar.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open-ended chat&lt;/td&gt;
&lt;td&gt;10-20%&lt;/td&gt;
&lt;td&gt;Conversations are unique. Multi-turn context makes each request different.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;5-15%&lt;/td&gt;
&lt;td&gt;High specificity per request. Users want varied outputs.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: &lt;strong&gt;bounded answer spaces with repetitive inputs cache well. Open-ended, context-dependent, or creative tasks don't.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Threshold Problem: 0.85 vs. 0.92 vs. 0.98
&lt;/h2&gt;

&lt;p&gt;The cosine similarity threshold is the most important — and most under-discussed — configuration in semantic caching. It's the knob that determines whether your cache is useful or dangerous.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Threshold 0.85 (aggressive):&lt;/strong&gt; More cache hits, but higher false positive rate. "How to reset my password" might match "How to change my email" — similar intent, wrong answer. Good for FAQ-style use cases where a slightly imprecise answer is acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold 0.92 (balanced):&lt;/strong&gt; The sweet spot for most production use cases. Catches clear rephrasings while rejecting distinct-but-similar queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold 0.98 (conservative):&lt;/strong&gt; Almost-exact matching. Very few false positives, but you're only catching the most obvious rephrasings. At this point, exact caching captures nearly as much with zero false positive risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no universal correct threshold. It depends on the cost of a wrong answer in your application. A customer support bot returning a slightly wrong FAQ answer is tolerable. A medical advice application returning a cached answer for a different condition is dangerous.&lt;/p&gt;


&lt;h2&gt;
  
  
  Five Failure Modes Nobody Warns You About
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Context-dependent queries that look identical
&lt;/h3&gt;

&lt;p&gt;"What's the status?" asked by User A about Order #4521 and User B about Order #7893 will have near-identical embeddings. Without user-scoped or session-scoped cache keys, User B gets User A's order status. Cache keys must include relevant context — not just the prompt text.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Time-sensitive queries returning stale answers
&lt;/h3&gt;

&lt;p&gt;"What's the latest pricing for GPT-5?" cached last week is wrong this week if pricing changed. TTL helps, but the right TTL varies by query type. Pricing questions need TTLs of hours. FAQ answers can cache for days. One-size-fits-all TTL is a guarantee of either stale answers or low hit rates.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Embedding model drift
&lt;/h3&gt;

&lt;p&gt;If you update your embedding model, all previously cached embeddings become invalid. The similarity scores between old and new embeddings are meaningless. You need a cache invalidation strategy tied to your embedding model version. Most teams learn this the hard way after a model update causes a spike in incorrect cache responses.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Cache poisoning from bad responses
&lt;/h3&gt;

&lt;p&gt;If the LLM returns a hallucinated or incorrect response and you cache it, every similar future query gets that same bad answer. The cache amplifies the error. Mitigation: add quality checks before caching (confidence scores, length validation, format checks), or let users flag cached responses as incorrect to trigger cache eviction.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Streaming response caching complexity
&lt;/h3&gt;

&lt;p&gt;Most LLM calls use streaming (&lt;code&gt;stream: true&lt;/code&gt;). You can't cache a streaming response mid-stream — you need to buffer the full response, then store it. On cache hit, you either return the full response instantly (breaking the streaming contract your client expects) or simulate streaming by chunking the cached response with artificial delays. Both are engineering overhead that vendors rarely mention.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Dollar Math: What Caching Actually Saves
&lt;/h2&gt;

&lt;p&gt;For a team spending &lt;strong&gt;$5,000/month&lt;/strong&gt; on LLM APIs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;th&gt;Monthly Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10% hits&lt;/td&gt;
&lt;td&gt;$500/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20% hits&lt;/td&gt;
&lt;td&gt;$1,000/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30% hits&lt;/td&gt;
&lt;td&gt;$1,500/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;45% hits&lt;/td&gt;
&lt;td&gt;$2,250/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The savings come from two places: &lt;strong&gt;avoided LLM calls&lt;/strong&gt; (the obvious one) and &lt;strong&gt;reduced latency&lt;/strong&gt; (the hidden one). A cache hit returns in under 5ms instead of 2-5 seconds. For customer-facing applications, that latency improvement often matters more than the dollar savings.&lt;/p&gt;

&lt;p&gt;The cost of running the cache itself is minimal. Embedding generation uses a small model (text-embedding-3-small at $0.02/1M tokens). Vector storage in Redis or a dedicated vector DB adds $50-200/month depending on cache size. The infrastructure cost is under 5% of the savings at even a 10% hit rate.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Right Architecture: Layer Exact and Semantic Caching
&lt;/h2&gt;

&lt;p&gt;The best approach is a two-layer cache that checks exact matches first (fast, zero risk) and falls back to semantic matching only when needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Layer 1: Exact cache (sub-ms, zero false positives)
&lt;/span&gt;&lt;span class="n"&gt;exact_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exact_hit&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exact_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;exact_hit&lt;/span&gt;

&lt;span class="c1"&gt;# Layer 2: Semantic cache (2-5ms, threshold-gated)
&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;semantic_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantic_hit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;semantic_hit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="c1"&gt;# Cache miss: call the LLM
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to both layers
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exact_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The average app we onboard discovers that &lt;strong&gt;18% of requests are exact duplicates&lt;/strong&gt; on day one — before semantic matching even kicks in.&lt;/p&gt;

&lt;p&gt;Cache backends matter less than you'd think. In-memory works for single-instance proxies. Redis works for distributed deployments. Dedicated vector databases (Qdrant, Pinecone) are worth it only if your cache exceeds 1M entries — below that, Redis with vector search is sufficient and simpler to operate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start With Measurement, Not Implementation
&lt;/h2&gt;

&lt;p&gt;The most common mistake: building a caching layer before understanding what your traffic looks like. You might spend two weeks implementing semantic caching only to discover that your traffic is 90% unique, context-dependent queries with a 12% hit rate ceiling.&lt;/p&gt;

&lt;p&gt;Measure first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Log all prompts for a week.&lt;/strong&gt; Hash them. Count exact duplicates. That's your floor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample 1,000 requests.&lt;/strong&gt; Generate embeddings. Cluster them. Count how many fall within a 0.92 similarity threshold. That's your ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Estimate savings.&lt;/strong&gt; Floor hit rate × monthly LLM spend = guaranteed savings. Ceiling hit rate × monthly spend = maximum possible savings. If both numbers are under $200/month, caching isn't worth the engineering effort.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If both numbers justify the effort, start with exact caching only. Run it for two weeks. Then add semantic caching on top and compare the marginal improvement. If semantic caching only adds 5-8 percentage points over exact caching, the false positive risk may not justify the complexity.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost optimization that detects exact duplicates and semantically similar requests across your traffic. See your cache potential and projected savings before you build anything. Free for up to 10K requests.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>redis</category>
      <category>backend</category>
    </item>
    <item>
      <title>We built an LLM proxy that adds 47ms of latency. Here's every millisecond accounted for.</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Sat, 04 Apr 2026 14:45:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/we-built-an-llm-proxy-that-adds-47ms-of-latency-heres-every-millisecond-accounted-for-2lnk</link>
      <guid>https://forem.com/gauravdagde/we-built-an-llm-proxy-that-adds-47ms-of-latency-heres-every-millisecond-accounted-for-2lnk</guid>
      <description>&lt;p&gt;Your LLM API request passes through 7 layers before it reaches OpenAI. Authentication. Rate limiting. Cache lookup. Model routing. The upstream call itself. Fallback logic. Logging and cost attribution. Most teams have no idea what happens in between — or that the entire round trip adds less than 50 milliseconds.&lt;/p&gt;

&lt;p&gt;This post breaks down every layer of an LLM proxy, what each one costs in latency, and why those 47 milliseconds determine whether your AI infrastructure scales — or quietly bankrupts you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An LLM proxy intercepts your API request and passes it through 7 processing layers in under 50ms — adding auth, caching, routing, failover, and cost tracking that the provider API doesn't give you.&lt;/li&gt;
&lt;li&gt;Proxy overhead (3-50ms) is under 3% of total request time. The cost of &lt;em&gt;not&lt;/em&gt; having a proxy — untracked spend, zero failover, no per-feature attribution — is far higher.&lt;/li&gt;
&lt;li&gt;The setup is one line of code: change your &lt;code&gt;base_url&lt;/code&gt;. Everything else stays the same.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What Is an LLM Proxy (and Why Should a CTO Care)?
&lt;/h2&gt;

&lt;p&gt;An LLM proxy sits between your application code and the LLM provider. Your app sends requests to the proxy URL instead of directly to &lt;code&gt;api.openai.com&lt;/code&gt;. The proxy handles everything else: authentication, routing, caching, logging, failover.&lt;/p&gt;

&lt;p&gt;Think of it as an API gateway — but AI-aware. Traditional gateways (Kong, Nginx) understand HTTP. An LLM proxy understands tokens, models, prompt structure, and cost-per-request. It can make routing decisions based on task complexity, enforce per-team budget limits, and detect that 30% of your requests are semantically identical and cacheable.&lt;/p&gt;

&lt;p&gt;The setup is one line of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After — same SDK, same code, different base URL
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://proxy.preto.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything downstream — your prompts, your response handling, your error handling — stays the same. The proxy is transparent to your application code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 7 Layers Your Request Passes Through
&lt;/h2&gt;

&lt;p&gt;Here's what happens in those 47 milliseconds, layer by layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Ingress and Authentication (~2-5ms)
&lt;/h3&gt;

&lt;p&gt;The proxy receives your HTTP request and validates the API key. But unlike a direct OpenAI call, the key maps to an internal identity: a team, a project, a budget. Your upstream provider keys are never exposed to application code.&lt;/p&gt;

&lt;p&gt;One leaked key doesn't compromise your entire OpenAI account — it compromises one team's allocation with a hard spending cap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Rate Limiting and Budget Enforcement (~1-3ms)
&lt;/h3&gt;

&lt;p&gt;Before the request goes anywhere, the proxy checks two things: Is this user within their rate limit? Is their team within its budget?&lt;/p&gt;

&lt;p&gt;Smart proxies enforce &lt;em&gt;token-level&lt;/em&gt; rate limits, not just request-level — because one 100K-context request is not the same as one 500-token classification. Budget checks happen in-memory (synced with Redis every ~10ms) so they don't block the request path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Cache Lookup (~1-8ms; hit returns in &amp;lt;5ms, saving 500ms-5s)
&lt;/h3&gt;

&lt;p&gt;The proxy checks whether it has seen this request — or one semantically similar — before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact caching&lt;/strong&gt; hashes the prompt and returns an identical response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt; generates an embedding, computes cosine similarity against recent requests, and returns a cached response if similarity exceeds a threshold.&lt;/p&gt;

&lt;p&gt;A cache hit skips the LLM entirely: response in under 5ms instead of 2-5 seconds. In production, hit rates range from 20% to 45% depending on the use case — even 20% is a meaningful cost reduction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Routing and Model Selection (~1-3ms)
&lt;/h3&gt;

&lt;p&gt;If the request isn't cached, the proxy decides where to send it. Simple routing forwards to the model specified in the request. Advanced routing makes a decision: load balance across multiple Azure OpenAI deployments, select a cheaper model for simple tasks, or route based on headers or request patterns.&lt;/p&gt;

&lt;p&gt;Cost-based routing — sending classification tasks to GPT-5 Mini instead of GPT-5 — can cut 80% of cost on affected requests with no accuracy loss.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Upstream Call + Streaming (~500ms-5,000ms)
&lt;/h3&gt;

&lt;p&gt;The proxy forwards the request to the selected provider with the upstream API key. For streaming responses (&lt;code&gt;stream: true&lt;/code&gt;), the proxy pipes tokens back to your application as they arrive — the client starts receiving output before the full response is generated.&lt;/p&gt;

&lt;p&gt;The proxy also enforces request timeouts, killing requests that exceed a duration threshold before they waste tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 6: Fallback and Retry (~0ms unless triggered: then 100-500ms)
&lt;/h3&gt;

&lt;p&gt;If the primary provider returns a 429 (rate limit), 503 (service unavailable), or times out, the proxy retries with exponential backoff — then falls back to the next provider in the chain.&lt;/p&gt;

&lt;p&gt;GPT-5 fails? Route to Claude Sonnet. Claude is down? Try Gemini Pro.&lt;/p&gt;

&lt;p&gt;Circuit breakers monitor error rates per provider: when a provider crosses a failure threshold, it's automatically removed from the rotation and re-tested after a cooldown period. Teams running this report 99.97% effective uptime despite individual provider outages, with failover in milliseconds instead of the 5+ minutes it takes to update a hard-coded API key.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 7: Logging, Cost Attribution, and Response (~2-5ms, async)
&lt;/h3&gt;

&lt;p&gt;As the response streams back, the proxy calculates cost (input tokens × input price + output tokens × output price), tags the request with team/feature/environment metadata, and ships the log to your observability backend.&lt;/p&gt;

&lt;p&gt;This happens asynchronously — the client gets the response immediately. The log includes: model used, tokens consumed, cost, latency, cache hit/miss, which feature triggered it, and whether the request fell back to a secondary provider.&lt;/p&gt;




&lt;h2&gt;
  
  
  47ms in Context: Why Proxy Overhead Doesn't Matter (and When It Does)
&lt;/h2&gt;

&lt;p&gt;The proxy adds 7-25ms to a request that takes 500ms-5,000ms from the LLM itself. That's 0.5-3% overhead. For most teams, this is noise.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;LLM Latency&lt;/th&gt;
&lt;th&gt;Proxy Overhead&lt;/th&gt;
&lt;th&gt;% Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard completion (GPT-5, 500 tokens out)&lt;/td&gt;
&lt;td&gt;~2,000ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;1.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming first token (TTFT)&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;6.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit (semantic match)&lt;/td&gt;
&lt;td&gt;&amp;lt;5ms&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;160%*&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-form generation (2K tokens)&lt;/td&gt;
&lt;td&gt;~8,000ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;0.25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mini model classification&lt;/td&gt;
&lt;td&gt;~400ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;5.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*The cache hit row looks alarming — but the total response time is 13ms instead of 2,000ms. Your user got a response 150x faster.&lt;/p&gt;

&lt;p&gt;The only scenario where proxy latency is a real concern: &lt;strong&gt;real-time applications with sub-100ms requirements&lt;/strong&gt; and no caching benefit — voice AI, game NPCs, live translation. For these, a Rust or Go proxy (under 1ms overhead) is the right choice. For everything else, the 20ms is the best trade in your stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Proxy Architecture Patterns: Forward, Reverse, and Sidecar
&lt;/h2&gt;

&lt;p&gt;Not all proxies work the same way. The architecture pattern determines your failure modes, your latency profile, and what features you can use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Forward Proxy (Client-Side Integration)
&lt;/h3&gt;

&lt;p&gt;Your application points at the proxy URL. The proxy forwards requests to the provider. This is the most common pattern (Portkey, LiteLLM, Preto). You get the full feature set: caching, routing, failover, cost tracking. The trade-off: the proxy is in the critical path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reverse Proxy (Edge-Deployed)
&lt;/h3&gt;

&lt;p&gt;The proxy runs at the edge (e.g., Cloudflare Workers), intercepting requests globally with minimal latency. Helicone uses this pattern. Low latency from geographic proximity, but limited by what you can run in an edge function.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sidecar / Async Observer
&lt;/h3&gt;

&lt;p&gt;The proxy doesn't sit in the request path at all. Instead, it observes traffic after the fact — through SDK hooks, log tailing, or provider API polling. Langfuse advocates this approach. Zero latency impact, no single point of failure — but you lose caching, real-time routing, and failover.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The honest trade-off:&lt;/strong&gt; A synchronous proxy creates a dependency. Run it as a horizontally scaled service behind a load balancer, with health checks and automatic instance replacement. Keep a direct-to-provider fallback for critical paths. This is standard infrastructure — the same way you'd deploy any API gateway.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Proxy Overhead Actually Costs in Dollars
&lt;/h2&gt;

&lt;p&gt;The proxy adds latency. It also saves money. Here's the math for a team running 100,000 LLM requests per day on GPT-5 ($1.25/1M input, $5.00/1M output) with an average of 500 input + 300 output tokens per request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monthly LLM spend without a proxy:&lt;/strong&gt; $6,450/month&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the proxy saves:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Semantic caching (30% hit rate): -$1,935/month&lt;/li&gt;
&lt;li&gt;Cost-based routing (40% of requests downgraded to GPT-5 Mini): -$1,548/month&lt;/li&gt;
&lt;li&gt;Budget enforcement (prevents 2 runaway features/quarter): -$800-2,000/quarter&lt;/li&gt;
&lt;li&gt;Automatic failover (avoids 3 provider outages/quarter): prevents 4-12 hours of downtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Net result:&lt;/strong&gt; $3,483/month in direct savings, plus avoided downtime. The proxy pays for itself in the first week.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Cost of Not Having a Proxy
&lt;/h2&gt;

&lt;p&gt;Without a proxy, you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No per-feature cost attribution.&lt;/strong&gt; OpenAI gives you two fields for attribution: &lt;code&gt;user&lt;/code&gt; and &lt;code&gt;project&lt;/code&gt;. That's it. You can't see which feature is responsible for 60% of your bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No automatic failover.&lt;/strong&gt; When OpenAI goes down — and it does, multiple times per quarter — every AI feature in your product goes down with it. Manual failover takes 5+ minutes. At 3am, nobody is watching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No caching layer.&lt;/strong&gt; Identical requests hit the LLM every time. The average production app sends 15-30% duplicate or near-duplicate requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No budget enforcement.&lt;/strong&gt; A new feature ships with a prompt that generates 2,000 output tokens per request instead of 300. Nobody notices until the monthly bill arrives 3x higher than expected.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The average production app we onboard discovers that &lt;strong&gt;18% of its requests are cacheable on day one&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build vs. Buy: The Decision Framework
&lt;/h2&gt;

&lt;p&gt;Building a production-grade LLM proxy is a 6-12 month engineering effort. Based on published estimates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core gateway (routing, auth, failover): $200K-$300K in engineering time&lt;/li&gt;
&lt;li&gt;Observability (logging, dashboards, alerting): $100K-$150K&lt;/li&gt;
&lt;li&gt;Prompt management UI: $100K-$150K&lt;/li&gt;
&lt;li&gt;Compliance and security (SOC 2, HIPAA): $50K-$100K/year ongoing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total first-year investment: $450K-$700K&lt;/strong&gt;, plus 12-18 months before your AI features ship with production-grade infrastructure.&lt;/p&gt;

&lt;p&gt;One real case study: a team replaced their custom LLM manager with a managed proxy and &lt;strong&gt;removed 11,005 lines of code&lt;/strong&gt; across 112 files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build if:&lt;/strong&gt; LLM routing is your core product differentiator, you have unique compliance requirements, or your scale requires custom optimizations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Buy if:&lt;/strong&gt; You want to ship AI features this month, your engineering team should be building product not infrastructure, and your LLM spend is between $1K and $100K/month.&lt;/p&gt;




&lt;h2&gt;
  
  
  Latency Benchmarks by Implementation Language
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Proxy&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;~11μs at 5K RPS&lt;/td&gt;
&lt;td&gt;5,000+ RPS&lt;/td&gt;
&lt;td&gt;Pure routing, no observability platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TensorZero&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms P99&lt;/td&gt;
&lt;td&gt;10,000 QPS&lt;/td&gt;
&lt;td&gt;Built-in A/B testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;~1-5ms P95&lt;/td&gt;
&lt;td&gt;~10,000 RPS&lt;/td&gt;
&lt;td&gt;Edge-deployed on Cloudflare Workers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portkey&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;td&gt;&amp;lt;10ms&lt;/td&gt;
&lt;td&gt;1,000 RPS&lt;/td&gt;
&lt;td&gt;Full-featured: guardrails, prompt mgmt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;3-50ms&lt;/td&gt;
&lt;td&gt;1,000 QPS&lt;/td&gt;
&lt;td&gt;Most flexible (100+ providers)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rust and Go proxies handle 5-10x more throughput with 10-100x less overhead than Python. But LiteLLM has the largest provider coverage. For most teams under 1,000 RPS, the language doesn't matter. At 5,000+ RPS, it's the first thing that matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  When You Don't Need a Proxy
&lt;/h2&gt;

&lt;p&gt;Skip the proxy if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're calling one model, from one service, at low volume&lt;/li&gt;
&lt;li&gt;Your LLM spend is under $500/month&lt;/li&gt;
&lt;li&gt;You need observability but not routing (an async observer works fine)&lt;/li&gt;
&lt;li&gt;You're still prototyping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add the proxy when you have multiple models, multiple teams, real money at stake, and no visibility into where it's going.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost optimization that sits in your proxy layer. If you're evaluating options, the full build vs. buy decision checklist (12 questions, PDF) is linked below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>backend</category>
    </item>
    <item>
      <title>We evaluated Go, Rust, and Python for our LLM proxy. Go won - and not for the reason you'd expect.</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Fri, 03 Apr 2026 11:54:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/we-evaluated-go-rust-and-python-for-our-llm-proxy-go-won-and-not-for-the-reason-youd-expect-4a2j</link>
      <guid>https://forem.com/gauravdagde/we-evaluated-go-rust-and-python-for-our-llm-proxy-go-won-and-not-for-the-reason-youd-expect-4a2j</guid>
      <description>&lt;p&gt;We built our LLM proxy in Go. Not Rust. Not Python. Here's the engineering trade-off nobody talks about: the language that's fastest in benchmarks isn't always the language that ships the fastest product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go handles 5,000+ RPS with ~11 microseconds of overhead per request — more than enough for 99% of LLM proxy workloads.&lt;/li&gt;
&lt;li&gt;Rust is faster (sub-1ms P99 at 10K QPS), but the development velocity trade-off isn't worth it unless you're building for hyperscale.&lt;/li&gt;
&lt;li&gt;Python (LiteLLM) hits a wall at ~1,000 QPS due to the GIL — fine for prototyping, problematic for production traffic.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Three Contenders
&lt;/h2&gt;

&lt;p&gt;When we started building Preto's proxy layer, we had three options on the table. Each had a strong case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt; was the obvious first choice. The LLM ecosystem lives in Python. LiteLLM — the most popular open-source proxy — is Python. Every provider SDK is Python-first. We could ship a working proxy in a weekend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rust&lt;/strong&gt; was the performance choice. TensorZero and Helicone both use Rust. Sub-millisecond P99 latency at 10,000 QPS. Memory safety guarantees. If we wanted to claim "the fastest proxy," Rust was the path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go&lt;/strong&gt; was the pragmatic choice. Bifrost (the open-source proxy that benchmarks 50x faster than LiteLLM) is written in Go. Goroutines make concurrent streaming connections trivial. The standard library includes a production-grade HTTP server. And we could hire for it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Benchmark That Settled the Python Question
&lt;/h2&gt;

&lt;p&gt;We ran Python off the list first. Not because it's slow in theory — because it's slow in practice at our target scale.&lt;/p&gt;

&lt;p&gt;LiteLLM's own published benchmarks tell the story:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;At 500 RPS:&lt;/strong&gt; Stable. ~40ms overhead. Acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At 1,000 RPS:&lt;/strong&gt; Memory climbs to 4GB+. Latency variance increases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At 2,000 RPS:&lt;/strong&gt; Timeouts start. Memory hits 8GB+. Requests fail.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The culprit is Python's Global Interpreter Lock. An LLM proxy is fundamentally a concurrent I/O problem — you're holding thousands of open streaming connections simultaneously. Python's &lt;code&gt;asyncio&lt;/code&gt; helps, but the GIL still serializes CPU-bound work: JSON parsing, token counting, cost calculation, log serialization. Under load, these add up.&lt;/p&gt;

&lt;p&gt;LiteLLM's team knows this. They've announced a Rust sidecar to handle the hot path. That's telling — even the most popular Python proxy is moving critical code out of Python.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Python isn't wrong — it's wrong for this. If your LLM traffic is under 500 RPS and you need maximum provider coverage, LiteLLM is a solid choice. It supports 100+ providers with battle-tested adapters. The performance ceiling only matters if you're going to hit it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Go vs. Rust: Where the Decision Gets Interesting
&lt;/h2&gt;

&lt;p&gt;With Python out, the real comparison begins. Here's what we measured and researched:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Go&lt;/th&gt;
&lt;th&gt;Rust&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Proxy overhead&lt;/td&gt;
&lt;td&gt;~11μs at 5K RPS&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms P99 at 10K QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max throughput (single instance)&lt;/td&gt;
&lt;td&gt;5,000+ RPS&lt;/td&gt;
&lt;td&gt;10,000+ QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory under load&lt;/td&gt;
&lt;td&gt;~200MB at 5K RPS&lt;/td&gt;
&lt;td&gt;~50MB at 10K QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency model&lt;/td&gt;
&lt;td&gt;Goroutines (lightweight)&lt;/td&gt;
&lt;td&gt;async/await (Tokio)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming HTTP support&lt;/td&gt;
&lt;td&gt;stdlib net/http&lt;/td&gt;
&lt;td&gt;hyper/axum (good, more code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to implement proxy MVP&lt;/td&gt;
&lt;td&gt;~2 weeks&lt;/td&gt;
&lt;td&gt;~5-6 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hiring pool&lt;/td&gt;
&lt;td&gt;Large (DevOps, backend)&lt;/td&gt;
&lt;td&gt;Small (systems specialists)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compile times&lt;/td&gt;
&lt;td&gt;~5 seconds&lt;/td&gt;
&lt;td&gt;~2-5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;~15MB&lt;/td&gt;
&lt;td&gt;~8MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The performance numbers are close enough to not matter for our use case. The development velocity numbers are not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Factor That Made It Obvious: Goroutines and Streaming
&lt;/h2&gt;

&lt;p&gt;An LLM proxy's core job is holding thousands of concurrent HTTP connections open while streaming tokens back to clients. This is where Go's goroutine model shines.&lt;/p&gt;

&lt;p&gt;In Go, every incoming request gets its own goroutine. Streaming the response is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;proxyHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Forward to upstream LLM provider&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstreamReq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;handleFallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// try next provider&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Stream tokens back as they arrive&lt;/span&gt;
    &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flusher&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// send immediately&lt;/span&gt;
            &lt;span class="n"&gt;trackTokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c"&gt;// async cost tracking&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the core loop. In Rust, the equivalent code involves &lt;code&gt;async/await&lt;/code&gt;, &lt;code&gt;Pin&amp;lt;Box&amp;lt;dyn Stream&amp;gt;&amp;gt;&lt;/code&gt;, lifetime annotations, and careful ownership management. It's not harder conceptually — it's harder in practice, every time you refactor or add a new feature.&lt;/p&gt;

&lt;p&gt;When your proxy needs to add a new middleware layer — say, budget enforcement before routing — the Go version is a new function in the chain. The Rust version often requires restructuring lifetimes and trait bounds across multiple files.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real-World Request Lifecycle in Our Go Proxy
&lt;/h2&gt;

&lt;p&gt;Here's how a request flows through our stack, with timing at each stage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;TLS termination + HTTP parse&lt;/strong&gt; — handled by Go's &lt;code&gt;net/http&lt;/code&gt; server. ~1ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API key lookup + team resolution&lt;/strong&gt; — in-memory map with Redis sync every 10ms. ~0.5ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limit check&lt;/strong&gt; — token-bucket algorithm in goroutine-safe map. ~0.1ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget enforcement&lt;/strong&gt; — check team's monthly spend against cap. ~0.2ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache probe&lt;/strong&gt; — SHA-256 hash of prompt + model + params, checked against local cache with Redis fallback. ~1-3ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route selection&lt;/strong&gt; — match model to upstream endpoint, apply load balancing weights. ~0.1ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upstream call + streaming&lt;/strong&gt; — goroutine holds connection, pipes &lt;code&gt;data:&lt;/code&gt; chunks back. 500ms-5,000ms (the LLM).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async logging&lt;/strong&gt; — cost calculation and log entry shipped to ClickHouse via buffered channel. ~0ms on the request path (fires in background goroutine).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total proxy overhead: ~5-8ms.&lt;/strong&gt; The LLM takes 500-5,000ms. Our proxy is under 1% of total request time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We'd Choose Rust For
&lt;/h2&gt;

&lt;p&gt;This isn't a "Go is better than Rust" argument. It's a "Go is better for our constraints" argument. We'd choose Rust if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;We needed to handle 10,000+ QPS on a single instance.&lt;/strong&gt; At that scale, Rust's zero-cost abstractions and lack of GC pauses become meaningful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory was a hard constraint.&lt;/strong&gt; Rust's 50MB footprint vs. Go's 200MB matters if you're running on edge nodes or embedded devices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The proxy was the entire product.&lt;/strong&gt; If our company was an LLM proxy company, spending 3x longer on the core engine is justified. Our proxy is infrastructure — the product is cost intelligence built on top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TensorZero made the right call choosing Rust — their proxy IS the product, they need built-in A/B testing at wire speed, and they're targeting the highest-throughput tier. Helicone made the right call choosing Rust — they run on Cloudflare Workers at the edge, where memory and cold start time matter.&lt;/p&gt;

&lt;p&gt;For a cost intelligence platform where the proxy is the data collection layer? Go is the right tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons From 6 Months in Production
&lt;/h2&gt;

&lt;p&gt;Three things surprised us after shipping:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Garbage collection pauses are a non-issue.&lt;/strong&gt; Go's GC has improved dramatically. At 3,000 RPS, our P99 GC pause is under 500 microseconds. We were prepared to tune &lt;code&gt;GOGC&lt;/code&gt; — we never needed to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The standard library HTTP server is production-ready.&lt;/strong&gt; We started with Go's &lt;code&gt;net/http&lt;/code&gt; and never moved to a framework. It handles keep-alive, connection pooling, graceful shutdown, and HTTP/2 out of the box. One less dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Goroutine leaks are the real danger.&lt;/strong&gt; Early on, we had a bug where failed upstream connections weren't properly closed, leaking goroutines. &lt;code&gt;runtime.NumGoroutine()&lt;/code&gt; caught it — but only after goroutine count climbed from 200 to 45,000 over a weekend. Monitor goroutine count as a first-class metric from day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Build vs. Buy Question
&lt;/h2&gt;

&lt;p&gt;If you're evaluating whether to build your own proxy or use a managed solution, the math is sobering: a production-grade proxy is a 6-12 month engineering effort, roughly $450K-$700K in first-year engineering time when you include observability, a management UI, and compliance work.&lt;/p&gt;

&lt;p&gt;One team we onboarded had built their own LLM manager — a reasonable decision at the time. When they migrated to a managed proxy, they removed &lt;strong&gt;11,005 lines of code across 112 files&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Build if LLM routing is your core product differentiator. Buy if you want to ship AI features this month.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost optimization that sits in your proxy layer. Free for up to 10K requests. See what your LLM spend actually looks like.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>go</category>
      <category>rust</category>
    </item>
    <item>
      <title>How to integrate coroot-pg-agent with prometheus</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Tue, 23 Aug 2022 21:23:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/how-to-integrate-coroot-pg-agent-with-prometheus-2i48</link>
      <guid>https://forem.com/gauravdagde/how-to-integrate-coroot-pg-agent-with-prometheus-2i48</guid>
      <description>&lt;p&gt;For monitoring postgres server most of the opensource stacks consists of grafana with prometheus.&lt;/p&gt;

&lt;p&gt;Connecting postgres metrics to prometheus is very interesting task and there are certain tools/libraries are available.&lt;/p&gt;

&lt;p&gt;Such libraries are helpful for monitoring and writing alert rules over prometheus.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;postgres_exporter(&lt;a href="https://github.com/prometheus-community/postgres_exporter" rel="noopener noreferrer"&gt;https://github.com/prometheus-community/postgres_exporter&lt;/a&gt;) &lt;/li&gt;
&lt;li&gt;coroot-pg-agent(&lt;a href="https://github.com/coroot/coroot-pg-agent" rel="noopener noreferrer"&gt;https://github.com/coroot/coroot-pg-agent&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Today we are going to discuss more about coroot-pg-agent.&lt;/p&gt;

&lt;p&gt;coroot-pg-agent can be run using docker, more information can be found here.(&lt;a href="https://github.com/coroot/coroot-pg-agent" rel="noopener noreferrer"&gt;https://github.com/coroot/coroot-pg-agent&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;On official postgresql : &lt;a href="https://www.postgresql.org/about/news/coroot-pg-agent-an-open-source-postgres-exporter-for-prometheus-2488/" rel="noopener noreferrer"&gt;https://www.postgresql.org/about/news/coroot-pg-agent-an-open-source-postgres-exporter-for-prometheus-2488/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now while running coroot-pg-agent with prometheus, there are certain things which we should keep it in mind.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;coroot-pg-agent using docker runs on port 80 by default,
We can run it on custom port using following command through docker
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run --name coroot-pg-agent \
--env DSN="postgresql://&amp;lt;USER&amp;gt;:&amp;lt;PASSWORD&amp;gt;@&amp;lt;HOST&amp;gt;:5432/postgres?connect_timeout=1&amp;amp;statement_timeout=30000" \
--env LISTEN="0.0.0.0:&amp;lt;custom_port_for_pg_agent&amp;gt;" \
-p &amp;lt;custom_port_for_pg_agent&amp;gt;:&amp;lt;custom_port_for_pg_agent&amp;gt; \
ghcr.io/coroot/coroot-pg-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also pass scrape-interval using --env PG_SCRAPE_INTERVAL.&lt;/p&gt;

&lt;p&gt;After executing above command we see output as follows, custom_port_for_pg_agent is 3000 here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I0823 21:00:58.259629       1 main.go:35] static labels: map[]
I0823 21:00:58.273610       1 main.go:41] listening on: 0.0.0.0:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;prometheus.yml for prometheus configs
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;global:
  scrape_interval: 5m
  scrape_timeout: 3m
  evaluation_interval: 15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: coroot-pg-agent
    static_configs:
      - targets: ["&amp;lt;localhost-ip&amp;gt;:&amp;lt;custom_port_for_pg_agent&amp;gt;"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can always change the scrape interval and timeouts as per our needs, was testing over local hence kept it like this.&lt;/p&gt;

&lt;p&gt;Keep in mind while editing above yml scrape_interval should always be greater than scrape_timeout.&lt;/p&gt;

&lt;p&gt;To run prometheus using docker we can use following command where we're using official image from prometheus at &lt;a href="https://hub.docker.com/r/prom/prometheus" rel="noopener noreferrer"&gt;docker&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run \
    -p 9090:9090 \
    -v ~/pro/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After executing above command you will see output like following&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ts=2022-08-23T21:02:26.203Z caller=main.go:495 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2022-08-23T21:02:26.203Z caller=main.go:539 level=info msg="Starting Prometheus Server" mode=server version="(version=2.38.0, branch=HEAD, revision=818d6e60888b2a3ea363aee8a9828c7bafd73699)"
ts=2022-08-23T21:02:26.203Z caller=main.go:544 level=info build_context="(go=go1.18.5, user=root@e6b781f65453, date=20220816-13:29:14)"
ts=2022-08-23T21:02:26.204Z caller=main.go:545 level=info host_details="(Linux 5.10.47-linuxkit #1 SMP PREEMPT Sat Jul 3 21:50:16 UTC 2021 aarch64 87decec12cad (none))"
ts=2022-08-23T21:02:26.204Z caller=main.go:546 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-08-23T21:02:26.204Z caller=main.go:547 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-08-23T21:02:26.205Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-08-23T21:02:26.205Z caller=main.go:976 level=info msg="Starting TSDB ..."
ts=2022-08-23T21:02:26.206Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false
ts=2022-08-23T21:02:26.207Z caller=head.go:495 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2022-08-23T21:02:26.207Z caller=head.go:538 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=10.125µs
ts=2022-08-23T21:02:26.207Z caller=head.go:544 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2022-08-23T21:02:26.207Z caller=head.go:615 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
ts=2022-08-23T21:02:26.207Z caller=head.go:621 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=21.416µs wal_replay_duration=117.958µs total_replay_duration=159.167µs
ts=2022-08-23T21:02:26.208Z caller=main.go:997 level=info fs_type=EXT4_SUPER_MAGIC
ts=2022-08-23T21:02:26.208Z caller=main.go:1000 level=info msg="TSDB started"
ts=2022-08-23T21:02:26.208Z caller=main.go:1181 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
ts=2022-08-23T21:02:26.210Z caller=main.go:1218 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=2.047292ms db_storage=750ns remote_storage=1.709µs web_handler=292ns query_engine=583ns scrape=341.75µs scrape_sd=16.708µs notify=542ns notify_sd=792ns rules=1µs tracing=9.625µs
ts=2022-08-23T21:02:26.210Z caller=main.go:961 level=info msg="Server is ready to receive web requests."
ts=2022-08-23T21:02:26.210Z caller=manager.go:941 level=info component="rule manager" msg="Starting rule manager..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can hit &lt;a href="https://dev.tourl"&gt;localhost:9090&lt;/a&gt; over browser and see screen like following&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhx0ostww5ry85asdlvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhx0ostww5ry85asdlvt.png" alt="landing page for prometheus" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, to see the targets we can visit &lt;a href="http://localhost:9090/targets?search=" rel="noopener noreferrer"&gt;Targets&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dyj5z3x516h4gpoggtx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dyj5z3x516h4gpoggtx.png" alt="Targets from prometheus" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once targets are up we can see the status changed like follows&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrkj4kvhyi993jr0q8eh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrkj4kvhyi993jr0q8eh.png" alt="Targets updated statuses" width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can visit &lt;a href="http://localhost:9090/graph?g0.expr=&amp;amp;g0.tab=1&amp;amp;g0.stacked=0&amp;amp;g0.show_exemplars=0&amp;amp;g0.range_input=1h" rel="noopener noreferrer"&gt;graph&lt;/a&gt; and here if we hit the search bar, as I've kept auto suggestions on, we can see it like following.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Firlygs8jc4nqsk0uc0jy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Firlygs8jc4nqsk0uc0jy.png" alt="Shows graphs page" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>prometheusexporter</category>
      <category>corootpgagent</category>
      <category>postgresmonitoring</category>
    </item>
  </channel>
</rss>
