Forem: gauravdagde

GPT-5 vs Claude Sonnet 4: real per-task cost and benchmark comparison for production workloads

gauravdagde — Mon, 27 Apr 2026 03:44:00 +0000

You're choosing between GPT-5 and Claude Sonnet 4 for a production workload. Pricing pages give you per-million-token numbers. Benchmark leaderboards give you scores that don't always survive contact with your actual queries. The honest comparison lives in between — per-task cost on workloads that look like yours, with the gotchas that don't show up on either page.

This post is that comparison.

Deprecation note before we go further. Claude Sonnet 4 (claude-sonnet-4-20250514, launched May 22, 2025) is deprecated and retires on June 15, 2026. Anthropic's recommended migration target is Claude Sonnet 4.6 — same pricing, larger context window, more capable. Most teams choosing between GPT-5 and "Sonnet 4" today are practically choosing between GPT-5 and Sonnet 4.6 because the original Sonnet 4 won't be in production a quarter from now. Numbers below cover both.

TL;DR

GPT-5 is roughly 1.6–2.0x cheaper per task than Sonnet 4 / 4.6 on most workload mixes ($1.25/$10 vs $3/$15 per MTok).
GPT-5 wins on math and science reasoning (AIME 2025: 94.6% vs 70.5%; GPQA Diamond: 88.4% vs 75.4%). Sonnet wins on agentic tool-use reliability (Tau-Bench).
The most cost-effective production setup is neither alone — a routing layer that uses GPT-5 nano or Haiku 4.5 for simple work and escalates to GPT-5 or Sonnet 4.6 only when needed.

Pricing reality (April 2026)

Dimension	GPT-5 (Aug 2025)	Sonnet 4 (deprecating)	Sonnet 4.6 (current)
Input	$1.25 / MTok	$3.00 / MTok	$3.00 / MTok
Output	$10.00 / MTok	$15.00 / MTok	$15.00 / MTok
Cached input	~$0.125 / MTok (~90% off)	$0.30 / MTok (cache read)	$0.30 / MTok (cache read)
Context window	400K (2x rate >272K on GPT-5.4)	200K	1M (flat pricing)
Max output	128K	64K	64K
Reasoning model?	Yes — reasoning tokens billed as output; 5 effort levels	Extended thinking; thinking tokens as output	Extended + adaptive thinking
Batch API	50% off	50% off	50% off
Knowledge cutoff	~Sep/Oct 2024	~Mar 2025	Aug 2025

Sources: OpenAI GPT-5 model page, Anthropic pricing, Anthropic models overview.

The headline: GPT-5 is 2.4x cheaper on input, 1.5x cheaper on output. On a typical 4:1 input:output mix that blends to a 1.6–2.0x cost advantage. Caching changes the picture: GPT-5's 90% cached-input discount is competitive with Anthropic's 90% cache-read discount, but Anthropic charges 1.25x premium on cache writes (5-min TTL), front-loading cost on the first call.

The long-context line is where Sonnet 4.6 quietly wins. GPT-5.4 charges 2x the standard rate above 272K input tokens. Sonnet 4.6 has flat pricing across its full 1M token context. For document-heavy workloads (large codebases, long PDF analysis, research synthesis), Sonnet 4.6 is often cheaper per request despite the higher per-token rate.

Head-to-head benchmarks (launch window)

Benchmark	GPT-5	Sonnet 4	Margin
SWE-bench Verified	74.9%	72.7% (80.2% high-compute)	Tight; high-compute Sonnet leads
AIME 2025 (no tools)	94.6%	70.5%	GPT-5 +24.1pp
GPQA Diamond	88.4% (Pro)	75.4%	GPT-5 +13.0pp
MMMU (multimodal)	84.2%	74.4%	GPT-5 +9.8pp
Aider Polyglot	88%	n/a published	—
Tau-Bench Retail	n/a published	80.5%	Sonnet
Tau-Bench Airline	n/a published	60.0%	Sonnet

Sources: OpenAI GPT-5 launch, Anthropic Claude 4 launch.

The pattern: GPT-5 dominates pure reasoning benchmarks (math, science, multimodal). Sonnet 4 holds its own on agentic tool-use where reliability matters more than peak intelligence. Tau-Bench is a stronger predictor of how a model behaves inside a long agent loop than MMLU is.

Current generation narrows significantly. SWE-bench Verified: Sonnet 4.6 79.6%, GPT-5.4 ~80%, Opus 4.5/4.6 80.8–80.9%. The gap between picking GPT-5.4 and Sonnet 4.6 for SWE work is much smaller than the launch numbers suggest.

Cost per real task

Five workloads at typical sizes, no caching, no batching.

Customer support reply (200 in / 150 out)

Model	Cost	At 100K replies/mo
GPT-5	$0.001750	$175
Sonnet 4 / 4.6	$0.002850	$285

GPT-5 1.6x cheaper. Both handle this well; pick on cost.

Code review of 500-line PR (4,000 in / 800 out)

Model	Cost
GPT-5	$0.013000
Sonnet 4 / 4.6	$0.024000

GPT-5 1.85x cheaper. Sonnet 4.6's SWE-bench parity + agentic tool-use track record makes it the conventional choice for code-review agents anyway. The premium buys reliability on the tool-use chain, not raw intelligence.

Document summarization (3,000 in / 400 out)

Model	Cost
GPT-5	$0.007750
Sonnet 4 / 4.6	$0.015000

GPT-5 1.94x cheaper. Quality comparable for sub-200K-token documents. Above 272K tokens, Sonnet 4.6's flat 1M context flips the cost picture — cheaper per long-document call than GPT-5.4.

RAG-enabled Q&A (2,500 in / 250 out)

Model	Cost
GPT-5	$0.005625
Sonnet 4 / 4.6	$0.011250

GPT-5 2.0x cheaper. Both handle RAG well; pick on cost unless you've benchmarked Sonnet 4.6 outperforming on your specific retrieval-grounded answer quality.

Agentic task with 5 tool calls (~8,000 in / ~3,500 out incl reasoning)

Model	Cost
GPT-5	$0.045000
Sonnet 4 / 4.6	$0.076500

GPT-5 1.7x cheaper per agent run. Sonnet 4 / 4.6's agentic reliability advantage means production agents often pay the premium for fewer retries and tool-use failures. Cost-per-successful-task ratio narrows considerably once retry math is included.

⚠️ Reasoning token caveat. Add ~30–60% to GPT-5 output cost at reasoning_effort >= medium. Reasoning tokens are silent and consume your output budget. Sonnet 4.6 extended thinking has the same dynamic. Numbers above assume low effort.

Production gotchas neither pricing page mentions

1. GPT-5 reasoning inflation

reasoning_effort has 5 levels (none/low/medium/high/xhigh). xhigh runs ~3–5x the cost of low because of hidden reasoning token volume. max_completion_tokens ≠ visible output for reasoning models — the budget includes the reasoning tokens you're billed for but never see.

# Good — explicit; controls hidden reasoning cost
response = client.responses.create(
    model="gpt-5",
    input=prompt,
    reasoning={"effort": "low"},      # explicit, not default
    max_output_tokens=500             # ceiling includes reasoning
)

"Be concise" in the prompt does not control reasoning verbosity. Source.

2. Sonnet output verbosity

Anthropic's own docs note Sonnet's "engaging responses" and recommend prompt-tuning for concision. Real-world reports consistently describe Sonnet outputs as more verbose than GPT-5. Cost implication: more output tokens at $15/M. Mitigation: explicit length constraints + max_tokens ceiling.

3. Long-context pricing flip

Above 272K input tokens, GPT-5.4 charges 2x standard input rate. Sonnet 4.6's 1M context has flat pricing throughout. For workloads regularly using long context — codebases >50K lines, multi-document research synthesis, RAG with large retrieval windows — Sonnet 4.6 can be cheaper per request despite the higher per-token rate.

4. Rate limits

OpenAI GPT-5 (post-Sept 2025 increase): T1 500K TPM, T2 1M, T3 2M, T4 4M, T5 40M. Source.

Anthropic: T1 (after $5 deposit) 50 RPM, T2 1K, T3 2K, T4 4K.

Not directly comparable (TPM vs RPM), but production teams hitting bursty traffic on Anthropic regularly need tier escalation earlier than equivalent OpenAI workloads.

5. Reliability (Dec 2025 reference)

IsDown LLM provider report Dec 2025: Anthropic 20 incidents (7 major), 184.5 hrs total. OpenAI 22 incidents (1 major), 182.7 hrs. Anthropic fewer total but more severe; OpenAI more frequent but minor.

6. Recent security incidents

November 2025 OpenAI Mixpanel breach (API portal customer profiles); Anthropic Claude Code internal-files exposure. Both public. AI Incident DB.

7. Geographic / data residency premiums

Anthropic 1.1x multiplier for US-only inference_geo on Opus 4.6+. Bedrock and Vertex regional endpoints add ~10% premium for Sonnet 4.5+. Source.

The decision framework

Choose GPT-5 (or GPT-5 mini) when

Math-heavy reasoning, science Q&A, technical analysis
Structured-output generation at lower cost
High-volume workloads where the 1.6–2.0x price gap compounds
RAG and document summarization on documents under 272K tokens
Customer support replies where per-reply cost matters more than peak quality

Choose Claude Sonnet 4.6 when

Agentic workflows with tool-use reliability requirements
Software engineering agents (code review, refactoring, multi-file patches)
Long-context workloads — flat 1M pricing beats GPT-5's 2x above 272K
Writing-heavy tasks where coherent verbose output is preferred
Workloads where retry math (failed tool calls) makes "cheaper" GPT-5 more expensive in practice

Choose neither alone for production

The architecture that minimizes cost-per-task across a real product is a routing layer that uses GPT-5 nano or Haiku 4.5 for simple classification and FAQ-pattern requests, escalating to GPT-5 or Sonnet 4.6 only for requests that need the capability.

Berkeley's RouteLLM benchmarks demonstrate ~85% cost reduction at 95% quality on routable workloads. The setup is straightforward; the gain is much larger than the GPT-5-vs-Sonnet pricing gap.

What the pricing comparison doesn't capture

The 2x price gap between GPT-5 and Sonnet 4.6 is real but it's not the most consequential variable in your bill. The variables that matter more, in roughly this order:

Whether you're routing at all. A team running 100% on Sonnet 4.6 is paying 6–10x what a team running a routed mix of Haiku 4.5 + Sonnet 4.6 pays for the same product.
Whether prompt caching is active. Up to 90% off cached input on both providers. The bug that breaks the cache (timestamps in the prefix) is more expensive than picking the more expensive model.
Whether reasoning_effort is set. Default reasoning settings on GPT-5 can blow your output budget by 3–5x silently.
Whether you're on the Batch API for batchable work. Flat 50% off — invisible if traffic is all real-time, enormous if any non-realtime work is in the mix.
Then, finally, the per-token rate of the model you picked.

Picking the right model matters. Picking the right routing, caching, and reasoning configuration matters more.

Full breakdown with the per-task cost table, deprecation timeline, and production gotchas: preto.ai/blog/gpt-5-vs-claude-sonnet-4/

How We Log LLM Requests at Sub-50ms Latency Using ClickHouse

gauravdagde — Thu, 23 Apr 2026 18:30:00 +0000

We were logging 50,000 LLM requests per day to PostgreSQL. Query latency was fine. At 400,000 requests, cost aggregation queries started taking 3 seconds. At 2 million, the database was the slowest thing in the stack.

We switched to ClickHouse. Here's exactly what changed and why.

TL;DR:

LLM request logs are append-only, high-cardinality, analytics-heavy — that's a ClickHouse workload, not Postgres
Switching dropped our cost dashboard p95 from 3.2s to 12ms (with materialized views)
Our async Go write path keeps logging overhead under 2ms p95

The Schema That Broke PostgreSQL

Our initial PostgreSQL schema was straightforward — UUID primary key, indexes on (tenant_id, created_at), model, and feature. It worked fine at low volume.

The problem was our query pattern:

SELECT model, SUM(cost_usd), COUNT(*)
FROM llm_requests
WHERE tenant_id = $1
  AND created_at > NOW() - INTERVAL '30 days'
GROUP BY model;

At 10 million rows, these GROUP BY queries were scanning hundreds of thousands of rows per tenant. PostgreSQL is row-oriented — to compute SUM(cost_usd), it reads entire rows to extract one column. At scale, that's a lot of I/O for a field you could pre-aggregate.

Why LLM Logs Are a ClickHouse Workload

Three properties make LLM request logs a natural fit for ClickHouse:

1. Append-only writes. LLM logs are never updated or deleted (outside retention policies). ClickHouse is optimized for high-throughput inserts — it doesn't pay the MVCC overhead PostgreSQL does on write-heavy workloads.

2. Columnar storage. When you query SUM(cost_usd) GROUP BY model, ClickHouse reads only the cost_usd and model columns from disk. PostgreSQL reads entire rows. For a 20-column table where you're aggregating 2 columns, ClickHouse does 10% of the I/O.

3. Materialized views. ClickHouse pre-aggregates data at insert time. When a new log row lands, a materialized view fires and updates a running total in a summary table. Your dashboard hits the summary — not 10 million raw rows.

For reference: Langfuse uses PostgreSQL for its logging backend. It works well at lower volumes and for teams that need ACID transactions. If you're running more than 1M requests/month and querying with analytics patterns, you'll eventually feel the difference.

Our ClickHouse Schema

CREATE TABLE llm_requests (
  request_id    String,
  tenant_id     String,
  model         LowCardinality(String),
  provider      LowCardinality(String),
  feature       LowCardinality(String),
  user_id       String,
  input_tokens  UInt32,
  output_tokens UInt32,
  cost_usd      Float64,
  latency_ms    UInt32,
  cached        UInt8,
  created_at    DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (tenant_id, created_at)
SETTINGS index_granularity = 8192;

Two things worth noting:

LowCardinality(String) for model, provider, feature — bounded value sets. ClickHouse dictionary-encodes them, cutting storage ~4x and improving scan speed.
Monthly partitions let us drop old data with ALTER TABLE DROP PARTITION instead of slow DELETE queries.

Materialized Views That Pre-Aggregate at Insert Time

CREATE TABLE cost_by_model_daily (
  tenant_id   String,
  model       LowCardinality(String),
  day         Date,
  total_cost  AggregateFunction(sum, Float64),
  total_reqs  AggregateFunction(count, UInt64),
  avg_latency AggregateFunction(avg, UInt32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (tenant_id, model, day);

CREATE MATERIALIZED VIEW cost_by_model_mv
TO cost_by_model_daily AS
SELECT
  tenant_id,
  model,
  toDate(created_at)    AS day,
  sumState(cost_usd)    AS total_cost,
  countState()          AS total_reqs,
  avgState(latency_ms)  AS avg_latency
FROM llm_requests
GROUP BY tenant_id, model, day;

Dashboard queries now hit cost_by_model_daily — a pre-aggregated summary with orders of magnitude fewer rows than the raw logs.

Query latency comparison (10M rows):

Query	PostgreSQL p95	ClickHouse raw	ClickHouse mat. view
Cost by model, 30d	3.2s	180ms	12ms
Cost by feature, 7d	2.8s	160ms	9ms
Hourly spend, last 24h	4.1s	210ms	15ms

The Async Write Path: Keeping Logging Under 2ms

Getting queries fast was the first problem. Making sure logging never blocks a request was the second.

Our Go implementation:

// Buffered channel — fire and forget from request handler
var logCh = make(chan LogEntry, 10_000)

func (p *Proxy) logRequest(entry LogEntry) {
  select {
  case logCh <- entry:
    // buffered, will be flushed async
  default:
    // channel full — drop rather than block
    metrics.Increment("log_dropped")
  }
}

// Background goroutine: batch flush every 500ms
func runLogFlusher(ch <-chan LogEntry, db *clickhouse.Conn) {
  ticker := time.NewTicker(500 * time.Millisecond)
  batch := make([]LogEntry, 0, 500)

  for {
    select {
    case entry := <-ch:
      batch = append(batch, entry)
      if len(batch) >= 500 {
        flushBatch(db, batch)
        batch = batch[:0]
      }
    case <-ticker.C:
      if len(batch) > 0 {
        flushBatch(db, batch)
        batch = batch[:0]
      }
    }
  }
}

The client gets its LLM response immediately. The log entry goes into a buffered channel. A background goroutine flushes to ClickHouse in batches of 500 every 500ms. If the channel fills under load, we drop the entry — we'd rather lose a log than slow down a request.

This gives us p95 logging overhead under 2ms. The ClickHouse batch insert takes 10–30ms for 500 rows — invisible to any individual request.

When to Switch (and When Not To)

	PostgreSQL	ClickHouse
Write pattern	Mixed read/write, ACID	Append-only inserts
Query pattern	Point lookups, joins	Aggregations, scans
Scale	~5M rows comfortably	Billions of rows
Operational cost	Low	Medium
Right when	<500K req/month	>1M req/month

Don't over-engineer. If you're under 500K LLM requests per month, Postgres with a (tenant_id, created_at) index is fine. Migrate when your analytics queries start timing out — not before.

We're building Preto.ai — LLM cost intelligence that runs on this ClickHouse stack. One URL change to see your spend broken down by model and feature. Free up to 10K requests, no credit card required.

Streaming SSE Proxying for LLM APIs: The Hard Parts

gauravdagde — Mon, 20 Apr 2026 18:30:00 +0000

OpenAI streaming looks simple from the outside. Set stream: true, iterate the response, pipe it to the client. One afternoon of work.

Then you ship it. A client disconnects mid-generation and you eat 2,000 tokens nobody received. A slow mobile client causes your proxy's memory to climb. An OpenAI rate limit hits after you've already sent a 200. Here's what actually happens when you proxy SSE at scale — and the Go patterns that fix each failure mode.

TL;DR:

There are four production failure modes: chunk boundary corruption, token leaks on disconnect, unbounded buffering under backpressure, and mid-stream errors after a 200
Each has a clean fix in ~50 lines of Go
All four are running in production at Preto at 5,000+ streaming req/s, <50ms p95 overhead

How LLM SSE Actually Works

OpenAI's streaming response is plain HTTP with Content-Type: text/event-stream. The body is a sequence of newline-delimited frames:

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Hello"},...}]}

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":" world"},...}]}

data: [DONE]

Each event is data: {JSON}\n\n. The stream ends with data: [DONE]\n\n. A proxy needs to read this upstream, optionally inspect frames, and write them downstream without adding latency. Here's what breaks.

Failure 1: Chunk Boundary Corruption

TCP delivers what it wants. A single SSE event may arrive split across multiple reads, or multiple events may arrive in one read.

If you need to inspect frames (extract token counts, inject headers, filter events), a naive fixed-buffer read corrupts the framing. Use bufio.Scanner with a custom SSE split function:

func proxySSE(upstream io.ReadCloser, w http.ResponseWriter, onEvent func([]byte)) {
    scanner := bufio.NewScanner(upstream)
    scanner.Split(scanSSEEvents)
    flusher := w.(http.Flusher)

    for scanner.Scan() {
        line := scanner.Bytes()
        w.Write(line)
        w.Write([]byte("\n\n"))
        flusher.Flush() // push each event immediately — don't buffer

        if onEvent != nil && bytes.HasPrefix(line, []byte("data: ")) {
            onEvent(line[6:]) // strip "data: " prefix
        }
    }
}

// Split on double-newline (SSE event boundary)
func scanSSEEvents(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.Index(data, []byte("\n\n")); i >= 0 {
        return i + 2, bytes.TrimRight(data[:i], "\n"), nil
    }
    if atEOF {
        return len(data), bytes.TrimRight(data, "\n"), nil
    }
    return 0, nil, nil
}

The flusher.Flush() call on every event is critical. Without it, Go's http.ResponseWriter buffers writes and the client gets batched chunks — defeating streaming entirely.

Failure 2: Token Leaks on Client Disconnect

This is the most expensive failure mode.

When a client closes the connection mid-stream — tab closed, app backgrounded, network timeout — a naive proxy keeps the upstream OpenAI request running. OpenAI finishes generating the full completion. You're billed for every token.

At 1,000 req/s with a 5% disconnect rate and 500 average output tokens: 25,000 wasted tokens per second — hundreds of dollars per day at GPT-4o-mini pricing, more at GPT-4o.

The fix is Go context propagation. When a client disconnects, Go's net/http server cancels r.Context(). Pass that context to the upstream call:

func (p *Proxy) handleStream(w http.ResponseWriter, r *http.Request) {
    // r.Context() is cancelled when the client disconnects
    ctx := r.Context()

    upstreamReq, _ := http.NewRequestWithContext(ctx, "POST",
        "https://api.openai.com/v1/chat/completions",
        r.Body,
    )
    upstreamReq.Header = r.Header.Clone()

    resp, err := p.client.Do(upstreamReq)
    if err != nil {
        if errors.Is(err, context.Canceled) {
            return // client disconnected — upstream cancelled, no leak
        }
        http.Error(w, "upstream error", 502)
        return
    }
    defer resp.Body.Close()

    // ... proxy the stream
}

http.NewRequestWithContext instead of http.NewRequest. When context cancels, Go's HTTP client aborts the upstream TCP connection. OpenAI stops generating. You stop paying.

Failure 3: Backpressure and Unbounded Buffering

OpenAI streams at ~50–100 tokens/second. A fast client reads fine. A slow client — one processing each chunk before reading the next, or on a congested connection — falls behind.

Go's kernel socket buffers are bounded (~64–256KB per connection). When they fill, Write() blocks, which blocks your upstream reader, which stalls the TCP window, which pauses OpenAI. The stream stalls but the connection stays open — holding resources indefinitely.

The dangerous alternative is an unbounded in-memory buffer. Under load with many slow clients, OOM.

Our fix: a bounded channel between reader and writer goroutines, with a timeout:

func (p *Proxy) streamWithBackpressure(ctx context.Context,
    upstream io.ReadCloser, w http.ResponseWriter) {

    eventCh := make(chan []byte, 64) // max 64 events buffered
    flusher  := w.(http.Flusher)

    go func() {
        defer close(eventCh)
        scanner := bufio.NewScanner(upstream)
        scanner.Split(scanSSEEvents)
        for scanner.Scan() {
            select {
            case eventCh <- append([]byte{}, scanner.Bytes()...):
            case <-ctx.Done():
                return
            case <-time.After(5 * time.Second):
                return // client too slow — abort cleanly
            }
        }
    }()

    for event := range eventCh {
        w.Write(event)
        w.Write([]byte("\n\n"))
        flusher.Flush()
    }
}

If the downstream writer doesn't consume an event within 5 seconds, the reader exits, closes the channel, and the writer loop exits cleanly. Context cancellation terminates the upstream connection.

Failure 4: Mid-Stream Errors After HTTP 200

HTTP status codes are sent before the body. Once you've written 200 OK and started streaming, you cannot send a 429 or 503 if something goes wrong. This happens: OpenAI sends a 200 header, begins streaming, then hits an internal rate limit mid-generation. The stream truncates. The client sees success with an incomplete response — and typically retries, paying for partial tokens twice.

The fix: send an in-band error event before closing:

func writeSSEError(w http.ResponseWriter, code, message string) {
    flusher, ok := w.(http.Flusher)
    if !ok {
        return
    }
    fmt.Fprintf(w,
        "data: {\"error\":{\"code\":\"%s\",\"message\":\"%s\"}}\n\n",
        code, message,
    )
    flusher.Flush()
}

// After your scanner loop:
if err := scanner.Err(); err != nil {
    if !errors.Is(err, context.Canceled) {
        writeSSEError(w, "stream_error", "upstream stream interrupted")
    }
}

Clients should check every data: payload for an error key, not just the HTTP status code. A truncated stream with an in-band error should retry differently from a clean completion.

Bonus: Cost Tracking Without Blocking

You can't calculate output token cost until the stream ends. Without the final count, you'd need to build your own token counter — and you'd still be estimating mid-stream.

Solution: pass stream_options: {"include_usage": true} in the request body. OpenAI sends a final usage chunk before [DONE]:

data: {"choices":[],"usage":{"prompt_tokens":142,"completion_tokens":387}}

data: [DONE]

In your onEvent handler, watch for the usage field. When it appears, fire the log entry into your async channel (see: how we log at sub-50ms with ClickHouse) and pass the chunk through unmodified. Zero latency added.

Summary

Failure	Root Cause	Fix
Chunk corruption	Fixed-buffer reads split events	`bufio.Scanner` with SSE split func
Token leak	Upstream outlives client	`NewRequestWithContext(r.Context())`
Backpressure / OOM	Unbounded buffer	Bounded channel + write timeout
Mid-stream errors	Can't change 200 status	In-band error event before close

All four are in production in Preto's Go proxy. If you want to see the full latency breakdown — proxy overhead vs. provider TTFT — on your own traffic, Preto is free up to 10K requests.

Building Preto.ai — LLM cost intelligence that runs on this proxy stack.

Prompt Hashing for Duplicate Detection: Cutting LLM Waste With SHA-256

gauravdagde — Thu, 16 Apr 2026 18:30:00 +0000

You know your LLM bill is higher than it should be. You can see total spend and total tokens in the OpenAI dashboard. What you can't see: how many of those requests are asking the exact same question that was answered five minutes ago.

Prompt hashing is the cheapest LLM optimization available — no model changes, no prompt rewrites, under 1ms added to the request path, no false positives. You hash the request, check the cache, skip the LLM on a hit — and don't pay for that call.

TL;DR:

The average production app sends 15–30% duplicate LLM requests. SHA-256 exact hashing catches all of them with zero false positives
The hash key must cover model + full messages array + normalized generation params — not just the prompt string
When teams first connect Preto, the average exact duplicate rate on day one is 18% — pure recoverable waste

Why Duplicates Are More Common Than You Think

The first reaction: "our requests aren't duplicates — every user's query is different." True for open-ended chat. Not true for most production LLM use cases:

Support/FAQ bots. "What are your hours?" "How do I reset my password?" "What's your return policy?" These arrive hundreds of times per day, word for word. Every one hits the LLM fresh.

Scheduled jobs. Weekly reports, nightly summaries, cron-triggered pipelines. Same prompt template, same data, same request — billed every run.

Application-layer cache misses. The most common root cause: someone added LLM calls to a hot path without adding a cache. Every page load calls the LLM even when nothing changed. The cache was on the roadmap. It never shipped.

SHA-256 hashing at the proxy layer catches all of these before they reach the provider — even if the application has no caching at all. You stop paying for duplicates without touching a line of app code.

What to Hash (Most Implementations Get This Wrong)

The naive approach: hash the last user message. Wrong in production.

"Summarize this document" sent to GPT-4o with temperature 0.2 should cache differently from the same string sent to GPT-4o-mini with temperature 0.8. Same prompt text, different request, different output.

The hash key must include everything that deterministically affects the output:

Model name — gpt-4o and gpt-4o-mini are different requests
Full messages array — role + content for every message, including system prompt
Generation params — temperature, max_tokens, top_p, frequency_penalty, presence_penalty

Exclude: user field, request IDs, stream flag, stream_options.

The Go Implementation

The critical requirement: canonical serialization. The same logical request must always produce the same byte sequence. JSON marshaling is not canonical by default (map key order undefined). We fix this with structs:

type CacheKey struct {
    Model    string    `json:"model"`
    Messages []Message `json:"messages"`
    Params   KeyParams `json:"params"`
}

type Message struct {
    Role    string `json:"role"`
    Content string `json:"content"`
}

type KeyParams struct {
    Temperature      float64 `json:"temperature,omitempty"`
    MaxTokens        int     `json:"max_tokens,omitempty"`
    TopP             float64 `json:"top_p,omitempty"`
    FrequencyPenalty float64 `json:"frequency_penalty,omitempty"`
    PresencePenalty  float64 `json:"presence_penalty,omitempty"`
}

func ComputePromptHash(req *openai.ChatCompletionRequest) string {
    key := CacheKey{
        Model:    req.Model,
        Messages: normalizeMessages(req.Messages),
        Params: KeyParams{
            Temperature: math.Round(req.Temperature*1000) / 1000,
            MaxTokens:   req.MaxTokens,
            TopP:        math.Round(req.TopP*1000) / 1000,
            // ... other params
        },
    }
    b, _ := json.Marshal(key) // struct fields marshal in declaration order — deterministic
    h := sha256.Sum256(b)
    return hex.EncodeToString(h[:])
}

func normalizeMessages(msgs []openai.ChatCompletionMessage) []Message {
    out := make([]Message, len(msgs))
    for i, m := range msgs {
        out[i] = Message{
            Role:    strings.TrimSpace(m.Role),
            Content: strings.TrimSpace(m.Content),
        }
    }
    return out
}

The float normalization (math.Round(x*1000)/1000) matters. Different SDK versions can produce 0.7 vs 0.6999999999999998 for the same value. Without normalization, these hash differently and your cache never hits.

Redis cache:

const cacheTTL = 4 * time.Hour

func (c *Cache) Get(hash string) (*CachedResponse, bool) {
    val, err := c.redis.Get(ctx, "ph:"+hash).Bytes()
    if err != nil {
        return nil, false
    }
    var resp CachedResponse
    json.Unmarshal(val, &resp)
    return &resp, true
}

func (c *Cache) Set(hash string, resp *CachedResponse) {
    b, _ := json.Marshal(resp)
    c.redis.Set(ctx, "ph:"+hash, b, cacheTTL)
}

Real Duplicate Rates by Use Case (Anonymized Production Data)

Use Case	Exact Duplicate Rate	Notes
Support / FAQ bots	35–45%	Users ask the same questions constantly
Scheduled / batch jobs	20–60%	Retried jobs, multi-instance deploys
Code assist / review	8–15%	Boilerplate patterns, shared stubs
General chat / copilot	3–8%	Lowest — open-ended queries rarely repeat
Weighted average	~18%	Discovered on day one of proxy connection

18% is not an edge case — it's the first thing visible when a team adds a proxy with prompt hashing. Some teams see 40%. That's the floor of recoverable waste, before semantic caching, model routing, or any other optimization.

What to Do With the Hash Beyond Caching

Duplicate rate reporting. Track cache hit rate per endpoint. This surfaces which parts of your application send duplicate traffic — usually a missing application-layer cache that should be fixed at the root.

Cost projection. A hash seen 200 times this month × $0.008/request (roughly GPT-4o-mini at 500 tokens) = $1.60 in recoverable waste for one prompt. Multiply across all duplicate hashes for a concrete monthly saving figure.

Abuse detection. A single hash 10,000 times in an hour from one user is a different pattern from organic duplicates. Rate limit by hash to catch prompt injection loops and runaway retry logic.

Where Hashing Falls Short

SHA-256 catches exact duplicates. It misses semantic duplicates.

"What are your business hours?" and "When do you open?" hash differently. They're the same question.

Semantic caching handles this with embedding similarity — it can catch 2–3x more waste, but requires an embedding model, a vector store, and careful threshold tuning to avoid false positives.

The right order: implement exact hashing first. Zero-risk, one afternoon, immediate savings. Add semantic caching once you've measured the remaining duplicate rate.

We're building Preto.ai — the proxy layer that computes prompt hashes on every request and surfaces duplicate rates, recoverable waste, and per-feature cost breakdown in real time. Free up to 10K requests.

LLM Gateway vs LLM Proxy vs LLM Router: What's the Difference?

gauravdagde — Sun, 12 Apr 2026 18:30:00 +0000

Everyone calls their product a "gateway" now. LiteLLM markets itself as both a proxy and a gateway. Portkey is a gateway. Helicone's docs use proxy and gateway interchangeably. There's a well-cited Medium post by Bijit Ghosh that ranks on Google for this comparison — correct high-level definitions, but it stops before the implementation details that tell you what to actually choose and deploy.

Here's the precise version: three different layers, concrete Go code for each, and a decision framework based on team size.

TL;DR:

Proxy = transport layer. Pipes requests from your app to the provider
Router = decision layer. Chooses which model or provider handles the request
Gateway = policy layer. Auth, rate limits, budget enforcement, audit trails
They're not separate products — they're three layers of the same stack

The Proxy: Transport Layer

A proxy intercepts your HTTP request and forwards it to the provider. Your app changes one thing: the base_url.

// Before
client := openai.NewClient(apiKey)

// After — same SDK, same code, different URL
client := openai.NewClient(
    apiKey,
    openai.WithBaseURL("https://proxy.your-company.com/v1"),
)

A minimal Go proxy handler:

func (p *Proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // Swap client key → upstream provider key
    r.Header.Set("Authorization", "Bearer "+p.providerKey)

    target, _ := url.Parse("https://api.openai.com")
    proxy := httputil.NewSingleHostReverseProxy(target)
    proxy.ServeHTTP(w, r)
}

That's the core. A proxy doesn't decide anything — it doesn't choose GPT-4o over GPT-4o-mini, doesn't enforce rate limits. It pipes traffic. Everything else is built on top of this.

The Router: Decision Layer

A router decides which model and provider handle each request. It returns a routing decision; the proxy executes it. The router is pure business logic — no HTTP, no transport — which makes it testable independently and swappable without touching the proxy.

Cost-based routing (most valuable):

func (r *Router) Route(req *ChatRequest) RoutingDecision {
    complexity := r.estimateComplexity(req)

    switch {
    case complexity < 0.3:
        // Short, simple: classification, extraction, booleans
        return RoutingDecision{Model: "gpt-4o-mini", Provider: "openai"}
    case complexity < 0.7:
        // Medium: summarization, structured output
        return RoutingDecision{Model: "gpt-4o", Provider: "openai"}
    default:
        // Complex: multi-step reasoning, code generation
        return RoutingDecision{Model: "claude-opus-4-6", Provider: "anthropic"}
    }
}

Failover routing:

var providerChain = []RoutingDecision{
    {Model: "gpt-4o",            Provider: "openai"},
    {Model: "claude-sonnet-4-6", Provider: "anthropic"},
    {Model: "gemini-1.5-pro",    Provider: "google"},
}

func (r *Router) RouteWithFailover(req *ChatRequest) RoutingDecision {
    for _, candidate := range providerChain {
        if r.circuit.IsAvailable(candidate.Provider) {
            return candidate
        }
    }
    return providerChain[len(providerChain)-1]
}

Metadata-based routing (route by feature tag your app sets):

func (r *Router) RouteByTag(req *ChatRequest, headers http.Header) RoutingDecision {
    switch headers.Get("X-Feature") {
    case "support-bot":
        return RoutingDecision{Model: "gpt-4o-mini", Provider: "openai"}
    case "code-review":
        return RoutingDecision{Model: "claude-sonnet-4-6", Provider: "anthropic"}
    default:
        return r.Route(req)
    }
}

The Gateway: Policy Layer

A gateway adds policy enforcement above the router and proxy. The defining characteristic: the gateway has a concept of identity. It knows which team or user is sending each request and enforces rules based on that identity.

In Go, a gateway is a middleware chain wrapping the proxy:

func BuildGateway(proxy http.Handler) http.Handler {
    return chain(
        AuthMiddleware,      // validate key → resolve tenant identity
        RateLimitMiddleware, // per-tenant request + token rate limits
        BudgetMiddleware,    // per-team monthly spend enforcement
        AuditMiddleware,     // log every request with identity + decision
        proxy,
    )
}

func AuthMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        key := r.Header.Get("Authorization")
        tenant, err := db.LookupTenant(key)
        if err != nil {
            http.Error(w, "unauthorized", 401)
            return
        }
        r = r.WithContext(context.WithValue(r.Context(), tenantKey, tenant))
        r.Header.Set("Authorization", "Bearer "+tenant.ProviderKey)
        next.ServeHTTP(w, r)
    })
}

func BudgetMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        tenant := r.Context().Value(tenantKey).(*Tenant)
        if tenant.MonthlySpend >= tenant.BudgetLimit {
            http.Error(w, `{"error":"budget_exceeded"}`, 429)
            return
        }
        next.ServeHTTP(w, r)
    })
}

A proxy is stateless with respect to the caller. A gateway is not.

How Products Map to These Layers

Product	Proxy	Router	Gateway	Cost Intelligence
LiteLLM	✓	✓ (100+ providers)	Partial	—
Helicone	✓	—	Partial	Basic
Portkey	✓	✓	✓ (enterprise)	Basic
Langfuse	— (async only)	—	—	Basic
Preto	✓	✓	✓	✓ (recommendations)

One thing to know about Langfuse: it's an async observer — it doesn't sit in the request path. Zero proxy latency, but also no caching, routing, or real-time budget enforcement. A deliberate architectural choice — just a different layer entirely — fine if you only need post-hoc observability and don't need caching, routing, or budget enforcement.

What You Actually Need

One team, one model, under $2K/month → direct SDK calls. Add a proxy for logging once you have real traffic to observe.

Multiple models, cost visibility needed → proxy + router. One URL change gives you per-request cost attribution and the ability to route simple tasks to cheaper models. Teams typically see 20–40% cost reduction within the first week of enabling model routing.

Multiple teams, budget enforcement needed → gateway. The moment two teams share an API key and neither can see what the other spends, you have a governance problem. A bill spike hits. Nobody knows which team caused it. Nobody can be held accountable.

Compliance requirements (SOC 2, HIPAA, GDPR) → gateway with audit logging and PII controls. A gateway gives you the audit trail to prove it.

We're building Preto.ai — all three layers (proxy + router + gateway) plus cost intelligence in one URL change. Free up to 10K requests.

LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

gauravdagde — Sun, 05 Apr 2026 08:47:00 +0000

You opened your OpenAI dashboard this morning and felt that familiar pit in your stomach. The number was higher than last month. Again. Somebody mentioned semantic caching — "just cache the responses, cut costs by 90%." So you looked into it.

The vendor pages all say the same thing: 95% cache hit rates, 90% cost reduction, millisecond responses. Then you ran the numbers on your own traffic and the reality was different. Much different.

This post breaks down how semantic caching actually works, what the published production hit rates are (not the marketing numbers), and which use cases benefit — and which don't.

TL;DR

Published production hit rates range from 20-45%, not 90-95%. The 95% number refers to accuracy of cache matches, not frequency of hits.
Even a 20% hit rate saves real money — $1,000/month on a $5K LLM bill — while cutting latency from 2-5s to under 5ms on cached requests.
Start with exact caching. Add semantic caching only if the marginal improvement justifies the complexity.

Exact Caching vs. Semantic Caching: Two Different Problems

Before diving into architecture, the distinction matters because most teams should start with exact caching and only add semantic caching if exact caching alone doesn't cover enough.

Exact caching

Hash the full prompt (including model name, temperature, and other parameters) with SHA-256. If the hash matches a stored request, return the cached response. Zero ambiguity — the prompt is identical, so the response is valid.

cache_key = sha256(model + prompt + str(temperature) + str(max_tokens))
cached = redis.get(cache_key)
if cached:
    return cached  # <5ms, zero LLM cost
response = call_llm(prompt)
redis.set(cache_key, response, ttl=3600)
return response

Pros: Zero false positives. Sub-millisecond lookup. Trivial to implement.
Cons: Misses rephrased duplicates. "How do I reset my password?" and "password reset help" are different hashes.

Exact caching alone catches more traffic than you'd expect. The average production app sends 15-30% identical requests — automated pipelines, retries, and users asking the same FAQ.

Semantic caching

Generate a vector embedding of the prompt, compare it via cosine similarity to stored embeddings, and return a cached response if the similarity exceeds a threshold. This catches rephrased duplicates.

embedding = embed_model.encode(prompt)  # ~2-5ms
matches = vector_db.search(embedding, threshold=0.92)
if matches:
    return matches[0].response  # <5ms total
response = call_llm(prompt)
vector_db.upsert(embedding, response, ttl=3600)
return response

Pros: Catches semantically similar requests with different wording.
Cons: Embedding generation adds 2-5ms. False positives are possible. Threshold tuning is critical and use-case dependent.

The 95% Myth: What the Numbers Actually Say

The "95% cache hit rate" claim circulates across vendor marketing pages. Here's what the published data actually shows:

Source	Hit Rate	Context	Type
Portkey (production)	~20%	RAG use cases, 99% match accuracy	Vendor data
EdTech platform (production)	~45%	Student Q&A — high repetition	Case study
GPT Semantic Cache (academic)	61-69%	Controlled benchmark, curated dataset	Research paper
General production estimate	30-40%	Mixed traffic across use cases	Industry average
Open-ended chat (production)	10-20%	Unique conversations, low repetition	Observed range

The 95% number, when you trace it back, almost always refers to match accuracy — meaning 95% of the time a cache returns a response, that response is correct for the query. Not that 95% of queries hit the cache. These are fundamentally different metrics.

The honest range for production semantic caching: 20-45% hit rate, depending heavily on use case.

Why academic benchmarks are misleading: Academic benchmarks test against curated datasets where similar questions are intentionally grouped. Production traffic is messier — 60-70% of real queries are genuinely unique. The 61-69% hit rates from research papers don't survive contact with production diversity.

Hit Rates by Use Case: Where Caching Works (and Doesn't)

Use Case	Expected Hit Rate	Why
FAQ / customer support	40-60%	Users ask the same questions in slightly different ways. High repetition, bounded answer space.
Classification / labeling	50-70%	Automated pipelines often send identical or near-identical inputs.
Internal knowledge base Q&A	30-45%	Employees ask similar questions about policies, processes, docs.
RAG with document retrieval	15-25%	Context varies per query even if questions are similar.
Open-ended chat	10-20%	Conversations are unique. Multi-turn context makes each request different.
Code generation	5-15%	High specificity per request. Users want varied outputs.

The pattern: bounded answer spaces with repetitive inputs cache well. Open-ended, context-dependent, or creative tasks don't.

The Threshold Problem: 0.85 vs. 0.92 vs. 0.98

The cosine similarity threshold is the most important — and most under-discussed — configuration in semantic caching. It's the knob that determines whether your cache is useful or dangerous.

Threshold 0.85 (aggressive): More cache hits, but higher false positive rate. "How to reset my password" might match "How to change my email" — similar intent, wrong answer. Good for FAQ-style use cases where a slightly imprecise answer is acceptable.
Threshold 0.92 (balanced): The sweet spot for most production use cases. Catches clear rephrasings while rejecting distinct-but-similar queries.
Threshold 0.98 (conservative): Almost-exact matching. Very few false positives, but you're only catching the most obvious rephrasings. At this point, exact caching captures nearly as much with zero false positive risk.

There is no universal correct threshold. It depends on the cost of a wrong answer in your application. A customer support bot returning a slightly wrong FAQ answer is tolerable. A medical advice application returning a cached answer for a different condition is dangerous.

Five Failure Modes Nobody Warns You About

1. Context-dependent queries that look identical

"What's the status?" asked by User A about Order #4521 and User B about Order #7893 will have near-identical embeddings. Without user-scoped or session-scoped cache keys, User B gets User A's order status. Cache keys must include relevant context — not just the prompt text.

2. Time-sensitive queries returning stale answers

"What's the latest pricing for GPT-5?" cached last week is wrong this week if pricing changed. TTL helps, but the right TTL varies by query type. Pricing questions need TTLs of hours. FAQ answers can cache for days. One-size-fits-all TTL is a guarantee of either stale answers or low hit rates.

3. Embedding model drift

If you update your embedding model, all previously cached embeddings become invalid. The similarity scores between old and new embeddings are meaningless. You need a cache invalidation strategy tied to your embedding model version. Most teams learn this the hard way after a model update causes a spike in incorrect cache responses.

4. Cache poisoning from bad responses

If the LLM returns a hallucinated or incorrect response and you cache it, every similar future query gets that same bad answer. The cache amplifies the error. Mitigation: add quality checks before caching (confidence scores, length validation, format checks), or let users flag cached responses as incorrect to trigger cache eviction.

5. Streaming response caching complexity

Most LLM calls use streaming (stream: true). You can't cache a streaming response mid-stream — you need to buffer the full response, then store it. On cache hit, you either return the full response instantly (breaking the streaming contract your client expects) or simulate streaming by chunking the cached response with artificial delays. Both are engineering overhead that vendors rarely mention.

The Dollar Math: What Caching Actually Saves

For a team spending $5,000/month on LLM APIs:

Hit Rate	Monthly Savings
10% hits	$500/month
20% hits	$1,000/month
30% hits	$1,500/month
45% hits	$2,250/month

The savings come from two places: avoided LLM calls (the obvious one) and reduced latency (the hidden one). A cache hit returns in under 5ms instead of 2-5 seconds. For customer-facing applications, that latency improvement often matters more than the dollar savings.

The cost of running the cache itself is minimal. Embedding generation uses a small model (text-embedding-3-small at $0.02/1M tokens). Vector storage in Redis or a dedicated vector DB adds $50-200/month depending on cache size. The infrastructure cost is under 5% of the savings at even a 10% hit rate.

The Right Architecture: Layer Exact and Semantic Caching

The best approach is a two-layer cache that checks exact matches first (fast, zero risk) and falls back to semantic matching only when needed:

# Layer 1: Exact cache (sub-ms, zero false positives)
exact_key = sha256(model + prompt + params)
if exact_hit := cache.get(exact_key):
    return exact_hit

# Layer 2: Semantic cache (2-5ms, threshold-gated)
embedding = embed(prompt)
semantic_hit = vector_db.search(embedding, threshold=0.92)
if semantic_hit:
    return semantic_hit.response

# Cache miss: call the LLM
response = call_llm(prompt)

# Write to both layers
cache.set(exact_key, response, ttl=3600)
vector_db.upsert(embedding, response, ttl=3600)

return response

The average app we onboard discovers that 18% of requests are exact duplicates on day one — before semantic matching even kicks in.

Cache backends matter less than you'd think. In-memory works for single-instance proxies. Redis works for distributed deployments. Dedicated vector databases (Qdrant, Pinecone) are worth it only if your cache exceeds 1M entries — below that, Redis with vector search is sufficient and simpler to operate.

Start With Measurement, Not Implementation

The most common mistake: building a caching layer before understanding what your traffic looks like. You might spend two weeks implementing semantic caching only to discover that your traffic is 90% unique, context-dependent queries with a 12% hit rate ceiling.

Measure first:

Log all prompts for a week. Hash them. Count exact duplicates. That's your floor.
Sample 1,000 requests. Generate embeddings. Cluster them. Count how many fall within a 0.92 similarity threshold. That's your ceiling.
Estimate savings. Floor hit rate × monthly LLM spend = guaranteed savings. Ceiling hit rate × monthly spend = maximum possible savings. If both numbers are under $200/month, caching isn't worth the engineering effort.

If both numbers justify the effort, start with exact caching only. Run it for two weeks. Then add semantic caching on top and compare the marginal improvement. If semantic caching only adds 5-8 percentage points over exact caching, the false positive risk may not justify the complexity.

We're building Preto.ai — LLM cost optimization that detects exact duplicates and semantically similar requests across your traffic. See your cache potential and projected savings before you build anything. Free for up to 10K requests.

We built an LLM proxy that adds 47ms of latency. Here's every millisecond accounted for.

gauravdagde — Sat, 04 Apr 2026 14:45:00 +0000

Your LLM API request passes through 7 layers before it reaches OpenAI. Authentication. Rate limiting. Cache lookup. Model routing. The upstream call itself. Fallback logic. Logging and cost attribution. Most teams have no idea what happens in between — or that the entire round trip adds less than 50 milliseconds.

This post breaks down every layer of an LLM proxy, what each one costs in latency, and why those 47 milliseconds determine whether your AI infrastructure scales — or quietly bankrupts you.

TL;DR

An LLM proxy intercepts your API request and passes it through 7 processing layers in under 50ms — adding auth, caching, routing, failover, and cost tracking that the provider API doesn't give you.
Proxy overhead (3-50ms) is under 3% of total request time. The cost of not having a proxy — untracked spend, zero failover, no per-feature attribution — is far higher.
The setup is one line of code: change your base_url. Everything else stays the same.

What Is an LLM Proxy (and Why Should a CTO Care)?

An LLM proxy sits between your application code and the LLM provider. Your app sends requests to the proxy URL instead of directly to api.openai.com. The proxy handles everything else: authentication, routing, caching, logging, failover.

Think of it as an API gateway — but AI-aware. Traditional gateways (Kong, Nginx) understand HTTP. An LLM proxy understands tokens, models, prompt structure, and cost-per-request. It can make routing decisions based on task complexity, enforce per-team budget limits, and detect that 30% of your requests are semantically identical and cacheable.

The setup is one line of code:

# Before
client = OpenAI(api_key="sk-...")

# After — same SDK, same code, different base URL
client = OpenAI(
    api_key="sk-...",
    base_url="https://proxy.preto.ai/v1"
)

Everything downstream — your prompts, your response handling, your error handling — stays the same. The proxy is transparent to your application code.

The 7 Layers Your Request Passes Through

Here's what happens in those 47 milliseconds, layer by layer.

Layer 1: Ingress and Authentication (~2-5ms)

The proxy receives your HTTP request and validates the API key. But unlike a direct OpenAI call, the key maps to an internal identity: a team, a project, a budget. Your upstream provider keys are never exposed to application code.

One leaked key doesn't compromise your entire OpenAI account — it compromises one team's allocation with a hard spending cap.

Layer 2: Rate Limiting and Budget Enforcement (~1-3ms)

Before the request goes anywhere, the proxy checks two things: Is this user within their rate limit? Is their team within its budget?

Smart proxies enforce token-level rate limits, not just request-level — because one 100K-context request is not the same as one 500-token classification. Budget checks happen in-memory (synced with Redis every ~10ms) so they don't block the request path.

Layer 3: Cache Lookup (~1-8ms; hit returns in <5ms, saving 500ms-5s)

The proxy checks whether it has seen this request — or one semantically similar — before.

Exact caching hashes the prompt and returns an identical response.

Semantic caching generates an embedding, computes cosine similarity against recent requests, and returns a cached response if similarity exceeds a threshold.

A cache hit skips the LLM entirely: response in under 5ms instead of 2-5 seconds. In production, hit rates range from 20% to 45% depending on the use case — even 20% is a meaningful cost reduction.

Layer 4: Routing and Model Selection (~1-3ms)

If the request isn't cached, the proxy decides where to send it. Simple routing forwards to the model specified in the request. Advanced routing makes a decision: load balance across multiple Azure OpenAI deployments, select a cheaper model for simple tasks, or route based on headers or request patterns.

Cost-based routing — sending classification tasks to GPT-5 Mini instead of GPT-5 — can cut 80% of cost on affected requests with no accuracy loss.

Layer 5: Upstream Call + Streaming (~500ms-5,000ms)

The proxy forwards the request to the selected provider with the upstream API key. For streaming responses (stream: true), the proxy pipes tokens back to your application as they arrive — the client starts receiving output before the full response is generated.

The proxy also enforces request timeouts, killing requests that exceed a duration threshold before they waste tokens.

Layer 6: Fallback and Retry (~0ms unless triggered: then 100-500ms)

If the primary provider returns a 429 (rate limit), 503 (service unavailable), or times out, the proxy retries with exponential backoff — then falls back to the next provider in the chain.

GPT-5 fails? Route to Claude Sonnet. Claude is down? Try Gemini Pro.

Circuit breakers monitor error rates per provider: when a provider crosses a failure threshold, it's automatically removed from the rotation and re-tested after a cooldown period. Teams running this report 99.97% effective uptime despite individual provider outages, with failover in milliseconds instead of the 5+ minutes it takes to update a hard-coded API key.

Layer 7: Logging, Cost Attribution, and Response (~2-5ms, async)

As the response streams back, the proxy calculates cost (input tokens × input price + output tokens × output price), tags the request with team/feature/environment metadata, and ships the log to your observability backend.

This happens asynchronously — the client gets the response immediately. The log includes: model used, tokens consumed, cost, latency, cache hit/miss, which feature triggered it, and whether the request fell back to a secondary provider.

47ms in Context: Why Proxy Overhead Doesn't Matter (and When It Does)

The proxy adds 7-25ms to a request that takes 500ms-5,000ms from the LLM itself. That's 0.5-3% overhead. For most teams, this is noise.

Scenario	LLM Latency	Proxy Overhead	% Impact
Standard completion (GPT-5, 500 tokens out)	~2,000ms	~20ms	1.0%
Streaming first token (TTFT)	~300ms	~20ms	6.7%
Cache hit (semantic match)	<5ms	~8ms	160%*
Long-form generation (2K tokens)	~8,000ms	~20ms	0.25%
Mini model classification	~400ms	~20ms	5.0%

*The cache hit row looks alarming — but the total response time is 13ms instead of 2,000ms. Your user got a response 150x faster.

The only scenario where proxy latency is a real concern: real-time applications with sub-100ms requirements and no caching benefit — voice AI, game NPCs, live translation. For these, a Rust or Go proxy (under 1ms overhead) is the right choice. For everything else, the 20ms is the best trade in your stack.

Proxy Architecture Patterns: Forward, Reverse, and Sidecar

Not all proxies work the same way. The architecture pattern determines your failure modes, your latency profile, and what features you can use.

Forward Proxy (Client-Side Integration)

Your application points at the proxy URL. The proxy forwards requests to the provider. This is the most common pattern (Portkey, LiteLLM, Preto). You get the full feature set: caching, routing, failover, cost tracking. The trade-off: the proxy is in the critical path.

Reverse Proxy (Edge-Deployed)

The proxy runs at the edge (e.g., Cloudflare Workers), intercepting requests globally with minimal latency. Helicone uses this pattern. Low latency from geographic proximity, but limited by what you can run in an edge function.

Sidecar / Async Observer

The proxy doesn't sit in the request path at all. Instead, it observes traffic after the fact — through SDK hooks, log tailing, or provider API polling. Langfuse advocates this approach. Zero latency impact, no single point of failure — but you lose caching, real-time routing, and failover.

The honest trade-off: A synchronous proxy creates a dependency. Run it as a horizontally scaled service behind a load balancer, with health checks and automatic instance replacement. Keep a direct-to-provider fallback for critical paths. This is standard infrastructure — the same way you'd deploy any API gateway.

What Proxy Overhead Actually Costs in Dollars

The proxy adds latency. It also saves money. Here's the math for a team running 100,000 LLM requests per day on GPT-5 ($1.25/1M input, $5.00/1M output) with an average of 500 input + 300 output tokens per request.

Monthly LLM spend without a proxy: $6,450/month

What the proxy saves:

Semantic caching (30% hit rate): -$1,935/month
Cost-based routing (40% of requests downgraded to GPT-5 Mini): -$1,548/month
Budget enforcement (prevents 2 runaway features/quarter): -$800-2,000/quarter
Automatic failover (avoids 3 provider outages/quarter): prevents 4-12 hours of downtime

Net result: $3,483/month in direct savings, plus avoided downtime. The proxy pays for itself in the first week.

The Real Cost of Not Having a Proxy

Without a proxy, you have:

No per-feature cost attribution. OpenAI gives you two fields for attribution: user and project. That's it. You can't see which feature is responsible for 60% of your bill.
No automatic failover. When OpenAI goes down — and it does, multiple times per quarter — every AI feature in your product goes down with it. Manual failover takes 5+ minutes. At 3am, nobody is watching.
No caching layer. Identical requests hit the LLM every time. The average production app sends 15-30% duplicate or near-duplicate requests.
No budget enforcement. A new feature ships with a prompt that generates 2,000 output tokens per request instead of 300. Nobody notices until the monthly bill arrives 3x higher than expected.

The average production app we onboard discovers that 18% of its requests are cacheable on day one.

Build vs. Buy: The Decision Framework

Building a production-grade LLM proxy is a 6-12 month engineering effort. Based on published estimates:

Core gateway (routing, auth, failover): $200K-$300K in engineering time
Observability (logging, dashboards, alerting): $100K-$150K
Prompt management UI: $100K-$150K
Compliance and security (SOC 2, HIPAA): $50K-$100K/year ongoing

Total first-year investment: $450K-$700K, plus 12-18 months before your AI features ship with production-grade infrastructure.

One real case study: a team replaced their custom LLM manager with a managed proxy and removed 11,005 lines of code across 112 files.

Build if: LLM routing is your core product differentiator, you have unique compliance requirements, or your scale requires custom optimizations.

Buy if: You want to ship AI features this month, your engineering team should be building product not infrastructure, and your LLM spend is between $1K and $100K/month.

Latency Benchmarks by Implementation Language

Proxy	Language	Overhead	Throughput	Note
Bifrost	Go	~11μs at 5K RPS	5,000+ RPS	Pure routing, no observability platform
TensorZero	Rust	<1ms P99	10,000 QPS	Built-in A/B testing
Helicone	Rust	~1-5ms P95	~10,000 RPS	Edge-deployed on Cloudflare Workers
Portkey	Managed	<10ms	1,000 RPS	Full-featured: guardrails, prompt mgmt
LiteLLM	Python	3-50ms	1,000 QPS	Most flexible (100+ providers)

Rust and Go proxies handle 5-10x more throughput with 10-100x less overhead than Python. But LiteLLM has the largest provider coverage. For most teams under 1,000 RPS, the language doesn't matter. At 5,000+ RPS, it's the first thing that matters.

When You Don't Need a Proxy

Skip the proxy if:

You're calling one model, from one service, at low volume
Your LLM spend is under $500/month
You need observability but not routing (an async observer works fine)
You're still prototyping

Add the proxy when you have multiple models, multiple teams, real money at stake, and no visibility into where it's going.

We're building Preto.ai — LLM cost optimization that sits in your proxy layer. If you're evaluating options, the full build vs. buy decision checklist (12 questions, PDF) is linked below.

We evaluated Go, Rust, and Python for our LLM proxy. Go won - and not for the reason you'd expect.

gauravdagde — Fri, 03 Apr 2026 11:54:00 +0000

We built our LLM proxy in Go. Not Rust. Not Python. Here's the engineering trade-off nobody talks about: the language that's fastest in benchmarks isn't always the language that ships the fastest product.

TL;DR

Go handles 5,000+ RPS with ~11 microseconds of overhead per request — more than enough for 99% of LLM proxy workloads.
Rust is faster (sub-1ms P99 at 10K QPS), but the development velocity trade-off isn't worth it unless you're building for hyperscale.
Python (LiteLLM) hits a wall at ~1,000 QPS due to the GIL — fine for prototyping, problematic for production traffic.

The Three Contenders

When we started building Preto's proxy layer, we had three options on the table. Each had a strong case.

Python was the obvious first choice. The LLM ecosystem lives in Python. LiteLLM — the most popular open-source proxy — is Python. Every provider SDK is Python-first. We could ship a working proxy in a weekend.

Rust was the performance choice. TensorZero and Helicone both use Rust. Sub-millisecond P99 latency at 10,000 QPS. Memory safety guarantees. If we wanted to claim "the fastest proxy," Rust was the path.

Go was the pragmatic choice. Bifrost (the open-source proxy that benchmarks 50x faster than LiteLLM) is written in Go. Goroutines make concurrent streaming connections trivial. The standard library includes a production-grade HTTP server. And we could hire for it.

The Benchmark That Settled the Python Question

We ran Python off the list first. Not because it's slow in theory — because it's slow in practice at our target scale.

LiteLLM's own published benchmarks tell the story:

At 500 RPS: Stable. ~40ms overhead. Acceptable.
At 1,000 RPS: Memory climbs to 4GB+. Latency variance increases.
At 2,000 RPS: Timeouts start. Memory hits 8GB+. Requests fail.

The culprit is Python's Global Interpreter Lock. An LLM proxy is fundamentally a concurrent I/O problem — you're holding thousands of open streaming connections simultaneously. Python's asyncio helps, but the GIL still serializes CPU-bound work: JSON parsing, token counting, cost calculation, log serialization. Under load, these add up.

LiteLLM's team knows this. They've announced a Rust sidecar to handle the hot path. That's telling — even the most popular Python proxy is moving critical code out of Python.

Note: Python isn't wrong — it's wrong for this. If your LLM traffic is under 500 RPS and you need maximum provider coverage, LiteLLM is a solid choice. It supports 100+ providers with battle-tested adapters. The performance ceiling only matters if you're going to hit it.

Go vs. Rust: Where the Decision Gets Interesting

With Python out, the real comparison begins. Here's what we measured and researched:

Dimension	Go	Rust
Proxy overhead	~11μs at 5K RPS	<1ms P99 at 10K QPS
Max throughput (single instance)	5,000+ RPS	10,000+ QPS
Memory under load	~200MB at 5K RPS	~50MB at 10K QPS
Concurrency model	Goroutines (lightweight)	async/await (Tokio)
Streaming HTTP support	stdlib net/http	hyper/axum (good, more code)
Time to implement proxy MVP	~2 weeks	~5-6 weeks
Hiring pool	Large (DevOps, backend)	Small (systems specialists)
Compile times	~5 seconds	~2-5 minutes
Binary size	~15MB	~8MB

The performance numbers are close enough to not matter for our use case. The development velocity numbers are not.

The Factor That Made It Obvious: Goroutines and Streaming

An LLM proxy's core job is holding thousands of concurrent HTTP connections open while streaming tokens back to clients. This is where Go's goroutine model shines.

In Go, every incoming request gets its own goroutine. Streaming the response is straightforward:

func proxyHandler(w http.ResponseWriter, r *http.Request) {
    // Forward to upstream LLM provider
    resp, err := http.DefaultClient.Do(upstreamReq)
    if err != nil {
        handleFallback(w, r) // try next provider
        return
    }
    defer resp.Body.Close()

    // Stream tokens back as they arrive
    flusher, _ := w.(http.Flusher)
    buf := make([]byte, 4096)
    for {
        n, err := resp.Body.Read(buf)
        if n > 0 {
            w.Write(buf[:n])
            flusher.Flush() // send immediately
            trackTokens(buf[:n]) // async cost tracking
        }
        if err != nil {
            break
        }
    }
}

That's the core loop. In Rust, the equivalent code involves async/await, Pin<Box<dyn Stream>>, lifetime annotations, and careful ownership management. It's not harder conceptually — it's harder in practice, every time you refactor or add a new feature.

When your proxy needs to add a new middleware layer — say, budget enforcement before routing — the Go version is a new function in the chain. The Rust version often requires restructuring lifetimes and trait bounds across multiple files.

The Real-World Request Lifecycle in Our Go Proxy

Here's how a request flows through our stack, with timing at each stage:

TLS termination + HTTP parse — handled by Go's net/http server. ~1ms.
API key lookup + team resolution — in-memory map with Redis sync every 10ms. ~0.5ms.
Rate limit check — token-bucket algorithm in goroutine-safe map. ~0.1ms.
Budget enforcement — check team's monthly spend against cap. ~0.2ms.
Cache probe — SHA-256 hash of prompt + model + params, checked against local cache with Redis fallback. ~1-3ms.
Route selection — match model to upstream endpoint, apply load balancing weights. ~0.1ms.
Upstream call + streaming — goroutine holds connection, pipes data: chunks back. 500ms-5,000ms (the LLM).
Async logging — cost calculation and log entry shipped to ClickHouse via buffered channel. ~0ms on the request path (fires in background goroutine).

Total proxy overhead: ~5-8ms. The LLM takes 500-5,000ms. Our proxy is under 1% of total request time.

What We'd Choose Rust For

This isn't a "Go is better than Rust" argument. It's a "Go is better for our constraints" argument. We'd choose Rust if:

We needed to handle 10,000+ QPS on a single instance. At that scale, Rust's zero-cost abstractions and lack of GC pauses become meaningful.
Memory was a hard constraint. Rust's 50MB footprint vs. Go's 200MB matters if you're running on edge nodes or embedded devices.
The proxy was the entire product. If our company was an LLM proxy company, spending 3x longer on the core engine is justified. Our proxy is infrastructure — the product is cost intelligence built on top.

TensorZero made the right call choosing Rust — their proxy IS the product, they need built-in A/B testing at wire speed, and they're targeting the highest-throughput tier. Helicone made the right call choosing Rust — they run on Cloudflare Workers at the edge, where memory and cold start time matter.

For a cost intelligence platform where the proxy is the data collection layer? Go is the right tool.

Lessons From 6 Months in Production

Three things surprised us after shipping:

1. Garbage collection pauses are a non-issue. Go's GC has improved dramatically. At 3,000 RPS, our P99 GC pause is under 500 microseconds. We were prepared to tune GOGC — we never needed to.

2. The standard library HTTP server is production-ready. We started with Go's net/http and never moved to a framework. It handles keep-alive, connection pooling, graceful shutdown, and HTTP/2 out of the box. One less dependency.

3. Goroutine leaks are the real danger. Early on, we had a bug where failed upstream connections weren't properly closed, leaking goroutines. runtime.NumGoroutine() caught it — but only after goroutine count climbed from 200 to 45,000 over a weekend. Monitor goroutine count as a first-class metric from day one.

The Build vs. Buy Question

If you're evaluating whether to build your own proxy or use a managed solution, the math is sobering: a production-grade proxy is a 6-12 month engineering effort, roughly $450K-$700K in first-year engineering time when you include observability, a management UI, and compliance work.

One team we onboarded had built their own LLM manager — a reasonable decision at the time. When they migrated to a managed proxy, they removed 11,005 lines of code across 112 files.

Build if LLM routing is your core product differentiator. Buy if you want to ship AI features this month.

We're building Preto.ai — LLM cost optimization that sits in your proxy layer. Free for up to 10K requests. See what your LLM spend actually looks like.

How to integrate coroot-pg-agent with prometheus

gauravdagde — Tue, 23 Aug 2022 21:23:00 +0000

For monitoring postgres server most of the opensource stacks consists of grafana with prometheus.

Connecting postgres metrics to prometheus is very interesting task and there are certain tools/libraries are available.

Such libraries are helpful for monitoring and writing alert rules over prometheus.

postgres_exporter(https://github.com/prometheus-community/postgres_exporter)
coroot-pg-agent(https://github.com/coroot/coroot-pg-agent).

Today we are going to discuss more about coroot-pg-agent.

coroot-pg-agent can be run using docker, more information can be found here.(https://github.com/coroot/coroot-pg-agent)

On official postgresql : https://www.postgresql.org/about/news/coroot-pg-agent-an-open-source-postgres-exporter-for-prometheus-2488/

Now while running coroot-pg-agent with prometheus, there are certain things which we should keep it in mind.

coroot-pg-agent using docker runs on port 80 by default, We can run it on custom port using following command through docker

docker run --name coroot-pg-agent \
--env DSN="postgresql://<USER>:<PASSWORD>@<HOST>:5432/postgres?connect_timeout=1&statement_timeout=30000" \
--env LISTEN="0.0.0.0:<custom_port_for_pg_agent>" \
-p <custom_port_for_pg_agent>:<custom_port_for_pg_agent> \
ghcr.io/coroot/coroot-pg-agent

We can also pass scrape-interval using --env PG_SCRAPE_INTERVAL.

After executing above command we see output as follows, custom_port_for_pg_agent is 3000 here.

I0823 21:00:58.259629       1 main.go:35] static labels: map[]
I0823 21:00:58.273610       1 main.go:41] listening on: 0.0.0.0:3000

prometheus.yml for prometheus configs

global:
  scrape_interval: 5m
  scrape_timeout: 3m
  evaluation_interval: 15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: coroot-pg-agent
    static_configs:
      - targets: ["<localhost-ip>:<custom_port_for_pg_agent>"]

We can always change the scrape interval and timeouts as per our needs, was testing over local hence kept it like this.

Keep in mind while editing above yml scrape_interval should always be greater than scrape_timeout.

To run prometheus using docker we can use following command where we're using official image from prometheus at docker

docker run \
    -p 9090:9090 \
    -v ~/pro/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus

After executing above command you will see output like following

ts=2022-08-23T21:02:26.203Z caller=main.go:495 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2022-08-23T21:02:26.203Z caller=main.go:539 level=info msg="Starting Prometheus Server" mode=server version="(version=2.38.0, branch=HEAD, revision=818d6e60888b2a3ea363aee8a9828c7bafd73699)"
ts=2022-08-23T21:02:26.203Z caller=main.go:544 level=info build_context="(go=go1.18.5, user=root@e6b781f65453, date=20220816-13:29:14)"
ts=2022-08-23T21:02:26.204Z caller=main.go:545 level=info host_details="(Linux 5.10.47-linuxkit #1 SMP PREEMPT Sat Jul 3 21:50:16 UTC 2021 aarch64 87decec12cad (none))"
ts=2022-08-23T21:02:26.204Z caller=main.go:546 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-08-23T21:02:26.204Z caller=main.go:547 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-08-23T21:02:26.205Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-08-23T21:02:26.205Z caller=main.go:976 level=info msg="Starting TSDB ..."
ts=2022-08-23T21:02:26.206Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false
ts=2022-08-23T21:02:26.207Z caller=head.go:495 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2022-08-23T21:02:26.207Z caller=head.go:538 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=10.125µs
ts=2022-08-23T21:02:26.207Z caller=head.go:544 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2022-08-23T21:02:26.207Z caller=head.go:615 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
ts=2022-08-23T21:02:26.207Z caller=head.go:621 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=21.416µs wal_replay_duration=117.958µs total_replay_duration=159.167µs
ts=2022-08-23T21:02:26.208Z caller=main.go:997 level=info fs_type=EXT4_SUPER_MAGIC
ts=2022-08-23T21:02:26.208Z caller=main.go:1000 level=info msg="TSDB started"
ts=2022-08-23T21:02:26.208Z caller=main.go:1181 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
ts=2022-08-23T21:02:26.210Z caller=main.go:1218 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=2.047292ms db_storage=750ns remote_storage=1.709µs web_handler=292ns query_engine=583ns scrape=341.75µs scrape_sd=16.708µs notify=542ns notify_sd=792ns rules=1µs tracing=9.625µs
ts=2022-08-23T21:02:26.210Z caller=main.go:961 level=info msg="Server is ready to receive web requests."
ts=2022-08-23T21:02:26.210Z caller=manager.go:941 level=info component="rule manager" msg="Starting rule manager..."

We can hit localhost:9090 over browser and see screen like following

Now, to see the targets we can visit Targets

Once targets are up we can see the status changed like follows

We can visit graph and here if we hit the search bar, as I've kept auto suggestions on, we can see it like following.