<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: gauravdagde</title>
    <description>The latest articles on Forem by gauravdagde (@gauravdagde).</description>
    <link>https://forem.com/gauravdagde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F370559%2F2c8c382f-9e05-4ad6-93eb-07769ca1b8c6.jpeg</url>
      <title>Forem: gauravdagde</title>
      <link>https://forem.com/gauravdagde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gauravdagde"/>
    <language>en</language>
    <item>
      <title>LLM Gateway vs LLM Proxy vs LLM Router: What's the Difference?</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Sun, 12 Apr 2026 18:30:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/llm-gateway-vs-llm-proxy-vs-llm-router-whats-the-difference-3o5a</link>
      <guid>https://forem.com/gauravdagde/llm-gateway-vs-llm-proxy-vs-llm-router-whats-the-difference-3o5a</guid>
      <description>&lt;p&gt;Everyone calls their product a "gateway" now. LiteLLM markets itself as both a proxy and a gateway. Portkey is a gateway. Helicone's docs use proxy and gateway interchangeably. There's a well-cited Medium post by Bijit Ghosh that ranks on Google for this comparison — correct high-level definitions, but it stops before the implementation details that tell you what to actually choose and deploy.&lt;/p&gt;

&lt;p&gt;Here's the precise version: three different layers, concrete Go code for each, and a decision framework based on team size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proxy&lt;/strong&gt; = transport layer. Pipes requests from your app to the provider&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Router&lt;/strong&gt; = decision layer. Chooses which model or provider handles the request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gateway&lt;/strong&gt; = policy layer. Auth, rate limits, budget enforcement, audit trails&lt;/li&gt;
&lt;li&gt;They're not separate products — they're three layers of the same stack&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Proxy: Transport Layer
&lt;/h2&gt;

&lt;p&gt;A proxy intercepts your HTTP request and forwards it to the provider. Your app changes one thing: the &lt;code&gt;base_url&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Before&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// After — same SDK, same code, different URL&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithBaseURL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://proxy.your-company.com/v1"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A minimal Go proxy handler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Swap client key → upstream provider key&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Bearer "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;providerKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://api.openai.com"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;httputil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSingleHostReverseProxy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the core. A proxy doesn't decide anything — it doesn't choose GPT-4o over GPT-4o-mini, doesn't enforce rate limits. It pipes traffic. Everything else is built on top of this.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Router: Decision Layer
&lt;/h2&gt;

&lt;p&gt;A router decides which model and provider handle each request. It returns a routing decision; the proxy executes it. The router is pure business logic — no HTTP, no transport — which makes it testable independently and swappable without touching the proxy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-based routing&lt;/strong&gt; (most valuable):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChatRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;estimateComplexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Short, simple: classification, extraction, booleans&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;complexity&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Medium: summarization, structured output&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// Complex: multi-step reasoning, code generation&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"claude-opus-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Failover routing:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;providerChain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gemini-1.5-pro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"google"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RouteWithFailover&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChatRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;providerChain&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;circuit&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsAvailable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;providerChain&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;providerChain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metadata-based routing&lt;/strong&gt; (route by feature tag your app sets):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Router&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RouteByTag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChatRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-Feature"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"support-bot"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"code-review"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RoutingDecision&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"claude-sonnet-4-6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Provider&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Gateway: Policy Layer
&lt;/h2&gt;

&lt;p&gt;A gateway adds policy enforcement above the router and proxy. The defining characteristic: the gateway has a concept of &lt;em&gt;identity&lt;/em&gt;. It knows which team or user is sending each request and enforces rules based on that identity.&lt;/p&gt;

&lt;p&gt;In Go, a gateway is a middleware chain wrapping the proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;BuildGateway&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;AuthMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c"&gt;// validate key → resolve tenant identity&lt;/span&gt;
        &lt;span class="n"&gt;RateLimitMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c"&gt;// per-tenant request + token rate limits&lt;/span&gt;
        &lt;span class="n"&gt;BudgetMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c"&gt;// per-team monthly spend enforcement&lt;/span&gt;
        &lt;span class="n"&gt;AuditMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c"&gt;// log every request with identity + decision&lt;/span&gt;
        &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;AuthMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LookupTenant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"unauthorized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;401&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;tenantKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Header&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Authorization"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Bearer "&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProviderKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;BudgetMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Handler&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HandlerFunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;tenant&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenantKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Tenant&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MonthlySpend&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BudgetLimit&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;`{"error":"budget_exceeded"}`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;429&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServeHTTP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A proxy is stateless with respect to the caller. A gateway is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Products Map to These Layers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Product&lt;/th&gt;
&lt;th&gt;Proxy&lt;/th&gt;
&lt;th&gt;Router&lt;/th&gt;
&lt;th&gt;Gateway&lt;/th&gt;
&lt;th&gt;Cost Intelligence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓ (100+ providers)&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portkey&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓ (enterprise)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;td&gt;— (async only)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preto&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;✓ (recommendations)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One thing to know about Langfuse: it's an async observer — it doesn't sit in the request path. Zero proxy latency, but also no caching, routing, or real-time budget enforcement. A deliberate architectural choice — just a different layer entirely — fine if you only need post-hoc observability and don't need caching, routing, or budget enforcement.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Actually Need
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One team, one model, under $2K/month&lt;/strong&gt; → direct SDK calls. Add a proxy for logging once you have real traffic to observe.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple models, cost visibility needed&lt;/strong&gt; → proxy + router. One URL change gives you per-request cost attribution and the ability to route simple tasks to cheaper models. Teams typically see 20–40% cost reduction within the first week of enabling model routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple teams, budget enforcement needed&lt;/strong&gt; → gateway. The moment two teams share an API key and neither can see what the other spends, you have a governance problem. A bill spike hits. Nobody knows which team caused it. Nobody can be held accountable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compliance requirements (SOC 2, HIPAA, GDPR)&lt;/strong&gt; → gateway with audit logging and PII controls. A gateway gives you the audit trail to prove it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — all three layers (proxy + router + gateway) plus cost intelligence in one URL change. Free up to 10K requests.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>backend</category>
      <category>go</category>
    </item>
    <item>
      <title>LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Sun, 05 Apr 2026 08:47:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/llm-semantic-caching-the-95-hit-rate-myth-and-what-production-data-actually-shows-8ga</link>
      <guid>https://forem.com/gauravdagde/llm-semantic-caching-the-95-hit-rate-myth-and-what-production-data-actually-shows-8ga</guid>
      <description>&lt;p&gt;You opened your OpenAI dashboard this morning and felt that familiar pit in your stomach. The number was higher than last month. Again. Somebody mentioned semantic caching — "just cache the responses, cut costs by 90%." So you looked into it.&lt;/p&gt;

&lt;p&gt;The vendor pages all say the same thing: 95% cache hit rates, 90% cost reduction, millisecond responses. Then you ran the numbers on your own traffic and the reality was different. Much different.&lt;/p&gt;

&lt;p&gt;This post breaks down how semantic caching actually works, what the published production hit rates are (not the marketing numbers), and which use cases benefit — and which don't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Published production hit rates range from 20-45%, not 90-95%. The 95% number refers to &lt;em&gt;accuracy&lt;/em&gt; of cache matches, not frequency of hits.&lt;/li&gt;
&lt;li&gt;Even a 20% hit rate saves real money — $1,000/month on a $5K LLM bill — while cutting latency from 2-5s to under 5ms on cached requests.&lt;/li&gt;
&lt;li&gt;Start with exact caching. Add semantic caching only if the marginal improvement justifies the complexity.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Exact Caching vs. Semantic Caching: Two Different Problems
&lt;/h2&gt;

&lt;p&gt;Before diving into architecture, the distinction matters because most teams should start with exact caching and only add semantic caching if exact caching alone doesn't cover enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exact caching
&lt;/h3&gt;

&lt;p&gt;Hash the full prompt (including model name, temperature, and other parameters) with SHA-256. If the hash matches a stored request, return the cached response. Zero ambiguity — the prompt is identical, so the response is valid.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;5ms, zero LLM cost
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Zero false positives. Sub-millisecond lookup. Trivial to implement.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Misses rephrased duplicates. "How do I reset my password?" and "password reset help" are different hashes.&lt;/p&gt;

&lt;p&gt;Exact caching alone catches more traffic than you'd expect. The average production app sends &lt;strong&gt;15-30% identical requests&lt;/strong&gt; — automated pipelines, retries, and users asking the same FAQ.&lt;/p&gt;
&lt;h3&gt;
  
  
  Semantic caching
&lt;/h3&gt;

&lt;p&gt;Generate a vector embedding of the prompt, compare it via cosine similarity to stored embeddings, and return a cached response if the similarity exceeds a threshold. This catches rephrased duplicates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;embed_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ~2-5ms
&lt;/span&gt;&lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;5ms total
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt; Catches semantically similar requests with different wording.&lt;br&gt;
&lt;strong&gt;Cons:&lt;/strong&gt; Embedding generation adds 2-5ms. False positives are possible. Threshold tuning is critical and use-case dependent.&lt;/p&gt;


&lt;h2&gt;
  
  
  The 95% Myth: What the Numbers Actually Say
&lt;/h2&gt;

&lt;p&gt;The "95% cache hit rate" claim circulates across vendor marketing pages. Here's what the published data actually shows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Portkey (production)&lt;/td&gt;
&lt;td&gt;~20%&lt;/td&gt;
&lt;td&gt;RAG use cases, 99% match accuracy&lt;/td&gt;
&lt;td&gt;Vendor data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EdTech platform (production)&lt;/td&gt;
&lt;td&gt;~45%&lt;/td&gt;
&lt;td&gt;Student Q&amp;amp;A — high repetition&lt;/td&gt;
&lt;td&gt;Case study&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT Semantic Cache (academic)&lt;/td&gt;
&lt;td&gt;61-69%&lt;/td&gt;
&lt;td&gt;Controlled benchmark, curated dataset&lt;/td&gt;
&lt;td&gt;Research paper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;General production estimate&lt;/td&gt;
&lt;td&gt;30-40%&lt;/td&gt;
&lt;td&gt;Mixed traffic across use cases&lt;/td&gt;
&lt;td&gt;Industry average&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open-ended chat (production)&lt;/td&gt;
&lt;td&gt;10-20%&lt;/td&gt;
&lt;td&gt;Unique conversations, low repetition&lt;/td&gt;
&lt;td&gt;Observed range&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 95% number, when you trace it back, almost always refers to &lt;strong&gt;match accuracy&lt;/strong&gt; — meaning 95% of the time a cache returns a response, that response is correct for the query. Not that 95% of queries hit the cache. These are fundamentally different metrics.&lt;/p&gt;

&lt;p&gt;The honest range for production semantic caching: &lt;strong&gt;20-45% hit rate&lt;/strong&gt;, depending heavily on use case.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why academic benchmarks are misleading:&lt;/strong&gt; Academic benchmarks test against curated datasets where similar questions are intentionally grouped. Production traffic is messier — 60-70% of real queries are genuinely unique. The 61-69% hit rates from research papers don't survive contact with production diversity.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Hit Rates by Use Case: Where Caching Works (and Doesn't)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Expected Hit Rate&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FAQ / customer support&lt;/td&gt;
&lt;td&gt;40-60%&lt;/td&gt;
&lt;td&gt;Users ask the same questions in slightly different ways. High repetition, bounded answer space.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Classification / labeling&lt;/td&gt;
&lt;td&gt;50-70%&lt;/td&gt;
&lt;td&gt;Automated pipelines often send identical or near-identical inputs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Internal knowledge base Q&amp;amp;A&lt;/td&gt;
&lt;td&gt;30-45%&lt;/td&gt;
&lt;td&gt;Employees ask similar questions about policies, processes, docs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG with document retrieval&lt;/td&gt;
&lt;td&gt;15-25%&lt;/td&gt;
&lt;td&gt;Context varies per query even if questions are similar.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open-ended chat&lt;/td&gt;
&lt;td&gt;10-20%&lt;/td&gt;
&lt;td&gt;Conversations are unique. Multi-turn context makes each request different.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation&lt;/td&gt;
&lt;td&gt;5-15%&lt;/td&gt;
&lt;td&gt;High specificity per request. Users want varied outputs.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: &lt;strong&gt;bounded answer spaces with repetitive inputs cache well. Open-ended, context-dependent, or creative tasks don't.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Threshold Problem: 0.85 vs. 0.92 vs. 0.98
&lt;/h2&gt;

&lt;p&gt;The cosine similarity threshold is the most important — and most under-discussed — configuration in semantic caching. It's the knob that determines whether your cache is useful or dangerous.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Threshold 0.85 (aggressive):&lt;/strong&gt; More cache hits, but higher false positive rate. "How to reset my password" might match "How to change my email" — similar intent, wrong answer. Good for FAQ-style use cases where a slightly imprecise answer is acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold 0.92 (balanced):&lt;/strong&gt; The sweet spot for most production use cases. Catches clear rephrasings while rejecting distinct-but-similar queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold 0.98 (conservative):&lt;/strong&gt; Almost-exact matching. Very few false positives, but you're only catching the most obvious rephrasings. At this point, exact caching captures nearly as much with zero false positive risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no universal correct threshold. It depends on the cost of a wrong answer in your application. A customer support bot returning a slightly wrong FAQ answer is tolerable. A medical advice application returning a cached answer for a different condition is dangerous.&lt;/p&gt;


&lt;h2&gt;
  
  
  Five Failure Modes Nobody Warns You About
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Context-dependent queries that look identical
&lt;/h3&gt;

&lt;p&gt;"What's the status?" asked by User A about Order #4521 and User B about Order #7893 will have near-identical embeddings. Without user-scoped or session-scoped cache keys, User B gets User A's order status. Cache keys must include relevant context — not just the prompt text.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Time-sensitive queries returning stale answers
&lt;/h3&gt;

&lt;p&gt;"What's the latest pricing for GPT-5?" cached last week is wrong this week if pricing changed. TTL helps, but the right TTL varies by query type. Pricing questions need TTLs of hours. FAQ answers can cache for days. One-size-fits-all TTL is a guarantee of either stale answers or low hit rates.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Embedding model drift
&lt;/h3&gt;

&lt;p&gt;If you update your embedding model, all previously cached embeddings become invalid. The similarity scores between old and new embeddings are meaningless. You need a cache invalidation strategy tied to your embedding model version. Most teams learn this the hard way after a model update causes a spike in incorrect cache responses.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Cache poisoning from bad responses
&lt;/h3&gt;

&lt;p&gt;If the LLM returns a hallucinated or incorrect response and you cache it, every similar future query gets that same bad answer. The cache amplifies the error. Mitigation: add quality checks before caching (confidence scores, length validation, format checks), or let users flag cached responses as incorrect to trigger cache eviction.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Streaming response caching complexity
&lt;/h3&gt;

&lt;p&gt;Most LLM calls use streaming (&lt;code&gt;stream: true&lt;/code&gt;). You can't cache a streaming response mid-stream — you need to buffer the full response, then store it. On cache hit, you either return the full response instantly (breaking the streaming contract your client expects) or simulate streaming by chunking the cached response with artificial delays. Both are engineering overhead that vendors rarely mention.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Dollar Math: What Caching Actually Saves
&lt;/h2&gt;

&lt;p&gt;For a team spending &lt;strong&gt;$5,000/month&lt;/strong&gt; on LLM APIs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;th&gt;Monthly Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10% hits&lt;/td&gt;
&lt;td&gt;$500/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20% hits&lt;/td&gt;
&lt;td&gt;$1,000/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30% hits&lt;/td&gt;
&lt;td&gt;$1,500/month&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;45% hits&lt;/td&gt;
&lt;td&gt;$2,250/month&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The savings come from two places: &lt;strong&gt;avoided LLM calls&lt;/strong&gt; (the obvious one) and &lt;strong&gt;reduced latency&lt;/strong&gt; (the hidden one). A cache hit returns in under 5ms instead of 2-5 seconds. For customer-facing applications, that latency improvement often matters more than the dollar savings.&lt;/p&gt;

&lt;p&gt;The cost of running the cache itself is minimal. Embedding generation uses a small model (text-embedding-3-small at $0.02/1M tokens). Vector storage in Redis or a dedicated vector DB adds $50-200/month depending on cache size. The infrastructure cost is under 5% of the savings at even a 10% hit rate.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Right Architecture: Layer Exact and Semantic Caching
&lt;/h2&gt;

&lt;p&gt;The best approach is a two-layer cache that checks exact matches first (fast, zero risk) and falls back to semantic matching only when needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Layer 1: Exact cache (sub-ms, zero false positives)
&lt;/span&gt;&lt;span class="n"&gt;exact_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exact_hit&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exact_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;exact_hit&lt;/span&gt;

&lt;span class="c1"&gt;# Layer 2: Semantic cache (2-5ms, threshold-gated)
&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;semantic_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;semantic_hit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;semantic_hit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="c1"&gt;# Cache miss: call the LLM
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to both layers
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exact_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The average app we onboard discovers that &lt;strong&gt;18% of requests are exact duplicates&lt;/strong&gt; on day one — before semantic matching even kicks in.&lt;/p&gt;

&lt;p&gt;Cache backends matter less than you'd think. In-memory works for single-instance proxies. Redis works for distributed deployments. Dedicated vector databases (Qdrant, Pinecone) are worth it only if your cache exceeds 1M entries — below that, Redis with vector search is sufficient and simpler to operate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Start With Measurement, Not Implementation
&lt;/h2&gt;

&lt;p&gt;The most common mistake: building a caching layer before understanding what your traffic looks like. You might spend two weeks implementing semantic caching only to discover that your traffic is 90% unique, context-dependent queries with a 12% hit rate ceiling.&lt;/p&gt;

&lt;p&gt;Measure first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Log all prompts for a week.&lt;/strong&gt; Hash them. Count exact duplicates. That's your floor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample 1,000 requests.&lt;/strong&gt; Generate embeddings. Cluster them. Count how many fall within a 0.92 similarity threshold. That's your ceiling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Estimate savings.&lt;/strong&gt; Floor hit rate × monthly LLM spend = guaranteed savings. Ceiling hit rate × monthly spend = maximum possible savings. If both numbers are under $200/month, caching isn't worth the engineering effort.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If both numbers justify the effort, start with exact caching only. Run it for two weeks. Then add semantic caching on top and compare the marginal improvement. If semantic caching only adds 5-8 percentage points over exact caching, the false positive risk may not justify the complexity.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost optimization that detects exact duplicates and semantically similar requests across your traffic. See your cache potential and projected savings before you build anything. Free for up to 10K requests.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>redis</category>
      <category>backend</category>
    </item>
    <item>
      <title>We built an LLM proxy that adds 47ms of latency. Here's every millisecond accounted for.</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Sat, 04 Apr 2026 14:45:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/we-built-an-llm-proxy-that-adds-47ms-of-latency-heres-every-millisecond-accounted-for-2lnk</link>
      <guid>https://forem.com/gauravdagde/we-built-an-llm-proxy-that-adds-47ms-of-latency-heres-every-millisecond-accounted-for-2lnk</guid>
      <description>&lt;p&gt;Your LLM API request passes through 7 layers before it reaches OpenAI. Authentication. Rate limiting. Cache lookup. Model routing. The upstream call itself. Fallback logic. Logging and cost attribution. Most teams have no idea what happens in between — or that the entire round trip adds less than 50 milliseconds.&lt;/p&gt;

&lt;p&gt;This post breaks down every layer of an LLM proxy, what each one costs in latency, and why those 47 milliseconds determine whether your AI infrastructure scales — or quietly bankrupts you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An LLM proxy intercepts your API request and passes it through 7 processing layers in under 50ms — adding auth, caching, routing, failover, and cost tracking that the provider API doesn't give you.&lt;/li&gt;
&lt;li&gt;Proxy overhead (3-50ms) is under 3% of total request time. The cost of &lt;em&gt;not&lt;/em&gt; having a proxy — untracked spend, zero failover, no per-feature attribution — is far higher.&lt;/li&gt;
&lt;li&gt;The setup is one line of code: change your &lt;code&gt;base_url&lt;/code&gt;. Everything else stays the same.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What Is an LLM Proxy (and Why Should a CTO Care)?
&lt;/h2&gt;

&lt;p&gt;An LLM proxy sits between your application code and the LLM provider. Your app sends requests to the proxy URL instead of directly to &lt;code&gt;api.openai.com&lt;/code&gt;. The proxy handles everything else: authentication, routing, caching, logging, failover.&lt;/p&gt;

&lt;p&gt;Think of it as an API gateway — but AI-aware. Traditional gateways (Kong, Nginx) understand HTTP. An LLM proxy understands tokens, models, prompt structure, and cost-per-request. It can make routing decisions based on task complexity, enforce per-team budget limits, and detect that 30% of your requests are semantically identical and cacheable.&lt;/p&gt;

&lt;p&gt;The setup is one line of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After — same SDK, same code, different base URL
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://proxy.preto.ai/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything downstream — your prompts, your response handling, your error handling — stays the same. The proxy is transparent to your application code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 7 Layers Your Request Passes Through
&lt;/h2&gt;

&lt;p&gt;Here's what happens in those 47 milliseconds, layer by layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1: Ingress and Authentication (~2-5ms)
&lt;/h3&gt;

&lt;p&gt;The proxy receives your HTTP request and validates the API key. But unlike a direct OpenAI call, the key maps to an internal identity: a team, a project, a budget. Your upstream provider keys are never exposed to application code.&lt;/p&gt;

&lt;p&gt;One leaked key doesn't compromise your entire OpenAI account — it compromises one team's allocation with a hard spending cap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Rate Limiting and Budget Enforcement (~1-3ms)
&lt;/h3&gt;

&lt;p&gt;Before the request goes anywhere, the proxy checks two things: Is this user within their rate limit? Is their team within its budget?&lt;/p&gt;

&lt;p&gt;Smart proxies enforce &lt;em&gt;token-level&lt;/em&gt; rate limits, not just request-level — because one 100K-context request is not the same as one 500-token classification. Budget checks happen in-memory (synced with Redis every ~10ms) so they don't block the request path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Cache Lookup (~1-8ms; hit returns in &amp;lt;5ms, saving 500ms-5s)
&lt;/h3&gt;

&lt;p&gt;The proxy checks whether it has seen this request — or one semantically similar — before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact caching&lt;/strong&gt; hashes the prompt and returns an identical response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt; generates an embedding, computes cosine similarity against recent requests, and returns a cached response if similarity exceeds a threshold.&lt;/p&gt;

&lt;p&gt;A cache hit skips the LLM entirely: response in under 5ms instead of 2-5 seconds. In production, hit rates range from 20% to 45% depending on the use case — even 20% is a meaningful cost reduction.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Routing and Model Selection (~1-3ms)
&lt;/h3&gt;

&lt;p&gt;If the request isn't cached, the proxy decides where to send it. Simple routing forwards to the model specified in the request. Advanced routing makes a decision: load balance across multiple Azure OpenAI deployments, select a cheaper model for simple tasks, or route based on headers or request patterns.&lt;/p&gt;

&lt;p&gt;Cost-based routing — sending classification tasks to GPT-5 Mini instead of GPT-5 — can cut 80% of cost on affected requests with no accuracy loss.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Upstream Call + Streaming (~500ms-5,000ms)
&lt;/h3&gt;

&lt;p&gt;The proxy forwards the request to the selected provider with the upstream API key. For streaming responses (&lt;code&gt;stream: true&lt;/code&gt;), the proxy pipes tokens back to your application as they arrive — the client starts receiving output before the full response is generated.&lt;/p&gt;

&lt;p&gt;The proxy also enforces request timeouts, killing requests that exceed a duration threshold before they waste tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 6: Fallback and Retry (~0ms unless triggered: then 100-500ms)
&lt;/h3&gt;

&lt;p&gt;If the primary provider returns a 429 (rate limit), 503 (service unavailable), or times out, the proxy retries with exponential backoff — then falls back to the next provider in the chain.&lt;/p&gt;

&lt;p&gt;GPT-5 fails? Route to Claude Sonnet. Claude is down? Try Gemini Pro.&lt;/p&gt;

&lt;p&gt;Circuit breakers monitor error rates per provider: when a provider crosses a failure threshold, it's automatically removed from the rotation and re-tested after a cooldown period. Teams running this report 99.97% effective uptime despite individual provider outages, with failover in milliseconds instead of the 5+ minutes it takes to update a hard-coded API key.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 7: Logging, Cost Attribution, and Response (~2-5ms, async)
&lt;/h3&gt;

&lt;p&gt;As the response streams back, the proxy calculates cost (input tokens × input price + output tokens × output price), tags the request with team/feature/environment metadata, and ships the log to your observability backend.&lt;/p&gt;

&lt;p&gt;This happens asynchronously — the client gets the response immediately. The log includes: model used, tokens consumed, cost, latency, cache hit/miss, which feature triggered it, and whether the request fell back to a secondary provider.&lt;/p&gt;




&lt;h2&gt;
  
  
  47ms in Context: Why Proxy Overhead Doesn't Matter (and When It Does)
&lt;/h2&gt;

&lt;p&gt;The proxy adds 7-25ms to a request that takes 500ms-5,000ms from the LLM itself. That's 0.5-3% overhead. For most teams, this is noise.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;LLM Latency&lt;/th&gt;
&lt;th&gt;Proxy Overhead&lt;/th&gt;
&lt;th&gt;% Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standard completion (GPT-5, 500 tokens out)&lt;/td&gt;
&lt;td&gt;~2,000ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;1.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming first token (TTFT)&lt;/td&gt;
&lt;td&gt;~300ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;6.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache hit (semantic match)&lt;/td&gt;
&lt;td&gt;&amp;lt;5ms&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;160%*&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long-form generation (2K tokens)&lt;/td&gt;
&lt;td&gt;~8,000ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;0.25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mini model classification&lt;/td&gt;
&lt;td&gt;~400ms&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;5.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*The cache hit row looks alarming — but the total response time is 13ms instead of 2,000ms. Your user got a response 150x faster.&lt;/p&gt;

&lt;p&gt;The only scenario where proxy latency is a real concern: &lt;strong&gt;real-time applications with sub-100ms requirements&lt;/strong&gt; and no caching benefit — voice AI, game NPCs, live translation. For these, a Rust or Go proxy (under 1ms overhead) is the right choice. For everything else, the 20ms is the best trade in your stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Proxy Architecture Patterns: Forward, Reverse, and Sidecar
&lt;/h2&gt;

&lt;p&gt;Not all proxies work the same way. The architecture pattern determines your failure modes, your latency profile, and what features you can use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Forward Proxy (Client-Side Integration)
&lt;/h3&gt;

&lt;p&gt;Your application points at the proxy URL. The proxy forwards requests to the provider. This is the most common pattern (Portkey, LiteLLM, Preto). You get the full feature set: caching, routing, failover, cost tracking. The trade-off: the proxy is in the critical path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reverse Proxy (Edge-Deployed)
&lt;/h3&gt;

&lt;p&gt;The proxy runs at the edge (e.g., Cloudflare Workers), intercepting requests globally with minimal latency. Helicone uses this pattern. Low latency from geographic proximity, but limited by what you can run in an edge function.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sidecar / Async Observer
&lt;/h3&gt;

&lt;p&gt;The proxy doesn't sit in the request path at all. Instead, it observes traffic after the fact — through SDK hooks, log tailing, or provider API polling. Langfuse advocates this approach. Zero latency impact, no single point of failure — but you lose caching, real-time routing, and failover.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The honest trade-off:&lt;/strong&gt; A synchronous proxy creates a dependency. Run it as a horizontally scaled service behind a load balancer, with health checks and automatic instance replacement. Keep a direct-to-provider fallback for critical paths. This is standard infrastructure — the same way you'd deploy any API gateway.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Proxy Overhead Actually Costs in Dollars
&lt;/h2&gt;

&lt;p&gt;The proxy adds latency. It also saves money. Here's the math for a team running 100,000 LLM requests per day on GPT-5 ($1.25/1M input, $5.00/1M output) with an average of 500 input + 300 output tokens per request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monthly LLM spend without a proxy:&lt;/strong&gt; $6,450/month&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the proxy saves:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Semantic caching (30% hit rate): -$1,935/month&lt;/li&gt;
&lt;li&gt;Cost-based routing (40% of requests downgraded to GPT-5 Mini): -$1,548/month&lt;/li&gt;
&lt;li&gt;Budget enforcement (prevents 2 runaway features/quarter): -$800-2,000/quarter&lt;/li&gt;
&lt;li&gt;Automatic failover (avoids 3 provider outages/quarter): prevents 4-12 hours of downtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Net result:&lt;/strong&gt; $3,483/month in direct savings, plus avoided downtime. The proxy pays for itself in the first week.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Cost of Not Having a Proxy
&lt;/h2&gt;

&lt;p&gt;Without a proxy, you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No per-feature cost attribution.&lt;/strong&gt; OpenAI gives you two fields for attribution: &lt;code&gt;user&lt;/code&gt; and &lt;code&gt;project&lt;/code&gt;. That's it. You can't see which feature is responsible for 60% of your bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No automatic failover.&lt;/strong&gt; When OpenAI goes down — and it does, multiple times per quarter — every AI feature in your product goes down with it. Manual failover takes 5+ minutes. At 3am, nobody is watching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No caching layer.&lt;/strong&gt; Identical requests hit the LLM every time. The average production app sends 15-30% duplicate or near-duplicate requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No budget enforcement.&lt;/strong&gt; A new feature ships with a prompt that generates 2,000 output tokens per request instead of 300. Nobody notices until the monthly bill arrives 3x higher than expected.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The average production app we onboard discovers that &lt;strong&gt;18% of its requests are cacheable on day one&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build vs. Buy: The Decision Framework
&lt;/h2&gt;

&lt;p&gt;Building a production-grade LLM proxy is a 6-12 month engineering effort. Based on published estimates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core gateway (routing, auth, failover): $200K-$300K in engineering time&lt;/li&gt;
&lt;li&gt;Observability (logging, dashboards, alerting): $100K-$150K&lt;/li&gt;
&lt;li&gt;Prompt management UI: $100K-$150K&lt;/li&gt;
&lt;li&gt;Compliance and security (SOC 2, HIPAA): $50K-$100K/year ongoing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total first-year investment: $450K-$700K&lt;/strong&gt;, plus 12-18 months before your AI features ship with production-grade infrastructure.&lt;/p&gt;

&lt;p&gt;One real case study: a team replaced their custom LLM manager with a managed proxy and &lt;strong&gt;removed 11,005 lines of code&lt;/strong&gt; across 112 files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build if:&lt;/strong&gt; LLM routing is your core product differentiator, you have unique compliance requirements, or your scale requires custom optimizations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Buy if:&lt;/strong&gt; You want to ship AI features this month, your engineering team should be building product not infrastructure, and your LLM spend is between $1K and $100K/month.&lt;/p&gt;




&lt;h2&gt;
  
  
  Latency Benchmarks by Implementation Language
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Proxy&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Overhead&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bifrost&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;~11μs at 5K RPS&lt;/td&gt;
&lt;td&gt;5,000+ RPS&lt;/td&gt;
&lt;td&gt;Pure routing, no observability platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TensorZero&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms P99&lt;/td&gt;
&lt;td&gt;10,000 QPS&lt;/td&gt;
&lt;td&gt;Built-in A/B testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helicone&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;~1-5ms P95&lt;/td&gt;
&lt;td&gt;~10,000 RPS&lt;/td&gt;
&lt;td&gt;Edge-deployed on Cloudflare Workers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portkey&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;td&gt;&amp;lt;10ms&lt;/td&gt;
&lt;td&gt;1,000 RPS&lt;/td&gt;
&lt;td&gt;Full-featured: guardrails, prompt mgmt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LiteLLM&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;3-50ms&lt;/td&gt;
&lt;td&gt;1,000 QPS&lt;/td&gt;
&lt;td&gt;Most flexible (100+ providers)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rust and Go proxies handle 5-10x more throughput with 10-100x less overhead than Python. But LiteLLM has the largest provider coverage. For most teams under 1,000 RPS, the language doesn't matter. At 5,000+ RPS, it's the first thing that matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  When You Don't Need a Proxy
&lt;/h2&gt;

&lt;p&gt;Skip the proxy if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're calling one model, from one service, at low volume&lt;/li&gt;
&lt;li&gt;Your LLM spend is under $500/month&lt;/li&gt;
&lt;li&gt;You need observability but not routing (an async observer works fine)&lt;/li&gt;
&lt;li&gt;You're still prototyping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add the proxy when you have multiple models, multiple teams, real money at stake, and no visibility into where it's going.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost optimization that sits in your proxy layer. If you're evaluating options, the full build vs. buy decision checklist (12 questions, PDF) is linked below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>backend</category>
    </item>
    <item>
      <title>We evaluated Go, Rust, and Python for our LLM proxy. Go won - and not for the reason you'd expect.</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Fri, 03 Apr 2026 11:54:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/we-evaluated-go-rust-and-python-for-our-llm-proxy-go-won-and-not-for-the-reason-youd-expect-4a2j</link>
      <guid>https://forem.com/gauravdagde/we-evaluated-go-rust-and-python-for-our-llm-proxy-go-won-and-not-for-the-reason-youd-expect-4a2j</guid>
      <description>&lt;p&gt;We built our LLM proxy in Go. Not Rust. Not Python. Here's the engineering trade-off nobody talks about: the language that's fastest in benchmarks isn't always the language that ships the fastest product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go handles 5,000+ RPS with ~11 microseconds of overhead per request — more than enough for 99% of LLM proxy workloads.&lt;/li&gt;
&lt;li&gt;Rust is faster (sub-1ms P99 at 10K QPS), but the development velocity trade-off isn't worth it unless you're building for hyperscale.&lt;/li&gt;
&lt;li&gt;Python (LiteLLM) hits a wall at ~1,000 QPS due to the GIL — fine for prototyping, problematic for production traffic.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Three Contenders
&lt;/h2&gt;

&lt;p&gt;When we started building Preto's proxy layer, we had three options on the table. Each had a strong case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt; was the obvious first choice. The LLM ecosystem lives in Python. LiteLLM — the most popular open-source proxy — is Python. Every provider SDK is Python-first. We could ship a working proxy in a weekend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rust&lt;/strong&gt; was the performance choice. TensorZero and Helicone both use Rust. Sub-millisecond P99 latency at 10,000 QPS. Memory safety guarantees. If we wanted to claim "the fastest proxy," Rust was the path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go&lt;/strong&gt; was the pragmatic choice. Bifrost (the open-source proxy that benchmarks 50x faster than LiteLLM) is written in Go. Goroutines make concurrent streaming connections trivial. The standard library includes a production-grade HTTP server. And we could hire for it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Benchmark That Settled the Python Question
&lt;/h2&gt;

&lt;p&gt;We ran Python off the list first. Not because it's slow in theory — because it's slow in practice at our target scale.&lt;/p&gt;

&lt;p&gt;LiteLLM's own published benchmarks tell the story:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;At 500 RPS:&lt;/strong&gt; Stable. ~40ms overhead. Acceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At 1,000 RPS:&lt;/strong&gt; Memory climbs to 4GB+. Latency variance increases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At 2,000 RPS:&lt;/strong&gt; Timeouts start. Memory hits 8GB+. Requests fail.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The culprit is Python's Global Interpreter Lock. An LLM proxy is fundamentally a concurrent I/O problem — you're holding thousands of open streaming connections simultaneously. Python's &lt;code&gt;asyncio&lt;/code&gt; helps, but the GIL still serializes CPU-bound work: JSON parsing, token counting, cost calculation, log serialization. Under load, these add up.&lt;/p&gt;

&lt;p&gt;LiteLLM's team knows this. They've announced a Rust sidecar to handle the hot path. That's telling — even the most popular Python proxy is moving critical code out of Python.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Python isn't wrong — it's wrong for this. If your LLM traffic is under 500 RPS and you need maximum provider coverage, LiteLLM is a solid choice. It supports 100+ providers with battle-tested adapters. The performance ceiling only matters if you're going to hit it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Go vs. Rust: Where the Decision Gets Interesting
&lt;/h2&gt;

&lt;p&gt;With Python out, the real comparison begins. Here's what we measured and researched:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Go&lt;/th&gt;
&lt;th&gt;Rust&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Proxy overhead&lt;/td&gt;
&lt;td&gt;~11μs at 5K RPS&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms P99 at 10K QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max throughput (single instance)&lt;/td&gt;
&lt;td&gt;5,000+ RPS&lt;/td&gt;
&lt;td&gt;10,000+ QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory under load&lt;/td&gt;
&lt;td&gt;~200MB at 5K RPS&lt;/td&gt;
&lt;td&gt;~50MB at 10K QPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency model&lt;/td&gt;
&lt;td&gt;Goroutines (lightweight)&lt;/td&gt;
&lt;td&gt;async/await (Tokio)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming HTTP support&lt;/td&gt;
&lt;td&gt;stdlib net/http&lt;/td&gt;
&lt;td&gt;hyper/axum (good, more code)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to implement proxy MVP&lt;/td&gt;
&lt;td&gt;~2 weeks&lt;/td&gt;
&lt;td&gt;~5-6 weeks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hiring pool&lt;/td&gt;
&lt;td&gt;Large (DevOps, backend)&lt;/td&gt;
&lt;td&gt;Small (systems specialists)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compile times&lt;/td&gt;
&lt;td&gt;~5 seconds&lt;/td&gt;
&lt;td&gt;~2-5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Binary size&lt;/td&gt;
&lt;td&gt;~15MB&lt;/td&gt;
&lt;td&gt;~8MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The performance numbers are close enough to not matter for our use case. The development velocity numbers are not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Factor That Made It Obvious: Goroutines and Streaming
&lt;/h2&gt;

&lt;p&gt;An LLM proxy's core job is holding thousands of concurrent HTTP connections open while streaming tokens back to clients. This is where Go's goroutine model shines.&lt;/p&gt;

&lt;p&gt;In Go, every incoming request gets its own goroutine. Streaming the response is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;proxyHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Forward to upstream LLM provider&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DefaultClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstreamReq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;handleFallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// try next provider&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Stream tokens back as they arrive&lt;/span&gt;
    &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flusher&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;flusher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// send immediately&lt;/span&gt;
            &lt;span class="n"&gt;trackTokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="c"&gt;// async cost tracking&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the core loop. In Rust, the equivalent code involves &lt;code&gt;async/await&lt;/code&gt;, &lt;code&gt;Pin&amp;lt;Box&amp;lt;dyn Stream&amp;gt;&amp;gt;&lt;/code&gt;, lifetime annotations, and careful ownership management. It's not harder conceptually — it's harder in practice, every time you refactor or add a new feature.&lt;/p&gt;

&lt;p&gt;When your proxy needs to add a new middleware layer — say, budget enforcement before routing — the Go version is a new function in the chain. The Rust version often requires restructuring lifetimes and trait bounds across multiple files.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real-World Request Lifecycle in Our Go Proxy
&lt;/h2&gt;

&lt;p&gt;Here's how a request flows through our stack, with timing at each stage:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;TLS termination + HTTP parse&lt;/strong&gt; — handled by Go's &lt;code&gt;net/http&lt;/code&gt; server. ~1ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API key lookup + team resolution&lt;/strong&gt; — in-memory map with Redis sync every 10ms. ~0.5ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limit check&lt;/strong&gt; — token-bucket algorithm in goroutine-safe map. ~0.1ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget enforcement&lt;/strong&gt; — check team's monthly spend against cap. ~0.2ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache probe&lt;/strong&gt; — SHA-256 hash of prompt + model + params, checked against local cache with Redis fallback. ~1-3ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route selection&lt;/strong&gt; — match model to upstream endpoint, apply load balancing weights. ~0.1ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upstream call + streaming&lt;/strong&gt; — goroutine holds connection, pipes &lt;code&gt;data:&lt;/code&gt; chunks back. 500ms-5,000ms (the LLM).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async logging&lt;/strong&gt; — cost calculation and log entry shipped to ClickHouse via buffered channel. ~0ms on the request path (fires in background goroutine).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Total proxy overhead: ~5-8ms.&lt;/strong&gt; The LLM takes 500-5,000ms. Our proxy is under 1% of total request time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We'd Choose Rust For
&lt;/h2&gt;

&lt;p&gt;This isn't a "Go is better than Rust" argument. It's a "Go is better for our constraints" argument. We'd choose Rust if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;We needed to handle 10,000+ QPS on a single instance.&lt;/strong&gt; At that scale, Rust's zero-cost abstractions and lack of GC pauses become meaningful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory was a hard constraint.&lt;/strong&gt; Rust's 50MB footprint vs. Go's 200MB matters if you're running on edge nodes or embedded devices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The proxy was the entire product.&lt;/strong&gt; If our company was an LLM proxy company, spending 3x longer on the core engine is justified. Our proxy is infrastructure — the product is cost intelligence built on top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TensorZero made the right call choosing Rust — their proxy IS the product, they need built-in A/B testing at wire speed, and they're targeting the highest-throughput tier. Helicone made the right call choosing Rust — they run on Cloudflare Workers at the edge, where memory and cold start time matter.&lt;/p&gt;

&lt;p&gt;For a cost intelligence platform where the proxy is the data collection layer? Go is the right tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons From 6 Months in Production
&lt;/h2&gt;

&lt;p&gt;Three things surprised us after shipping:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Garbage collection pauses are a non-issue.&lt;/strong&gt; Go's GC has improved dramatically. At 3,000 RPS, our P99 GC pause is under 500 microseconds. We were prepared to tune &lt;code&gt;GOGC&lt;/code&gt; — we never needed to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The standard library HTTP server is production-ready.&lt;/strong&gt; We started with Go's &lt;code&gt;net/http&lt;/code&gt; and never moved to a framework. It handles keep-alive, connection pooling, graceful shutdown, and HTTP/2 out of the box. One less dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Goroutine leaks are the real danger.&lt;/strong&gt; Early on, we had a bug where failed upstream connections weren't properly closed, leaking goroutines. &lt;code&gt;runtime.NumGoroutine()&lt;/code&gt; caught it — but only after goroutine count climbed from 200 to 45,000 over a weekend. Monitor goroutine count as a first-class metric from day one.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Build vs. Buy Question
&lt;/h2&gt;

&lt;p&gt;If you're evaluating whether to build your own proxy or use a managed solution, the math is sobering: a production-grade proxy is a 6-12 month engineering effort, roughly $450K-$700K in first-year engineering time when you include observability, a management UI, and compliance work.&lt;/p&gt;

&lt;p&gt;One team we onboarded had built their own LLM manager — a reasonable decision at the time. When they migrated to a managed proxy, they removed &lt;strong&gt;11,005 lines of code across 112 files&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Build if LLM routing is your core product differentiator. Buy if you want to ship AI features this month.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're building &lt;a href="https://preto.ai" rel="noopener noreferrer"&gt;Preto.ai&lt;/a&gt; — LLM cost optimization that sits in your proxy layer. Free for up to 10K requests. See what your LLM spend actually looks like.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>go</category>
      <category>rust</category>
    </item>
    <item>
      <title>How to integrate coroot-pg-agent with prometheus</title>
      <dc:creator>gauravdagde</dc:creator>
      <pubDate>Tue, 23 Aug 2022 21:23:00 +0000</pubDate>
      <link>https://forem.com/gauravdagde/how-to-integrate-coroot-pg-agent-with-prometheus-2i48</link>
      <guid>https://forem.com/gauravdagde/how-to-integrate-coroot-pg-agent-with-prometheus-2i48</guid>
      <description>&lt;p&gt;For monitoring postgres server most of the opensource stacks consists of grafana with prometheus.&lt;/p&gt;

&lt;p&gt;Connecting postgres metrics to prometheus is very interesting task and there are certain tools/libraries are available.&lt;/p&gt;

&lt;p&gt;Such libraries are helpful for monitoring and writing alert rules over prometheus.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;postgres_exporter(&lt;a href="https://github.com/prometheus-community/postgres_exporter" rel="noopener noreferrer"&gt;https://github.com/prometheus-community/postgres_exporter&lt;/a&gt;) &lt;/li&gt;
&lt;li&gt;coroot-pg-agent(&lt;a href="https://github.com/coroot/coroot-pg-agent" rel="noopener noreferrer"&gt;https://github.com/coroot/coroot-pg-agent&lt;/a&gt;).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Today we are going to discuss more about coroot-pg-agent.&lt;/p&gt;

&lt;p&gt;coroot-pg-agent can be run using docker, more information can be found here.(&lt;a href="https://github.com/coroot/coroot-pg-agent" rel="noopener noreferrer"&gt;https://github.com/coroot/coroot-pg-agent&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;On official postgresql : &lt;a href="https://www.postgresql.org/about/news/coroot-pg-agent-an-open-source-postgres-exporter-for-prometheus-2488/" rel="noopener noreferrer"&gt;https://www.postgresql.org/about/news/coroot-pg-agent-an-open-source-postgres-exporter-for-prometheus-2488/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now while running coroot-pg-agent with prometheus, there are certain things which we should keep it in mind.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;coroot-pg-agent using docker runs on port 80 by default,
We can run it on custom port using following command through docker
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run --name coroot-pg-agent \
--env DSN="postgresql://&amp;lt;USER&amp;gt;:&amp;lt;PASSWORD&amp;gt;@&amp;lt;HOST&amp;gt;:5432/postgres?connect_timeout=1&amp;amp;statement_timeout=30000" \
--env LISTEN="0.0.0.0:&amp;lt;custom_port_for_pg_agent&amp;gt;" \
-p &amp;lt;custom_port_for_pg_agent&amp;gt;:&amp;lt;custom_port_for_pg_agent&amp;gt; \
ghcr.io/coroot/coroot-pg-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also pass scrape-interval using --env PG_SCRAPE_INTERVAL.&lt;/p&gt;

&lt;p&gt;After executing above command we see output as follows, custom_port_for_pg_agent is 3000 here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I0823 21:00:58.259629       1 main.go:35] static labels: map[]
I0823 21:00:58.273610       1 main.go:41] listening on: 0.0.0.0:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;prometheus.yml for prometheus configs
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;global:
  scrape_interval: 5m
  scrape_timeout: 3m
  evaluation_interval: 15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: coroot-pg-agent
    static_configs:
      - targets: ["&amp;lt;localhost-ip&amp;gt;:&amp;lt;custom_port_for_pg_agent&amp;gt;"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can always change the scrape interval and timeouts as per our needs, was testing over local hence kept it like this.&lt;/p&gt;

&lt;p&gt;Keep in mind while editing above yml scrape_interval should always be greater than scrape_timeout.&lt;/p&gt;

&lt;p&gt;To run prometheus using docker we can use following command where we're using official image from prometheus at &lt;a href="https://hub.docker.com/r/prom/prometheus" rel="noopener noreferrer"&gt;docker&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run \
    -p 9090:9090 \
    -v ~/pro/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After executing above command you will see output like following&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ts=2022-08-23T21:02:26.203Z caller=main.go:495 level=info msg="No time or size retention was set so using the default time retention" duration=15d
ts=2022-08-23T21:02:26.203Z caller=main.go:539 level=info msg="Starting Prometheus Server" mode=server version="(version=2.38.0, branch=HEAD, revision=818d6e60888b2a3ea363aee8a9828c7bafd73699)"
ts=2022-08-23T21:02:26.203Z caller=main.go:544 level=info build_context="(go=go1.18.5, user=root@e6b781f65453, date=20220816-13:29:14)"
ts=2022-08-23T21:02:26.204Z caller=main.go:545 level=info host_details="(Linux 5.10.47-linuxkit #1 SMP PREEMPT Sat Jul 3 21:50:16 UTC 2021 aarch64 87decec12cad (none))"
ts=2022-08-23T21:02:26.204Z caller=main.go:546 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-08-23T21:02:26.204Z caller=main.go:547 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-08-23T21:02:26.205Z caller=web.go:553 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-08-23T21:02:26.205Z caller=main.go:976 level=info msg="Starting TSDB ..."
ts=2022-08-23T21:02:26.206Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false
ts=2022-08-23T21:02:26.207Z caller=head.go:495 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2022-08-23T21:02:26.207Z caller=head.go:538 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=10.125µs
ts=2022-08-23T21:02:26.207Z caller=head.go:544 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2022-08-23T21:02:26.207Z caller=head.go:615 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
ts=2022-08-23T21:02:26.207Z caller=head.go:621 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=21.416µs wal_replay_duration=117.958µs total_replay_duration=159.167µs
ts=2022-08-23T21:02:26.208Z caller=main.go:997 level=info fs_type=EXT4_SUPER_MAGIC
ts=2022-08-23T21:02:26.208Z caller=main.go:1000 level=info msg="TSDB started"
ts=2022-08-23T21:02:26.208Z caller=main.go:1181 level=info msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
ts=2022-08-23T21:02:26.210Z caller=main.go:1218 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=2.047292ms db_storage=750ns remote_storage=1.709µs web_handler=292ns query_engine=583ns scrape=341.75µs scrape_sd=16.708µs notify=542ns notify_sd=792ns rules=1µs tracing=9.625µs
ts=2022-08-23T21:02:26.210Z caller=main.go:961 level=info msg="Server is ready to receive web requests."
ts=2022-08-23T21:02:26.210Z caller=manager.go:941 level=info component="rule manager" msg="Starting rule manager..."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can hit &lt;a href="https://dev.tourl"&gt;localhost:9090&lt;/a&gt; over browser and see screen like following&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhx0ostww5ry85asdlvt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyhx0ostww5ry85asdlvt.png" alt="landing page for prometheus" width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, to see the targets we can visit &lt;a href="http://localhost:9090/targets?search=" rel="noopener noreferrer"&gt;Targets&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dyj5z3x516h4gpoggtx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5dyj5z3x516h4gpoggtx.png" alt="Targets from prometheus" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once targets are up we can see the status changed like follows&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrkj4kvhyi993jr0q8eh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrkj4kvhyi993jr0q8eh.png" alt="Targets updated statuses" width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can visit &lt;a href="http://localhost:9090/graph?g0.expr=&amp;amp;g0.tab=1&amp;amp;g0.stacked=0&amp;amp;g0.show_exemplars=0&amp;amp;g0.range_input=1h" rel="noopener noreferrer"&gt;graph&lt;/a&gt; and here if we hit the search bar, as I've kept auto suggestions on, we can see it like following.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Firlygs8jc4nqsk0uc0jy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Firlygs8jc4nqsk0uc0jy.png" alt="Shows graphs page" width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>prometheusexporter</category>
      <category>corootpgagent</category>
      <category>postgresmonitoring</category>
    </item>
  </channel>
</rss>
