<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Pranay Batta</title>
    <description>The latest articles on Forem by Pranay Batta (@pranay_batta).</description>
    <link>https://forem.com/pranay_batta</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3652594%2F9d2926ca-eede-4542-b782-4feb2ced66f1.jpg</url>
      <title>Forem: Pranay Batta</title>
      <link>https://forem.com/pranay_batta</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/pranay_batta"/>
    <language>en</language>
    <item>
      <title>Best AI Gateway to Route Codex CLI to Any Model</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Thu, 16 Apr 2026 10:05:16 +0000</pubDate>
      <link>https://forem.com/pranay_batta/best-ai-gateway-to-route-codex-cli-to-any-model-4640</link>
      <guid>https://forem.com/pranay_batta/best-ai-gateway-to-route-codex-cli-to-any-model-4640</guid>
      <description>&lt;p&gt;Codex CLI is OpenAI's terminal-based coding agent that runs entirely in your shell. It reads your codebase, proposes changes, runs commands, and writes code. Solid tool. One problem: it only talks to OpenAI by default.&lt;/p&gt;

&lt;p&gt;I wanted to route Codex CLI through an AI gateway so I could use Claude Sonnet, Gemini 2.5 Pro, Mistral, and others without switching tools. I tested a few options. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; worked best. Open-source, written in Go, 11 microsecond overhead. Here is exactly how I set it up and what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Route Codex CLI Through an AI Gateway
&lt;/h2&gt;

&lt;p&gt;Codex CLI sends requests to OpenAI's API. That is fine until you need something else. Maybe Claude Sonnet handles your refactoring tasks better. Maybe Gemini's context window fits your monorepo. Maybe you want automatic failover when OpenAI rate limits you mid-session.&lt;/p&gt;

&lt;p&gt;An AI gateway sits between Codex CLI and your providers. It translates requests, routes traffic, and handles failures. You configure it once and Codex CLI does not know the difference.&lt;/p&gt;

&lt;p&gt;Without a gateway, your options are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stick with OpenAI only (no routing, no failover, no cost tracking)&lt;/li&gt;
&lt;li&gt;Manually swap API keys and base URLs every time you want a different model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither scales.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Bifrost for Codex CLI
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes an OpenAI-compatible endpoint. Codex CLI connects to it like it would connect to OpenAI directly. Full &lt;a href="https://docs.getbifrost.ai/cli-agents/codex-cli" rel="noopener noreferrer"&gt;Codex CLI integration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Bifrost
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That starts the gateway locally. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; has the full walkthrough.&lt;/p&gt;

&lt;h3&gt;
  
  
  The OAuth Gotcha
&lt;/h3&gt;

&lt;p&gt;This one tripped me up. Codex CLI always prefers OAuth authentication over custom API keys. If you have previously logged in with OpenAI, Codex CLI will ignore your custom &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Run &lt;code&gt;/logout&lt;/code&gt; inside Codex CLI before configuring Bifrost. Without this step, your gateway config will be silently bypassed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configure Codex CLI to Use Bifrost
&lt;/h3&gt;

&lt;p&gt;Set your environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bifrost_virtual_key
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/openai/v1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or add it to your &lt;code&gt;codex.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[auth]&lt;/span&gt;
&lt;span class="py"&gt;api_key&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"bifrost_virtual_key"&lt;/span&gt;

&lt;span class="nn"&gt;[network]&lt;/span&gt;
&lt;span class="py"&gt;openai_base_url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://localhost:8080/openai/v1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; here is a Bifrost virtual key. Your actual provider keys live in the Bifrost config.&lt;/p&gt;

&lt;p&gt;Done. Every Codex CLI request now flows through Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Codex CLI to Any Model
&lt;/h2&gt;

&lt;p&gt;This is the core use case. Configure multiple providers in Bifrost, and route Codex CLI traffic however you want. Bifrost uses the &lt;code&gt;provider/model-name&lt;/code&gt; format for cross-provider routing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codex-dev"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-sonnet-4-5-20250929"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini/gemini-2-5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;60% of requests go to Claude Sonnet. 25% to Gemini. 15% to GPT-4o. Weights auto-normalise, so use any numbers.&lt;/p&gt;

&lt;p&gt;I ran this for a week. Claude Sonnet handled tool-heavy refactoring better. Gemini was faster on large context reads. GPT-4o was solid as a fallback. The &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing docs&lt;/a&gt; cover all configuration options.&lt;/p&gt;

&lt;p&gt;Other providers you can route to: Mistral, Groq, Cerebras, Cohere, Perplexity. All via the same &lt;code&gt;provider/model-name&lt;/code&gt; format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can You Use Codex CLI with Non-OpenAI Models?
&lt;/h3&gt;

&lt;p&gt;Yes. That is exactly what this setup does. Bifrost translates the OpenAI-format requests from Codex CLI into whatever format each provider expects. Codex CLI thinks it is talking to OpenAI. Bifrost handles the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical requirement:&lt;/strong&gt; non-OpenAI models must support tool use. Codex CLI relies on function calling for file operations, terminal commands, and code editing. If a model does not support tools, it will break on anything beyond simple chat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic Failover
&lt;/h3&gt;

&lt;p&gt;Provider outages are inevitable. Bifrost sorts providers by weight and retries on failure. If Claude goes down, Gemini picks up. Gemini fails, falls back to OpenAI. Your Codex CLI session never interrupts.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover docs&lt;/a&gt; explain the retry logic in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison: AI Gateway Options for Codex CLI
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Direct API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Routing overhead&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;td&gt;~8 milliseconds&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weighted routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic failover&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget controls&lt;/td&gt;
&lt;td&gt;4-tier hierarchy&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex CLI compatible&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiteLLM works as a proxy for Codex CLI, but the Python runtime adds measurable latency. When every Codex CLI request goes through the gateway, those milliseconds compound. For a tool sitting in the critical path of your coding workflow, overhead matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Route Codex CLI Through an AI Gateway?
&lt;/h3&gt;

&lt;p&gt;Three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start Bifrost (&lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;/logout&lt;/code&gt; in Codex CLI to clear OAuth&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; and &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to point at Bifrost&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is it. Configure your providers in the Bifrost config, and Codex CLI routes to any model you specify.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget and Observability
&lt;/h2&gt;

&lt;p&gt;Once all Codex CLI traffic flows through Bifrost, you get cost controls and logging for free. The four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt; lets you cap spend at the virtual key, team, or provider level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;virtual_key"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codex-cli-dev"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;150&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; logs every request: latency, tokens, cost, which provider handled it. When you are routing across three providers, this data tells you exactly where your money goes and which model performs best for your tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;Semantic caching&lt;/a&gt; also helps. Repeated or similar queries hit the cache instead of the provider. Cuts both cost and latency for common operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OAuth quirk is easy to miss.&lt;/strong&gt; If you skip the &lt;code&gt;/logout&lt;/code&gt; step, Codex CLI silently ignores your gateway config. There is no error. It just routes to OpenAI directly. I lost an hour to this before checking the docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool use is non-negotiable.&lt;/strong&gt; Not every model supports function calling well enough for Codex CLI. Stick to models with solid tool use: Claude Sonnet, GPT-4o, Gemini 2.5 Pro. Smaller or older models may fail on file operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted only.&lt;/strong&gt; You run and maintain the gateway. No managed cloud version for the open-source release. The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; helps with access control, but ops is on you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extra hop.&lt;/strong&gt; One more process in the chain. The 11 microsecond overhead is negligible, but it is still something to keep running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Logout from OpenAI OAuth in Codex CLI&lt;/span&gt;
&lt;span class="c"&gt;# Inside Codex CLI, run: /logout&lt;/span&gt;

&lt;span class="c"&gt;# 3. Point Codex CLI at Bifrost&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;bifrost_virtual_key
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/openai/v1

&lt;span class="c"&gt;# 4. Use Codex CLI normally - it routes through Bifrost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are using Codex CLI for real work, routing through an AI gateway gives you model flexibility, failover, and cost visibility that you cannot get from a single provider. I benchmarked the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;performance&lt;/a&gt; and the overhead is genuinely negligible.&lt;/p&gt;

&lt;p&gt;Open an issue on the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you run into anything.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>devops</category>
    </item>
    <item>
      <title>What is LLM Orchestration and How AI Gateways Enable It</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 15 Apr 2026 17:37:58 +0000</pubDate>
      <link>https://forem.com/pranay_batta/what-is-llm-orchestration-and-how-ai-gateways-enable-it-mm</link>
      <guid>https://forem.com/pranay_batta/what-is-llm-orchestration-and-how-ai-gateways-enable-it-mm</guid>
      <description>&lt;p&gt;Most teams start with one LLM provider. Then they add a second for cost reasons. Then a third for latency. Six months in, they have a tangled mess of provider-specific SDKs, manual failover logic, and zero visibility into what anything costs. That mess is the problem LLM orchestration solves.&lt;/p&gt;

&lt;p&gt;I evaluated how teams handle multi-model routing at scale. Custom code, orchestration frameworks, AI gateways. Here is what works and what just adds overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is LLM Orchestration?
&lt;/h2&gt;

&lt;p&gt;LLM orchestration is the practice of managing multiple LLM providers, models, and configurations through a unified control layer. Instead of hard-coding provider logic into your application, you route, balance, cache, and monitor all LLM traffic from one place.&lt;/p&gt;

&lt;p&gt;Think of it like a load balancer, but purpose-built for AI workloads. It handles which model gets which request, what happens when a provider goes down, how costs are tracked, and where the logs go.&lt;/p&gt;

&lt;p&gt;The core components of LLM orchestration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Routing&lt;/strong&gt; - Deciding which model handles each request based on weight, cost, or capability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failover&lt;/strong&gt; - Automatically switching to a backup provider when one fails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing&lt;/strong&gt; - Distributing requests across providers to avoid rate limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost governance&lt;/strong&gt; - Enforcing budgets per team, project, or API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching&lt;/strong&gt; - Avoiding duplicate calls for identical or semantically similar prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; - Tracking latency, tokens, costs, and errors across every request&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Do You Need LLM Orchestration?
&lt;/h2&gt;

&lt;p&gt;If you are calling one model from one provider, you do not. The moment any of these are true, you do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple models (GPT-4o, Claude, Gemini) serving different use cases&lt;/li&gt;
&lt;li&gt;Multiple teams sharing the same providers&lt;/li&gt;
&lt;li&gt;Budget limits per team or per project&lt;/li&gt;
&lt;li&gt;Uptime requirements that demand automatic failover&lt;/li&gt;
&lt;li&gt;Cost optimisation that requires routing cheaper queries to cheaper models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I measured what happens without orchestration. Teams I evaluated had 15-30% higher LLM costs from duplicate calls, no failover causing multi-minute outages during provider incidents, and zero per-team cost attribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Does an AI Gateway Handle Orchestration?
&lt;/h2&gt;

&lt;p&gt;An AI gateway is the infrastructure layer that makes LLM orchestration practical. Without a gateway, you are building every orchestration component yourself. With one, you configure it.&lt;/p&gt;

&lt;p&gt;Here is the comparison:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Orchestration Feature&lt;/th&gt;
&lt;th&gt;DIY (Custom Code)&lt;/th&gt;
&lt;th&gt;AI Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-model routing&lt;/td&gt;
&lt;td&gt;Custom SDK per provider, manual selection&lt;/td&gt;
&lt;td&gt;Config-based weighted routing, auto-normalised&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failover&lt;/td&gt;
&lt;td&gt;Try/catch with manual retry logic&lt;/td&gt;
&lt;td&gt;Automatic, sorted by weight, instant retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load balancing&lt;/td&gt;
&lt;td&gt;Custom queue + rate tracking&lt;/td&gt;
&lt;td&gt;Built-in weighted distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost governance&lt;/td&gt;
&lt;td&gt;Manual token counting + billing integration&lt;/td&gt;
&lt;td&gt;Budget hierarchy with auto-enforcement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Caching&lt;/td&gt;
&lt;td&gt;Redis/Memcached with custom key logic&lt;/td&gt;
&lt;td&gt;Semantic + exact-match, built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Custom logging + dashboards&lt;/td&gt;
&lt;td&gt;Real-time streaming, filters, sub-millisecond&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;Ongoing engineering effort&lt;/td&gt;
&lt;td&gt;Configuration changes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The DIY approach works for prototypes. For production with multiple teams and providers, it becomes a full-time job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up LLM Orchestration with Bifrost
&lt;/h2&gt;

&lt;p&gt;I tested several gateways for model orchestration. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; stood out on performance: 11 microsecond overhead, 5000 RPS throughput, written in Go. That matters because your orchestration layer should not become a bottleneck.&lt;/p&gt;

&lt;p&gt;Start the gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weighted Routing Config
&lt;/h3&gt;

&lt;p&gt;This is where LLM orchestration starts. You define providers with weights, and Bifrost &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routes traffic accordingly&lt;/a&gt;. Weights are auto-normalised to 1.0, and it follows a deny-by-default model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70% of traffic goes to GPT-4o. 30% to Claude. If OpenAI goes down, Bifrost &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;automatically fails over&lt;/a&gt; to the next provider sorted by weight. No code changes. No redeployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget Governance
&lt;/h3&gt;

&lt;p&gt;Bifrost uses a &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;four-tier budget hierarchy&lt;/a&gt;: Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider Config. Each tier can have independent spend limits and rate limits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-team-key"&lt;/span&gt;
    &lt;span class="na"&gt;budget&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_spend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
      &lt;span class="na"&gt;reset_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;
    &lt;span class="na"&gt;rate_limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
      &lt;span class="na"&gt;reset_duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1d"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a team hits their budget, requests are denied. Not throttled. Denied. That is real &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;cost governance&lt;/a&gt;, not just monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching
&lt;/h3&gt;

&lt;p&gt;Bifrost runs &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer caching&lt;/a&gt;: exact hash matching for identical prompts, plus semantic similarity for prompts that mean the same thing but are worded differently. Both layers reduce redundant API calls without any application code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;Every request is logged with &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;less than 0.1ms overhead&lt;/a&gt;. 14+ filters for slicing data. WebSocket-based live streaming so you can watch requests in real time. No separate logging pipeline needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What About MCP Workloads?
&lt;/h2&gt;

&lt;p&gt;If you are running MCP servers with 500+ tools, orchestration gets expensive fast. Every tool definition eats tokens. Bifrost's &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP Code Mode&lt;/a&gt; achieves 92% token reduction by encoding tool schemas efficiently. That is a direct cost saving on top of the orchestration layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is where to be careful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bifrost is self-hosted only.&lt;/strong&gt; You run it in your infrastructure. If you want a fully managed SaaS gateway, this is not it. For teams with compliance requirements, self-hosted is actually a benefit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go-based, not Python.&lt;/strong&gt; If your team needs to extend gateway logic in Python, the codebase will be unfamiliar. The upside is the 11 microsecond latency that Python gateways cannot match.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration over code.&lt;/strong&gt; Bifrost favours YAML/UI config over programmatic SDKs. If you need deeply custom routing logic (like routing based on prompt content analysis), you will need to handle that at the application layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alternatives exist.&lt;/strong&gt; For simple single-provider setups, a gateway is overkill. If you are only using OpenAI and do not need failover or budgets, just call the API directly.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between LLM orchestration and LLM routing?
&lt;/h3&gt;

&lt;p&gt;LLM routing is one component of orchestration. Routing decides which model handles a request. Orchestration includes routing plus failover, caching, budgets, load balancing, and observability. Multi-model routing is necessary but not sufficient for production AI workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I do LLM orchestration without a gateway?
&lt;/h3&gt;

&lt;p&gt;Technically, yes. You can build routing, failover, caching, and observability yourself. Practically, I have seen teams spend 2-3 engineering months building what a gateway provides out of the box. And then they still need to maintain it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does an AI gateway compare to LangChain for orchestration?
&lt;/h3&gt;

&lt;p&gt;LangChain is a framework for building LLM applications. An AI gateway is infrastructure for managing LLM traffic. They solve different problems. You can use both: LangChain for application logic, and a gateway like &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; for orchestration underneath. Bifrost is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for OpenAI's API format, so integration is straightforward.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;LLM orchestration is not optional once you are running multiple models in production. The question is whether you build it or use a gateway. I have tested both paths. The gateway approach - specifically Bifrost at &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;11 microsecond overhead&lt;/a&gt; - saves engineering time and gives you better observability from day one.&lt;/p&gt;

&lt;p&gt;Star the repo: &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;https://git.new/bifrost&lt;/a&gt;&lt;br&gt;
Docs: &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;https://getmax.im/bifrostdocs&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>beginners</category>
      <category>devops</category>
    </item>
    <item>
      <title>MCP at Scale: Access Control, Cost Governance, and 92% Lower Token Costs</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 14 Apr 2026 08:44:38 +0000</pubDate>
      <link>https://forem.com/pranay_batta/mcp-at-scale-access-control-cost-governance-and-92-lower-token-costs-50jf</link>
      <guid>https://forem.com/pranay_batta/mcp-at-scale-access-control-cost-governance-and-92-lower-token-costs-50jf</guid>
      <description>&lt;h2&gt;
  
  
  The Hidden Tax on Every MCP Request
&lt;/h2&gt;

&lt;p&gt;Here is something nobody talks about when they demo MCP integrations: token costs at scale.&lt;/p&gt;

&lt;p&gt;I have been running MCP setups with increasing numbers of connected servers. The pattern is always the same. You connect a few servers, everything works brilliantly. You connect a dozen, costs start climbing. You connect sixteen servers with 500+ tools, and suddenly your token budget is gone before the model even starts thinking about your actual query.&lt;/p&gt;

&lt;p&gt;Why? Every tool definition from every connected server gets injected into the model's context on every single request. 150+ tool definitions can consume the majority of your token budget. And there is zero access control. Any consumer can call any tool. No cost tracking at tool level.&lt;/p&gt;

&lt;p&gt;This is unsustainable for production deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  I Tested Bifrost's Code Mode Approach
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; takes a fundamentally different approach to this problem. Instead of dumping all tool definitions into the context window, it exposes a virtual filesystem of Python stub files. The model discovers tools on-demand through four meta-tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;listToolFiles&lt;/code&gt; - discover available servers and tools&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;readToolFile&lt;/code&gt; - load specific function signatures&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;getToolDocs&lt;/code&gt; - fetch detailed documentation only when needed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;executeToolCode&lt;/code&gt; - run scripts in a sandboxed Starlark interpreter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight: the model only loads what it actually needs for the current query. If you ask it to read a file, it does not need to know about your Slack, GitHub, Jira, and database tools all at once.&lt;/p&gt;

&lt;p&gt;Here is what a typical tool discovery flow looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Model calls listToolFiles to see available servers
&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;listToolFiles&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Returns: ["filesystem/", "github/", "slack/", "jira/", ...]
&lt;/span&gt;
&lt;span class="c1"&gt;# Model identifies it needs filesystem tools for this query
&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;readToolFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem/read.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Returns only the function signature for filesystem_read
&lt;/span&gt;
&lt;span class="c1"&gt;# Model fetches docs only if needed
&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getToolDocs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Executes with full sandboxing
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;executeToolCode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filesystem/read.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/src/main.go&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is lazy loading for LLM tool contexts. Simple idea. Massive impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results: 3 Controlled Rounds
&lt;/h2&gt;

&lt;p&gt;I ran three controlled rounds, scaling from 6 servers to 16 servers. Every round maintained a 100% task pass rate. The model completed every task correctly while using dramatically fewer tokens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Round&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Servers&lt;/th&gt;
&lt;th&gt;Token Reduction&lt;/th&gt;
&lt;th&gt;Cost Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;96&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;58.2%&lt;/td&gt;
&lt;td&gt;55.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;251&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;508&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;92.8%&lt;/td&gt;
&lt;td&gt;92.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At roughly 500 tools, Code Mode reduces per-query token usage by about 14x. From 1.15M tokens down to 83K. That is not an incremental improvement. That is a different cost structure entirely.&lt;/p&gt;

&lt;p&gt;The savings compound non-linearly. As you add more tools, the percentage saved increases because Code Mode's overhead stays roughly constant while traditional mode scales linearly with tool count.&lt;/p&gt;

&lt;p&gt;For full benchmark methodology, check the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking docs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Access Control That Actually Works
&lt;/h2&gt;

&lt;p&gt;Token savings are great, but production MCP deployments need &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance&lt;/a&gt;. Bifrost handles this through two mechanisms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Virtual Keys&lt;/strong&gt; let you create scoped credentials per user, team, or customer. You can scope at the tool level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-team-key"&lt;/span&gt;
  &lt;span class="na"&gt;allowed_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_read&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_query&lt;/span&gt;
  &lt;span class="na"&gt;blocked_tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;database_delete&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;filesystem_write&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Allow &lt;code&gt;filesystem_read&lt;/code&gt;, block &lt;code&gt;filesystem_write&lt;/code&gt;. Allow &lt;code&gt;database_query&lt;/code&gt;, block &lt;code&gt;database_delete&lt;/code&gt;. Fine-grained, declarative, no code changes needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Tool Groups&lt;/strong&gt; are named collections of tools from multiple servers. You create a group, attach it to keys, teams, or users. No database queries at resolve time. This is important when you are running at &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;5000 RPS&lt;/a&gt; and cannot afford lookup latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-Tool Observability
&lt;/h2&gt;

&lt;p&gt;Every tool execution gets logged with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool name and server source&lt;/li&gt;
&lt;li&gt;Arguments passed and results returned&lt;/li&gt;
&lt;li&gt;Execution latency&lt;/li&gt;
&lt;li&gt;Virtual key that initiated the call&lt;/li&gt;
&lt;li&gt;Parent LLM request context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can track &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;cost at the tool level&lt;/a&gt; alongside LLM token costs. This matters when your finance team asks why the AI bill doubled last month. You can point to exactly which tools, which teams, and which queries drove the spend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Budget and limits&lt;/a&gt; let you set spending caps per virtual key, so no single team can blow through the monthly allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connection Flexibility
&lt;/h2&gt;

&lt;p&gt;Bifrost supports four &lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP connection types&lt;/a&gt;: STDIO, HTTP, SSE, and in-process via the Go SDK. OAuth 2.0 with PKCE and automatic token refresh is built in. Health monitoring with automatic reconnects keeps things running without manual intervention.&lt;/p&gt;

&lt;p&gt;You can run it in manual approval mode where a human reviews tool calls, or in autonomous agent loop mode where the model chains tool calls independently.&lt;/p&gt;

&lt;p&gt;For Claude Code and Cursor users, the &lt;code&gt;/mcp&lt;/code&gt; endpoint integrates directly. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Setup takes minutes&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I noticed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning curve for Code Mode.&lt;/strong&gt; The virtual filesystem abstraction is elegant, but it is a new mental model. Teams used to traditional MCP tool injection will need to understand why their tools are now "files" the model reads on demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meta-tool overhead on simple queries.&lt;/strong&gt; If you only have 10-20 tools, the overhead of the four meta-tools (listToolFiles, readToolFile, etc.) might not save you much. The real wins kick in above 50-100 tools. Below that threshold, traditional mode works fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Starlark sandbox limitations.&lt;/strong&gt; The sandboxed Starlark interpreter is secure by design, but it means tool code runs in a restricted environment. Complex tool implementations may need adjustments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency on gateway availability.&lt;/strong&gt; Adding a gateway layer means one more component to monitor. Bifrost's 11 microsecond latency and Go-based architecture make this a non-issue in practice, but it is still an additional piece of infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Care
&lt;/h2&gt;

&lt;p&gt;If you are running fewer than 50 MCP tools, you probably do not need Code Mode yet. Traditional tool injection works fine at that scale.&lt;/p&gt;

&lt;p&gt;If you are running 100+ tools across multiple servers, or if you need per-team access control, or if your CFO is asking questions about AI infrastructure costs, this is worth evaluating.&lt;/p&gt;

&lt;p&gt;The 92% cost reduction at 500+ tools is the headline number, but the governance features (virtual keys, tool groups, audit logging) are what make it production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Bifrost is open-source and written in Go.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; - star it if this is useful&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/mcp/overview" rel="noopener noreferrer"&gt;MCP documentation&lt;/a&gt; - full setup guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;Governance docs&lt;/a&gt; - virtual keys, tool groups, budgets&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Getting started&lt;/a&gt; - up and running in minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have been testing a lot of MCP tooling lately. Bifrost's approach to the context window problem is the most practical solution I have seen. The lazy loading pattern for tool definitions should honestly be how all MCP gateways work.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; and give it a spin. Happy to discuss benchmarks or setup in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>How to Track LLM Costs and Rate Limits on AWS Bedrock with an AI Gateway</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Mon, 13 Apr 2026 10:21:38 +0000</pubDate>
      <link>https://forem.com/pranay_batta/how-to-track-llm-costs-and-rate-limits-on-aws-bedrock-with-an-ai-gateway-5alh</link>
      <guid>https://forem.com/pranay_batta/how-to-track-llm-costs-and-rate-limits-on-aws-bedrock-with-an-ai-gateway-5alh</guid>
      <description>&lt;p&gt;Running LLM workloads on AWS is easy. Knowing what they cost is not. You spin up Bedrock, call Claude or Mistral a few thousand times, and the bill shows up three days later as a single line item. No breakdown by team. No per-model cost tracking. No rate limits unless you build them yourself.&lt;/p&gt;

&lt;p&gt;I spent the last two weeks evaluating how teams can get proper cost governance over LLM usage on AWS. Native tools, third-party gateways, open-source options. Here is what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with AWS Native Cost Tracking
&lt;/h2&gt;

&lt;p&gt;AWS gives you CloudWatch and Cost Explorer. Both are built for general AWS resource monitoring. They work fine for EC2, Lambda, S3. For LLM workloads on Bedrock, they fall short.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you get from CloudWatch + Cost Explorer:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregate Bedrock spend per region&lt;/li&gt;
&lt;li&gt;Invocation counts at the service level&lt;/li&gt;
&lt;li&gt;Basic alarms on total spend thresholds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What you do not get:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-model token-level cost breakdowns&lt;/li&gt;
&lt;li&gt;Team or project-level budget enforcement&lt;/li&gt;
&lt;li&gt;Rate limiting by user, team, or API key&lt;/li&gt;
&lt;li&gt;Real-time cost tracking per request&lt;/li&gt;
&lt;li&gt;Automatic routing away from providers that exceed limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are running one model for one team, native tools are fine. The moment you have multiple teams, multiple models, or need to enforce granular budgets, you are building custom infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gateway Approach
&lt;/h2&gt;

&lt;p&gt;An LLM gateway sits between your application and Bedrock. Every request passes through it. That gives you a single place to track costs, enforce rate limits, and control routing.&lt;/p&gt;

&lt;p&gt;I tested three approaches:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;AWS Native (CloudWatch + Cost Explorer)&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM-specific cost tracking&lt;/td&gt;
&lt;td&gt;Aggregate only&lt;/td&gt;
&lt;td&gt;Per-request, per-model&lt;/td&gt;
&lt;td&gt;Per-request, per-model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget hierarchy&lt;/td&gt;
&lt;td&gt;Account-level billing alerts&lt;/td&gt;
&lt;td&gt;Basic budget controls&lt;/td&gt;
&lt;td&gt;4-tier: Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate limiting&lt;/td&gt;
&lt;td&gt;No native LLM rate limits&lt;/td&gt;
&lt;td&gt;Basic rate limiting&lt;/td&gt;
&lt;td&gt;VK + Provider Config level, token and request limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reset durations&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Limited options&lt;/td&gt;
&lt;td&gt;1m, 5m, 1h, 1d, 1w, 1M, 1Y (calendar-aligned UTC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bedrock support&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (provider type "bedrock")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overhead&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;~8ms (Python)&lt;/td&gt;
&lt;td&gt;11 microseconds (Go)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Self-hosted or cloud&lt;/td&gt;
&lt;td&gt;Self-hosted (runs in your VPC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The numbers tell the story. For teams that need real LLM cost governance on AWS, a dedicated gateway is the right call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Bifrost with AWS Bedrock
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; runs in your VPC alongside Bedrock. No data leaves your infrastructure. That matters for teams with compliance requirements.&lt;/p&gt;

&lt;p&gt;Start the gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Configure Bedrock as a provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-sonnet-4-20250514-v1:0"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-mistral"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
        &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mistral.mistral-large-2407-v1:0"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weighted routing across models. 80% of requests go to Claude Sonnet on Bedrock, 20% to Mistral. Both running through your AWS account. The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider configuration docs&lt;/a&gt; cover all Bedrock model formats and region options.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four-Tier Budget Hierarchy
&lt;/h2&gt;

&lt;p&gt;This is where Bifrost separates itself from everything else I tested. The &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget system&lt;/a&gt; has four levels: Customer, Team, Virtual Key, and Provider Config. All four must pass for a request to go through.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;customer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acme-corp"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;

  &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-engineering"&lt;/span&gt;
      &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acme-corp"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;

  &lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging-key"&lt;/span&gt;
      &lt;span class="na"&gt;team_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ml-engineering"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1w"&lt;/span&gt;

  &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
      &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
      &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1M"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Customer gets $5,000/month. ML Engineering team gets $2,000 of that. The staging key is capped at $500/week. And the Bedrock Claude provider itself is capped at $1,000/month. If any tier hits its limit, the request is blocked.&lt;/p&gt;

&lt;p&gt;Cost is calculated from provider pricing, token usage, request type, cache status, and batch operations. Not estimated. Calculated from actual usage data.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance docs&lt;/a&gt; have the full breakdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rate Limiting That Actually Works for LLMs
&lt;/h2&gt;

&lt;p&gt;AWS does not give you LLM-specific rate limits. Bedrock has service quotas, but those are blunt instruments. You cannot limit a specific team to 100 requests per minute or cap token consumption per API key.&lt;/p&gt;

&lt;p&gt;Bifrost handles rate limiting at two levels: Virtual Key and Provider Config. You can set both request limits (calls per duration) and token limits (tokens per duration).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rate_limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;virtual_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging-key"&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
      &lt;span class="na"&gt;tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;

  &lt;span class="na"&gt;provider_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
        &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1h"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reset durations: 1m, 5m, 1h, 1d, 1w, 1M, 1Y. The daily, weekly, monthly, and yearly resets are calendar-aligned in UTC. So "1d" resets at midnight UTC, not 24 hours from first request.&lt;/p&gt;

&lt;p&gt;Here is the clever part: if a provider config exceeds its rate limit, that provider gets excluded from &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing&lt;/a&gt;. But other providers in the account remain available. Traffic shifts automatically. No downtime, no manual intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability at Sub-Millisecond Overhead
&lt;/h2&gt;

&lt;p&gt;Every request through Bifrost is captured: tokens used, latency, cost, response status. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; adds less than 0.1ms of overhead. Storage backend is SQLite or PostgreSQL.&lt;/p&gt;

&lt;p&gt;What makes this useful for AWS teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;14+ API filter options&lt;/strong&gt; for querying logs. Filter by model, provider, team, cost range, status code, time window.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket live updates.&lt;/strong&gt; Watch requests flow through in real time. Useful during load testing or incident debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single pane across providers.&lt;/strong&gt; If you are running Bedrock plus OpenAI or Gemini as &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt;, all logs are in one place.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compare that to checking CloudWatch for Bedrock, then the OpenAI dashboard for your fallback, then manually correlating timestamps. The centralised view saves real time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool solves everything. Here is what to know:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost is self-hosted only.&lt;/strong&gt; You run it, you maintain it. For teams already on AWS with VPC infrastructure, this is straightforward. For smaller teams without DevOps, it is extra work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM has broader provider coverage.&lt;/strong&gt; 100+ providers out of the box. If you need niche providers, LiteLLM may have them. Bifrost focuses on major providers but adds the Go performance advantage and deeper governance features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS native tools have zero overhead.&lt;/strong&gt; If all you need is aggregate cost visibility and basic billing alerts, CloudWatch is already there. No extra infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go vs Python matters at scale.&lt;/strong&gt; Bifrost's 11 microsecond overhead versus LiteLLM's ~8ms becomes significant when you are processing thousands of requests per minute. At low volume, both are fine. At scale, the difference compounds. The &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarks&lt;/a&gt; back this up: 5,000 RPS on a single instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost is a newer project.&lt;/strong&gt; The community is growing but smaller than LiteLLM's. Documentation is solid. Edge cases may require checking GitHub issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use What
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Stick with AWS native tools if:&lt;/strong&gt; You have one team, one model, and just need billing alerts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consider LiteLLM if:&lt;/strong&gt; You need maximum provider coverage and are comfortable with Python-based overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Bifrost if:&lt;/strong&gt; You need granular cost governance, multi-tier budgets, LLM-specific rate limiting, and minimal latency on AWS. Especially if you are already running in a VPC and want &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; and automatic failover alongside cost controls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost in your VPC&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Configure Bedrock providers in bifrost.yaml&lt;/span&gt;

&lt;span class="c"&gt;# 3. Set budget and rate limit tiers&lt;/span&gt;

&lt;span class="c"&gt;# 4. Point your application at the gateway&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every Bedrock request now has cost tracking, rate limiting, and observability built in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS makes it easy to run LLM workloads. It does not make it easy to govern them. If your team is scaling Bedrock usage and needs real cost controls, a dedicated LLM gateway fills the gap that CloudWatch and Cost Explorer leave open.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you want to dig into the source.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>Best Claude Code Gateway for Multi-Model Routing</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Fri, 10 Apr 2026 22:29:45 +0000</pubDate>
      <link>https://forem.com/pranay_batta/best-claude-code-gateway-for-multi-model-routing-24mn</link>
      <guid>https://forem.com/pranay_batta/best-claude-code-gateway-for-multi-model-routing-24mn</guid>
      <description>&lt;p&gt;Claude Code is great until you need more than one model. You hit a rate limit on Anthropic, want Gemini for long context, or need GPT-4o for a specific task. The default setup gives you no way to route across providers.&lt;/p&gt;

&lt;p&gt;I spent a week testing gateways that sit between Claude Code and LLM providers. The goal was simple: configure multiple models, set routing weights, get automatic failover, and keep Claude Code working normally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; was the clear winner. Open-source, written in Go, 11 microsecond overhead per request. Here is how I set up multi-model routing and what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Multi-Model Routing Matters
&lt;/h2&gt;

&lt;p&gt;Different models are good at different things. Claude Sonnet handles tool use well. GPT-4o is strong at certain code generation tasks. Gemini 2.5 Pro handles massive context windows. Using one model for everything means you are leaving performance on the table.&lt;/p&gt;

&lt;p&gt;Multi-model routing lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Split traffic across providers by weight&lt;/li&gt;
&lt;li&gt;Fail over automatically when a provider goes down&lt;/li&gt;
&lt;li&gt;Pin specific models for specific tasks&lt;/li&gt;
&lt;li&gt;Control costs by routing cheaper models for simpler operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem: Claude Code talks to &lt;code&gt;api.anthropic.com&lt;/code&gt; by default. No native multi-model support. You need a gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: Bifrost as a Claude Code Gateway
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes an Anthropic-compatible endpoint. Claude Code does not know a gateway exists. It sends standard requests, and Bifrost translates and routes them to whatever provider you configure.&lt;/p&gt;

&lt;p&gt;Full &lt;a href="https://docs.getbifrost.ai/cli-agents/claude-code" rel="noopener noreferrer"&gt;Claude Code integration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install and Connect
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That starts the gateway locally. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;Setup guide&lt;/a&gt; has the details.&lt;/p&gt;

&lt;p&gt;Point Claude Code at Bifrost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; here is a Bifrost virtual key, not your actual Anthropic key. Provider keys live in the Bifrost config. This is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the Anthropic API.&lt;/p&gt;

&lt;p&gt;Done. Every Claude Code request now flows through Bifrost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Weighted Routing Configuration
&lt;/h2&gt;

&lt;p&gt;This is the core of multi-model routing. You assign weights to providers, and Bifrost distributes traffic accordingly. Weights auto-normalize to sum 1.0, so you can use any numbers.&lt;/p&gt;

&lt;p&gt;Here is a config that splits traffic between GPT-4o and Claude Sonnet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;70% of requests go to GPT-4o. 30% to Claude Sonnet. I used this to compare output quality across providers in real coding sessions without manually switching anything.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing docs&lt;/a&gt; cover all the configuration options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important detail:&lt;/strong&gt; cross-provider routing does not happen automatically. You must explicitly configure each provider in your config. Bifrost does not guess or infer routing rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic Failover
&lt;/h2&gt;

&lt;p&gt;Weighted routing is useful. Automatic &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt; is essential. Providers go down. Rate limits hit. You do not want your Claude Code session to break mid-task.&lt;/p&gt;

&lt;p&gt;Bifrost sorts providers by weight and retries on failure. If the primary provider fails, the next one picks up the request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-team"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI goes down, Bifrost retries with Gemini. Gemini fails, falls back to Anthropic. My coding session never interrupts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Pinning for Bedrock and Vertex AI
&lt;/h2&gt;

&lt;p&gt;If your team uses AWS Bedrock or Google Vertex AI, you can pin specific models directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bedrock&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"bedrock/global.anthropic.claude-sonnet-4-6"&lt;/span&gt;

&lt;span class="c"&gt;# Vertex AI&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_DEFAULT_SONNET_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vertex/claude-sonnet-4-6"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also override the model mid-session using the &lt;code&gt;--model&lt;/code&gt; flag or the &lt;code&gt;/model&lt;/code&gt; command inside Claude Code. Useful when you want to switch between models for different parts of a task. Start with Sonnet for scaffolding, switch to GPT-4o for a tricky implementation, then back again. The gateway handles the translation layer for each provider.&lt;/p&gt;

&lt;p&gt;This is one area where the &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK compatibility&lt;/a&gt; matters. Bifrost maintains full compatibility with the Anthropic message format, so model pinning and switching work without any client-side changes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider configuration docs&lt;/a&gt; list all supported providers and model formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Budget Controls Across Providers
&lt;/h2&gt;

&lt;p&gt;Once all traffic flows through one gateway, cost management becomes straightforward. Bifrost has a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, Provider Config.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;virtual_key"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-code-dev"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set a limit. When it is reached, requests get blocked. No surprise bills from a runaway Claude Code session.&lt;/p&gt;

&lt;p&gt;The full &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; handles rate limiting, access control, and spend management across all configured providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Across All Providers
&lt;/h2&gt;

&lt;p&gt;Every request through Bifrost gets logged: latency, token count, cost, provider used, response status. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; gives you a single view across all providers.&lt;/p&gt;

&lt;p&gt;This is particularly useful with multi-model routing. You can see exactly which provider handled each request, compare response times across models, and track per-provider costs. When I was running 70/30 weighted routing between GPT-4o and Claude Sonnet, the observability data showed me exactly how each model performed on real coding tasks. Response times, token consumption, and cost per request, all in one place.&lt;/p&gt;

&lt;p&gt;Without centralized logging, you are checking multiple provider dashboards and guessing which model handled what. That is not sustainable when you are running multiple providers through Claude Code daily.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I found:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter streaming limitation.&lt;/strong&gt; OpenRouter does not stream function call arguments properly. This causes file operation failures in Claude Code. If you use OpenRouter as a provider, expect issues with tool use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-Anthropic model requirements.&lt;/strong&gt; Any non-Anthropic model you route through must support tool use. Claude Code relies heavily on function calling. Models without proper tool support will fail on file operations, search, and other agent tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosted only.&lt;/strong&gt; The open-source version requires you to run and maintain the gateway. There is no managed cloud offering. That means monitoring, updating, and debugging are on you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Newer project.&lt;/strong&gt; Bifrost's community is growing but still smaller than older alternatives. Documentation is solid, but edge cases may require digging through issues on GitHub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extra hop.&lt;/strong&gt; You are adding a process between Claude Code and your provider. The 11 microsecond overhead is negligible, but it is one more thing in the chain to keep running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;I ran benchmarks matching the &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking guide&lt;/a&gt;. The numbers held up: 11 microseconds of routing overhead, 5,000 requests per second on a single instance. The Go implementation makes a real difference. Python-based gateways I tested added significantly more latency.&lt;/p&gt;

&lt;p&gt;For a gateway that sits in the critical path of every LLM call, low overhead matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start Bifrost&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# 2. Configure providers in bifrost.yaml (weighted routing + failover)&lt;/span&gt;

&lt;span class="c"&gt;# 3. Point Claude Code at Bifrost&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-bifrost-virtual-key

&lt;span class="c"&gt;# 4. Use Claude Code normally&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is it. Your Claude Code session now routes across multiple models with automatic failover and budget controls.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are running Claude Code for real work, multi-model routing is not optional. Single-provider setups break at the worst times. A gateway that handles routing, failover, and cost controls in one place saves hours of debugging and thousands in unexpected spend.&lt;/p&gt;

&lt;p&gt;Open an issue on the &lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;repo&lt;/a&gt; if you run into anything.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Best MCP Gateway for 50% Token Cost Savings</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Tue, 07 Apr 2026 08:32:48 +0000</pubDate>
      <link>https://forem.com/pranay_batta/best-mcp-gateway-for-50-token-cost-savings-4anm</link>
      <guid>https://forem.com/pranay_batta/best-mcp-gateway-for-50-token-cost-savings-4anm</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Classic MCP dumps 100+ tool definitions into every LLM call. Bifrost's Code Mode generates TypeScript declarations instead, cutting token usage by 50%+ and latency by 40-50%. If you are running 3 or more MCP servers, this is the single biggest cost lever you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Classic MCP
&lt;/h2&gt;

&lt;p&gt;I have been testing MCP setups for a few months now. The standard approach is simple. You connect your MCP servers, and every tool definition gets sent to the LLM as part of the context window. Every single call.&lt;/p&gt;

&lt;p&gt;With 3 MCP servers, you might have 30-40 tools. With 10 servers, easily 100+. Each tool definition includes the name, description, input schema, and parameter types. That is a lot of tokens. And you are paying for every single one of them on every request.&lt;/p&gt;

&lt;p&gt;The math is straightforward. If your average tool definition is 200 tokens, and you have 50 tools, that is 10,000 tokens of overhead per call. At scale, this adds up fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Code Mode Changes This
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; takes a different approach with its &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt;. Instead of exposing raw tool definitions to the LLM, it generates TypeScript declaration files (.d.ts) for all connected MCP tools.&lt;/p&gt;

&lt;p&gt;The LLM then writes TypeScript code to orchestrate multiple tools in a restricted sandbox environment. Instead of the model making 5 separate tool calls (each requiring a round trip), it writes one code block that handles all 5 operations.&lt;/p&gt;

&lt;p&gt;Here is what this means in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token reduction:&lt;/strong&gt; 50%+ compared to classic MCP. The TypeScript declarations are more compact than full JSON schemas, and the model makes fewer round trips.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency reduction:&lt;/strong&gt; 40-50% compared to classic MCP. Fewer round trips means faster overall execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recommended when:&lt;/strong&gt; You are using 3 or more MCP servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Code Mode Actually Does
&lt;/h2&gt;

&lt;p&gt;The execution model is restricted by design. Here is what is available in the sandbox:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Available:&lt;/strong&gt; ES5.1+ JavaScript, async/await, TypeScript, console.log/error/warn, JSON.parse/stringify, and all MCP tool bindings as globals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not available:&lt;/strong&gt; ES Modules, Node.js APIs, browser APIs, DOM, timers (setTimeout/setInterval), network access.&lt;/p&gt;

&lt;p&gt;This is not a general-purpose runtime. It is a controlled environment where the LLM can orchestrate tools safely. No arbitrary code execution, no network calls outside of the tool bindings.&lt;/p&gt;

&lt;p&gt;You can configure tool bindings at the server level or tool level, depending on how granular you need the control to be. The &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs cover the binding configuration&lt;/a&gt; in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Latency Numbers
&lt;/h2&gt;

&lt;p&gt;Bifrost itself adds 11 microseconds of latency overhead per request. It is written in Go and handles 5,000 RPS sustained throughput. That is roughly 50x faster than Python-based alternatives.&lt;/p&gt;

&lt;p&gt;For MCP-specific operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sub-3ms MCP latency overall&lt;/li&gt;
&lt;li&gt;InProcess connections: ~0.1ms&lt;/li&gt;
&lt;li&gt;STDIO connections: ~1-10ms&lt;/li&gt;
&lt;li&gt;HTTP connections: ~10-500ms (network dependent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The MCP tool discovery is cached after the first request, so subsequent calls hit ~100-500 microseconds for discovery and ~50-200 nanoseconds for tool filtering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Mode: The Other Side
&lt;/h2&gt;

&lt;p&gt;Bifrost also has an &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Agent Mode&lt;/a&gt; that turns the gateway into an autonomous agent runtime. You configure which tools are auto-approved via &lt;code&gt;tools_to_auto_execute&lt;/code&gt;, set a &lt;code&gt;max_depth&lt;/code&gt; to prevent infinite loops, and let the agent handle iterative execution.&lt;/p&gt;

&lt;p&gt;This is a different use case from Code Mode. Agent Mode is for workflows where you want the LLM to act autonomously within boundaries. Code Mode is for when you want to reduce token costs on tool-heavy operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;p&gt;Setup is zero-config. You can start with npx or Docker. The gateway supports 19+ providers out of the box (OpenAI, Anthropic, Azure, Bedrock, Gemini, Mistral, Cohere, Groq, and others), all through an &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;OpenAI-compatible API format&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# npx&lt;/span&gt;
npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost

&lt;span class="c"&gt;# Docker&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Who Should Use Code Mode
&lt;/h2&gt;

&lt;p&gt;If you are running fewer than 3 MCP servers, classic mode is probably fine. The overhead is manageable.&lt;/p&gt;

&lt;p&gt;If you are running 3+, especially with 50+ tools across those servers, Code Mode is worth testing. The 50%+ token savings are significant at scale, and the 40-50% latency improvement compounds across multi-step agent workflows.&lt;/p&gt;

&lt;p&gt;I tested this on a setup with 5 MCP servers and 80+ tools. The token savings were immediately visible in the cost dashboard. The reduced round trips also made the overall agent response noticeably faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;git.new/bifrost&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;getmax.im/bifrostdocs&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;getmax.im/bifrost-home&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>How to Connect Any Model with Gemini CLI Using Bifrost AI Gateway</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Mon, 06 Apr 2026 10:12:17 +0000</pubDate>
      <link>https://forem.com/pranay_batta/how-to-connect-any-model-with-gemini-cli-using-bifrost-ai-gateway-4n0d</link>
      <guid>https://forem.com/pranay_batta/how-to-connect-any-model-with-gemini-cli-using-bifrost-ai-gateway-4n0d</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Gemini CLI works with Google's models out of the box. But if you want to route requests through multiple providers, add failover, or track costs, you can point Gemini CLI at Bifrost. One config change. Every model available through a single endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Single-Provider CLI Tools
&lt;/h2&gt;

&lt;p&gt;Gemini CLI connects to Google's Generative AI API. That is fine if you only use Gemini models. But most production setups involve multiple providers. OpenAI for some tasks. Anthropic for others. Maybe a local Ollama instance for development.&lt;/p&gt;

&lt;p&gt;Switching between CLIs and API keys for each provider gets old fast.&lt;/p&gt;

&lt;p&gt;I tested &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an open-source LLM gateway written in Go, as a unified routing layer for Gemini CLI. The setup took about 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bifrost Works with Gemini CLI
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes a &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;fully Google GenAI compatible endpoint&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;http://localhost:8080/genai/v1beta/models/{model}/generateContent
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means Gemini CLI can talk to Bifrost without any code changes. Just point the base URL to your Bifrost instance.&lt;/p&gt;

&lt;p&gt;Bifrost then routes the request to whatever provider and model you specify. OpenAI, Anthropic, Vertex AI, Bedrock, Groq, Ollama. All through the same endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Install Bifrost
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @anthropic-ai/bifrost@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero config. Starts on port 8080 by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Configure Providers
&lt;/h3&gt;

&lt;p&gt;Add your provider keys to the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;config&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.OPENAI_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4o"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4o-mini"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"anthropic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"anthropic-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.ANTHROPIC_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"claude-sonnet-4-20250514"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gemini"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemini-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"env.GEMINI_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"gemini-2.5-flash"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"weight"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Point Gemini CLI to Bifrost
&lt;/h3&gt;

&lt;p&gt;Set the base URL to your Bifrost instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every request from Gemini CLI goes through Bifrost. You can target any provider using the provider-prefixed model format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gemini/gemini-2.5-flash      → Google Gemini
openai/gpt-4o                → OpenAI
anthropic/claude-sonnet-4-20250514  → Anthropic
vertex/gemini-pro             → Vertex AI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What You Get
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Multi-Provider Routing
&lt;/h3&gt;

&lt;p&gt;One CLI, every model. No more switching between tools or managing separate API keys per provider.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic Failover
&lt;/h3&gt;

&lt;p&gt;Set up &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;fallback chains&lt;/a&gt; in your requests. If Gemini is rate-limited, the request goes to OpenAI. If OpenAI is down, it goes to Anthropic. Each fallback is a fresh request. All plugins still run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget Controls
&lt;/h3&gt;

&lt;p&gt;Bifrost has a &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;four-tier budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, and Provider Config. Set a monthly spending cap on your Virtual Key. When it is hit, the gateway stops routing to paid providers. Your local Ollama instance can serve as the fallback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Tracking
&lt;/h3&gt;

&lt;p&gt;Every request is logged with token counts and cost calculations. The &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Model Catalog&lt;/a&gt; tracks pricing across all providers automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;I ran 1,000 requests through Bifrost targeting Gemini models. The gateway adds 11µs of overhead per request. At 5,000 RPS sustained throughput, the bottleneck is always the provider, never the gateway. That is 50x faster than Python-based alternatives like LiteLLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Overhead Question
&lt;/h2&gt;

&lt;p&gt;The concern with adding a proxy layer is always latency. In practice, LLM API calls take 500ms to 5 seconds depending on the model and prompt. An 11µs gateway overhead is invisible.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; layer (currently Weaviate-backed) can actually reduce latency for repeated queries by serving cached responses instead of hitting the provider again.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Makes Sense
&lt;/h2&gt;

&lt;p&gt;This setup is useful if you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Gemini CLI but also need OpenAI or Anthropic models&lt;/li&gt;
&lt;li&gt;Want failover so your workflow does not break during provider outages&lt;/li&gt;
&lt;li&gt;Need to track costs across providers in one place&lt;/li&gt;
&lt;li&gt;Want to set budget limits so you do not get surprise bills&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only use Gemini models and do not care about failover or cost tracking, direct connection is fine. The gateway adds value when you are working across providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost Home&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>gemini</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Top 5 Enterprise AI Gateways for Dynamic Routing in 2026</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Fri, 03 Apr 2026 05:42:24 +0000</pubDate>
      <link>https://forem.com/pranay_batta/top-5-enterprise-ai-gateways-for-dynamic-routing-in-2026-514b</link>
      <guid>https://forem.com/pranay_batta/top-5-enterprise-ai-gateways-for-dynamic-routing-in-2026-514b</guid>
      <description>&lt;p&gt;If you are running multiple LLM providers in production, routing logic becomes a critical infrastructure decision. Send everything to one provider and you get single points of failure. Hardcode routing rules and you lose flexibility when latency spikes or rate limits hit.&lt;/p&gt;

&lt;p&gt;I spent the last few weeks evaluating five AI gateways specifically for their dynamic routing capabilities. The criteria: latency overhead, failover behaviour, weighted distribution, and how much config it takes to get routing working in production.&lt;/p&gt;

&lt;p&gt;The short version: &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; came out on top for raw performance and routing flexibility. 11 microsecond latency overhead, written in Go, with weighted routing and automatic &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;failover&lt;/a&gt; built in. You can run it right now with &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;. &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Full docs here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why dynamic routing matters
&lt;/h2&gt;

&lt;p&gt;Static routing is fine for prototypes. Pick a model, call the API, ship it.&lt;/p&gt;

&lt;p&gt;Production is different. You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failover&lt;/strong&gt;: When OpenAI returns 429s or 500s, traffic should automatically shift to Anthropic or another provider. No manual intervention.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weighted distribution&lt;/strong&gt;: Split traffic 70/30 across providers for cost optimization or A/B testing model quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency-based routing&lt;/strong&gt;: Send requests to whichever provider responds fastest at that moment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget-aware routing&lt;/strong&gt;: Stop sending traffic to a provider when your spend cap is hit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gateway layer is the right place to handle this. Application code should not care which provider serves a request.&lt;/p&gt;




&lt;h2&gt;
  
  
  The five gateways I tested
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Bifrost
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language&lt;/strong&gt;: Go | &lt;strong&gt;Overhead&lt;/strong&gt;: 11 microseconds | &lt;strong&gt;Throughput&lt;/strong&gt;: 5,000 RPS sustained&lt;/p&gt;

&lt;p&gt;Bifrost is the fastest gateway I have tested. The 11 microsecond overhead is not a typo. That is roughly 50x faster than Python-based alternatives like LiteLLM, which adds around 8ms per request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;Routing configuration&lt;/a&gt; is declarative and clean. Here is what weighted routing across two providers looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# bifrost-config.yaml&lt;/span&gt;
&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai-primary&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${OPENAI_API_KEY}&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic-fallback&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropic&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
    &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${ANTHROPIC_API_KEY}&lt;/span&gt;

&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;weighted&lt;/span&gt;
  &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;max_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That splits 70% of traffic to OpenAI and 30% to Anthropic. If OpenAI fails, requests automatically &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;fall back&lt;/a&gt; to Anthropic.&lt;/p&gt;

&lt;p&gt;What I like: the &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance layer&lt;/a&gt; ties routing to budgets. You can set a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt; (Customer, Team, Virtual Key, Provider Config) and routing decisions respect those limits. When a provider budget is exhausted, traffic shifts automatically.&lt;/p&gt;

&lt;p&gt;Setup is genuinely fast. One command to start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; covers both approaches. &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider configuration&lt;/a&gt; takes a few minutes.&lt;/p&gt;

&lt;p&gt;Other features worth noting: &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; with dual-layer support (exact hash + semantic similarity), &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; built in, &lt;a href="https://docs.getbifrost.ai/features/mcp" rel="noopener noreferrer"&gt;MCP support&lt;/a&gt; with sub-3ms latency and 50%+ token reduction in Code Mode, and a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; endpoint for the Anthropic SDK so you can migrate without changing application code. &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK integration docs here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;Check the benchmarks&lt;/a&gt; if you want to verify the numbers yourself.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. LiteLLM
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python | &lt;strong&gt;Overhead&lt;/strong&gt;: ~8ms | &lt;strong&gt;Providers&lt;/strong&gt;: 100+&lt;/p&gt;

&lt;p&gt;LiteLLM has the widest provider coverage I have seen. Over 100 providers through a unified interface. If you need to call a niche model API, LiteLLM probably supports it.&lt;/p&gt;

&lt;p&gt;Routing is available through the proxy server. You can configure fallbacks and load balancing across models. The configuration is YAML-based and straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-4&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sk-xxx&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;azure/gpt-4&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sk-yyy&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;routing_strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;least-busy&lt;/span&gt;
  &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trade-off is performance. At ~8ms overhead per request, you are adding meaningful latency at high throughput. For applications doing thousands of requests per second, that adds up. The Python runtime is the bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit where it is due&lt;/strong&gt;: LiteLLM's provider coverage is unmatched and the community is active. For teams that prioritize breadth over speed, it is a solid choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Kong AI Gateway
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language&lt;/strong&gt;: Lua/C (OpenResty) | &lt;strong&gt;Type&lt;/strong&gt;: Enterprise, plugin-based&lt;/p&gt;

&lt;p&gt;Kong is a well-established API gateway that added AI capabilities through plugins. If your organization already runs Kong for general API management, adding AI routing is incremental.&lt;/p&gt;

&lt;p&gt;The AI plugin supports multiple providers and basic routing. Rate limiting, authentication, and logging come from Kong's mature plugin ecosystem.&lt;/p&gt;

&lt;p&gt;The limitation: AI-specific routing features require the enterprise tier. The open-source version gives you basic proxying, but weighted routing, advanced failover, and AI-specific analytics are paid features. Configuration is also more complex because you are working within Kong's plugin architecture rather than a purpose-built AI gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit&lt;/strong&gt;: Kong's plugin ecosystem is mature and battle-tested for general API management.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cloudflare AI Gateway
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Type&lt;/strong&gt;: Managed service | &lt;strong&gt;Setup&lt;/strong&gt;: Minutes&lt;/p&gt;

&lt;p&gt;Cloudflare AI Gateway is the easiest to set up on this list. If you are already on Cloudflare, you can enable it from the dashboard and start routing requests through their edge network.&lt;/p&gt;

&lt;p&gt;It provides caching, rate limiting, and basic analytics out of the box. The managed nature means zero infrastructure to maintain.&lt;/p&gt;

&lt;p&gt;The limitation: routing flexibility is constrained compared to self-hosted options. Custom routing strategies, weighted distribution, and provider-level budget controls are limited. You also depend on Cloudflare's edge network for all LLM traffic, which may not work for teams with data residency requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit&lt;/strong&gt;: For teams that want AI gateway functionality without managing infrastructure, Cloudflare delivers the simplest path to production.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Azure API Management
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Type&lt;/strong&gt;: Enterprise, Azure-native | &lt;strong&gt;Setup&lt;/strong&gt;: Hours to days&lt;/p&gt;

&lt;p&gt;Azure APIM is the default choice for organizations already invested in Azure. It supports routing to Azure OpenAI endpoints with built-in integration, and you can configure policies for retry, circuit breaking, and load balancing.&lt;/p&gt;

&lt;p&gt;The routing configuration uses Azure's policy XML, which is verbose but powerful. You get deep integration with Azure Monitor, Key Vault, and other Azure services.&lt;/p&gt;

&lt;p&gt;The limitation: it is Azure-native. If you are multi-cloud or use non-Azure LLM providers, the integration story gets complicated. Routing to Anthropic or other providers requires custom policy work. Setup is also significantly more complex than purpose-built AI gateways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credit&lt;/strong&gt;: For Azure-first organizations, the deep integration with the Azure ecosystem and enterprise compliance features are genuinely valuable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Kong AI&lt;/th&gt;
&lt;th&gt;Cloudflare AI&lt;/th&gt;
&lt;th&gt;Azure APIM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Latency overhead&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;Low (Lua/C)&lt;/td&gt;
&lt;td&gt;Varies (edge)&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Lua/C&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;td&gt;Managed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weighted routing&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Enterprise only&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Via policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic failover&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Via policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget-aware routing&lt;/td&gt;
&lt;td&gt;Yes (4-tier)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic caching&lt;/td&gt;
&lt;td&gt;Yes (dual-layer)&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider count&lt;/td&gt;
&lt;td&gt;Growing&lt;/td&gt;
&lt;td&gt;100+&lt;/td&gt;
&lt;td&gt;Major providers&lt;/td&gt;
&lt;td&gt;Major providers&lt;/td&gt;
&lt;td&gt;Azure-focused&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup time&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Hours to days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Honest trade-offs
&lt;/h2&gt;

&lt;p&gt;No tool is perfect. Here is what I found lacking in each.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bifrost&lt;/strong&gt;: Provider count is still growing. If you need a niche provider that is not yet supported, you will need to check the docs or request it. The project is newer than LiteLLM, so community resources are still building up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt;: Performance at scale is the main concern. The ~8ms overhead is fine for low-throughput applications, but at 5,000+ RPS, you are looking at significant cumulative latency. Memory usage also climbs with the Python runtime under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kong AI Gateway&lt;/strong&gt;: The AI features feel bolted on rather than native. If you are not already a Kong customer, adopting the full Kong stack just for AI routing is overkill. Enterprise pricing for AI-specific features is a barrier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare AI Gateway&lt;/strong&gt;: Limited control. You cannot implement custom routing strategies or complex failover logic. Data flows through Cloudflare's network, which is a non-starter for some compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure APIM&lt;/strong&gt;: Vendor lock-in is real. Multi-provider routing outside Azure requires significant custom work. Configuration through XML policies is tedious compared to YAML-based alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which one should you pick
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pick Bifrost&lt;/strong&gt; if performance and routing flexibility are your top priorities. The 11 microsecond overhead and built-in &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance&lt;/a&gt; features (budget-aware routing, weighted distribution, automatic failover) make it the strongest option for high-throughput production workloads. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Star it on GitHub&lt;/a&gt; or &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;check the docs&lt;/a&gt; to get started.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick LiteLLM&lt;/strong&gt; if you need the widest provider coverage and performance is not your bottleneck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Kong&lt;/strong&gt; if your organization already runs Kong and wants to add AI routing incrementally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Cloudflare&lt;/strong&gt; if you want zero infrastructure overhead and can live with limited routing customization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Azure APIM&lt;/strong&gt; if you are fully committed to the Azure ecosystem.&lt;/p&gt;

&lt;p&gt;For most teams building production AI infrastructure, routing is a gateway-level concern that should not leak into application code. The right gateway depends on your throughput requirements, provider mix, and how much control you need over routing logic.&lt;/p&gt;

&lt;p&gt;I would start with &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;. One command to run, sub-microsecond overhead, and routing that actually works at scale. &lt;a href="https://getmax.im/docspage" rel="noopener noreferrer"&gt;Docs are here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
      <category>go</category>
    </item>
    <item>
      <title>How to Connect Non-Anthropic Models to Claude Code with Bifrost AI Gateway</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 01 Apr 2026 13:41:26 +0000</pubDate>
      <link>https://forem.com/pranay_batta/how-to-connect-non-anthropic-models-to-claude-code-with-bifrost-ai-gateway-5dnj</link>
      <guid>https://forem.com/pranay_batta/how-to-connect-non-anthropic-models-to-claude-code-with-bifrost-ai-gateway-5dnj</guid>
      <description>&lt;p&gt;I tested five different LLM gateways to route non-Anthropic models through Claude Code. Bifrost was the fastest by a wide margin. 11 microseconds of overhead per request. 50x faster than the Python-based alternatives I benchmarked.&lt;/p&gt;

&lt;p&gt;Here is exactly how I set it up, what worked, and where each feature matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Bifrost is an open-source Go gateway that exposes an Anthropic-compatible endpoint, letting you route Claude Code requests to GPT-4o, Gemini, Bedrock, or any supported provider by changing one environment variable. You get multi-provider failover, budget controls, and semantic caching at 11 microseconds of overhead per request.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post assumes you are familiar with Claude Code and have used at least one LLM API.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost on GitHub&lt;/a&gt; -- open-source, written in Go, handles 5,000 RPS on a single instance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Claude Code locks you into &lt;code&gt;api.anthropic.com&lt;/code&gt;. No native way to swap providers. You cannot route to GPT-4o, Gemini, or Bedrock models without building your own proxy or switching tools entirely.&lt;/p&gt;

&lt;p&gt;I needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4o for certain coding tasks&lt;/li&gt;
&lt;li&gt;Gemini 2.5 Pro for long context&lt;/li&gt;
&lt;li&gt;Automatic failover when a provider goes down&lt;/li&gt;
&lt;li&gt;One place to track costs across all models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building a custom proxy was not worth the maintenance burden. So I went looking for something production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Bifrost Does
&lt;/h2&gt;

&lt;p&gt;Bifrost exposes an Anthropic-compatible endpoint at &lt;code&gt;/anthropic&lt;/code&gt;. Claude Code sends standard Anthropic-format requests. Bifrost translates and routes them to whatever provider you configure -- OpenAI, Bedrock, Vertex AI, Gemini, others.&lt;/p&gt;

&lt;p&gt;It is a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt;. Change one URL. No SDK modifications. No wrapper code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Claude Code -&amp;gt; Bifrost (/anthropic) -&amp;gt; Any LLM Provider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK integration&lt;/a&gt; page has the full compatibility details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup: 3 Minutes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Install
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That starts the gateway locally. Full &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup instructions here&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Configure a Provider
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;bifrost.yaml&lt;/code&gt;. This routes everything to GPT-4o:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-account"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider configuration docs&lt;/a&gt; for all supported providers and options.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Point Claude Code at Bifrost
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/anthropic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Done. Claude Code now sends requests through Bifrost, which translates them to OpenAI format and forwards to GPT-4o. Zero code changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Provider Routing
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. I configured weighted &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing&lt;/a&gt; across two providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-account"&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-fallback"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;80% of traffic goes to GPT-4o. 20% to Claude. Useful when you want to compare output quality across models in real usage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic Failover
&lt;/h2&gt;

&lt;p&gt;This was the feature that sold me. &lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;Failover configuration&lt;/a&gt; took five minutes. If GPT-4o goes down, Bifrost tries the next provider automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;accounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-account"&lt;/span&gt;
    &lt;span class="na"&gt;failover&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-primary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${OPENAI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o"&lt;/span&gt;
        &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-secondary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${GEMINI_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
        &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic-tertiary"&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
        &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${ANTHROPIC_API_KEY}"&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514"&lt;/span&gt;
        &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI fails, Bifrost tries Gemini. Gemini fails, falls back to Anthropic. My Claude Code session never breaks. No retry logic on my side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bedrock and Vertex AI
&lt;/h2&gt;

&lt;p&gt;I also tested with AWS Bedrock and Vertex AI. Same pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;providers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock-claude"&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock"&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1"&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-sonnet-4-20250514-v2:0"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vertex-gemini"&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vertex"&lt;/span&gt;
    &lt;span class="na"&gt;project_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-gcp-project"&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1"&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same Anthropic-compatible endpoint. Claude Code does not know which provider is behind Bifrost. That is the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features Worth Mentioning
&lt;/h2&gt;

&lt;p&gt;Routing alone is useful. But once all requests flow through one gateway, you get access to several other capabilities I found genuinely practical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Budget Enforcement
&lt;/h3&gt;

&lt;p&gt;Bifrost has a four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;: Customer, Team, Virtual Key, Provider Config. I set team-level limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;team"&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engineering"&lt;/span&gt;
    &lt;span class="na"&gt;limit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
    &lt;span class="na"&gt;period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Budget runs out, requests get blocked. No surprise bills. The &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;governance docs&lt;/a&gt; cover the full hierarchy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Caching
&lt;/h3&gt;

&lt;p&gt;This cut my costs noticeably. Bifrost supports &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer caching&lt;/a&gt;: exact hash matching plus semantic similarity. If I have already asked a similar question, it returns the cached response instead of hitting the provider.&lt;/p&gt;

&lt;p&gt;Supported &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector stores&lt;/a&gt;: Weaviate, Redis, Qdrant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;Every request gets logged with latency, tokens, cost, and provider information. The &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability layer&lt;/a&gt; gives you full visibility into what is happening across all your providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP Support
&lt;/h3&gt;

&lt;p&gt;Bifrost also works as an &lt;a href="https://docs.getbifrost.ai/features/mcp" rel="noopener noreferrer"&gt;MCP server&lt;/a&gt;. I tested Code Mode -- it reduced tokens by over 50% and latency by 40-50%. Agent Mode is available for more complex workflows. Useful if you are connecting to Claude Desktop or other MCP-compatible clients.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;I ran my own tests and the numbers matched what is documented. 11 microseconds overhead. 5,000 RPS on a single instance. The Go implementation makes a real difference compared to Python gateways I tested.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;benchmarking guide&lt;/a&gt; explains how to reproduce these numbers yourself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade-offs and Limitations
&lt;/h2&gt;

&lt;p&gt;Worth being upfront about the downsides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Relatively new project.&lt;/strong&gt; Bifrost does not have the years of battle-testing that older proxies have. The community is growing but smaller than established alternatives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosted only.&lt;/strong&gt; The open-source version has no managed cloud offering. You run and maintain the infrastructure yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extra operational overhead.&lt;/strong&gt; You are running a separate process between Claude Code and your LLM provider. That is one more thing to monitor, update, and debug compared to direct API calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider coverage is expanding but not exhaustive.&lt;/strong&gt; Some niche providers or model variants may not be supported yet. Check the docs before committing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Recap
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Install: &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Configure providers in &lt;code&gt;bifrost.yaml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;ANTHROPIC_BASE_URL=http://localhost:8080/anthropic&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Use Claude Code normally. Bifrost routes to whatever model you configured.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I tried building custom proxies before. I tried other gateways. This is the fastest option I found, and the setup takes minutes not hours.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; | &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Docs&lt;/a&gt; | &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you run into issues or want a specific provider supported, open an issue on the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;Drop-in Replacement Guide&lt;/a&gt; -- how Bifrost maintains full Anthropic SDK compatibility&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider Configuration&lt;/a&gt; -- all supported providers and config options&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/fallbacks" rel="noopener noreferrer"&gt;Failover and Fallbacks&lt;/a&gt; -- setting up automatic provider failover&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;Governance: Budget and Limits&lt;/a&gt; -- the four-tier budget hierarchy explained&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getbifrost.ai/benchmarking/getting-started" rel="noopener noreferrer"&gt;Benchmarking Guide&lt;/a&gt; -- reproduce the latency and throughput numbers yourself&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claudecode</category>
      <category>llm</category>
      <category>opensource</category>
      <category>ai</category>
    </item>
    <item>
      <title>How Bifrost Reduces GPT Costs and Response Times with Semantic Caching</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 01 Apr 2026 05:51:15 +0000</pubDate>
      <link>https://forem.com/pranay_batta/how-bifrost-reduces-gpt-costs-and-response-times-with-semantic-caching-344g</link>
      <guid>https://forem.com/pranay_batta/how-bifrost-reduces-gpt-costs-and-response-times-with-semantic-caching-344g</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Every GPT API call costs money and takes time. If your app sends the same (or very similar) prompts repeatedly, you are paying full price each time for answers you already have. &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an open-source LLM gateway, ships with a &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt; plugin that uses dual-layer caching: exact hash matching plus &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector similarity search&lt;/a&gt;. Cache hits cost zero. Semantic matches cost only the embedding lookup. This post walks you through how it works and how to set it up.&lt;/p&gt;




&lt;h2&gt;
  
  
  The cost problem with GPT API calls
&lt;/h2&gt;

&lt;p&gt;If you are building anything production-grade with GPT-4, GPT-4o, or any OpenAI model, you already know that API costs add up fast. Token-based pricing means every request burns through your budget, whether it is a fresh question or something your system answered three minutes ago.&lt;/p&gt;

&lt;p&gt;Here is the thing: in most real applications, a significant portion of requests are either identical or semantically similar to previous ones. Think about it. Customer support bots get asked the same questions in slightly different words. Code assistants receive near-identical prompts from different users. RAG pipelines retrieve similar context and ask similar follow-ups.&lt;/p&gt;

&lt;p&gt;Without caching, you pay full model cost for every single one of those requests. You also wait for the full round-trip to the provider each time, adding latency that your users notice.&lt;/p&gt;

&lt;p&gt;The obvious fix is caching. But traditional exact-match caching has a big limitation: it only works when the prompt is character-for-character identical. Change one word, add a comma, rephrase slightly, and you get a cache miss. That is where semantic caching changes the game.&lt;/p&gt;




&lt;h2&gt;
  
  
  What semantic caching is and how it differs from exact-match caching
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Exact-match caching&lt;/strong&gt; hashes the entire request and looks up that hash. If the hash matches a stored response, you get a cache hit. If even one character is different, it is a miss. This works well for automated pipelines where prompts are templated and predictable. It falls apart for user-facing applications where people phrase things differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt; converts the request into a vector embedding and searches for similar embeddings in a vector store. If a stored request is semantically similar enough (above a configurable threshold), the cached response is returned. This means "How do I reset my password?" and "What are the steps to change my password?" can both hit the same cache entry.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; combines both approaches in a &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;dual-layer architecture&lt;/a&gt;, giving you the speed of exact matching with the intelligence of semantic similarity as a fallback.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Bifrost implements dual-layer caching
&lt;/h2&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache plugin&lt;/a&gt; uses a two-step lookup process for every request that has a cache key:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Exact hash match.&lt;/strong&gt; The plugin hashes the request and checks for a direct match. This is the fastest path. If it hits, you get the cached response with zero additional cost. No embedding generation, no vector search, no provider call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Semantic similarity search.&lt;/strong&gt; If the exact match misses, Bifrost generates an embedding for the request and searches the &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt; for semantically similar entries. If a match is found above the similarity threshold (default 0.8), the cached response is returned. The only cost here is the embedding generation.&lt;/p&gt;

&lt;p&gt;If both layers miss, the request goes to the LLM provider as normal. The response is then stored in the &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt; with its embedding for future lookups.&lt;/p&gt;

&lt;p&gt;You can also control which layer to use per request. If you know your use case only needs exact matching (templated prompts), you can skip the semantic layer entirely. If you want semantic-only, that is an option too. The default is both, with direct matching first and semantic as fallback.&lt;/p&gt;

&lt;p&gt;Here is how the cost breaks down:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;LLM API Cost&lt;/th&gt;
&lt;th&gt;Embedding Cost&lt;/th&gt;
&lt;th&gt;Total Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Exact cache hit&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic cache hit&lt;/td&gt;
&lt;td&gt;Zero&lt;/td&gt;
&lt;td&gt;Embedding only&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cache miss&lt;/td&gt;
&lt;td&gt;Full model cost&lt;/td&gt;
&lt;td&gt;Embedding generation&lt;/td&gt;
&lt;td&gt;Full + embedding&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Bifrost also handles cost calculation natively through &lt;code&gt;CalculateCostWithCacheDebug&lt;/code&gt;, which automatically accounts for cache hits, semantic matches, and misses in your &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;cost tracking&lt;/a&gt;. All pricing data is cached in memory for O(1) lookup, so the cost calculation itself adds no overhead.&lt;/p&gt;

&lt;p&gt;Check out the full &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Bifrost documentation&lt;/a&gt; for the complete API reference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting it up
&lt;/h2&gt;

&lt;p&gt;Follow the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; to get Bifrost running, then configure two things: a vector store and the semantic cache plugin.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Configure the &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Bifrost uses Weaviate as its vector store. You can run Weaviate locally with Docker or use Weaviate Cloud.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local setup with Docker:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 50051:50051 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;PERSISTENCE_DATA_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'/var/lib/weaviate'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  semitechnologies/weaviate:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;config.json (local Weaviate):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vector_store"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weaviate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"localhost:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scheme"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;config.json (Weaviate Cloud):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vector_store"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"weaviate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"host"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-cluster.weaviate.network"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scheme"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"api_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-weaviate-api-key"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure the &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache plugin&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Add the plugin to your Bifrost config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"plugins"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"semantic_cache"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"embedding_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text-embedding-3-small"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ttl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5m"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"conversation_history_threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"exclude_system_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"cache_by_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"cache_by_provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"cleanup_on_shutdown"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to note about these settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;threshold&lt;/code&gt;&lt;/strong&gt;: The similarity score (0 to 1) required for a semantic match. 0.8 is a good starting point. Higher means stricter matching, fewer false positives, but more cache misses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;conversation_history_threshold&lt;/code&gt;&lt;/strong&gt;: Defaults to 3. If a conversation has more messages than this, caching is skipped. Long conversations have high probability of false positive semantic matches due to topic overlap, and they rarely produce exact hash matches anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ttl&lt;/code&gt;&lt;/strong&gt;: How long cached responses stay valid. Accepts duration strings like &lt;code&gt;"30s"&lt;/code&gt;, &lt;code&gt;"5m"&lt;/code&gt;, &lt;code&gt;"1h"&lt;/code&gt;, or numeric seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;cache_by_model&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;cache_by_provider&lt;/code&gt;&lt;/strong&gt;: When true, cache entries are isolated per &lt;a href="https://docs.getbifrost.ai/architecture/framework/model-catalog" rel="noopener noreferrer"&gt;model&lt;/a&gt; and &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;provider&lt;/a&gt; combination. A GPT-4 response will not be returned for a GPT-3.5-turbo request.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Trigger caching per request
&lt;/h3&gt;

&lt;p&gt;Caching is opt-in per request. You need to set a cache key, either via the Go SDK or HTTP headers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP API:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This request WILL be cached&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "gpt-4", "messages": [{"role": "user", "content": "What is semantic caching?"}]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:8080/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Go SDK:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithValue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semanticcache&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CacheKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"session-123"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChatCompletionRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without the cache key, requests bypass caching entirely. This gives you fine-grained control over what gets cached and what does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-request overrides (HTTP):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-ttl: 30s"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-threshold: 0.9"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     http://localhost:8080/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache type control:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Direct hash matching only (fastest, no embedding cost)&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-type: direct"&lt;/span&gt; ...

&lt;span class="c"&gt;# Semantic similarity search only&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-type: semantic"&lt;/span&gt; ...

&lt;span class="c"&gt;# Default: both (direct first, semantic fallback)&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also use no-store mode to read from cache without storing the response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-key: session-123"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-bf-cache-no-store: true"&lt;/span&gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  When semantic caching helps vs when it does not
&lt;/h2&gt;

&lt;p&gt;Semantic caching is not a universal solution. Here is where it works well and where it does not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer support bots where users ask the same questions in different words&lt;/li&gt;
&lt;li&gt;FAQ-style applications with predictable query patterns&lt;/li&gt;
&lt;li&gt;RAG pipelines where similar contexts produce similar queries&lt;/li&gt;
&lt;li&gt;Internal tools where multiple team members ask overlapping questions&lt;/li&gt;
&lt;li&gt;Any high-volume application with repetitive prompt patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not a good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conversations that are heavily context-dependent and unique every time&lt;/li&gt;
&lt;li&gt;Long multi-turn conversations (the &lt;code&gt;conversation_history_threshold&lt;/code&gt; exists for this reason, as longer conversations create false positive matches)&lt;/li&gt;
&lt;li&gt;Applications where responses must reflect real-time data that changes frequently&lt;/li&gt;
&lt;li&gt;Creative generation tasks where you want varied outputs for similar inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight is that semantic caching works best when your application naturally produces clusters of similar requests. If every request is genuinely unique, caching of any kind will not help much.&lt;/p&gt;




&lt;h2&gt;
  
  
  Other performance details worth knowing
&lt;/h2&gt;

&lt;p&gt;Beyond semantic caching, Bifrost caches aggressively at multiple levels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool discovery&lt;/strong&gt; is cached after the first request, bringing subsequent lookups down to roughly 100-500 microseconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health check results&lt;/strong&gt; are cached at approximately 50 nanoseconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All pricing data&lt;/strong&gt; is cached in memory for O(1) lookups during cost calculations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cache entries use namespace isolation. Each Bifrost instance gets its own &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector store&lt;/a&gt; namespace to prevent conflicts. When the Bifrost client shuts down (with &lt;code&gt;cleanup_on_shutdown&lt;/code&gt; set to true), all cache entries and the namespace itself are cleaned up. You can also programmatically &lt;a href="https://docs.getbifrost.ai/api-reference/cache/clear-cache-by-cache-key" rel="noopener noreferrer"&gt;clear cache by key&lt;/a&gt; or &lt;a href="https://docs.getbifrost.ai/api-reference/cache/clear-cache-by-request-id" rel="noopener noreferrer"&gt;clear cache by request ID&lt;/a&gt; via the API.&lt;/p&gt;

&lt;p&gt;Cache metadata is automatically added to responses via &lt;code&gt;response.ExtraFields.CacheDebug&lt;/code&gt;, so you can inspect whether a response came from direct cache, semantic match, or a fresh provider call. You can also use the &lt;a href="https://docs.getbifrost.ai/api-reference/logging/get-log-statistics" rel="noopener noreferrer"&gt;log statistics API&lt;/a&gt; for deeper &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; into your cache performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;If your GPT-powered application handles any volume of requests, there is a good chance a meaningful portion of those requests are semantically similar. Paying full API cost for every one of them does not make sense.&lt;/p&gt;

&lt;p&gt;Bifrost's &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache plugin&lt;/a&gt; gives you dual-layer caching with exact matching and &lt;a href="https://docs.getbifrost.ai/architecture/framework/vector-store" rel="noopener noreferrer"&gt;vector similarity search&lt;/a&gt;, opt-in per request, configurable thresholds, and built-in &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;cost tracking&lt;/a&gt;. It is open source, written in Go, and designed for production workloads.&lt;/p&gt;

&lt;p&gt;Check out the &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; to get started, read the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for the full configuration reference, or visit the &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Bifrost website&lt;/a&gt; to learn more about the gateway.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>gpt</category>
      <category>mcp</category>
    </item>
    <item>
      <title>LLM Cost Tracking and Spend Management for Engineering Teams</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Wed, 01 Apr 2026 05:43:30 +0000</pubDate>
      <link>https://forem.com/pranay_batta/llm-cost-tracking-and-spend-management-for-engineering-teams-233a</link>
      <guid>https://forem.com/pranay_batta/llm-cost-tracking-and-spend-management-for-engineering-teams-233a</guid>
      <description>&lt;p&gt;Your team ships a feature using GPT-4, it works great in staging, and then production traffic hits. Suddenly you are burning through API credits faster than anyone expected. Multiply that across three providers, five teams, and a few hundred thousand requests per day. Good luck figuring out where the money went.&lt;/p&gt;

&lt;p&gt;We built &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;, an open-source LLM gateway in Go, and cost tracking was one of the first problems we had to solve properly. This post covers what we learned, how we designed spend management into the gateway layer, and what the alternatives look like. You can get started with the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup guide&lt;/a&gt; in under a minute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Bifrost gives you per-request &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;cost logging&lt;/a&gt;, four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchies&lt;/a&gt; (Customer, Team, Virtual Key, Provider Config), auto-synced model pricing, and cache-aware cost calculations. All at 11 microsecond latency overhead. You can run it right now with &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt;. &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Full docs here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The actual problem with LLM costs
&lt;/h2&gt;

&lt;p&gt;Cloud compute costs are predictable. You pick an instance type, you know the hourly rate, you can forecast monthly spend within a few percent.&lt;/p&gt;

&lt;p&gt;LLM costs are nothing like that.&lt;/p&gt;

&lt;p&gt;A single API call costs somewhere between $0.0001 and $0.50 depending on the model, the input length, the output length, whether you are sending images or audio, and whether the context crosses the 128k token threshold (where pricing tiers change). That is per request.&lt;/p&gt;

&lt;p&gt;Now add multi-provider &lt;a href="https://docs.getbifrost.ai/features/governance/routing" rel="noopener noreferrer"&gt;routing&lt;/a&gt;. Your app might use OpenAI for chat, Anthropic for analysis, and a smaller model for classification. Each provider has different pricing structures, different token counting methods, and different billing cycles.&lt;/p&gt;

&lt;p&gt;The result: engineering teams have no idea what they are spending until the invoice arrives.&lt;/p&gt;

&lt;h2&gt;
  
  
  What cost tracking actually requires
&lt;/h2&gt;

&lt;p&gt;Most teams start with "we will check the provider dashboard." That breaks down fast for three reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-request granularity.&lt;/strong&gt; You need to know the cost of every single API call, tied to which customer, which team, and which feature triggered it. Provider dashboards give you aggregate numbers, not per-request attribution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time budget enforcement.&lt;/strong&gt; Knowing you overspent last month does not help. You need the system to reject requests when a budget limit is hit, before the money is gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-modal cost calculation.&lt;/strong&gt; If your app sends images, audio, or very long contexts, the cost calculation is not a simple token multiplication. You need tiered pricing support, per-image costs, per-second audio costs, and character-based pricing for certain models.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we built cost tracking in Bifrost
&lt;/h2&gt;

&lt;p&gt;We wanted cost management to be a &lt;a href="https://docs.getbifrost.ai/features/governance" rel="noopener noreferrer"&gt;gateway-level concern&lt;/a&gt;, not something each application team has to implement. Here is how the pieces fit together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Catalog with auto-synced pricing
&lt;/h3&gt;

&lt;p&gt;The Model Catalog is the foundation. It maintains pricing data for every supported model across all providers. You can also &lt;a href="https://docs.getbifrost.ai/api-reference/configuration/force-pricing-sync" rel="noopener noreferrer"&gt;force a pricing sync&lt;/a&gt; at any time via the API.&lt;/p&gt;

&lt;p&gt;On startup, Bifrost downloads the latest pricing sheet and loads it into memory. When a ConfigStore (SQLite or PostgreSQL) is available, it also persists the data and re-syncs every 24 hours automatically. All lookups are O(1) from memory.&lt;/p&gt;

&lt;p&gt;The pricing data covers multiple modalities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Text&lt;/strong&gt;: token-based and character-based pricing for chat completions, text completions, and embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio&lt;/strong&gt;: token-based and duration-based pricing for speech synthesis and transcription&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Images&lt;/strong&gt;: per-image costs with tiered pricing for high-token contexts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiered pricing&lt;/strong&gt;: automatic rate changes above 128k tokens, reflecting actual provider pricing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means cost calculation is accurate for every request type, not an approximation based on token count alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four-tier &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;This is where spend management happens. Bifrost supports budgets at four levels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Customer&lt;/strong&gt; - set a spending cap for an entire customer account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team&lt;/strong&gt; - limit spend per team within a customer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;Virtual Key&lt;/a&gt;&lt;/strong&gt; - control costs per API key (useful for per-feature or per-environment budgets)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://docs.getbifrost.ai/quickstart/gateway/provider-configuration" rel="noopener noreferrer"&gt;Provider Config&lt;/a&gt;&lt;/strong&gt; - cap total spend on a specific provider&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each budget has a &lt;code&gt;max_limit&lt;/code&gt;, a &lt;code&gt;reset_duration&lt;/code&gt; (daily, weekly, monthly), and tracks &lt;code&gt;current_usage&lt;/code&gt; in real time.&lt;/p&gt;

&lt;p&gt;Here is what creating a customer with a budget looks like via the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--url&lt;/span&gt; http://localhost:8080/api/governance/customers &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "acme-corp",
    "budget": {
      "max_limit": 500,
      "reset_duration": "monthly"
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response includes the budget object with &lt;code&gt;current_usage&lt;/code&gt; tracked automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"customer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cust-abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"acme-corp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"budget"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bdgt-xyz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"max_limit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"reset_duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"monthly"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"current_usage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;current_usage&lt;/code&gt; hits &lt;code&gt;max_limit&lt;/code&gt;, requests are rejected. No surprises on the invoice.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.getbifrost.ai/features/observability" rel="noopener noreferrer"&gt;LogStore&lt;/a&gt;: per-request cost audit trail
&lt;/h3&gt;

&lt;p&gt;Every request that passes through Bifrost gets &lt;a href="https://docs.getbifrost.ai/api-reference/logging/get-logs" rel="noopener noreferrer"&gt;logged with full cost data&lt;/a&gt;. The LogStore captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provider and model used&lt;/li&gt;
&lt;li&gt;Input tokens, output tokens, total tokens&lt;/li&gt;
&lt;li&gt;Calculated cost (broken down into input cost, output cost, request cost, total cost)&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Status (success or error)&lt;/li&gt;
&lt;li&gt;Timestamps&lt;/li&gt;
&lt;li&gt;Full input/output content (serialized as JSON)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can query this data with filters. Want to see all requests to OpenAI that cost more than $0.10 in the last hour? That is a single API call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;--request&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--url&lt;/span&gt; http://localhost:8080/api/logs/search &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s1"&gt;'{
    "filters": {
      "providers": ["openai"],
      "min_cost": 0.10,
      "start_time": "2026-03-31T00:00:00Z"
    },
    "pagination": {
      "limit": 50,
      "sort_by": "cost",
      "order": "desc"
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response includes &lt;a href="https://docs.getbifrost.ai/api-reference/logging/get-log-statistics" rel="noopener noreferrer"&gt;aggregated stats&lt;/a&gt; alongside individual logs: total requests, success rate, average latency, total tokens, and total cost for the query. This is the data you need for cost attribution and chargeback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;

&lt;p&gt;You can have this running in under a minute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @maximhq/bifrost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or with Docker if you prefer containerized deployment. Then point your LLM calls at the Bifrost endpoint instead of directly at the provider — it works as a &lt;a href="https://docs.getbifrost.ai/features/drop-in-replacement" rel="noopener noreferrer"&gt;drop-in replacement&lt;/a&gt; for the &lt;a href="https://docs.getbifrost.ai/integrations/openai-sdk" rel="noopener noreferrer"&gt;OpenAI SDK&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/integrations/anthropic-sdk" rel="noopener noreferrer"&gt;Anthropic SDK&lt;/a&gt;, and &lt;a href="https://docs.getbifrost.ai/integrations/bedrock-sdk" rel="noopener noreferrer"&gt;Bedrock SDK&lt;/a&gt;. Cost tracking, budget enforcement, and logging happen automatically.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://docs.getbifrost.ai/quickstart/gateway/setting-up" rel="noopener noreferrer"&gt;setup docs&lt;/a&gt; for configuration details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cache-aware cost tracking
&lt;/h2&gt;

&lt;p&gt;This is a detail that matters more than you would expect.&lt;/p&gt;

&lt;p&gt;Bifrost includes a dual-layer &lt;a href="https://docs.getbifrost.ai/features/semantic-caching" rel="noopener noreferrer"&gt;semantic cache&lt;/a&gt; (exact hash matching + semantic similarity via Weaviate). When a request hits the cache, the cost calculation changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct cache hit&lt;/strong&gt; (exact match): zero cost. The response comes from cache, no provider API call is made.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic cache hit&lt;/strong&gt; (similar query found): the cost is the embedding generation cost only. No model inference cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache miss with storage&lt;/strong&gt;: the cost is the base model usage plus the embedding generation cost for storing the result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are not tracking cache-aware costs, your cost reports will overcount. Every cache hit that gets reported at full model price inflates your numbers and hides the ROI of caching.&lt;/p&gt;

&lt;h2&gt;
  
  
  How other tools handle cost tracking
&lt;/h2&gt;

&lt;p&gt;Credit where it is due. There are several tools in this space, and they each take a different approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; is a proxy-based observability platform. It logs requests and provides cost analytics through a dashboard. The cost tracking is solid, with per-request granularity. Where it differs from Bifrost: Helicone is primarily an observability tool. Budget enforcement and cache-aware cost calculations are not its focus. It is a good choice if you want analytics without gateway-level controls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt; acts as a unified API layer across multiple LLM providers. It handles routing and gives you a single bill, which simplifies accounting. However, OpenRouter is a hosted proxy — your requests pass through their infrastructure. There is no self-hosted option, no budget enforcement at the gateway level, and no per-customer or per-team spend hierarchy. If you need cost attribution beyond "which model was called," you will need to build that yourself on top of their logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS API Gateway + Bedrock&lt;/strong&gt; is what many AWS-native teams reach for. You get IAM-based access control and CloudWatch metrics. The limitation is that cost tracking is coarse-grained — you get aggregate billing through AWS Cost Explorer, not per-request cost breakdowns tied to your internal teams or customers. Building a four-tier budget hierarchy on top of AWS services means stitching together Lambda, DynamoDB, and custom billing logic. It works but it is a lot of glue code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kong AI Gateway&lt;/strong&gt; and &lt;strong&gt;Cloudflare AI Gateway&lt;/strong&gt; both provide rate limiting and basic analytics for AI API traffic. Kong gives you plugin-based extensibility, and Cloudflare gives you edge caching and DDoS protection. Neither provides built-in per-request cost calculation with multi-modal pricing awareness, and neither offers the kind of &lt;a href="https://docs.getbifrost.ai/features/governance/budget-and-limits" rel="noopener noreferrer"&gt;budget hierarchy&lt;/a&gt; where you can set spending caps at the customer, team, and key level with automatic enforcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiteLLM&lt;/strong&gt; is the most well-known Python-based proxy. It supports cost tracking and has a wide model coverage. The trade-off is performance. LiteLLM adds roughly 8ms of latency overhead per request. Bifrost adds 11 microseconds, which is about 50x faster. At 5,000 RPS, that difference compounds. If your use case is low-throughput internal tooling, LiteLLM works fine. If you are running production workloads at scale, the latency overhead matters.&lt;/p&gt;

&lt;p&gt;The math is straightforward: at 5,000 requests per second, 8ms overhead means 40 seconds of cumulative latency overhead per second of wall time. At 11 microseconds, it is 0.055 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we learned building this
&lt;/h2&gt;

&lt;p&gt;A few things surprised us during development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing data goes stale fast.&lt;/strong&gt; Providers update pricing regularly. We started with a static pricing file and quickly realized it needed to be auto-synced. The 24-hour sync interval with O(1) memory lookups was the balance we settled on. You can also trigger a &lt;a href="https://docs.getbifrost.ai/api-reference/configuration/force-pricing-sync" rel="noopener noreferrer"&gt;manual pricing sync&lt;/a&gt; via &lt;code&gt;POST /api/pricing/force-sync&lt;/code&gt; if a provider drops prices and you want immediate accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget enforcement needs to be in the hot path.&lt;/strong&gt; We tried implementing budgets as an async check initially. The problem: by the time the async check ran, the request was already sent to the provider and the cost was incurred. Budget checks have to happen before the request goes upstream. That is why Bifrost handles it at the gateway layer with in-memory state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-modal cost calculation is harder than it looks.&lt;/strong&gt; Text-only cost is straightforward: multiply tokens by price per token. But when a request includes images, the cost depends on the image resolution and the token context length. Audio adds per-second pricing. Some models charge per character instead of per token. The Model Catalog handles all of this, but getting it right required modelling each provider's pricing structure individually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost attribution needs hierarchy.&lt;/strong&gt; Flat per-key budgets are not enough for real organizations. An engineering team needs to know: "How much is Customer X spending? How much of that is Team Y? Which &lt;a href="https://docs.getbifrost.ai/features/governance/virtual-keys" rel="noopener noreferrer"&gt;virtual key&lt;/a&gt; is burning through budget?" That is why we built the four-tier hierarchy (Customer, Team, Virtual Key, Provider Config). You can &lt;a href="https://docs.getbifrost.ai/api-reference/governance/create-virtual-key" rel="noopener noreferrer"&gt;create virtual keys via the API&lt;/a&gt; and attach budgets to each level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;LLM cost management is not optional for production systems. If you are routing requests across multiple providers without per-request cost tracking, budget enforcement, and cache-aware calculations, you are flying blind. For enterprise teams, Bifrost also supports &lt;a href="https://docs.getbifrost.ai/enterprise/audit-logs" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt;, &lt;a href="https://docs.getbifrost.ai/enterprise/log-exports" rel="noopener noreferrer"&gt;log exports&lt;/a&gt;, and &lt;a href="https://docs.getbifrost.ai/enterprise/intelligent-load-balancing" rel="noopener noreferrer"&gt;intelligent load balancing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt; is open-source, written in Go, and runs with a single command. It handles cost tracking at the gateway layer so your application code does not have to.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://git.new/bifrostrepo" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;Website&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are dealing with LLM spend management, give it a try and let us know what is missing. We are actively building based on what teams actually need.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>LiteLLM vs Bifrost: Why the Supply Chain Attack Changes Everything for LLM Gateways</title>
      <dc:creator>Pranay Batta</dc:creator>
      <pubDate>Sat, 28 Mar 2026 05:10:03 +0000</pubDate>
      <link>https://forem.com/pranay_batta/litellm-vs-bifrost-why-the-supply-chain-attack-changes-everything-for-llm-gateways-b9l</link>
      <guid>https://forem.com/pranay_batta/litellm-vs-bifrost-why-the-supply-chain-attack-changes-everything-for-llm-gateways-b9l</guid>
      <description>&lt;p&gt;If you're running LiteLLM in production, the March 2026 supply chain attack probably got your attention. Mine too. I spent the past few days digging into what happened, why it happened, and what it means for anyone choosing an LLM gateway in 2026.&lt;/p&gt;

&lt;p&gt;This is not a hit piece. LiteLLM is a solid project with massive adoption. But this incident exposed something structural that every engineering team needs to think about. And it happens to make the case for Bifrost, a Go-based alternative, in ways that go beyond the usual performance benchmarks.&lt;/p&gt;

&lt;p&gt;Let's break it all down.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Two backdoored versions of LiteLLM (1.82.7, 1.82.8) were published to PyPI on March 24, 2026, via stolen credentials.&lt;/li&gt;
&lt;li&gt;The malware stole SSH keys, AWS/GCP/Azure credentials, and Kubernetes secrets. It used Python's &lt;code&gt;.pth&lt;/code&gt; persistence mechanism to survive across interpreter restarts.&lt;/li&gt;
&lt;li&gt;DSPy, MLflow, CrewAI, OpenHands, and Arize Phoenix all pulled the compromised version.&lt;/li&gt;
&lt;li&gt;Bifrost is a Go-based LLM gateway that compiles to a single binary. The attack vector that hit LiteLLM simply does not exist in its architecture.&lt;/li&gt;
&lt;li&gt;Beyond security, Bifrost adds 11 microseconds of overhead per request vs LiteLLM's roughly 8ms, supports 20+ providers, offers semantic caching via Weaviate, and has a four-tier budget hierarchy.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Happened: The Full Attack Chain
&lt;/h2&gt;

&lt;p&gt;Here's the sequence of events, based on &lt;a href="https://snyk.io/articles/poisoned-security-scanner-backdooring-litellm/" rel="noopener noreferrer"&gt;Snyk's detailed investigation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: The Trivy GitHub Action was compromised.&lt;/strong&gt; A group called TeamPCP tampered with the widely-used Trivy security scanner GitHub Action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: LiteLLM's CI/CD pipeline pulled the compromised Trivy Action.&lt;/strong&gt; Because LiteLLM's workflow used an unpinned version of the Trivy GitHub Action (not pinned to a specific SHA), the compromised version ran inside LiteLLM's CI environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: The malicious Trivy Action exfiltrated LiteLLM's &lt;code&gt;PYPI_PUBLISH&lt;/code&gt; token.&lt;/strong&gt; With that token, the attackers could publish any package version to PyPI under LiteLLM's name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Two backdoored versions (1.82.7, 1.82.8) were published to PyPI.&lt;/strong&gt; These looked like normal LiteLLM updates. Anyone running &lt;code&gt;pip install --upgrade litellm&lt;/code&gt; got them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: The malware deployed a &lt;code&gt;.pth&lt;/code&gt; persistence file.&lt;/strong&gt; This is the part that needs explaining.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are .pth files?
&lt;/h3&gt;

&lt;p&gt;If you're not deep into Python internals, &lt;code&gt;.pth&lt;/code&gt; files might be new to you. They live in Python's &lt;code&gt;site-packages&lt;/code&gt; directory and get executed automatically every time the Python interpreter starts up. Not when you import a specific package. Every single time Python runs. Anything.&lt;/p&gt;

&lt;p&gt;The attackers placed a &lt;code&gt;.pth&lt;/code&gt; file that loaded their malware on every Python interpreter startup. It did not matter whether your code imported &lt;code&gt;litellm&lt;/code&gt; or not. If the package was installed in the environment, the malware was active.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What the malware stole:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SSH private keys&lt;/li&gt;
&lt;li&gt;AWS, GCP, and Azure credentials&lt;/li&gt;
&lt;li&gt;Kubernetes secrets&lt;/li&gt;
&lt;li&gt;Crypto wallet keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 6: The attackers used 73 compromised GitHub accounts&lt;/strong&gt; to spam the disclosure issue with noise and eventually closed it using stolen maintainer credentials, trying to suppress the report.&lt;/p&gt;

&lt;p&gt;The backdoored versions were live on PyPI for approximately 3 hours. LiteLLM has 3.4 million+ daily downloads. You can do the math on the blast radius.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Architecture Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;Let's talk about why this specific attack cannot happen to &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;Bifrost&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is not just "Bifrost is written in Go, so it's safe." That would be a lazy argument. The actual reasons are architectural, and they matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  No site-packages directory
&lt;/h3&gt;

&lt;p&gt;Python packages install into &lt;code&gt;site-packages&lt;/code&gt;. That directory is a shared space where any installed package can drop files, including &lt;code&gt;.pth&lt;/code&gt; files that execute on interpreter startup. This is the mechanism the LiteLLM attackers exploited.&lt;/p&gt;

&lt;p&gt;Go compiles to a single static binary. There is no &lt;code&gt;site-packages&lt;/code&gt; equivalent. There is no shared directory where a compromised dependency could drop a persistence mechanism. The binary is the binary.&lt;/p&gt;

&lt;h3&gt;
  
  
  No .pth hook mechanism
&lt;/h3&gt;

&lt;p&gt;Python's &lt;code&gt;.pth&lt;/code&gt; file execution is a feature, not a bug. It exists for legitimate reasons (configuring import paths, running initialization code). But it also means any package you install can run arbitrary code on every Python startup without your knowledge or consent.&lt;/p&gt;

&lt;p&gt;Go has no equivalent mechanism. When you compile a Go binary, what goes in is what comes out. There is no startup hook that third-party code can inject into after compilation.&lt;/p&gt;

&lt;h3&gt;
  
  
  No transitive pip dependency chain
&lt;/h3&gt;

&lt;p&gt;LiteLLM has a substantial dependency tree. Each of those dependencies has its own dependencies. Each one is a potential attack surface. When you &lt;code&gt;pip install litellm&lt;/code&gt;, you're trusting not just the LiteLLM maintainers but every maintainer of every transitive dependency.&lt;/p&gt;

&lt;p&gt;Bifrost ships as a compiled binary via &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; or Docker (&lt;code&gt;docker pull maximhq/bifrost&lt;/code&gt;). Dependencies are resolved and compiled at build time by the Bifrost team. You're running a single binary, not managing a dependency tree.&lt;/p&gt;

&lt;h3&gt;
  
  
  The CI/CD surface area is smaller
&lt;/h3&gt;

&lt;p&gt;The LiteLLM attack started with a compromised GitHub Action in CI/CD. Go binaries distributed via npm or Docker reduce the CI/CD surface area because the compilation and dependency resolution happen upstream, not in your pipeline.&lt;/p&gt;

&lt;p&gt;This is not about Go being "more secure" than Python as a language. It's about the deployment model. A compiled binary distributed as a single artifact has a fundamentally smaller attack surface than a package installed via a package manager with a transitive dependency tree and runtime hook mechanisms.&lt;/p&gt;




&lt;h2&gt;
  
  
  Side-by-Side Feature Comparison
&lt;/h2&gt;

&lt;p&gt;Here's an honest look at both gateways.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;LiteLLM&lt;/th&gt;
&lt;th&gt;Bifrost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;pip install&lt;/code&gt;, Docker&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;npx&lt;/code&gt;, Docker, Go binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provider support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100+ providers&lt;/td&gt;
&lt;td&gt;20+ providers (OpenAI, Anthropic, Bedrock, Azure, Gemini, Vertex AI, Groq, Mistral, Cohere, xAI, and more) + custom providers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overhead per request&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~8ms&lt;/td&gt;
&lt;td&gt;11 microseconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Varies (Python GIL limits)&lt;/td&gt;
&lt;td&gt;5,000 RPS sustained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Redis-based key-value&lt;/td&gt;
&lt;td&gt;Weaviate-powered dual-layer semantic caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Budget management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic spend tracking&lt;/td&gt;
&lt;td&gt;Four-tier hierarchy (Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider Config)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Full MCP gateway with four connection types, sub-3ms latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Web UI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dashboard available&lt;/td&gt;
&lt;td&gt;Built-in Web UI for visual setup, monitoring, and governance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI compatibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (drop-in replacement, single URL change)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supply chain surface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PyPI + transitive deps + .pth hooks&lt;/td&gt;
&lt;td&gt;Single compiled binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Config files, environment variables&lt;/td&gt;
&lt;td&gt;Zero-config start, Web UI, API, or config.json&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Let me be upfront: LiteLLM's provider count is significantly higher. If you need access to 100+ providers through a single gateway, that is a real advantage. Bifrost supports 20+ providers natively with the ability to add custom providers, which covers most production use cases, but it is not the same breadth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Deep Dive: What the Numbers Actually Mean
&lt;/h2&gt;

&lt;p&gt;You'll see "11 microseconds vs 8 milliseconds" in Bifrost's benchmarks. That's roughly a 50x difference. But what does it mean in practice?&lt;/p&gt;

&lt;p&gt;Let's do the math at different scales.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At 10,000 requests per day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM overhead: 10,000 x 8ms = 80 seconds of cumulative gateway latency&lt;/li&gt;
&lt;li&gt;Bifrost overhead: 10,000 x 11 microseconds = 0.11 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;At 100,000 requests per day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM overhead: 100,000 x 8ms = 800 seconds (~13.3 minutes)&lt;/li&gt;
&lt;li&gt;Bifrost overhead: 100,000 x 11 microseconds = 1.1 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;At 1,000,000 requests per day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LiteLLM overhead: 1,000,000 x 8ms = 8,000 seconds (~2.2 hours)&lt;/li&gt;
&lt;li&gt;Bifrost overhead: 1,000,000 x 11 microseconds = 11 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At low volume, the difference doesn't matter much. Your LLM provider's response time (hundreds of milliseconds to seconds) dwarfs the gateway overhead either way.&lt;/p&gt;

&lt;p&gt;But at scale, the difference becomes real. 13 minutes of cumulative latency at 100K requests/day isn't catastrophic, but it adds up across your user base. And 2.2 hours at a million requests/day starts affecting tail latencies and user experience, especially for streaming responses where gateway overhead is felt on every chunk.&lt;/p&gt;

&lt;p&gt;The 5,000 RPS sustained throughput from Bifrost also matters. Python's GIL (Global Interpreter Lock) creates a concurrency ceiling that Go simply doesn't have. If you're running high-concurrency workloads, this is a material difference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Here's What This Means for Your Stack
&lt;/h2&gt;

&lt;p&gt;If you're evaluating LLM gateways right now, the LiteLLM incident should change your evaluation criteria. Not because LiteLLM is bad software, but because it highlighted a category of risk that most teams weren't thinking about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Questions to ask about any LLM gateway:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;What's the dependency footprint?&lt;/strong&gt; How many transitive dependencies does it pull in? Each one is a potential attack surface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's the deployment model?&lt;/strong&gt; Is it a package you install into your environment, or a standalone binary/container?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does it have runtime hook mechanisms?&lt;/strong&gt; Can dependencies execute code at startup without explicit imports?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How is it distributed?&lt;/strong&gt; Via a package manager with mutable versions, or via immutable artifacts?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What's in the CI/CD chain?&lt;/strong&gt; Are GitHub Actions pinned by SHA? Are publish tokens scoped and rotated?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These aren't questions most teams were asking about their LLM gateway a month ago. They should be now.&lt;/p&gt;




&lt;h2&gt;
  
  
  When LiteLLM Still Makes Sense
&lt;/h2&gt;

&lt;p&gt;I want to be honest about this. There are real scenarios where LiteLLM is the better choice.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need access to 100+ providers.&lt;/strong&gt; LiteLLM's provider breadth is unmatched. If you're working with niche or specialized providers that Bifrost doesn't support yet, LiteLLM gets you there faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your entire stack is Python and you want deep integration.&lt;/strong&gt; LiteLLM plays well with the Python ML ecosystem. If you're already in that world and need tight integration with LangChain, LlamaIndex, or similar frameworks, LiteLLM fits naturally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need it as a library, not a gateway.&lt;/strong&gt; LiteLLM can be imported and used as a Python library within your application code. Bifrost is a standalone gateway service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these are your primary requirement, LiteLLM may still be right for you. Just audit your versions, pin your dependencies, and check for &lt;code&gt;.pth&lt;/code&gt; files in your &lt;code&gt;site-packages&lt;/code&gt; directory.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Bifrost Is the Better Choice
&lt;/h2&gt;

&lt;p&gt;Bifrost wins when your priorities look like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security surface area matters to you.&lt;/strong&gt; If you're in a regulated industry, handle sensitive data, or simply don't want to worry about Python supply chain attacks in your infrastructure layer, a compiled Go binary is a different risk profile entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance at scale.&lt;/strong&gt; If you're pushing high request volumes and need minimal gateway overhead, 11 microseconds vs 8 milliseconds is not a rounding error.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want governance out of the box.&lt;/strong&gt; Bifrost's four-tier budget hierarchy (Customer &amp;gt; Team &amp;gt; Virtual Key &amp;gt; Provider Config) with independent budget checking at each level gives you cost control that's built into the gateway, not bolted on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic caching.&lt;/strong&gt; Bifrost's Weaviate-powered dual-layer caching understands the meaning of requests, not just exact matches. Similar queries hit the cache even if they're worded differently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP gateway support.&lt;/strong&gt; If you're building agentic applications, Bifrost has native MCP support with four connection types and sub-3ms tool execution latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config setup.&lt;/strong&gt; Run &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; and you have a working gateway with a Web UI at &lt;code&gt;localhost:8080&lt;/code&gt;. No config files, no environment variables, no setup ceremony.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Question
&lt;/h2&gt;

&lt;p&gt;Should your LLM gateway be a Python package at all?&lt;/p&gt;

&lt;p&gt;This isn't Python-bashing. Python is great for ML research, data science, prototyping, and application-level code. But your LLM gateway sits in the critical path of every AI request your application makes. It's infrastructure.&lt;/p&gt;

&lt;p&gt;Infrastructure components have different requirements than application code. They need to be fast, stable, have minimal dependencies, and present the smallest possible attack surface. This is why web servers, databases, load balancers, and message queues are almost never written in Python. They're written in C, C++, Go, or Rust.&lt;/p&gt;

&lt;p&gt;The LiteLLM incident didn't happen because of a bug in LiteLLM's code. It happened because of a structural property of the Python packaging ecosystem. That's a different kind of risk, and it's one that applies to any Python package in your infrastructure layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Action Items
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you're currently using LiteLLM:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check your installed version immediately. Versions 1.82.7 and 1.82.8 are compromised.&lt;/li&gt;
&lt;li&gt;Search for &lt;code&gt;.pth&lt;/code&gt; files in your Python &lt;code&gt;site-packages&lt;/code&gt; directories.&lt;/li&gt;
&lt;li&gt;Rotate all credentials that were accessible from environments where LiteLLM was installed (SSH keys, cloud provider credentials, Kubernetes secrets).&lt;/li&gt;
&lt;li&gt;Pin your GitHub Actions by SHA, not by tag.&lt;/li&gt;
&lt;li&gt;Evaluate whether a compiled gateway is a better fit for your security posture.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;If you're evaluating LLM gateways:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Try Bifrost: &lt;code&gt;npx -y @maximhq/bifrost&lt;/code&gt; (takes 30 seconds)&lt;/li&gt;
&lt;li&gt;Check out the &lt;a href="https://git.new/bifrost" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; to see the codebase&lt;/li&gt;
&lt;li&gt;Read the &lt;a href="https://getmax.im/bifrostdocs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for the full feature set&lt;/li&gt;
&lt;li&gt;Visit the &lt;a href="https://getmax.im/bifrost-home" rel="noopener noreferrer"&gt;website&lt;/a&gt; for architecture details&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LLM gateway space is going to look different after this incident. Supply chain security just became an evaluation criterion, and compiled gateways have a structural advantage that no amount of Python dependency scanning can match.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>javascript</category>
      <category>python</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
