<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: AwxGlobal</title>
    <description>The latest articles on Forem by AwxGlobal (@awxglobal).</description>
    <link>https://forem.com/awxglobal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3907767%2Ffd1b18a3-e608-4663-901b-77f3b2e9c47c.png</url>
      <title>Forem: AwxGlobal</title>
      <link>https://forem.com/awxglobal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/awxglobal"/>
    <language>en</language>
    <item>
      <title>Preventing CrewAI Budget Overruns: Hard Limits Per Agent Role</title>
      <dc:creator>AwxGlobal</dc:creator>
      <pubDate>Mon, 04 May 2026 07:00:43 +0000</pubDate>
      <link>https://forem.com/awxglobal/preventing-crewai-budget-overruns-hard-limits-per-agent-role-3njp</link>
      <guid>https://forem.com/awxglobal/preventing-crewai-budget-overruns-hard-limits-per-agent-role-3njp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://awx-shredder.fly.dev/blog/preventing-crewai-budget-overruns-hard-limits-per-agent-role" rel="noopener noreferrer"&gt;awx-shredder.fly.dev/blog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Preventing CrewAI Budget Overruns: Hard Limits Per Agent Role
&lt;/h1&gt;

&lt;p&gt;A multi-agent CrewAI workflow spun up in production last month and burned through $340 in API costs before anyone noticed. The culprit? A research agent stuck in a loop, making hundreds of GPT-4 calls to refine a single report. The agent eventually completed its task, but the invoice was brutal.&lt;/p&gt;

&lt;p&gt;CrewAI's built-in &lt;code&gt;max_rpm&lt;/code&gt; and &lt;code&gt;max_iter&lt;/code&gt; parameters help, but they're blunt instruments. RPM limits don't account for token usage variance—a single long-context call can cost 50x more than a short one. Iteration limits stop runaway loops but won't catch an agent that makes expensive calls within reasonable iteration counts.&lt;/p&gt;

&lt;p&gt;What you actually need is hard budget enforcement at the agent level, ideally denominated in dollars rather than tokens or requests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Agent-Level Budgets Matter
&lt;/h2&gt;

&lt;p&gt;In a typical CrewAI setup, different agents have wildly different cost profiles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Research agents&lt;/strong&gt; make many calls with large contexts (expensive)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing agents&lt;/strong&gt; make quick classification calls (cheap)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writing agents&lt;/strong&gt; generate long-form content (moderate, predictable)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A flat budget across all agents means your routing agent's allocation goes unused while your research agent blows past acceptable spend. You need granular control that maps to how each agent actually behaves in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Naive Approach: Token Counting
&lt;/h2&gt;

&lt;p&gt;Your first instinct might be to wrap the LLM client and count tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BudgetEnforcedLLM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget_dollars&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget_dollars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;budget_dollars&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent_dollars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encoding_for_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Simplified pricing (actual pricing is more complex)
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price_per_1k_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0015&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price_per_1k_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.06&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.002&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent_dollars&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget_dollars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Budget exceeded: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent_dollars&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; / $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;budget_dollars&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Estimate input cost
&lt;/span&gt;        &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Count actual output tokens
&lt;/span&gt;        &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price_per_1k_input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; 
                &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price_per_1k_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spent_dollars&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

&lt;span class="c1"&gt;# Usage with CrewAI
&lt;/span&gt;&lt;span class="n"&gt;research_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BudgetEnforcedLLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4-turbo-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;budget_dollars&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research Analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gather comprehensive market data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;research_llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works in theory, but falls apart in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pricing complexity&lt;/strong&gt;: OpenAI's pricing varies by model version, has different rates for cached tokens, and changes frequently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token counting edge cases&lt;/strong&gt;: Function calls, vision inputs, and prompt caching make accurate token estimation nearly impossible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No persistence&lt;/strong&gt;: Restarting your process resets budgets, making daily limits impossible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No observability&lt;/strong&gt;: You're flying blind until something breaks&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Production-Grade Budget Enforcement
&lt;/h2&gt;

&lt;p&gt;The real solution is intercepting calls at the API level, not in your application code. This is where a proxy layer makes sense.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://awx-shredder.fly.dev" rel="noopener noreferrer"&gt;AWX Shredder&lt;/a&gt; is built specifically for this use case—it sits between your CrewAI agents and OpenAI's API, enforcing hard budget limits per agent role without requiring code changes beyond swapping the base URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;

&lt;span class="c1"&gt;# Configure different budget headers for each agent role
&lt;/span&gt;&lt;span class="n"&gt;research_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4-turbo-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://awx-shredder.fly.dev/proxy/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;default_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-AWX-Agent-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-AWX-Daily-Budget&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5.00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;writer_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4-turbo-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://awx-shredder.fly.dev/proxy/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;default_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-AWX-Agent-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-AWX-Daily-Budget&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.00&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Research Analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gather comprehensive market data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;research_llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content Writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create engaging reports from research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;writer_llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The proxy tracks actual API costs in real-time and returns HTTP 429 (rate limit) responses the moment an agent exceeds its daily budget. CrewAI handles this gracefully—the agent fails fast rather than racking up surprise costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rolling Your Own Proxy
&lt;/h2&gt;

&lt;p&gt;If you need full control, building a simple budget-enforcing proxy isn't difficult. The minimal viable version needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Redis instance for persisting spend by agent ID and date&lt;/li&gt;
&lt;li&gt;A FastAPI or Express server that proxies to &lt;code&gt;api.openai.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Middleware that checks budget before forwarding requests&lt;/li&gt;
&lt;li&gt;A parser for OpenAI's response to extract actual costs from the &lt;code&gt;usage&lt;/code&gt; field&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tricky parts are handling streaming responses (costs aren't known until the stream completes) and dealing with OpenAI's eventual consistency in reporting costs. You'll also need to build alerting, dashboards, and handle key rotation.&lt;/p&gt;

&lt;p&gt;For most teams, the build-vs-buy calculation favors a managed solution unless budget enforcement is core to your product's IP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Set agent-level budgets based on each role's expected behavior, not flat limits across your crew&lt;/li&gt;
&lt;li&gt;Enforce budgets at the API proxy level, not in application code&lt;/li&gt;
&lt;li&gt;Use hard blocks (HTTP 429) rather than soft warnings—agents can't learn from budget alerts&lt;/li&gt;
&lt;li&gt;Start conservative: set daily budgets at 2x your observed costs, then tighten after a week of production data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today's action: audit your last month of OpenAI bills, break down costs by the logical "agent roles" you're running, and set per-role daily budgets at 150% of the highest daily spend you saw. Hard limits prevent one-off disasters while giving your agents room to operate.&lt;/p&gt;

</description>
      <category>crewai</category>
      <category>agents</category>
      <category>budget</category>
      <category>production</category>
    </item>
    <item>
      <title>Why Your LLM Agent Costs 10x More Than Your Estimate</title>
      <dc:creator>AwxGlobal</dc:creator>
      <pubDate>Sat, 02 May 2026 07:00:43 +0000</pubDate>
      <link>https://forem.com/awxglobal/why-your-llm-agent-costs-10x-more-than-your-estimate-4o78</link>
      <guid>https://forem.com/awxglobal/why-your-llm-agent-costs-10x-more-than-your-estimate-4o78</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://awx-shredder.fly.dev/blog/why-your-llm-agent-costs-10x-more-than-your-estimate" rel="noopener noreferrer"&gt;awx-shredder.fly.dev/blog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Why Your LLM Agent Costs 10x More Than Your Estimate
&lt;/h1&gt;

&lt;p&gt;Your product manager approved the $500/month LLM budget. Two weeks later, you're staring at a $4,200 bill from OpenAI. The agent works perfectly in testing, but production is eating tokens like a memory leak eats RAM.&lt;/p&gt;

&lt;p&gt;I've debugged this exact scenario four times in the past year. The culprit is never a single smoking gun—it's the multiplication of hidden costs that developers systematically underestimate during planning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Token Math Nobody Does
&lt;/h2&gt;

&lt;p&gt;Most developers estimate LLM costs like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1000 requests/day × 500 tokens/request × $0.002/1k tokens = $1/day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This calculation assumes every request is a pristine, single-shot API call. Real agents don't work that way.&lt;/p&gt;

&lt;p&gt;Here's what actually happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;System prompts are charged on every call.&lt;/strong&gt; That 800-token system prompt explaining your agent's role, output format, and business rules? It's not free. It's billed on every single request.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool call overhead compounds exponentially.&lt;/strong&gt; Each function call requires the model to output JSON, you to execute the function, then send results back with the full conversation history. A single user request often triggers 3-5 tool calls.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Conversation history grows linearly.&lt;/strong&gt; If you're maintaining context across turns (and you probably are), each subsequent message in a conversation includes all previous messages.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's recalculate with reality:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What you estimated
&lt;/span&gt;&lt;span class="n"&gt;simple_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.002&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimate: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;simple_cost&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# $1.00/day
&lt;/span&gt;
&lt;span class="c1"&gt;# What actually happens
&lt;/span&gt;&lt;span class="n"&gt;system_prompt_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;
&lt;span class="n"&gt;avg_user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;
&lt;span class="n"&gt;avg_assistant_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;
&lt;span class="n"&gt;tool_calls_per_request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3.5&lt;/span&gt;  &lt;span class="c1"&gt;# Average across all requests
&lt;/span&gt;&lt;span class="n"&gt;tool_call_overhead&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;  &lt;span class="c1"&gt;# JSON formatting + function results
&lt;/span&gt;&lt;span class="n"&gt;conversation_turns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="c1"&gt;# Average conversation length
&lt;/span&gt;
&lt;span class="c1"&gt;# First turn
&lt;/span&gt;&lt;span class="n"&gt;turn_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;system_prompt_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_user_input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_assistant_response&lt;/span&gt;

&lt;span class="c1"&gt;# Subsequent turns include history
&lt;/span&gt;&lt;span class="n"&gt;turn_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;turn_1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_user_input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_assistant_response&lt;/span&gt;
&lt;span class="n"&gt;turn_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;turn_2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_user_input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_assistant_response&lt;/span&gt;  
&lt;span class="n"&gt;turn_4&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;turn_3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_user_input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;avg_assistant_response&lt;/span&gt;

&lt;span class="c1"&gt;# Add tool call overhead
&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;turn_1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;turn_2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;turn_3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;turn_4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_calls_per_request&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;tool_call_overhead&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;conversation_turns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;real_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.002&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reality: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;real_cost&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# $12.40/day
&lt;/span&gt;
&lt;span class="n"&gt;multiplier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;real_cost&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;simple_cost&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hidden multiplier: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;multiplier&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 12.4x
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This 12x multiplier is before we account for the really expensive mistakes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry Loops: The Silent Budget Killer
&lt;/h2&gt;

&lt;p&gt;Retry logic is essential for production reliability. It's also where costs spiral out of control.&lt;/p&gt;

&lt;p&gt;Consider this common pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_agent_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;max_retries&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;raise&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Exponential backoff
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks reasonable. But what happens when OpenAI has a bad day and timeouts spike to 15% of requests?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;15% of your 1000 daily requests now make 2-3 attempts&lt;/li&gt;
&lt;li&gt;Each retry sends the full prompt again (including that 800-token system prompt)&lt;/li&gt;
&lt;li&gt;Your token consumption jumps by 20-40% instantly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Worse yet: I've seen validation retry loops where the agent's output doesn't match the expected schema, so the developer adds logic to retry with error feedback. Each failed parse triggers another full API call with the previous attempt's context. A single malformed JSON response can cascade into 5-10 retry attempts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Call Explosion
&lt;/h2&gt;

&lt;p&gt;Function calling feels efficient—until you look at the token counts.&lt;/p&gt;

&lt;p&gt;Every tool call follows this pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model decides to call a function (outputs JSON)&lt;/li&gt;
&lt;li&gt;Your code executes the function&lt;/li&gt;
&lt;li&gt;Results are formatted and sent back&lt;/li&gt;
&lt;li&gt;Model processes results and decides next step&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each step includes the full conversation history and system prompt. A research agent that calls &lt;code&gt;search()&lt;/code&gt;, then &lt;code&gt;fetch_url()&lt;/code&gt;, then &lt;code&gt;extract_data()&lt;/code&gt; isn't making three cheap calls—it's making three increasingly expensive calls as the context window fills with previous tool results.&lt;/p&gt;

&lt;p&gt;The economics get brutal with GPT-4. A complex agent workflow that feels like "just a few tool calls" can easily consume 15,000-20,000 tokens per user request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Reality Check
&lt;/h2&gt;

&lt;p&gt;After you've shipped and the costs are running hot, you need visibility and hard limits. Setting OpenAI spending limits helps but doesn't give you per-agent granularity or prevent runaway costs before they hit your credit card.&lt;/p&gt;

&lt;p&gt;For production deployments where budget control is non-negotiable, tools like AWX Shredder (awx-shredder.fly.dev) act as a hard circuit breaker—an OpenAI-compatible proxy that blocks requests when an agent exceeds its daily budget. It takes one environment variable change and gives you real-time spend tracking with alerts before you blow through your allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Should Do Today
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit your actual token consumption.&lt;/strong&gt; Log &lt;code&gt;usage.total_tokens&lt;/code&gt; from every API response for a week. Calculate the median and p95. You'll be surprised.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Count your system prompt tokens.&lt;/strong&gt; Use &lt;code&gt;tiktoken&lt;/code&gt; to get exact counts. If your system prompt is over 500 tokens, consider whether every instruction is essential.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Track retry rates.&lt;/strong&gt; Add metrics for how often your retry logic actually fires. Set alerts when retry rates exceed 5%.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model tool call patterns.&lt;/strong&gt; Log how many function calls the average request triggers. If it's more than 3, consider whether you can combine tools or reduce the decision tree.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set hard budget limits per agent.&lt;/strong&gt; Don't rely on cost estimates. Implement actual spending caps that prevent runaway costs.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The gap between estimated and actual LLM costs isn't a rounding error—it's the difference between a sustainable product and a budget crisis. The math is straightforward once you account for what actually gets billed.&lt;/p&gt;

</description>
      <category>openai</category>
      <category>llm</category>
      <category>cost</category>
      <category>agents</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>AwxGlobal</dc:creator>
      <pubDate>Fri, 01 May 2026 22:55:32 +0000</pubDate>
      <link>https://forem.com/awxglobal/-5234</link>
      <guid>https://forem.com/awxglobal/-5234</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/awxglobal/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs-3123" class="crayons-story__hidden-navigation-link"&gt;Per-agent daily spend limits: the architecture every AI team needs&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/awxglobal" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3907767%2Ffd1b18a3-e608-4663-901b-77f3b2e9c47c.png" alt="awxglobal profile" class="crayons-avatar__image" width="460" height="460"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/awxglobal" class="crayons-story__secondary fw-medium m:hidden"&gt;
              AwxGlobal
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                AwxGlobal
                
              
              &lt;div id="story-author-preview-content-3597616" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/awxglobal" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3907767%2Ffd1b18a3-e608-4663-901b-77f3b2e9c47c.png" class="crayons-avatar__image" alt="" width="460" height="460"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;AwxGlobal&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/awxglobal/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs-3123" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 1&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/awxglobal/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs-3123" id="article-link-3597616"&gt;
          Per-agent daily spend limits: the architecture every AI team needs
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/architecture"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;architecture&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/openai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;openai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/agents"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;agents&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/awxglobal/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs-3123" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;1&lt;span class="hidden s:inline"&gt; reaction&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/awxglobal/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs-3123#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Per-agent daily spend limits: the architecture every AI team needs</title>
      <dc:creator>AwxGlobal</dc:creator>
      <pubDate>Fri, 01 May 2026 21:07:07 +0000</pubDate>
      <link>https://forem.com/awxglobal/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs-3123</link>
      <guid>https://forem.com/awxglobal/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs-3123</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://awx-shredder.fly.dev/blog/per-agent-daily-spend-limits-the-architecture-every-ai-team-needs" rel="noopener noreferrer"&gt;awx-shredder.fly.dev/blog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Per-agent daily spend limits: the architecture every AI team needs
&lt;/h1&gt;

&lt;p&gt;Your Slack bot just burned through $847 in four hours because a junior dev accidentally pushed a loop that called &lt;code&gt;gpt-4-turbo&lt;/code&gt; on every message edit event. Your customer support agent hit an infinite reasoning loop and racked up $2,300 in o1-preview costs before anyone noticed. These aren't hypothetical scenarios—they're the kind of incidents that happen weekly across AI engineering teams.&lt;/p&gt;

&lt;p&gt;The problem isn't that developers are careless. It's that LLM APIs have fundamentally different cost characteristics than traditional APIs. A single malformed request can cost $50. A logic bug can drain thousands before your monitoring alerts even fire. And when you're running multiple agents—research bots, customer support, data analysis tools—the blast radius of a single misconfigured agent can take down your entire API budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why application-level budget checks fail
&lt;/h2&gt;

&lt;p&gt;Most teams start with application-level budget enforcement. You add a counter in your database, increment it on each API call, and check before making requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;current_spend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_daily_spend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_spend&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DAILY_LIMIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;BudgetExceededError&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate and record cost
&lt;/span&gt;    &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;calculate_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;increment_spend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks reasonable until you hit production. The cost calculation happens &lt;em&gt;after&lt;/em&gt; the API call completes. If your database write fails, you've lost spend tracking. If the process crashes between the API call and the database update, that cost vanishes. Race conditions mean multiple requests can check the budget simultaneously, all see they're under the limit, and fire off requests that collectively exceed it.&lt;/p&gt;

&lt;p&gt;More critically: this pattern requires every callsite in your codebase to route through your budget enforcement logic. Third-party libraries that call OpenAI directly bypass it entirely. That LangChain agent you integrated? It's not checking budgets. The new engineer who doesn't know about your internal wrapper? They import &lt;code&gt;openai&lt;/code&gt; directly and circumvent everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The proxy architecture
&lt;/h2&gt;

&lt;p&gt;The robust solution is budget enforcement at the network layer. Every LLM API call flows through a proxy that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Authenticates the agent&lt;/strong&gt; making the request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checks current spend&lt;/strong&gt; against the daily limit &lt;em&gt;before&lt;/em&gt; forwarding to the LLM provider&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocks the request&lt;/strong&gt; immediately if the limit is exceeded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Records actual costs&lt;/strong&gt; from the LLM response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aggregates spend&lt;/strong&gt; across all instances of your application&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This architecture makes budget enforcement impossible to bypass. Applications can't accidentally route around it because the proxy is configured at the network level via &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt;. Multiple application instances automatically share the same spend tracking because it's centralized in the proxy.&lt;/p&gt;

&lt;p&gt;Here's what the client-side configuration looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;OPENAI_BASE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// points to proxy&lt;/span&gt;
  &lt;span class="na"&gt;defaultHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;X-Agent-ID&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;customer-support-bot&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// This call is budget-enforced automatically&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4-turbo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The proxy intercepts the request, checks if &lt;code&gt;customer-support-bot&lt;/code&gt; has budget remaining today, and either forwards it to OpenAI or returns a 429 error. Your application code doesn't need to think about budgets—they're enforced infrastructure-level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building vs. buying
&lt;/h2&gt;

&lt;p&gt;Implementing a production-grade proxy requires solving several non-trivial problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming support&lt;/strong&gt;: LLM streaming responses require careful proxy handling to calculate costs from partial responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token counting accuracy&lt;/strong&gt;: Different models have different pricing for input/output tokens, and your cost calculations need to match OpenAI's billing exactly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic spend updates&lt;/strong&gt;: You need transactional guarantees that spend increments don't get lost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region deployment&lt;/strong&gt;: Low latency requires running the proxy close to your application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert fatigue&lt;/strong&gt;: Teams need warnings before hitting limits, not just hard blocks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams that need this now, &lt;a href="https://awx-shredder.fly.dev" rel="noopener noreferrer"&gt;AWX Shredder&lt;/a&gt; is a production-ready proxy that handles all of this. Change &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to &lt;code&gt;https://awx-shredder.fly.dev/proxy/v1&lt;/code&gt;, set per-agent daily budgets, and get email alerts at 50%/80%/100% thresholds. It's OpenAI-compatible, so existing code works unchanged.&lt;/p&gt;

&lt;p&gt;For teams building internally, the core architecture is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run a lightweight HTTP proxy (Node.js with &lt;code&gt;http-proxy-middleware&lt;/code&gt; or Python with &lt;code&gt;aiohttp&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Use Redis for atomic spend tracking with daily key expiration&lt;/li&gt;
&lt;li&gt;Parse token usage from OpenAI responses and multiply by model-specific pricing&lt;/li&gt;
&lt;li&gt;Return 429 errors when budgets are exceeded&lt;/li&gt;
&lt;li&gt;Implement request signing or API keys to authenticate agents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The tricky parts are handling streaming correctly (you need to buffer the response to extract token counts while still streaming to the client) and keeping your pricing table up to date as OpenAI changes model costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The enforcement guarantee
&lt;/h2&gt;

&lt;p&gt;The key insight is that budget enforcement must happen &lt;em&gt;before&lt;/em&gt; cost is incurred, not after. Application-level tracking is audit logging. Proxy-level blocking is actual enforcement.&lt;/p&gt;

&lt;p&gt;When your proxy returns 429, that request never reaches OpenAI. No tokens are consumed. No cost is charged. The agent is hard-stopped until the daily limit resets. This guarantee—that exceeding a budget is architecturally impossible—is what lets you safely increase agent autonomy without fear of runaway costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do today
&lt;/h2&gt;

&lt;p&gt;If you're running multiple AI agents in production, implement per-agent spend limits this week. The next production incident will happen—the question is whether it costs $50 or $5,000. Pick a proxy architecture (build or buy), assign realistic daily budgets to each agent (10-20% above their typical daily spend), and configure alerts before you hit limits. Your infrastructure should make expensive mistakes impossible, not just unlikely.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>openai</category>
      <category>agents</category>
    </item>
    <item>
      <title>I woke up to a $400 OpenAI bill. Here's what I built to make sure it never happens again.</title>
      <dc:creator>AwxGlobal</dc:creator>
      <pubDate>Fri, 01 May 2026 15:52:35 +0000</pubDate>
      <link>https://forem.com/awxglobal/i-woke-up-to-a-400-openai-bill-heres-what-i-built-to-make-sure-it-never-happens-again-5a8b</link>
      <guid>https://forem.com/awxglobal/i-woke-up-to-a-400-openai-bill-heres-what-i-built-to-make-sure-it-never-happens-again-5a8b</guid>
      <description>&lt;p&gt;It was 2am when the email came in.&lt;/p&gt;

&lt;p&gt;"Your OpenAI usage has exceeded $400."&lt;/p&gt;

&lt;p&gt;An agent I'd been testing had hit a loop. It kept retrying a failed call, over and over, for six hours while I slept. By the time I saw it, the damage was done.&lt;/p&gt;

&lt;p&gt;I went looking for a tool that would let me set a hard daily limit per agent — something that would block the call before it ever reached OpenAI. I couldn't find one.&lt;/p&gt;

&lt;p&gt;So I built it.&lt;/p&gt;

&lt;p&gt;The problem with AI agents and money&lt;/p&gt;

&lt;p&gt;When you build with AI APIs, you're essentially writing a blank check. Your agent makes calls, OpenAI charges you, and you find out later. There's no circuit breaker. There's no per-agent budget. There's no "stop after $10."&lt;/p&gt;

&lt;p&gt;Soft limits and billing alerts exist, but they notify you after the money is gone.&lt;/p&gt;

&lt;p&gt;What I needed was a hard block. Something that intercepts the call and returns an error before it reaches the API.&lt;/p&gt;

&lt;p&gt;What AWX Shredder does&lt;/p&gt;

&lt;p&gt;It's a proxy. Instead of your agent calling api.openai.com directly, it calls your AWX endpoint. AWX checks the agent's daily budget, and either forwards the call or blocks it with a 402.&lt;/p&gt;

&lt;p&gt;Setup is two lines:&lt;/p&gt;

&lt;p&gt;OPENAI_BASE_URL=&lt;a href="https://awx-shredder.fly.dev/proxy/v1" rel="noopener noreferrer"&gt;https://awx-shredder.fly.dev/proxy/v1&lt;/a&gt;&lt;br&gt;
OPENAI_API_KEY=your_awx_key&lt;br&gt;
You create agents in a dashboard, set a daily budget per agent, and that's it. Every call is logged. When an agent hits its limit, calls stop — not after, not eventually, right then.&lt;/p&gt;

&lt;p&gt;Why per-agent matters&lt;/p&gt;

&lt;p&gt;Most billing tools are org-level. You get one limit for everything.&lt;/p&gt;

&lt;p&gt;But in a real system you might have five agents running at once — a researcher, a writer, a code reviewer, a scraper, a scheduler. If one goes rogue, you want to block that agent, not your whole org.&lt;/p&gt;

&lt;p&gt;AWX gives each agent its own daily budget. One agent burning money doesn't affect the others.&lt;/p&gt;

&lt;p&gt;It works with Claude too&lt;/p&gt;

&lt;p&gt;Not just OpenAI. Point your Anthropic client at /proxy/v1/messages and the same budget enforcement applies.&lt;/p&gt;

&lt;p&gt;Try it&lt;/p&gt;

&lt;p&gt;It's free to start: awx-shredder.fly.dev&lt;/p&gt;

&lt;p&gt;If you've ever had an unexpected AI bill, or you're building agents and want to sleep at night — this is for you.&lt;/p&gt;

&lt;p&gt;I'm building this in public. Follow along if you want to see how it grows.&lt;/p&gt;

&lt;h1&gt;
  
  
  ai #openai #webdev #productivity
&lt;/h1&gt;

</description>
    </item>
  </channel>
</rss>
