Forem: AwxGlobal

Claude 3.5 Sonnet vs Haiku: Why Your Agent Budget Disappeared in 3 Hours

AwxGlobal — Sat, 16 May 2026 07:00:42 +0000

Claude 3.5 Sonnet vs Haiku: Why Your Agent Budget Disappeared in 3 Hours

Last Tuesday at 2 PM, our customer support agent processed 847 tickets. By 5 PM, we'd burned through $340 of our daily Claude budget. The agent was working exactly as designed—reading tickets, searching our knowledge base, drafting responses—but we'd configured it to use Claude 3.5 Sonnet for every single operation. That's when I learned the hard way that model selection isn't just about capability; it's about matching the right model to each specific task.

The Pricing Reality

Claude's pricing structure creates a 15x cost differential that most developers underestimate:

Claude 3.5 Sonnet: $3/MTok input, $15/MTok output
Claude 3.5 Haiku: $0.25/MTok input, $1.25/MTok output

For a typical agent that processes 10K input tokens and generates 2K output tokens per interaction:

Sonnet: $0.03 input + $0.03 output = $0.06 per interaction
Haiku: $0.0025 input + $0.0025 output = $0.005 per interaction

That customer support agent running 1,000 interactions daily? We're talking $60/day with Sonnet versus $5/day with Haiku. Over a month, that's $1,800 vs $150.

When Sonnet Actually Matters

The instinct is to use the most powerful model everywhere, but Sonnet's advantages only matter for specific workloads:

Use Sonnet for:

Complex reasoning chains (multi-step problem solving)
Code generation requiring architectural decisions
Nuanced content creation where tone and style matter
Ambiguous instructions that need interpretation

Use Haiku for:

Classification and routing
Structured data extraction
Simple summarization
Template filling with clear instructions
Fact retrieval from provided context

In our support agent, 80% of the work was classification ("What category is this ticket?") and data extraction ("Pull the order number and issue description"). We were using a sledgehammer to tap in thumbtacks.

Building a Hybrid Agent Architecture

Here's how we rebuilt the agent with model routing:

import anthropic
from enum import Enum

class TaskComplexity(Enum):
    SIMPLE = "claude-3-5-haiku-20241022"
    COMPLEX = "claude-3-5-sonnet-20241022"

class HybridAgent:
    def __init__(self, api_key: str):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.daily_cost = 0.0

    def route_task(self, task_type: str) -> str:
        """Route to appropriate model based on task complexity."""
        simple_tasks = {
            "classify", "extract", "summarize_short",
            "validate", "route", "tag"
        }
        return (TaskComplexity.SIMPLE.value if task_type in simple_tasks 
                else TaskComplexity.COMPLEX.value)

    def process_ticket(self, ticket_text: str) -> dict:
        # Step 1: Classify with Haiku (cheap)
        classification = self.client.messages.create(
            model=self.route_task("classify"),
            max_tokens=100,
            messages=[{
                "role": "user",
                "content": f"""Classify this support ticket into ONE category:
                BILLING, TECHNICAL, ACCOUNT, SHIPPING

                Ticket: {ticket_text}

                Response format: CATEGORY_NAME"""
            }]
        )

        category = classification.content[0].text.strip()

        # Step 2: Extract key data with Haiku (cheap)
        extraction = self.client.messages.create(
            model=self.route_task("extract"),
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"""Extract structured data as JSON:
                {{"order_id": "...", "issue": "...", "urgency": "low|medium|high"}}

                Ticket: {ticket_text}"""
            }]
        )

        # Step 3: Generate response with Sonnet ONLY if complex
        model = (TaskComplexity.COMPLEX.value if category == "TECHNICAL" 
                 else TaskComplexity.SIMPLE.value)

        response = self.client.messages.create(
            model=model,
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"""Draft a response for this {category} ticket.
                Data: {extraction.content[0].text}
                Original: {ticket_text}"""
            }]
        )

        return {
            "category": category,
            "data": extraction.content[0].text,
            "response": response.content[0].text
        }

This approach cut our per-ticket cost from $0.06 to roughly $0.015—a 75% reduction. Classification and extraction run on Haiku for $0.003 combined, while only technical tickets get the Sonnet treatment for response generation.

The Budget Control Problem

Even with optimized routing, agent costs can spiral. A bug in retry logic, a sudden spike in usage, or an inefficient prompt can blow through your budget before you notice. You need hard stops, not just monitoring.

For production agents, I now use AWX Shredder as a proxy layer that blocks requests when daily budgets are exceeded. It's OpenAI-compatible, so it works with Anthropic's API through a simple base URL change. Set per-agent budgets, get alerts at 50%/80% usage, and most importantly: requests hard-fail when you hit 100%. No surprise $5,000 bills.

Measuring What Matters

Track these metrics for each agent:

Cost per interaction: Total daily spend / interactions processed
Model distribution: % of calls to Haiku vs Sonnet
Quality by model: User satisfaction scores segmented by which model generated the response

For our support agent, quality metrics showed that Haiku-generated responses for simple categories had identical satisfaction scores to Sonnet. Users couldn't tell the difference—because the task didn't require Sonnet's capabilities.

Implementation Checklist

Before deploying your next Claude-powered agent:

Audit each operation: List every LLM call your agent makes
Classify by complexity: Which operations truly need reasoning vs. pattern matching?
Implement routing logic: Start with a simple dictionary of task types to models
Add budget controls: Implement hard stops on spending, not just alerts
Monitor model performance: Track quality metrics by model to validate your routing decisions

Start with the audit today. List every Claude API call in your agent, estimate monthly token usage, and calculate costs at both Sonnet and Haiku pricing. You'll likely find 60-80% of your operations can run on Haiku with zero quality loss. That's your savings opportunity.

Alerting on LLM Cost Thresholds: When to Warn vs When to Hard-Block

AwxGlobal — Thu, 14 May 2026 07:00:42 +0000

Alerting on LLM Cost Thresholds: When to Warn vs When to Hard-Block

Last month, our AI-powered support agent racked up $4,800 in OpenAI charges over a weekend. A misconfigured retry loop hit GPT-4 with full conversation history on every attempt. The API never said no—it just kept billing us.

If you're running LLM agents in production, this nightmare scenario is closer than you think. The question isn't whether to set up cost alerts, but how to structure them so you catch problems early without killing legitimate usage.

The Three-Tier Alert Strategy

Most developers instinctively reach for a single budget threshold: "Alert me when we hit $X." But production systems need a graduated response that balances awareness, urgency, and damage control.

Here's what works:

50% threshold: Passive monitoring

Log the event and send a low-priority notification. This is your early warning system. Normal usage patterns should occasionally hit this mark—it means your budget is sized correctly. No action required, just awareness.

80% threshold: Active investigation

Page the on-call engineer or send a high-priority alert. Something unusual is happening. Maybe it's legitimate (unexpected traffic spike, complex queries), maybe it's a bug. You need human eyes on this before it becomes a crisis.

100% threshold: Hard block

Stop all API calls immediately. Yes, this will cause user-facing failures. But uncontrolled cost spirals are worse than temporary downtime. You can always manually override if the spending is legitimate.

Implementation: Building Your Own Cost Gate

The naive approach is to check spending after each API call:

import openai
import redis
from datetime import datetime

class CostGate:
    def __init__(self, redis_client, daily_budget_usd):
        self.redis = redis_client
        self.budget = daily_budget_usd

    def get_today_key(self, agent_id):
        date = datetime.utcnow().strftime("%Y-%m-%d")
        return f"llm_cost:{agent_id}:{date}"

    def record_cost(self, agent_id, tokens_used, cost_per_1k):
        key = self.get_today_key(agent_id)
        cost = (tokens_used / 1000) * cost_per_1k

        # Atomic increment
        total_cost = self.redis.incrbyfloat(key, cost)
        self.redis.expire(key, 86400 * 2)  # Keep 2 days

        usage_pct = (total_cost / self.budget) * 100

        if usage_pct >= 100:
            raise BudgetExceededError(f"Agent {agent_id} exceeded daily budget")
        elif usage_pct >= 80:
            self.alert_high_usage(agent_id, total_cost, usage_pct)
        elif usage_pct >= 50:
            self.log_milestone(agent_id, total_cost, usage_pct)

        return total_cost

    def check_budget_before_call(self, agent_id, estimated_tokens):
        key = self.get_today_key(agent_id)
        current_cost = float(self.redis.get(key) or 0)

        if current_cost >= self.budget:
            raise BudgetExceededError(
                f"Agent {agent_id} already at ${current_cost:.2f} of ${self.budget} budget"
            )

# Usage
gate = CostGate(redis_client, daily_budget_usd=100)

try:
    gate.check_budget_before_call(agent_id="support-bot", estimated_tokens=2000)

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages
    )

    gate.record_cost(
        agent_id="support-bot",
        tokens_used=response.usage.total_tokens,
        cost_per_1k=0.03
    )
except BudgetExceededError as e:
    # Log and fail gracefully
    return {"error": "Service temporarily unavailable"}

This works, but you're reinventing infrastructure. You need:

Accurate token counting across multiple models
Rate limiting logic
Alert delivery (email, Slack, PagerDuty)
Dashboard for visibility
Handling of streaming responses
Coordination across distributed services

When to Use a Proxy Layer

For production systems, wrapping your API client isn't enough. You need enforcement at the network level. If a rogue process bypasses your wrapper, you're back to unlimited billing.

This is where proxy solutions become essential. Tools like AWX Shredder (awx-shredder.fly.dev) sit between your application and OpenAI, enforcing budgets at the protocol level. Change OPENAI_BASE_URL and suddenly every API call—even from legacy code or third-party libraries—goes through the cost gate. It handles the 50%/80%/100% alert progression out of the box.

The key architectural benefit: your application code stays simple. No cost-tracking logic scattered across microservices. Budget enforcement becomes infrastructure, not application logic.

Sizing Your Thresholds

Daily budgets should be based on P95 usage, not average. If your agent typically costs $50/day but occasionally spikes to $120, a $100 daily budget will cause false positives.

Start with this formula:

daily_budget = (P95_daily_usage * 1.5) + expected_growth_buffer

For new agents without usage history, be conservative. A $20/day limit won't break the bank if something goes wrong, and you can increase it once you understand actual patterns.

The Hard-Block Decision

Some developers resist hard-blocking at 100%, fearing user impact. But consider the alternative: a coding error that burns through thousands of dollars before anyone notices.

Hard-blocking is the right default. For agents that genuinely need to exceed daily budgets, implement an explicit override mechanism:

def create_completion(messages, allow_budget_override=False):
    if allow_budget_override:
        # Use separate, monitored endpoint
        return openai.ChatCompletion.create(...)
    else:
        # Route through cost gate
        return gated_completion(messages)

Overrides should require a paper trail—log who authorized it and why.

What to Do Right Now

Calculate your current daily LLM spend per agent
Set up basic Redis-backed tracking with the code above (or deploy a proxy layer)
Configure 50%/80%/100% alerts with your actual thresholds
Test the hard-block by manually triggering it in staging

The first time that 100% threshold stops a runaway agent from emptying your AWS budget, you'll wonder how you ever ran LLMs without it.

Stopping AutoGen Agents Before They Drain Your OpenAI Budget

AwxGlobal — Tue, 12 May 2026 07:00:43 +0000

Originally published at awx-shredder.fly.dev/blog

Stopping AutoGen Agents Before They Drain Your OpenAI Budget

Your AutoGen agent just racked up $240 in API costs overnight. The culprit? A reflection loop where two agents debated code formatting standards for 3,000 turns before you woke up and killed the process. If you've deployed AutoGen agents in any production or semi-production environment, you've either experienced this already or you're about to.

AutoGen's multi-agent conversations are powerful precisely because they can iterate autonomously. But that autonomy becomes expensive fast when agents get stuck in loops, misinterpret termination conditions, or simply take more turns than you anticipated to solve a problem.

The Built-In Options (And Why They're Not Enough)

AutoGen provides max_consecutive_auto_reply to limit turns per agent. Here's the standard approach:

from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "config_list": [{"model": "gpt-4", "api_key": os.environ["OPENAI_API_KEY"]}],
        "temperature": 0.7,
    },
    max_consecutive_auto_reply=10,  # Limit turns
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=5,
    code_execution_config={"work_dir": "coding"},
)

This caps the number of consecutive replies, but it's a blunt instrument. A 10-turn conversation with GPT-3.5-turbo costs pennies. The same 10 turns with GPT-4 on a complex problem with large context windows can cost $5-10. You're controlling iterations, not spending.

The max_tokens parameter in llm_config helps, but only per request. In a 50-turn conversation, you're still looking at 50 separate API calls, each maxing out your token limit. And if you set it too low, you get truncated responses that derail the conversation.

The Real Problem: No Budget Awareness

AutoGen agents have no concept of cumulative cost. They don't know if you've already spent $100 today across other conversations. They can't distinguish between a test environment where you want to spend $1 and production where $50 is acceptable.

You need budget enforcement at the API layer, not the conversation layer. This means intercepting OpenAI API calls before they happen and blocking requests that exceed spending limits.

Building Your Own Budget Guard

The straightforward approach is wrapping the OpenAI client with a budget tracker:

import openai
from datetime import datetime, timedelta
import tiktoken

class BudgetGuardedClient:
    def __init__(self, daily_budget_usd=10.0):
        self.daily_budget = daily_budget_usd
        self.today_spend = 0.0
        self.last_reset = datetime.now().date()
        self.pricing = {
            "gpt-4": {"input": 0.03 / 1000, "output": 0.06 / 1000},
            "gpt-3.5-turbo": {"input": 0.0015 / 1000, "output": 0.002 / 1000},
        }

    def _reset_if_new_day(self):
        today = datetime.now().date()
        if today > self.last_reset:
            self.today_spend = 0.0
            self.last_reset = today

    def _estimate_cost(self, model, prompt_tokens, completion_tokens):
        if model not in self.pricing:
            model = "gpt-4"  # Conservative fallback
        return (
            prompt_tokens * self.pricing[model]["input"] +
            completion_tokens * self.pricing[model]["output"]
        )

    def chat_completion_create(self, **kwargs):
        self._reset_if_new_day()

        if self.today_spend >= self.daily_budget:
            raise Exception(f"Daily budget of ${self.daily_budget} exceeded")

        response = openai.ChatCompletion.create(**kwargs)

        cost = self._estimate_cost(
            kwargs.get("model", "gpt-4"),
            response.usage.prompt_tokens,
            response.usage.completion_tokens
        )
        self.today_spend += cost

        return response

# Use it with AutoGen
guarded_client = BudgetGuardedClient(daily_budget_usd=5.0)

# Monkey-patch or wrap AutoGen's LLM calls
# (Implementation details vary based on AutoGen version)

This works for single-process development, but breaks down in production:

No persistence: Restart your app, reset your budget counter
No multi-agent differentiation: All agents share one budget
No alerting: You find out you hit the limit when the conversation fails
No concurrent safety: Multiple agents in different processes can race past the limit

Production-Grade Budget Enforcement

For production AutoGen deployments, you need a proxy that sits between your agents and OpenAI. AWX Shredder (awx-shredder.fly.dev) handles this exact use case—it's an OpenAI-compatible proxy that enforces hard daily budgets per agent or per project. Change one environment variable (OPENAI_BASE_URL=https://awx-shredder.fly.dev/proxy/v1), configure your budgets, and the proxy blocks requests the moment an agent exceeds its limit. You get real-time spend tracking, alerts at 50%/80%/100% of budget, and a dashboard showing which agents are burning through credits.

The critical advantage: budget enforcement happens before the API call, not after. The conversation stops cleanly, you don't get charged, and you can investigate why the agent needed so many turns.

Architectural Patterns That Reduce Cost

Beyond hard limits, design your agent conversations to minimize waste:

Use cheaper models for orchestration. Your planner agent that decides which specialist agent to call next doesn't need GPT-4. Use GPT-3.5-turbo for routing, GPT-4 for the actual work.

Implement streaming termination. Check for termination conditions as responses stream in, not after the full completion. You can cancel mid-response if you detect the agent is going off-track.

Cache aggressively. If three agents need the same context (like a large document), don't include it in all three conversations. Use a retrieval pattern where agents query for specific sections.

Set conversation-level budgets. Don't just limit per agent—limit the entire group chat. If your code review workflow should never cost more than $0.50, enforce it at the conversation level.

Start Here Today

Add basic budget tracking to your AutoGen setup this week. Even a simple wrapper that logs cumulative costs and raises warnings gives you visibility you don't have now. Set a modest daily limit ($5-10) and run your typical workflows. You'll quickly learn which conversations are expensive and which agents need better termination conditions.

The goal isn't to eliminate all LLM costs—it's to eliminate surprise costs. AutoGen agents should fail fast and loudly when they hit budget limits, not silently drain your account while you sleep.

Per-Agent Cost Tracking: Why Your LLM Analytics Are Probably Wrong

AwxGlobal — Fri, 08 May 2026 07:00:43 +0000

Originally published at awx-shredder.fly.dev/blog

Per-Agent Cost Tracking: Why Your LLM Analytics Are Probably Wrong

Your CFO asks why the OpenAI bill jumped 340% last month. You open your dashboard and see... one giant line item. Maybe you've got it broken down by API key, but three teams share the same key. You know the costs are coming from your agent fleet, but which agents? Which conversations? Which models are burning through your budget at 3 AM?

This isn't a hypothetical. I've watched teams spend $40K/month on GPT-4 calls before realizing that a single misconfigured agent was responsible for 60% of it. The agent was retrying failed calls with exponential backoff but no maximum—running for days on edge cases.

The Three Dimensions That Actually Matter

Cost visibility for AI APIs needs three dimensions simultaneously: who (agent/user), what (model/operation), and when (time-series data). Miss any one and you're flying blind.

Per-agent tracking tells you which parts of your system are expensive. Your customer support bot might cost $0.03 per conversation while your code review agent costs $2.40. Without this, you're optimizing in the dark.

Per-model tracking reveals substitution opportunities. If 80% of your GPT-4 calls could use GPT-4-mini, that's a 97% cost reduction on that subset.

Per-hour tracking catches runaway processes and usage patterns. That spike at 2 AM? A batch job that should've been rate-limited. The gradual climb every afternoon? User traffic you can predict and budget for.

Building Cost Tracking Into Your Middleware

The right place to track costs is in your API middleware layer, not scattered across services. Here's a Python example using a decorator pattern that captures all three dimensions:

import time
import logging
from functools import wraps
from datetime import datetime
from openai import OpenAI

class CostTracker:
    def __init__(self):
        self.costs = []
        # Pricing per 1M tokens (update these regularly)
        self.pricing = {
            'gpt-4-turbo': {'input': 10.0, 'output': 30.0},
            'gpt-4': {'input': 30.0, 'output': 60.0},
            'gpt-3.5-turbo': {'input': 0.5, 'output': 1.5},
        }

    def calculate_cost(self, model, input_tokens, output_tokens):
        if model not in self.pricing:
            logging.warning(f"Unknown model {model}, cost not tracked")
            return 0

        input_cost = (input_tokens / 1_000_000) * self.pricing[model]['input']
        output_cost = (output_tokens / 1_000_000) * self.pricing[model]['output']
        return input_cost + output_cost

    def track_call(self, agent_id, model, input_tokens, output_tokens, timestamp):
        cost = self.calculate_cost(model, input_tokens, output_tokens)
        self.costs.append({
            'agent_id': agent_id,
            'model': model,
            'input_tokens': input_tokens,
            'output_tokens': output_tokens,
            'cost': cost,
            'timestamp': timestamp,
            'hour': timestamp.replace(minute=0, second=0, microsecond=0)
        })
        return cost

def track_llm_call(agent_id):
    tracker = CostTracker()

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            start_time = datetime.utcnow()
            result = func(*args, **kwargs)

            # Extract token usage from response
            usage = result.usage
            model = result.model

            cost = tracker.track_call(
                agent_id=agent_id,
                model=model,
                input_tokens=usage.prompt_tokens,
                output_tokens=usage.completion_tokens,
                timestamp=start_time
            )

            logging.info(f"Agent {agent_id} | {model} | ${cost:.4f}")
            return result
        return wrapper
    return decorator

# Usage
@track_llm_call(agent_id='customer-support-v2')
def call_gpt(messages):
    client = OpenAI()
    return client.chat.completions.create(
        model='gpt-4-turbo',
        messages=messages
    )

This gives you structured logs you can pipe to your analytics stack. But there's a gap: this tracks costs after they happen. You need real-time budgets to prevent runaway spending.

Real-Time Budget Enforcement

Logging is reactive. You need proactive budget enforcement—hard limits that block requests before they hit your bill. This requires a stateful proxy that:

Intercepts all LLM API calls
Tracks cumulative spend per agent in real-time
Blocks requests that would exceed budget
Alerts before you hit limits

For production use, AWX Shredder solves this specific problem as a drop-in OpenAI-compatible proxy. Set OPENAI_BASE_URL=https://awx-shredder.fly.dev/proxy/v1, configure per-agent daily budgets, and it hard-blocks calls the moment an agent exceeds its limit. You get real-time spend tracking, email alerts at 50%/80%/100%, and a dashboard—without changing your application code beyond the base URL.

What to Alert On

Once you have visibility, alert on these signals:

Hourly spend rate exceeding daily budget pro-rata: If you've spent 30% of your daily budget in the first 2 hours (8.3% of the day), something's wrong.
Per-agent spend deviating >3σ from baseline: Catches both runaway agents and sudden drops (which might indicate broken functionality).
Model selection anomalies: If an agent that normally uses GPT-3.5 suddenly switches to GPT-4, investigate.
Token efficiency degradation: Track cost-per-successful-operation. If your code review agent's cost-per-review doubles, your prompts might be degrading.

The Data Schema That Scales

Store cost data in a time-series database with this schema:

timestamp | agent_id | model | input_tokens | output_tokens | cost | metadata

The metadata JSON field is crucial—store request IDs, user IDs, feature flags, anything you might need to slice by later. You can't predict what questions your CFO will ask.

Partition by day and agent_id for query performance. Your most common query will be "show me the last 7 days for agent X."

Start Today

If you're not tracking per-agent costs right now, add the decorator pattern above to your highest-spend endpoint this afternoon. Point it at CloudWatch, Datadog, or even a Postgres table. You'll have visibility within hours, not sprints.

Then ask: which agent cost the most yesterday? You probably don't know the answer. By tomorrow, you should.

What happens when an AI agent hits a rate limit — and how to design around it

AwxGlobal — Tue, 05 May 2026 07:00:39 +0000

Originally published at awx-shredder.fly.dev/blog

What happens when an AI agent hits a rate limit — and how to design around it

Your AI agent is processing customer support tickets at 3 AM. It's been running flawlessly for hours, then suddenly: RateLimitError: You exceeded your current quota. The agent crashes. Thirty tickets sit in limbo. Your on-call phone rings.

This isn't a hypothetical. Rate limits and budget exhaustion are distinct failure modes with different blast radiuses, and most developers conflate them until production teaches them otherwise.

Rate limits vs budget limits: different animals

A rate limit restricts requests per time window — 3,500 requests per minute for GPT-4, for example. Cross it and you get a 429 status code. Wait 60 seconds and you're back in business.

A budget limit is about cumulative spend. Once you've burned through your daily or monthly allocation, you're done until the reset. The API returns 429 with insufficient_quota as the error type, but the fix isn't waiting — it's either increasing your budget or stopping work entirely.

The failure modes differ:

Rate limit: Temporary. Backoff and retry works.
Budget limit: Terminal for that billing period. Retry loops just burn CPU.

Yet both return 429. Your error handling needs to distinguish them.

Parsing the error correctly

OpenAI's Python SDK raises RateLimitError for both. The distinction lives in the error message or response headers. Here's how to differentiate:

from openai import OpenAI, RateLimitError
import time

client = OpenAI()

def call_with_smart_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
        except RateLimitError as e:
            error_message = str(e).lower()

            # Budget exhausted - don't retry
            if "quota" in error_message or "insufficient" in error_message:
                print(f"Budget exhausted: {e}")
                # Log to monitoring, alert ops, gracefully degrade
                raise BudgetExhaustedError("Daily budget hit") from e

            # Rate limit - exponential backoff
            if attempt < max_retries - 1:
                wait_time = (2 ** attempt) + (random.random() * 0.1)
                print(f"Rate limited, backing off {wait_time:.2f}s")
                time.sleep(wait_time)
            else:
                raise

class BudgetExhaustedError(Exception):
    pass

This prevents the classic mistake: retry loops that hammer the API when you're out of budget, racking up failed request logs and wasting cycles.

Backoff strategies that actually work

Exponential backoff with jitter is table stakes. The jitter (random component) prevents thundering herds when multiple agents hit limits simultaneously.

But there's a subtlety: OpenAI returns retry-after headers on rate limit responses. Respect them:

except RateLimitError as e:
    retry_after = e.response.headers.get('retry-after')
    if retry_after:
        wait_time = int(retry_after)
    else:
        wait_time = min((2 ** attempt) + random.random(), 60)
    time.sleep(wait_time)

Adaptive rate limiting is the next level. Track your request success rate and slow down proactively before hitting limits:

class AdaptiveRateLimiter:
    def __init__(self, initial_rate=10):
        self.requests_per_second = initial_rate
        self.window_start = time.time()
        self.requests_in_window = 0

    def acquire(self):
        now = time.time()
        if now - self.window_start >= 1.0:
            self.window_start = now
            self.requests_in_window = 0

        if self.requests_in_window >= self.requests_per_second:
            sleep_time = 1.0 - (now - self.window_start)
            if sleep_time > 0:
                time.sleep(sleep_time)
            self.window_start = time.time()
            self.requests_in_window = 0

        self.requests_in_window += 1

    def on_rate_limit(self):
        # Reduce rate by 50% when we hit a limit
        self.requests_per_second = max(1, self.requests_per_second * 0.5)

    def on_success(self):
        # Gradually increase rate by 10% on sustained success
        self.requests_per_second = min(100, self.requests_per_second * 1.1)

Queue design for resilient agents

The real solution isn't better retry logic — it's building agents that fail gracefully. Use a persistent queue:

Accept work into a queue (Redis, SQS, PostgreSQL with SKIP LOCKED)
Workers pull from the queue with visibility timeouts
On rate limit: Release the message back to the queue, don't retry immediately
On budget exhaustion: Stop pulling from the queue entirely, alert, and wait for budget reset

This architecture decouples work acceptance from execution. When you hit limits, work queues up instead of erroring out.

# Pseudocode for queue-based processing
while True:
    message = queue.receive(wait_time=20)
    if not message:
        continue

    try:
        result = call_with_smart_retry(message.data)
        message.delete()
    except BudgetExhaustedError:
        message.release()  # Back to queue
        print("Budget exhausted, sleeping until reset")
        time.sleep(3600)  # Check hourly
    except RateLimitError:
        message.release()  # Back to queue with delay
        time.sleep(10)  # Brief pause before next pull

Hard budget enforcement

If you need guaranteed budget enforcement at the API level rather than in your application logic, AWX Shredder provides a proxy layer that blocks requests the moment an agent exceeds its daily budget. It's OpenAI-compatible and requires only a base URL change: OPENAI_BASE_URL=https://awx-shredder.fly.dev/proxy/v1. This prevents the scenario where your retry logic has bugs or a runaway agent burns through budget before your application-level checks catch it.

What to implement today

Add error type detection to your LLM calls. Distinguish rate limits from budget exhaustion.
Implement exponential backoff with jitter, respecting retry-after headers.
Move to queue-based processing if you're doing any multi-request workflows.
Set up monitoring for rate limit and budget exhaustion events. These should page someone.

The difference between a resilient AI agent and a fragile one isn't the model you use — it's how you handle the inevitable moment when the API says "no."

Preventing CrewAI Budget Overruns: Hard Limits Per Agent Role

AwxGlobal — Mon, 04 May 2026 07:00:43 +0000

Originally published at awx-shredder.fly.dev/blog

Preventing CrewAI Budget Overruns: Hard Limits Per Agent Role

A multi-agent CrewAI workflow spun up in production last month and burned through $340 in API costs before anyone noticed. The culprit? A research agent stuck in a loop, making hundreds of GPT-4 calls to refine a single report. The agent eventually completed its task, but the invoice was brutal.

CrewAI's built-in max_rpm and max_iter parameters help, but they're blunt instruments. RPM limits don't account for token usage variance—a single long-context call can cost 50x more than a short one. Iteration limits stop runaway loops but won't catch an agent that makes expensive calls within reasonable iteration counts.

What you actually need is hard budget enforcement at the agent level, ideally denominated in dollars rather than tokens or requests.

Why Agent-Level Budgets Matter

In a typical CrewAI setup, different agents have wildly different cost profiles:

Research agents make many calls with large contexts (expensive)
Routing agents make quick classification calls (cheap)
Writing agents generate long-form content (moderate, predictable)

A flat budget across all agents means your routing agent's allocation goes unused while your research agent blows past acceptable spend. You need granular control that maps to how each agent actually behaves in production.

The Naive Approach: Token Counting

Your first instinct might be to wrap the LLM client and count tokens:

from crewai import Agent, LLM
from collections import defaultdict
import tiktoken

class BudgetEnforcedLLM:
    def __init__(self, model_name, budget_dollars):
        self.llm = LLM(model=model_name)
        self.model_name = model_name
        self.budget_dollars = budget_dollars
        self.spent_dollars = 0.0
        self.encoding = tiktoken.encoding_for_model(model_name)

        # Simplified pricing (actual pricing is more complex)
        self.price_per_1k_input = 0.03 if "gpt-4" in model_name else 0.0015
        self.price_per_1k_output = 0.06 if "gpt-4" in model_name else 0.002

    def call(self, messages, **kwargs):
        if self.spent_dollars >= self.budget_dollars:
            raise RuntimeError(f"Budget exceeded: ${self.spent_dollars:.4f} / ${self.budget_dollars}")

        # Estimate input cost
        input_tokens = sum(len(self.encoding.encode(str(m))) for m in messages)

        response = self.llm.call(messages, **kwargs)

        # Count actual output tokens
        output_tokens = len(self.encoding.encode(response.content))

        cost = (input_tokens / 1000 * self.price_per_1k_input + 
                output_tokens / 1000 * self.price_per_1k_output)
        self.spent_dollars += cost

        return response

# Usage with CrewAI
research_llm = BudgetEnforcedLLM("gpt-4-turbo-preview", budget_dollars=5.0)
researcher = Agent(
    role="Research Analyst",
    goal="Gather comprehensive market data",
    llm=research_llm,
    verbose=True
)

This works in theory, but falls apart in practice:

Pricing complexity: OpenAI's pricing varies by model version, has different rates for cached tokens, and changes frequently
Token counting edge cases: Function calls, vision inputs, and prompt caching make accurate token estimation nearly impossible
No persistence: Restarting your process resets budgets, making daily limits impossible
No observability: You're flying blind until something breaks

Production-Grade Budget Enforcement

The real solution is intercepting calls at the API level, not in your application code. This is where a proxy layer makes sense.

AWX Shredder is built specifically for this use case—it sits between your CrewAI agents and OpenAI's API, enforcing hard budget limits per agent role without requiring code changes beyond swapping the base URL:

import os
from crewai import Agent, Crew, Task, LLM

# Configure different budget headers for each agent role
research_llm = LLM(
    model="gpt-4-turbo-preview",
    base_url="https://awx-shredder.fly.dev/proxy/v1",
    api_key=os.getenv("OPENAI_API_KEY"),
    default_headers={"X-AWX-Agent-ID": "researcher", "X-AWX-Daily-Budget": "5.00"}
)

writer_llm = LLM(
    model="gpt-4-turbo-preview",
    base_url="https://awx-shredder.fly.dev/proxy/v1",
    api_key=os.getenv("OPENAI_API_KEY"),
    default_headers={"X-AWX-Agent-ID": "writer", "X-AWX-Daily-Budget": "3.00"}
)

researcher = Agent(
    role="Research Analyst",
    goal="Gather comprehensive market data",
    llm=research_llm,
    verbose=True
)

writer = Agent(
    role="Content Writer",
    goal="Create engaging reports from research",
    llm=writer_llm,
    verbose=True
)

The proxy tracks actual API costs in real-time and returns HTTP 429 (rate limit) responses the moment an agent exceeds its daily budget. CrewAI handles this gracefully—the agent fails fast rather than racking up surprise costs.

Rolling Your Own Proxy

If you need full control, building a simple budget-enforcing proxy isn't difficult. The minimal viable version needs:

A Redis instance for persisting spend by agent ID and date
A FastAPI or Express server that proxies to api.openai.com
Middleware that checks budget before forwarding requests
A parser for OpenAI's response to extract actual costs from the usage field

The tricky parts are handling streaming responses (costs aren't known until the stream completes) and dealing with OpenAI's eventual consistency in reporting costs. You'll also need to build alerting, dashboards, and handle key rotation.

For most teams, the build-vs-buy calculation favors a managed solution unless budget enforcement is core to your product's IP.

Key Takeaways

Set agent-level budgets based on each role's expected behavior, not flat limits across your crew
Enforce budgets at the API proxy level, not in application code
Use hard blocks (HTTP 429) rather than soft warnings—agents can't learn from budget alerts
Start conservative: set daily budgets at 2x your observed costs, then tighten after a week of production data

Today's action: audit your last month of OpenAI bills, break down costs by the logical "agent roles" you're running, and set per-role daily budgets at 150% of the highest daily spend you saw. Hard limits prevent one-off disasters while giving your agents room to operate.

Why Your LLM Agent Costs 10x More Than Your Estimate

AwxGlobal — Sat, 02 May 2026 07:00:43 +0000

Originally published at awx-shredder.fly.dev/blog

Why Your LLM Agent Costs 10x More Than Your Estimate

Your product manager approved the $500/month LLM budget. Two weeks later, you're staring at a $4,200 bill from OpenAI. The agent works perfectly in testing, but production is eating tokens like a memory leak eats RAM.

I've debugged this exact scenario four times in the past year. The culprit is never a single smoking gun—it's the multiplication of hidden costs that developers systematically underestimate during planning.

The Token Math Nobody Does

Most developers estimate LLM costs like this:

1000 requests/day × 500 tokens/request × $0.002/1k tokens = $1/day

This calculation assumes every request is a pristine, single-shot API call. Real agents don't work that way.

Here's what actually happens:

System prompts are charged on every call. That 800-token system prompt explaining your agent's role, output format, and business rules? It's not free. It's billed on every single request.
Tool call overhead compounds exponentially. Each function call requires the model to output JSON, you to execute the function, then send results back with the full conversation history. A single user request often triggers 3-5 tool calls.
Conversation history grows linearly. If you're maintaining context across turns (and you probably are), each subsequent message in a conversation includes all previous messages.

Let's recalculate with reality:

# What you estimated
simple_cost = 1000 * 500 * 0.002 / 1000
print(f"Estimate: ${simple_cost}/day")  # $1.00/day

# What actually happens
system_prompt_tokens = 800
avg_user_input = 150
avg_assistant_response = 300
tool_calls_per_request = 3.5  # Average across all requests
tool_call_overhead = 250  # JSON formatting + function results
conversation_turns = 4  # Average conversation length

# First turn
turn_1 = system_prompt_tokens + avg_user_input + avg_assistant_response

# Subsequent turns include history
turn_2 = turn_1 + avg_user_input + avg_assistant_response
turn_3 = turn_2 + avg_user_input + avg_assistant_response  
turn_4 = turn_3 + avg_user_input + avg_assistant_response

# Add tool call overhead
total_tokens = (turn_1 + turn_2 + turn_3 + turn_4) + (tool_calls_per_request * tool_call_overhead * conversation_turns)

real_cost = 1000 * total_tokens * 0.002 / 1000
print(f"Reality: ${real_cost}/day")  # $12.40/day

multiplier = real_cost / simple_cost
print(f"Hidden multiplier: {multiplier}x")  # 12.4x

This 12x multiplier is before we account for the really expensive mistakes.

Retry Loops: The Silent Budget Killer

Retry logic is essential for production reliability. It's also where costs spiral out of control.

Consider this common pattern:

from openai import OpenAI
import time

client = OpenAI()

def call_agent_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[{"role": "user", "content": prompt}],
                timeout=30
            )
            return response
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

This looks reasonable. But what happens when OpenAI has a bad day and timeouts spike to 15% of requests?

15% of your 1000 daily requests now make 2-3 attempts
Each retry sends the full prompt again (including that 800-token system prompt)
Your token consumption jumps by 20-40% instantly

Worse yet: I've seen validation retry loops where the agent's output doesn't match the expected schema, so the developer adds logic to retry with error feedback. Each failed parse triggers another full API call with the previous attempt's context. A single malformed JSON response can cascade into 5-10 retry attempts.

Tool Call Explosion

Function calling feels efficient—until you look at the token counts.

Every tool call follows this pattern:

Model decides to call a function (outputs JSON)
Your code executes the function
Results are formatted and sent back
Model processes results and decides next step

Each step includes the full conversation history and system prompt. A research agent that calls search(), then fetch_url(), then extract_data() isn't making three cheap calls—it's making three increasingly expensive calls as the context window fills with previous tool results.

The economics get brutal with GPT-4. A complex agent workflow that feels like "just a few tool calls" can easily consume 15,000-20,000 tokens per user request.

Production Reality Check

After you've shipped and the costs are running hot, you need visibility and hard limits. Setting OpenAI spending limits helps but doesn't give you per-agent granularity or prevent runaway costs before they hit your credit card.

For production deployments where budget control is non-negotiable, tools like AWX Shredder (awx-shredder.fly.dev) act as a hard circuit breaker—an OpenAI-compatible proxy that blocks requests when an agent exceeds its daily budget. It takes one environment variable change and gives you real-time spend tracking with alerts before you blow through your allocation.

What You Should Do Today

Audit your actual token consumption. Log usage.total_tokens from every API response for a week. Calculate the median and p95. You'll be surprised.
Count your system prompt tokens. Use tiktoken to get exact counts. If your system prompt is over 500 tokens, consider whether every instruction is essential.
Track retry rates. Add metrics for how often your retry logic actually fires. Set alerts when retry rates exceed 5%.
Model tool call patterns. Log how many function calls the average request triggers. If it's more than 3, consider whether you can combine tools or reduce the decision tree.
Set hard budget limits per agent. Don't rely on cost estimates. Implement actual spending caps that prevent runaway costs.

The gap between estimated and actual LLM costs isn't a rounding error—it's the difference between a sustainable product and a budget crisis. The math is straightforward once you account for what actually gets billed.

[Boost]

AwxGlobal — Fri, 01 May 2026 22:55:32 +0000

AwxGlobal

May 1

Per-agent daily spend limits: the architecture every AI team needs

#ai #architecture #openai #agents

Comments

4 min read

Per-agent daily spend limits: the architecture every AI team needs

AwxGlobal — Fri, 01 May 2026 21:07:07 +0000

Originally published at awx-shredder.fly.dev/blog

Per-agent daily spend limits: the architecture every AI team needs

Your Slack bot just burned through $847 in four hours because a junior dev accidentally pushed a loop that called gpt-4-turbo on every message edit event. Your customer support agent hit an infinite reasoning loop and racked up $2,300 in o1-preview costs before anyone noticed. These aren't hypothetical scenarios—they're the kind of incidents that happen weekly across AI engineering teams.

The problem isn't that developers are careless. It's that LLM APIs have fundamentally different cost characteristics than traditional APIs. A single malformed request can cost $50. A logic bug can drain thousands before your monitoring alerts even fire. And when you're running multiple agents—research bots, customer support, data analysis tools—the blast radius of a single misconfigured agent can take down your entire API budget.

Why application-level budget checks fail

Most teams start with application-level budget enforcement. You add a counter in your database, increment it on each API call, and check before making requests:

async def call_llm(agent_id: str, messages: list):
    current_spend = await db.get_daily_spend(agent_id)
    if current_spend >= DAILY_LIMIT:
        raise BudgetExceededError()

    response = await openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=messages
    )

    # Calculate and record cost
    cost = calculate_cost(response.usage)
    await db.increment_spend(agent_id, cost)
    return response

This looks reasonable until you hit production. The cost calculation happens after the API call completes. If your database write fails, you've lost spend tracking. If the process crashes between the API call and the database update, that cost vanishes. Race conditions mean multiple requests can check the budget simultaneously, all see they're under the limit, and fire off requests that collectively exceed it.

More critically: this pattern requires every callsite in your codebase to route through your budget enforcement logic. Third-party libraries that call OpenAI directly bypass it entirely. That LangChain agent you integrated? It's not checking budgets. The new engineer who doesn't know about your internal wrapper? They import openai directly and circumvent everything.

The proxy architecture

The robust solution is budget enforcement at the network layer. Every LLM API call flows through a proxy that:

Authenticates the agent making the request
Checks current spend against the daily limit before forwarding to the LLM provider
Blocks the request immediately if the limit is exceeded
Records actual costs from the LLM response
Aggregates spend across all instances of your application

This architecture makes budget enforcement impossible to bypass. Applications can't accidentally route around it because the proxy is configured at the network level via OPENAI_BASE_URL. Multiple application instances automatically share the same spend tracking because it's centralized in the proxy.

Here's what the client-side configuration looks like:

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: process.env.OPENAI_BASE_URL, // points to proxy
  defaultHeaders: {
    'X-Agent-ID': 'customer-support-bot'
  }
});

// This call is budget-enforced automatically
const response = await client.chat.completions.create({
  model: 'gpt-4-turbo',
  messages: [{ role: 'user', content: 'Hello' }]
});

The proxy intercepts the request, checks if customer-support-bot has budget remaining today, and either forwards it to OpenAI or returns a 429 error. Your application code doesn't need to think about budgets—they're enforced infrastructure-level.

Building vs. buying

Implementing a production-grade proxy requires solving several non-trivial problems:

Streaming support: LLM streaming responses require careful proxy handling to calculate costs from partial responses
Token counting accuracy: Different models have different pricing for input/output tokens, and your cost calculations need to match OpenAI's billing exactly
Atomic spend updates: You need transactional guarantees that spend increments don't get lost
Multi-region deployment: Low latency requires running the proxy close to your application
Alert fatigue: Teams need warnings before hitting limits, not just hard blocks

For teams that need this now, AWX Shredder is a production-ready proxy that handles all of this. Change OPENAI_BASE_URL to https://awx-shredder.fly.dev/proxy/v1, set per-agent daily budgets, and get email alerts at 50%/80%/100% thresholds. It's OpenAI-compatible, so existing code works unchanged.

For teams building internally, the core architecture is straightforward:

Run a lightweight HTTP proxy (Node.js with http-proxy-middleware or Python with aiohttp)
Use Redis for atomic spend tracking with daily key expiration
Parse token usage from OpenAI responses and multiply by model-specific pricing
Return 429 errors when budgets are exceeded
Implement request signing or API keys to authenticate agents

The tricky parts are handling streaming correctly (you need to buffer the response to extract token counts while still streaming to the client) and keeping your pricing table up to date as OpenAI changes model costs.

The enforcement guarantee

The key insight is that budget enforcement must happen before cost is incurred, not after. Application-level tracking is audit logging. Proxy-level blocking is actual enforcement.

When your proxy returns 429, that request never reaches OpenAI. No tokens are consumed. No cost is charged. The agent is hard-stopped until the daily limit resets. This guarantee—that exceeding a budget is architecturally impossible—is what lets you safely increase agent autonomy without fear of runaway costs.

What to do today

If you're running multiple AI agents in production, implement per-agent spend limits this week. The next production incident will happen—the question is whether it costs $50 or $5,000. Pick a proxy architecture (build or buy), assign realistic daily budgets to each agent (10-20% above their typical daily spend), and configure alerts before you hit limits. Your infrastructure should make expensive mistakes impossible, not just unlikely.

I woke up to a $400 OpenAI bill. Here's what I built to make sure it never happens again.

AwxGlobal — Fri, 01 May 2026 15:52:35 +0000

It was 2am when the email came in.

"Your OpenAI usage has exceeded $400."

An agent I'd been testing had hit a loop. It kept retrying a failed call, over and over, for six hours while I slept. By the time I saw it, the damage was done.

I went looking for a tool that would let me set a hard daily limit per agent — something that would block the call before it ever reached OpenAI. I couldn't find one.

So I built it.

The problem with AI agents and money

When you build with AI APIs, you're essentially writing a blank check. Your agent makes calls, OpenAI charges you, and you find out later. There's no circuit breaker. There's no per-agent budget. There's no "stop after $10."

Soft limits and billing alerts exist, but they notify you after the money is gone.

What I needed was a hard block. Something that intercepts the call and returns an error before it reaches the API.

What AWX Shredder does

It's a proxy. Instead of your agent calling api.openai.com directly, it calls your AWX endpoint. AWX checks the agent's daily budget, and either forwards the call or blocks it with a 402.

Setup is two lines:

OPENAI_BASE_URL=https://awx-shredder.fly.dev/proxy/v1
OPENAI_API_KEY=your_awx_key
You create agents in a dashboard, set a daily budget per agent, and that's it. Every call is logged. When an agent hits its limit, calls stop — not after, not eventually, right then.

Why per-agent matters

Most billing tools are org-level. You get one limit for everything.

But in a real system you might have five agents running at once — a researcher, a writer, a code reviewer, a scraper, a scheduler. If one goes rogue, you want to block that agent, not your whole org.

AWX gives each agent its own daily budget. One agent burning money doesn't affect the others.

It works with Claude too

Not just OpenAI. Point your Anthropic client at /proxy/v1/messages and the same budget enforcement applies.

Try it

It's free to start: awx-shredder.fly.dev

If you've ever had an unexpected AI bill, or you're building agents and want to sleep at night — this is for you.

I'm building this in public. Follow along if you want to see how it grows.

ai #openai #webdev #productivity