Forem: Msatfi

"I Ran 14 Agent Failure Scenarios Through a Guard Layer (With Cost Data From Every Run)"

Msatfi — Fri, 13 Mar 2026 07:56:06 +0000

How do you know if your agent is looping or actually working?

I've been stress-testing AI agents against known failure modes (tool loops, duplicate side-effects, retry storms) and I built a middleware layer to catch them. Then I measured what it actually caught across 14 scenarios and 25 real-model runs. Here's the data.

The problem

Agents generate activity. Tool calls, search results, reformulated queries. It looks like work. But after enough runs, I kept finding the same three things:

Agent searches "refund policy", then "refund policy EU", then "refund policy EU Germany 2024". Each query is different. Same results every time.
Agent issues a refund, gets a timeout, retries. Customer refunded twice.
Agent A asks Agent B for help. Agent B asks Agent A for clarification. Back and forth forever.

max_steps doesn't help. It can't tell productive calls from loops. Set it too low, you kill good workflows. Too high, you burn money.

What the guard does

AuraGuard sits between the agent and its tools. Every tool call goes through it. It checks the call signature against rolling history and returns a decision: ALLOW, CACHE, BLOCK, REWRITE, ESCALATE, or FINALIZE.

No LLM calls. Deterministic heuristics only. HMAC signatures, token-set overlap, counters, sequence matching.

8 primitives run in sequence on every call:

Identical repeat detection
Argument jitter detection (token-set overlap)
Error retry circuit breaker
Side-effect idempotency ledger
Stall/no-state-change detection
Cost budget enforcement
Per-tool policy layer
Multi-tool sequence loop detection

Zero dependencies. Stdlib Python only.

What the benchmark output looks like

14 synthetic scenarios. Each one replays a specific failure pattern. No LLM involved. This measures detection accuracy, not model behavior.

aura-guard bench --all

Scenario                             No Guard   Aura Guard    Saved
────────────────────────────────────────────────────────────────────
KB Query Jitter Loop                    $0.32        $0.24      25%
Double Refund Attempt                   $0.16        $0.08      50%
Error Retry Spiral                      $0.40        $0.26      35%
CRM Lookup Cascade                      $0.36        $0.24      33%
Stall + Apology Spiral                  $0.04        $0.06     -50%
Mixed Degradation                       $0.40        $0.28      30%
RAG Retrieval Loop                      $0.40        $0.28      30%
Ticket Lookup Cascade                   $0.32        $0.20      38%
Side-Effect Storm                       $0.28        $0.16      43%
Budget Overrun                          $0.80        $0.76       5%
Healthy Workflow (FP check)             $0.20        $0.20       0%
Ping-Pong Delegation Loop               $0.40        $0.30      25%
Circular 3-Agent Delegation             $0.48        $0.40      17%
Mixed Normal + Sequence Loop            $0.44        $0.38      14%
────────────────────────────────────────────────────────────────────
TOTAL                                   $5.00        $3.94      21%

What I learned from the data

"Stall + Apology Spiral" costs more with the guard. +$0.02 overhead. The guard adds an intervention turn, then escalates. Without the guard the agent loops forever. Small cost increase for termination.

"Healthy Workflow" shows 0% savings. This scenario exists to verify zero false positives. Five normal tool calls, no loops. The guard allows all of them. If this ever goes above 0%, the thresholds are wrong.

Budget Overrun only saves 5%. The guard escalates on the 19th of 20 calls. Most of the budget is already spent. Budget enforcement catches the overrun. It doesn't prevent the spend leading up to it.

Side-effect scenarios prevent business damage, not token waste. "Double Refund Attempt" blocks the duplicate. "Side-Effect Storm" blocks 3 of 6 mutations. These are prevented duplicate charges, not cost optimizations.

Real-model A/B test

5 scenarios against Claude Sonnet. 5 runs per variant. Real API calls. Tools rigged to trigger failures.

Scenario	No guard	With guard	Result
Jitter loop	$0.28	$0.14	48% saved
Double refund	$0.14	$0.15	Duplicate prevented at +$0.01
Error retry spiral	$0.13	$0.10	29% saved
Smart reformulation	$0.86	$0.15	83% saved
Combined flagship	$0.35	$0.14	59% saved

The "smart reformulation" one caught me off guard. The agent reformulated queries with different word order, synonyms, added qualifiers. String matching wouldn't catch it. Token-set overlap was above 60%. 83% cost reduction.

64 interventions across 25 runs. Zero false positives in manual review. JSON report committed in the repo.

Caveats: tools were rigged. Controlled test, not production replay.

What the failure modes look like in aggregate

After hundreds of runs, three categories:

Exploration loops. 60% of interventions. The agent explores a search space with diminishing returns. Each query is slightly different. Each result is slightly different. The agent thinks it's making progress. It's not. Jitter detection and per-tool caps catch these.

Retry spirals. 25% of interventions. Tool fails. Agent retries. Fails again. Retries with modifications. The tool is down. No modification will fix it. Circuit breaker catches these.

Delegation loops. 15% of interventions. Multi-agent only. A asks B. B asks A. Repeat. Sequence detection catches these after the pattern repeats 3 times.

Category 1 is the most dangerous because it looks like productivity. The agent is "working." The logs show diverse calls. The summaries report progress. Only the token counter tells the truth.

How to use it

pip install aura-guard

from aura_guard import AgentGuard

guard = AgentGuard(
    secret_key=b"your-key",
    side_effect_tools={"refund", "cancel"},
    max_cost_per_run=1.00,
)

result = guard.run("search_kb", search_kb, query="refund policy")

Run the benchmark against your own failure patterns:

aura-guard bench --all

MCP support (Claude Desktop, Cursor, any MCP client):

pip install aura-guard[mcp]

Who this is for

Anyone running an AI agent that calls tools. Especially tools with side effects (payments, emails, cancellations) where a duplicate call causes real damage, not just wasted tokens.

You don't need a specific framework. The guard wraps any Python callable. OpenAI, LangChain, and MCP adapters are included if you want them.

Source: github.com/auraguardhq/aura-guard. 75 tests, zero dependencies.

How a 3-Line Middleware Would Have Stopped the Replit Database Disaster

Msatfi — Thu, 12 Feb 2026 11:22:30 +0000

In July 2025, Replit's AI coding agent deleted a live production database. 1,206 executive records and 1,196 companies, gone. During an active code freeze. The agent admitted it "panicked," ignored eleven ALL-CAPS instructions not to make changes, fabricated 4,000 fake records to cover up the damage, and then lied about whether rollback was possible.

Three months later, a multi-agent research system ran an infinite loop for 11 days straight. Two agents got stuck in a recursive conversation while the team slept. The bill: $47,000.

These aren't fringe cases. They're the predictable result of giving autonomous agents unrestricted tool access with no runtime governor.

I built Aura Guard because I kept running into the same failure modes in my own agent work, and I wanted to understand exactly which primitives would have caught each one.

What actually went wrong at Replit

Reading through the full timeline, the Replit incident wasn't a single failure. It was a cascade of at least four distinct failure modes, each of which is independently preventable:

Failure 1: Repeated destructive calls with no circuit breaker.
The agent executed DROP TABLE and DELETE commands on production tables. When the first destructive call succeeded (from the agent's perspective), there was no mechanism to flag that a high-impact tool had already fired and should not fire again without human approval.

Failure 2: No side-effect deduplication.
The agent didn't just delete once. It ran multiple destructive operations across tables. Each one was treated as a fresh, independent action. There was no ledger tracking "this agent run has already executed a destructive operation."

Failure 3: No stall detection after the damage.
After deleting the database, the agent entered a loop of generating fake data and misleading status messages. It was producing output that looked like progress but was actually the same pattern repeating: generate fake records, claim success, generate more fake records. No system flagged that the agent's outputs had stopped making forward progress.

Failure 4: No cost or action budget.
The agent ran unconstrained. There was no per-run limit on how many tool calls it could make, how much it could spend, or how many side effects it could trigger. The damage scaled linearly with time until a human noticed.

Mapping each failure to a specific enforcement primitive

Here's the part most "what went wrong" articles skip: what would the fix actually look like in code?

Aura Guard implements seven enforcement primitives. Four of them map directly to the Replit failures:

Failure 1 → Primitive 3: Error circuit breaker + Primitive 7: Tool policy

If destructive database operations were tagged in the tool policy as requiring human approval:

from aura_guard import AgentGuard

guard = AgentGuard(
    tool_policies={
        "execute_sql": {
            "access": "human_approval",
            "risk": "critical",
        }
    },
)

The guard returns ESCALATE instead of ALLOW. The agent never gets to execute the DROP TABLE. The orchestrator routes it to a human.

Even without pre-configured policies, the circuit breaker would have caught this: after the first destructive call returned an error or unexpected result, the tool gets quarantined for the remainder of the run.

Failure 2 → Primitive 4: Side-effect gating + idempotency ledger

guard = AgentGuard(
    side_effect_tools={"execute_sql", "delete_records", "drop_table"},
    max_cost_per_run=1.00,
)

Side-effect tools are tracked in an idempotency ledger. By default, each side-effect tool can execute once per run. The second DELETE call returns BLOCK. The guard has already recorded that this tool fired a side effect with these arguments in this run. The agent gets a cached result instead of a live execution.

The idempotency key is deterministic: HMAC(secret, "idem:{ticket}:{tool}:{args}"). Same tool + same args + same run = same key = no re-execution.

Failure 3 → Primitive 5: Stall detection

After the deletion, the agent looped through generating fake data and producing apologetic, repetitive text. Aura Guard's stall detector uses two independent signals:

Token overlap: if the assistant's output is 92% or more similar to its previous output (measured on HMAC'd token signatures), it's flagged as stalling.
Pattern scoring: regex detectors catch common stall phrases like "I apologize," "let me try again," "I understand your concern," and score them. Above 0.6, it's flagged.

Both signals must fire. After the configured number of stall turns (default: 4), the guard forces a deterministic outcome, either a structured finalization or an escalation to a human. The agent cannot continue looping.

Failure 4 → Primitive 6: Cost budget

guard = AgentGuard(
    max_cost_per_run=0.50,  # hard stop at 50 cents
    max_calls_per_tool=3,   # no tool can be called more than 3 times
)

The budget check runs before every tool call. When projected cost exceeds the limit, the guard returns ESCALATE with a cost report. The $47,000 11-day loop? It hits the budget cap on minute one.

What this looks like in practice

The full integration is three method calls:

from aura_guard import AgentGuard, PolicyAction

guard = AgentGuard(
    side_effect_tools={"execute_sql", "send_email", "process_refund"},
    max_calls_per_tool=3,
    max_cost_per_run=1.00,
)

# In your agent loop:
decision = guard.check_tool("execute_sql", args={"query": "DROP TABLE users"})

if decision.action == PolicyAction.ALLOW:
    result = execute_tool(...)
    guard.record_result(ok=True, payload=result)

elif decision.action == PolicyAction.ESCALATE:
    # Route to human: tool requires approval or budget exceeded
    notify_human(decision.escalation_packet)

elif decision.action == PolicyAction.BLOCK:
    # Tool was blocked: duplicate, quarantined, or over limit
    pass

elif decision.action == PolicyAction.CACHE:
    # Idempotent replay: use the cached result, skip execution
    result = decision.cached_result

No LLM calls. No network requests. Sub-millisecond. The guard is pure computation: HMAC signatures, counter checks, set intersections. It sits between your agent and its tools and makes a deterministic decision before every call.

Why max_steps isn't enough

Every agent framework has some version of max_steps or max_iterations. LangGraph has recursion_limit. CrewAI has max_iter. OpenAI's Agents SDK has max_turns.

These are blunt stop buttons. They can't tell the difference between:

An agent that made 20 productive tool calls and is almost done
An agent that called the same failing API 20 times in a row
An agent that executed a refund twice with slightly different wording

Aura Guard detects the specific failure mode. It knows the difference between a repeat, a jitter loop (same tool with slightly rephrased args), a retry storm (tool returning errors), and a stall (model producing no-progress text). Each gets a different response: cache, block, quarantine, rewrite, or escalate.

A step counter would have stopped the Replit agent eventually, but only after the damage was done. A runtime governor would have stopped the first destructive call from executing without approval.

The uncomfortable gap

I looked for existing tools before building this. Guardrails AI is great for catching toxic outputs, but it doesn't operate at the tool-call level. It can't stop a loop. Portkey and LiteLLM handle cost limits and rate limiting at the API gateway, but they can't see inside the agent loop. They don't know if a tool call is a repeat or a side effect. LangGraph has recursion_limit, CrewAI has max_iter, but those are step counters, not diagnostics. They stop everything after N steps whether the agent is productive or stuck. Langfuse and LangSmith show you what happened after the fact. They watch; they don't enforce.

I needed something that sits at the tool-call boundary, the moment between "the model wants to call this tool" and "the tool actually executes." That's what Aura Guard does.

Try it

pip install aura-guard

Run the built-in demo to see the primitives in action:

aura-guard demo

The repo includes a benchmark harness with rigged tools designed to trigger each failure mode, plus a live A/B test against Claude Sonnet 4 with full cost accounting.

GitHub | PyPI

If you've dealt with the "agent rephrases the same query forever" problem, the "refund fired twice" problem, or the "retry storm against a failing API" problem, I'd like to hear what heuristics you use. My current jitter detection uses an overlap coefficient of 0.60 with a repeat threshold of 3. I'm sure there are better numbers. Open an issue or PR.