How I Built Open-Source Guardrails That Auto-Stop Runaway AI Agents

tazsat0512 — Mon, 30 Mar 2026 11:16:18 +0000

Runaway AI agents are expensive. Stories of agents burning through thousands of dollars overnight come up regularly on Reddit and Hacker News — no budget limit, no loop detection, no kill switch. The agent keeps calling GPT-4 in an infinite loop until someone wakes up and pulls the plug.

I built reivo-guard to prevent this. It's an open-source guardrail library that detects and stops runaway AI agents — with sub-microsecond overhead.

This post walks through the architecture decisions behind each detection layer.

The Problem: Agents Don't Know When to Stop

LLM agents fail in predictable ways:

Infinite loops — The agent keeps asking the same question, or semantically similar variations
Cost explosions — Token consumption spikes 100x with no warning
Quality degradation — Responses get worse over time but the agent keeps going
Cliff-edge failures — Everything works until 100% budget, then hard crash

Among the tools I evaluated (Helicone, Portkey, LangSmith, Lunary, LiteLLM), most either observe these failures (dashboards, alerts) or enforce static rules (rate limits, budget caps). I wanted something that detects and acts adaptively — so I built it.

Architecture Overview

guard.before()  →  Budget check, loop detection, session validation
       ↓
    LLM API call
       ↓
guard.after()   →  Cost tracking, quality verification, trend analysis

Guard functions are side-effect-free on the hot path — state lives in a key-value store interface (GuardStore), so it works in serverless (Cloudflare Workers, Lambda) or as a library.

The key insight: split checks into sync (blocking) and async (post-response).

Check	Sync/Async	Why
Budget enforcement	Sync	Must block before spending
Hash loop detection	Sync	O(20), sub-microsecond
EWMA anomaly	Sync	O(1), sub-microsecond
TF-IDF cosine loop	Async	O(W × V) where W=window, V=vocab. Runs in `waitUntil()`
LLM-as-Judge quality	Async	~100ms external call
Quality trend	Sync	O(50), lightweight

Layer 1: Loop Detection (Two Algorithms)

Hash Match (The Fast Path)

The simplest detector: keep a sliding window of prompt hashes and count exact matches.

const window = hashes.slice(-LOOP_HASH_WINDOW); // last 20
const matchCount = window.filter(h => h === newHash).length + 1;
return { isLoop: matchCount >= LOOP_HASH_THRESHOLD }; // ≥5 matches

Why this works: Most agent loops are exact duplicates. The agent asks "What is the capital of France?" five times in a row. Hash match catches this with sub-microsecond overhead.

Why window=20, threshold=5? Agents legitimately retry 2-3 times (network errors, rate limits). 5 matches in 20 requests means 25% of recent traffic is identical — that's a loop, not a retry.

TF-IDF Cosine Similarity (The Smart Path)

Hash match misses rephrased loops: "What's the capital of France?" vs "Tell me France's capital city." Same intent, different hash.

The cosine detector builds TF-IDF vectors from prompt text and computes pairwise similarity:

1. Tokenize: lowercase, split on \W+, filter len > 1
2. TF: freq / tokenCount per document
3. IDF: log(n / docFrequency) across all documents
4. Cosine: dot(a, b) / (||a|| × ||b||)

Threshold: 0.92. This is deliberately high. At 0.92, the prompts need to share ~85% of their meaningful vocabulary. "How do I sort a list in Python?" and "Python list sorting method?" score ~0.89, below threshold. But four variations of the same question cross it.

Why not embeddings? TF-IDF runs locally in <1ms. Embedding APIs add 50-200ms latency and cost money. For loop detection, lexical similarity is good enough — and it's free.

This runs async (waitUntil()) so it never blocks the response path.

Layer 2: Budget Enforcement with Graceful Degradation

Hard budget cutoffs create terrible UX. You're mid-conversation, and suddenly: 403 Forbidden. No warning, no wind-down.

Instead, reivo-guard implements four degradation levels:

Usage	Level	What Happens
< 80%	`normal`	Full access
80-95%	`aggressive`	Force cheaper model routing
95-100%	`new_sessions_only`	Existing sessions continue, new ones blocked
≥ 100%	`blocked`	All requests rejected

function getDegradationLevel(usedUsd: number, limitUsd: number) {
  const ratio = usedUsd / limitUsd;
  if (ratio >= 1.0) return { level: 'blocked', blockAll: true, ... };
  if (ratio >= 0.95) return { level: 'new_sessions_only', blockNewSessions: true, ... };
  if (ratio >= 0.80) return { level: 'aggressive', forceAggressiveRouting: true, ... };
  return { level: 'normal', ... };
}

Why 80%? At 80% budget consumption, you start routing to cheaper models (GPT-4o-mini instead of GPT-4o). The user barely notices quality difference for most tasks, but cost drops 10-20x.

Alert deduplication: Thresholds fire at 50%, 80%, 100% — but only once each. No alert storms.

Note: Portkey and LiteLLM also offer degradation strategies (fallback chains and budget caps respectively). reivo-guard's approach is more granular (4 levels with progressive restrictions) but theirs are more battle-tested at scale.

Layer 3: Anomaly Detection (EWMA)

Budget limits catch expected overuse. EWMA catches unexpected spikes.

If an agent normally uses 1,000 tokens per request and suddenly jumps to 100,000 — that's an anomaly, even if there's budget remaining.

Exponentially Weighted Moving Average tracks both the mean and variance of token consumption:

// Update running statistics
const diff = newValue - state.ewmaValue;
const newEwma = state.ewmaValue + EWMA_ALPHA * diff;
const newVariance = (1 - EWMA_ALPHA) * (state.ewmaVariance + EWMA_ALPHA * diff * diff);

// Detect anomaly
const stdDev = Math.sqrt(state.ewmaVariance);
const zScore = (currentRate - state.ewmaValue) / stdDev;
return { isAnomaly: zScore > ANOMALY_Z_THRESHOLD }; // z > 3.0

A note on the variance formula: this is a Welford-style EWMA variance update rather than the textbook α*(x-μ)² + (1-α)*σ². Both converge to the same result, but this form is slightly more numerically stable for streaming updates since it uses the pre-update diff.

Why EWMA, not a simple moving average?

O(1) space: just two numbers (mean + variance), no window buffer
Adapts to trends: if usage gradually increases, that's not an anomaly
Converges fast: ~10 samples and the variance is reliable

Why α=0.3? Aggressive enough to track trend shifts, but not so aggressive that a single outlier moves the baseline. A spike of 10x will trigger z > 3.0 (anomaly) but won't corrupt the baseline mean for subsequent checks.

Critical ordering: You must call detectAnomaly() before updateEwma(). If you update first, the variance absorbs the spike and the z-score drops. This is the kind of bug that only shows up in production.

Layer 4: Quality Verification

Cost and loops are necessary but not sufficient. An agent can stay within budget, never loop, but produce garbage outputs. We need quality signals.

Logprobs (OpenAI & Google)

When available, logprobs are the cheapest quality signal — they come free with the response.

// Map mean logprob to 0-1 score
score = Math.max(0, Math.min(1, 1 + meanLogprob / 2));
// logprob  0 → score 1.0 (certain)
// logprob -1 → score 0.5 (medium)
// logprob -2 → score 0.0 (uncertain)

This is a simple linear mapping. Logprobs are logarithmic so a nonlinear mapping might be more principled, but in practice this threshold-based approach (flag below -1.0) works well enough for the binary "retry or not" decision.

If the mean logprob falls below -1.0 (~37% average token confidence), the response is flagged for potential retry with a better model.

LLM-as-Judge (Anthropic & Fallback)

Anthropic doesn't expose logprobs. So we use GPT-4o-mini as a judge — truncate the prompt (500 chars) and response (1000 chars), ask for a 0-1 quality score.

Cost: <$0.0001 per judgment. At this price, you can judge every response.

Quality Trend Detection

Individual quality scores fluctuate. What matters is the trend. If quality degrades over a session, the model should auto-upgrade:

Compare: avg(last 5 scores) vs avg(earlier scores)
If delta ≤ -0.15 AND recent avg < 0.5 → upgrade model

This creates an automatic feedback loop: cheap model → quality drops → upgrade to better model → quality recovers.

Performance

Guard checks add sub-microsecond overhead — negligible vs. LLM API latency (100-3000ms).

Operation	Time	Notes
`checkBudget()`	~70 ns	Pure arithmetic
`detectLoopByHash()`	~200 ns	Array scan, n=20
`getDegradationLevel()`	~25 ns	Three comparisons
`guard.before()` (Python)	~2.5 µs	All sync checks combined
`guard.after()` (Python)	~0.3 µs	Cost tracking

Measured by dividing wall-clock time of 100K iterations on Apple M3. These numbers should be taken as order-of-magnitude — at this scale, JIT warmup, GC pauses, and measurement overhead all matter. The benchmark code is in the repo if you want to reproduce or challenge the methodology.

The point isn't the exact nanosecond count — it's that guard overhead is 5-6 orders of magnitude smaller than the LLM call it's protecting.

What I'd Do Differently

Start with Python first. The AI ecosystem runs on Python. I started with TypeScript because my proxy runs on Cloudflare Workers, but standalone adoption would've been faster with Python-first.
Simpler API surface. The TypeScript API exposes individual functions (checkBudget, detectLoopByHash, getDegradationLevel). The Python API has a simpler guard.before() / guard.after() pattern. The Python approach is better for most users.
Skip TF-IDF for v1. Hash match catches 90%+ of real loops. Cosine similarity is cool engineering but hasn't triggered in my testing where hash match didn't already catch it. (To be fair, my test traffic is limited — this may change with more diverse usage patterns.)

Try It

npx reivo-guard-demo  # Interactive demo

GitHub: github.com/tazsat0512/reivo-guard — MIT licensed, TypeScript + Python.

If you've had your own runaway agent story, I'd love to hear it in the comments.

Forem: tazsat0512