<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: tazsat0512</title>
    <description>The latest articles on Forem by tazsat0512 (@tazsat0512).</description>
    <link>https://forem.com/tazsat0512</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3851278%2F93d7d45d-abdb-4918-af88-fc8855ef2c2f.png</url>
      <title>Forem: tazsat0512</title>
      <link>https://forem.com/tazsat0512</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tazsat0512"/>
    <language>en</language>
    <item>
      <title>How I Built Open-Source Guardrails That Auto-Stop Runaway AI Agents</title>
      <dc:creator>tazsat0512</dc:creator>
      <pubDate>Mon, 30 Mar 2026 11:16:18 +0000</pubDate>
      <link>https://forem.com/tazsat0512/how-i-built-open-source-guardrails-that-auto-stop-runaway-ai-agents-249m</link>
      <guid>https://forem.com/tazsat0512/how-i-built-open-source-guardrails-that-auto-stop-runaway-ai-agents-249m</guid>
      <description>&lt;p&gt;Runaway AI agents are expensive. Stories of agents burning through &lt;strong&gt;thousands of dollars overnight&lt;/strong&gt; come up regularly on &lt;a href="https://www.reddit.com/r/ChatGPT/" rel="noopener noreferrer"&gt;Reddit&lt;/a&gt; and &lt;a href="https://news.ycombinator.com/" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt; — no budget limit, no loop detection, no kill switch. The agent keeps calling GPT-4 in an infinite loop until someone wakes up and pulls the plug.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/tazsat0512/reivo-guard" rel="noopener noreferrer"&gt;reivo-guard&lt;/a&gt; to prevent this. It's an open-source guardrail library that detects and stops runaway AI agents — with &lt;strong&gt;sub-microsecond overhead&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This post walks through the architecture decisions behind each detection layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Agents Don't Know When to Stop
&lt;/h2&gt;

&lt;p&gt;LLM agents fail in predictable ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Infinite loops&lt;/strong&gt; — The agent keeps asking the same question, or semantically similar variations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost explosions&lt;/strong&gt; — Token consumption spikes 100x with no warning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality degradation&lt;/strong&gt; — Responses get worse over time but the agent keeps going&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cliff-edge failures&lt;/strong&gt; — Everything works until 100% budget, then hard crash&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Among the tools I evaluated (Helicone, Portkey, LangSmith, Lunary, LiteLLM), most either &lt;em&gt;observe&lt;/em&gt; these failures (dashboards, alerts) or enforce &lt;em&gt;static&lt;/em&gt; rules (rate limits, budget caps). I wanted something that &lt;em&gt;detects&lt;/em&gt; and &lt;em&gt;acts&lt;/em&gt; adaptively — so I built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;guard.before()  →  Budget check, loop detection, session validation
       ↓
    LLM API call
       ↓
guard.after()   →  Cost tracking, quality verification, trend analysis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Guard functions are side-effect-free on the hot path — state lives in a key-value store interface (&lt;code&gt;GuardStore&lt;/code&gt;), so it works in serverless (Cloudflare Workers, Lambda) or as a library.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;split checks into sync (blocking) and async (post-response)&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Sync/Async&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Budget enforcement&lt;/td&gt;
&lt;td&gt;Sync&lt;/td&gt;
&lt;td&gt;Must block before spending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hash loop detection&lt;/td&gt;
&lt;td&gt;Sync&lt;/td&gt;
&lt;td&gt;O(20), sub-microsecond&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EWMA anomaly&lt;/td&gt;
&lt;td&gt;Sync&lt;/td&gt;
&lt;td&gt;O(1), sub-microsecond&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TF-IDF cosine loop&lt;/td&gt;
&lt;td&gt;Async&lt;/td&gt;
&lt;td&gt;O(W × V) where W=window, V=vocab. Runs in &lt;code&gt;waitUntil()&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-as-Judge quality&lt;/td&gt;
&lt;td&gt;Async&lt;/td&gt;
&lt;td&gt;~100ms external call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality trend&lt;/td&gt;
&lt;td&gt;Sync&lt;/td&gt;
&lt;td&gt;O(50), lightweight&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Layer 1: Loop Detection (Two Algorithms)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hash Match (The Fast Path)
&lt;/h3&gt;

&lt;p&gt;The simplest detector: keep a sliding window of prompt hashes and count exact matches.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;hashes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;LOOP_HASH_WINDOW&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// last 20&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;matchCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;newHash&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;isLoop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;matchCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;LOOP_HASH_THRESHOLD&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt; &lt;span class="c1"&gt;// ≥5 matches&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works&lt;/strong&gt;: Most agent loops are &lt;em&gt;exact&lt;/em&gt; duplicates. The agent asks "What is the capital of France?" five times in a row. Hash match catches this with sub-microsecond overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why window=20, threshold=5?&lt;/strong&gt; Agents legitimately retry 2-3 times (network errors, rate limits). 5 matches in 20 requests means 25% of recent traffic is identical — that's a loop, not a retry.&lt;/p&gt;

&lt;h3&gt;
  
  
  TF-IDF Cosine Similarity (The Smart Path)
&lt;/h3&gt;

&lt;p&gt;Hash match misses &lt;em&gt;rephrased&lt;/em&gt; loops: "What's the capital of France?" vs "Tell me France's capital city." Same intent, different hash.&lt;/p&gt;

&lt;p&gt;The cosine detector builds TF-IDF vectors from prompt text and computes pairwise similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Tokenize: lowercase, split on \W+, filter len &amp;gt; 1
2. TF: freq / tokenCount per document
3. IDF: log(n / docFrequency) across all documents
4. Cosine: dot(a, b) / (||a|| × ||b||)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Threshold: 0.92.&lt;/strong&gt; This is deliberately high. At 0.92, the prompts need to share ~85% of their meaningful vocabulary. "How do I sort a list in Python?" and "Python list sorting method?" score ~0.89, below threshold. But four variations of the same question cross it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not embeddings?&lt;/strong&gt; TF-IDF runs locally in &amp;lt;1ms. Embedding APIs add 50-200ms latency and cost money. For loop detection, lexical similarity is good enough — and it's free.&lt;/p&gt;

&lt;p&gt;This runs async (&lt;code&gt;waitUntil()&lt;/code&gt;) so it never blocks the response path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Budget Enforcement with Graceful Degradation
&lt;/h2&gt;

&lt;p&gt;Hard budget cutoffs create terrible UX. You're mid-conversation, and suddenly: &lt;code&gt;403 Forbidden&lt;/code&gt;. No warning, no wind-down.&lt;/p&gt;

&lt;p&gt;Instead, reivo-guard implements &lt;strong&gt;four degradation levels&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Usage&lt;/th&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;What Happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&amp;lt; 80%&lt;/td&gt;
&lt;td&gt;&lt;code&gt;normal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;80-95%&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aggressive&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Force cheaper model routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;95-100%&lt;/td&gt;
&lt;td&gt;&lt;code&gt;new_sessions_only&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Existing sessions continue, new ones blocked&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;≥ 100%&lt;/td&gt;
&lt;td&gt;&lt;code&gt;blocked&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All requests rejected&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getDegradationLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usedUsd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;limitUsd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;usedUsd&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;limitUsd&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;blocked&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;blockAll&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;new_sessions_only&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;blockNewSessions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aggressive&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;forceAggressiveRouting&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;normal&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why 80%?&lt;/strong&gt; At 80% budget consumption, you start routing to cheaper models (GPT-4o-mini instead of GPT-4o). The user barely notices quality difference for most tasks, but cost drops 10-20x.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert deduplication&lt;/strong&gt;: Thresholds fire at 50%, 80%, 100% — but only once each. No alert storms.&lt;/p&gt;

&lt;p&gt;Note: Portkey and LiteLLM also offer degradation strategies (fallback chains and budget caps respectively). reivo-guard's approach is more granular (4 levels with progressive restrictions) but theirs are more battle-tested at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Anomaly Detection (EWMA)
&lt;/h2&gt;

&lt;p&gt;Budget limits catch &lt;em&gt;expected&lt;/em&gt; overuse. EWMA catches &lt;em&gt;unexpected&lt;/em&gt; spikes.&lt;/p&gt;

&lt;p&gt;If an agent normally uses 1,000 tokens per request and suddenly jumps to 100,000 — that's an anomaly, even if there's budget remaining.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exponentially Weighted Moving Average&lt;/strong&gt; tracks both the mean and variance of token consumption:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Update running statistics&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;newValue&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ewmaValue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;newEwma&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ewmaValue&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;EWMA_ALPHA&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;newVariance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;EWMA_ALPHA&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ewmaVariance&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;EWMA_ALPHA&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;diff&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Detect anomaly&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stdDev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ewmaVariance&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;zScore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;currentRate&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ewmaValue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;stdDev&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;isAnomaly&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;zScore&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;ANOMALY_Z_THRESHOLD&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt; &lt;span class="c1"&gt;// z &amp;gt; 3.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A note on the variance formula: this is a Welford-style EWMA variance update rather than the textbook &lt;code&gt;α*(x-μ)² + (1-α)*σ²&lt;/code&gt;. Both converge to the same result, but this form is slightly more numerically stable for streaming updates since it uses the pre-update diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why EWMA, not a simple moving average?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;O(1) space: just two numbers (mean + variance), no window buffer&lt;/li&gt;
&lt;li&gt;Adapts to trends: if usage gradually increases, that's not an anomaly&lt;/li&gt;
&lt;li&gt;Converges fast: ~10 samples and the variance is reliable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why α=0.3?&lt;/strong&gt; Aggressive enough to track trend shifts, but not so aggressive that a single outlier moves the baseline. A spike of 10x will trigger z &amp;gt; 3.0 (anomaly) but won't corrupt the baseline mean for subsequent checks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical ordering&lt;/strong&gt;: You must call &lt;code&gt;detectAnomaly()&lt;/code&gt; &lt;strong&gt;before&lt;/strong&gt; &lt;code&gt;updateEwma()&lt;/code&gt;. If you update first, the variance absorbs the spike and the z-score drops. This is the kind of bug that only shows up in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 4: Quality Verification
&lt;/h2&gt;

&lt;p&gt;Cost and loops are necessary but not sufficient. An agent can stay within budget, never loop, but produce &lt;em&gt;garbage&lt;/em&gt; outputs. We need quality signals.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logprobs (OpenAI &amp;amp; Google)
&lt;/h3&gt;

&lt;p&gt;When available, logprobs are the cheapest quality signal — they come free with the response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Map mean logprob to 0-1 score&lt;/span&gt;
&lt;span class="nx"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;meanLogprob&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="c1"&gt;// logprob  0 → score 1.0 (certain)&lt;/span&gt;
&lt;span class="c1"&gt;// logprob -1 → score 0.5 (medium)&lt;/span&gt;
&lt;span class="c1"&gt;// logprob -2 → score 0.0 (uncertain)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a simple linear mapping. Logprobs are logarithmic so a nonlinear mapping might be more principled, but in practice this threshold-based approach (flag below -1.0) works well enough for the binary "retry or not" decision.&lt;/p&gt;

&lt;p&gt;If the mean logprob falls below -1.0 (~37% average token confidence), the response is flagged for potential retry with a better model.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM-as-Judge (Anthropic &amp;amp; Fallback)
&lt;/h3&gt;

&lt;p&gt;Anthropic doesn't expose logprobs. So we use GPT-4o-mini as a judge — truncate the prompt (500 chars) and response (1000 chars), ask for a 0-1 quality score.&lt;/p&gt;

&lt;p&gt;Cost: &lt;strong&gt;&amp;lt;$0.0001 per judgment.&lt;/strong&gt; At this price, you can judge every response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality Trend Detection
&lt;/h3&gt;

&lt;p&gt;Individual quality scores fluctuate. What matters is the &lt;em&gt;trend&lt;/em&gt;. If quality degrades over a session, the model should auto-upgrade:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compare: avg(last 5 scores) vs avg(earlier scores)
If delta ≤ -0.15 AND recent avg &amp;lt; 0.5 → upgrade model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates an automatic feedback loop: cheap model → quality drops → upgrade to better model → quality recovers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;Guard checks add &lt;strong&gt;sub-microsecond overhead&lt;/strong&gt; — negligible vs. LLM API latency (100-3000ms).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;checkBudget()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~70 ns&lt;/td&gt;
&lt;td&gt;Pure arithmetic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;detectLoopByHash()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~200 ns&lt;/td&gt;
&lt;td&gt;Array scan, n=20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;getDegradationLevel()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~25 ns&lt;/td&gt;
&lt;td&gt;Three comparisons&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;guard.before()&lt;/code&gt; (Python)&lt;/td&gt;
&lt;td&gt;~2.5 µs&lt;/td&gt;
&lt;td&gt;All sync checks combined&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;guard.after()&lt;/code&gt; (Python)&lt;/td&gt;
&lt;td&gt;~0.3 µs&lt;/td&gt;
&lt;td&gt;Cost tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Measured by dividing wall-clock time of 100K iterations on Apple M3. These numbers should be taken as order-of-magnitude — at this scale, JIT warmup, GC pauses, and measurement overhead all matter. The &lt;a href="https://github.com/tazsat0512/reivo-guard/tree/main/bench" rel="noopener noreferrer"&gt;benchmark code&lt;/a&gt; is in the repo if you want to reproduce or challenge the methodology.&lt;/p&gt;

&lt;p&gt;The point isn't the exact nanosecond count — it's that guard overhead is 5-6 orders of magnitude smaller than the LLM call it's protecting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with Python first.&lt;/strong&gt; The AI ecosystem runs on Python. I started with TypeScript because my proxy runs on Cloudflare Workers, but standalone adoption would've been faster with Python-first.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Simpler API surface.&lt;/strong&gt; The TypeScript API exposes individual functions (&lt;code&gt;checkBudget&lt;/code&gt;, &lt;code&gt;detectLoopByHash&lt;/code&gt;, &lt;code&gt;getDegradationLevel&lt;/code&gt;). The Python API has a simpler &lt;code&gt;guard.before()&lt;/code&gt; / &lt;code&gt;guard.after()&lt;/code&gt; pattern. The Python approach is better for most users.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Skip TF-IDF for v1.&lt;/strong&gt; Hash match catches 90%+ of real loops. Cosine similarity is cool engineering but hasn't triggered in my testing where hash match didn't already catch it. (To be fair, my test traffic is limited — this may change with more diverse usage patterns.)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx reivo-guard-demo  &lt;span class="c"&gt;# Interactive demo&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GitHub: &lt;a href="https://github.com/tazsat0512/reivo-guard" rel="noopener noreferrer"&gt;github.com/tazsat0512/reivo-guard&lt;/a&gt; — MIT licensed, TypeScript + Python.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've had your own runaway agent story, I'd love to hear it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>typescript</category>
      <category>python</category>
    </item>
  </channel>
</rss>
