<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nilofer 🚀</title>
    <description>The latest articles on Forem by Nilofer 🚀 (@nilofer_tweets).</description>
    <link>https://forem.com/nilofer_tweets</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1137273%2Fac10d3a1-21d6-46e3-90d6-889213a616bd.jpg</url>
      <title>Forem: Nilofer 🚀</title>
      <link>https://forem.com/nilofer_tweets</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nilofer_tweets"/>
    <language>en</language>
    <item>
      <title>Context Time Machine: Forensic Investigation of What Your Agent Actually Saw</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 16 May 2026 11:10:19 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/contexttimemachine-forensic-investigation-of-what-your-agent-actually-saw-joo</link>
      <guid>https://forem.com/nilofer_tweets/contexttimemachine-forensic-investigation-of-what-your-agent-actually-saw-joo</guid>
      <description>&lt;p&gt;Long-running agent sessions fail in a specific way that is hard to debug. The agent runs 40 turns. At turn 38, it gives a wrong answer that ignores something it decided at turn 12. You look at the logs, the turn 12 decision is there. The turn 38 response is there. But you cannot see what the context window looked like at turn 38. Was the turn 12 decision still in context? Was it evicted? Was it there but semantically overwhelmed by 25 other turns?&lt;/p&gt;

&lt;p&gt;This is the forensic problem that ContextTimeMachine solves. It is different from real-time session monitoring, it is for deep post-hoc investigation of what happened during a session, after it has already run. The key insight it is built on: the context window at any given turn is deterministic given the conversation history. You can reconstruct exactly what the model saw at turn 38, render it interactively, and query it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcatqc3xd1wiridxycd3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxcatqc3xd1wiridxycd3.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Investigation Modes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mode 1 - Timeline Navigator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The primary view is a vertical timeline of all turns in the session. Each turn shows the turn number, agent name if available, turn type, token count at that turn, and a sparkline showing how the context composition changed.&lt;/p&gt;

&lt;p&gt;Click any turn to travel to it - the context window at that exact point reconstructs and renders in the main panel. You see exactly what the model saw: every message in order, with token counts, with a red line showing where the context would have been truncated if it exceeded the model's limit. Scrub through turns with keyboard arrows. Watch the context window evolve turn by turn. See turns disappear as eviction happens. See tool results arrive and push older content further back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode 2 - Fact Tracker&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You know something specific, a decision made at turn 5, a fact retrieved at turn 15, a user instruction given at turn 3. You want to know: at what turn did this fact leave the context window?&lt;/p&gt;

&lt;p&gt;Enter any text snippet in the Fact Tracker search box. ContextTimeMachine embeds it locally using sentence-transformers, then searches every turn's context snapshot for the nearest matching content. It renders a presence chart, a horizontal bar across all turns colored green when the fact is present or red when absent and shows the exact turn where the fact entered context and the exact turn where it left.&lt;/p&gt;

&lt;p&gt;This answers the most common debugging question for long agent sessions: "When exactly did the agent stop knowing X?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode 3 - Divergence Finder&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You have two agent sessions that started identically but ended differently. One succeeded, one failed. Load both sessions and ContextTimeMachine finds the earliest turn where their context windows diverged where they started seeing different content and highlights that turn as the likely root cause of the different outcomes.&lt;/p&gt;

&lt;p&gt;It shows a side-by-side comparison of the two context windows at the divergence point with diffed content highlighted. This is the automated version of the manual debugging process every team does when comparing "the run that worked" against "the run that didn't."&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                    ContextTimeMachine                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Frontend (React)                                               │
│  ├─ TimelineNavigator    — Turn-by-turn timeline scrubber       │
│  ├─ ContextPanel         — Renders reconstructed context        │
│  ├─ FactTracker          — Fact presence chart                  │
│  └─ DivergenceFinder     — Two-session comparison               │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FastAPI Backend                                                │
│  ├─ /api/session/load          — Load session from file         │
│  ├─ /api/session/{id}/profile  — Get token profile              │
│  ├─ /api/session/{id}/turn/{n} — Reconstruct context at turn    │
│  ├─ /api/session/{id}/fact     — Track fact presence            │
│  ├─ /api/divergence            — Find divergence point          │
│  └─ /api/sessions              — List all sessions              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Core Analysis Modules                                          │
│  ├─ SessionLoader        — Load from multiple formats           │
│  ├─ ContextReconstructor — Reconstruct at any turn              │
│  ├─ FactTracker          — Track presence via embeddings        │
│  ├─ DivergenceFinder     — Find divergence points               │
│  ├─ TokenAnalyzer        — Token budget analysis                │
│  └─ EmbeddingService     — Local embeddings (all-MiniLM)        │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Storage                                                        │
│  └─ SQLite DB            — Session snapshots &amp;amp; metadata         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;pip&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quick Start&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/dakshjain-1616/context-time-machine.git
&lt;span class="nb"&gt;cd &lt;/span&gt;context-time-machine

&lt;span class="c"&gt;# Create virtual environment&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate  &lt;span class="c"&gt;# On Windows: venv\Scripts\activate&lt;/span&gt;

&lt;span class="c"&gt;# Install package&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Start the server&lt;/span&gt;
timemachine serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8000&lt;/code&gt; in your browser. The server will automatically open your browser if it can.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Loading Sessions&lt;/strong&gt;&lt;br&gt;
Sessions can be loaded from two formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From LiveContext SQLite Export:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;From Generic JSON:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generic JSON format expects a &lt;code&gt;turns&lt;/code&gt; array where each turn contains a &lt;code&gt;messages&lt;/code&gt; list, a &lt;code&gt;model_id&lt;/code&gt;, and a &lt;code&gt;timestamp&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"turns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"turn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are helpful."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"token_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is 2+2?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"token_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-09T10:00:00Z"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gpt-4"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CLI Commands&lt;/strong&gt;&lt;br&gt;
The CLI covers the full workflow from loading sessions to querying them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the web interface&lt;/span&gt;
timemachine serve

&lt;span class="c"&gt;# Load a session&lt;/span&gt;
timemachine load &lt;span class="nt"&gt;--file&lt;/span&gt; session.json

&lt;span class="c"&gt;# Track fact across session&lt;/span&gt;
timemachine fact &lt;span class="nt"&gt;--session&lt;/span&gt; &amp;lt;session-id&amp;gt; &lt;span class="nt"&gt;--fact&lt;/span&gt; &lt;span class="s2"&gt;"the user prefers JSON output"&lt;/span&gt;

&lt;span class="c"&gt;# Find divergence between two sessions&lt;/span&gt;
timemachine diverge &lt;span class="nt"&gt;--session-a&lt;/span&gt; &amp;lt;id-a&amp;gt; &lt;span class="nt"&gt;--session-b&lt;/span&gt; &amp;lt;id-b&amp;gt;

&lt;span class="c"&gt;# List all stored sessions&lt;/span&gt;
timemachine sessions

&lt;span class="c"&gt;# Clear all sessions&lt;/span&gt;
timemachine clear
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Python API&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every capability the CLI and web interface expose is also available as a Python library. This makes it straightforward to integrate ContextTimeMachine into evaluation pipelines or automated debugging scripts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;context_time_machine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;SessionLoader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ContextReconstructor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FactTracker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DivergenceFinder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;TokenAnalyzer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load session
&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SessionLoader&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Reconstruct context at turn 10
&lt;/span&gt;&lt;span class="n"&gt;reconstructor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContextReconstructor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reconstructor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reconstruct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context at turn 10: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Messages: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Utilization: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utilization_percent&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Track a fact
&lt;/span&gt;&lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FactTracker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;specific decision from turn 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fact first appeared: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_appeared_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fact last present: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_present_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Disappeared at: Turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;disappeared_at_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Analyze token budget
&lt;/span&gt;&lt;span class="n"&gt;analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TokenAnalyzer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Peak tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; at turn &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;peak_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Eviction turns: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;eviction_turns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Find divergence between sessions
&lt;/span&gt;&lt;span class="n"&gt;session_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_b.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;finder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DivergenceFinder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;finder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Divergence at turn: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;divergence_turn&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Supported Session Formats
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynbb2elgx53qdn920c8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsynbb2elgx53qdn920c8.png" alt=" " width="468" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Context Reconstruction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each turn N, ContextTimeMachine loads all messages from turns 0 to N and counts the total tokens using tiktoken. If the total exceeds the model's context limit, it simulates eviction using a model-specific strategy: GPT and Claude use left-truncation (oldest messages first), DeepSeek uses a sliding window with a recency bias, and Gemma uses local-global attention sampling from the middle. System messages are never evicted regardless of which strategy applies. The result is a reconstructed context with a full token breakdown exactly what the model would have seen at that turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fact Tracking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each turn, ContextTimeMachine embeds the fact text using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;. It then computes cosine similarity between that embedding and every message in the turn's reconstructed context. A fact is considered present if any message has a similarity above 0.75. Embeddings are cached for performance so repeated queries against the same session do not recompute embeddings. The output is a presence chart showing the fact's full lifecycle across the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Divergence Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For two sessions, ContextTimeMachine aligns turns and analyzes up to the minimum length of the two sessions. At each turn it reconstructs the context for both sessions, embeds all messages, and computes an average maximum cosine similarity between the two context windows. When this similarity drops below 0.85, the turn is flagged as the divergence point. The output includes a message diff at the divergence point and a summary of what changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Session Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/session/load&lt;/code&gt; - load session from file or JSON&lt;br&gt;
&lt;code&gt;GET /api/sessions&lt;/code&gt; - list all stored sessions&lt;br&gt;
&lt;code&gt;DELETE /api/session/{id}&lt;/code&gt; - delete a session&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GET /api/session/{id}/profile&lt;/code&gt; - get token profile for session&lt;br&gt;
&lt;code&gt;GET /api/session/{id}/turn/{num}&lt;/code&gt; - reconstruct context at turn&lt;br&gt;
&lt;code&gt;POST /api/session/{id}/fact&lt;/code&gt; - track fact presence&lt;br&gt;
&lt;code&gt;POST /api/divergence&lt;/code&gt; - find divergence between sessions&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Context Reconstruction: &amp;lt; 100ms for typical sessions&lt;/li&gt;
&lt;li&gt;Fact Tracking: ~1-5 seconds for full session (includes embedding)&lt;/li&gt;
&lt;li&gt;Divergence Detection: ~2-10 seconds for 2 sessions&lt;/li&gt;
&lt;li&gt;Memory: ~50-200MB per stored session (depending on size)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Dependencies
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Core&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;fastapi&lt;/strong&gt; - Web framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;uvicorn&lt;/strong&gt; - ASGI server&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pydantic&lt;/strong&gt; - Data validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;click&lt;/strong&gt; - CLI framework&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tiktoken&lt;/strong&gt; - Token counting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sentence-transformers&lt;/strong&gt; - Local embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;numpy&lt;/strong&gt; - Numerical operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sqlalchemy&lt;/strong&gt; - Database ORM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;aiofiles&lt;/strong&gt; - Async file operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;br&gt;
React, Tailwind CSS, Framer Motion, Recharts&lt;/p&gt;

&lt;h2&gt;
  
  
  Known Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Frontend is a React stub - core analysis is fully functional&lt;/li&gt;
&lt;li&gt;LangSmith format not yet implemented&lt;/li&gt;
&lt;li&gt;No streaming support for very large sessions (&amp;gt;10k turns)&lt;/li&gt;
&lt;li&gt;Embedding cache cleared on restart&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Future Enhancements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Complete React frontend with real-time updates&lt;/li&gt;
&lt;li&gt;WebSocket streaming for large sessions&lt;/li&gt;
&lt;li&gt;LangSmith format support&lt;/li&gt;
&lt;li&gt;Multi-session comparison UI&lt;/li&gt;
&lt;li&gt;Export to markdown/HTML&lt;/li&gt;
&lt;li&gt;Attention visualization&lt;/li&gt;
&lt;li&gt;Custom eviction strategy support&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a forensic debugging tool for long-running agent sessions, one that could reconstruct the exact context window at any historical turn, track when specific facts entered and left context using semantic embeddings, and find the earliest point where two divergent sessions started seeing different content. The tool needed to support multiple session formats, expose a Python API alongside the web interface, and work entirely offline with local embeddings.&lt;/p&gt;

&lt;p&gt;NEO handled all 12 specification steps autonomously, building the &lt;code&gt;SessionLoader&lt;/code&gt; with support for LiveContext SQLite, generic JSON, and raw conversation formats, the &lt;code&gt;ContextReconstructor&lt;/code&gt; with model-specific eviction strategies for GPT, Claude, DeepSeek, and Gemma, the &lt;code&gt;FactTracker&lt;/code&gt; with &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; embeddings and cosine similarity scoring, the &lt;code&gt;DivergenceFinder&lt;/code&gt; with turn-aligned context comparison, the &lt;code&gt;TokenAnalyzer&lt;/code&gt; for peak token and eviction turn detection, the FastAPI backend with all six API endpoints, the SQLite storage layer via SQLAlchemy, the Click CLI with all six commands, and the full 58-test suite covering all core modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to find the root cause of long-session failures.&lt;/strong&gt;&lt;br&gt;
When an agent gives a wrong answer deep into a long session, load the session into ContextTimeMachine, travel to the failure turn in the Timeline Navigator, and see exactly what was in context at that point. The reconstructed view shows every message the model saw, in order, with token counts, so you can see immediately whether the relevant context was present or had been evicted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Fact Tracker to measure context retention across your agent design.&lt;/strong&gt;&lt;br&gt;
Before settling on a context management strategy for your agent, run Fact Tracker against a set of real sessions. The presence chart for key decisions and instructions tells you at what turn they reliably drop out of context giving you a data-driven basis for choosing context window sizes, eviction strategies, or compression approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Divergence Finder to debug non-deterministic agent behaviour.&lt;/strong&gt;&lt;br&gt;
When two runs of the same agent with the same input produce different outcomes, load both into Divergence Finder. The tool identifies the exact turn where their context windows started differing and shows a diff of what changed, turning a difficult debugging problem into a specific, actionable finding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional session format parsers.&lt;/strong&gt;&lt;br&gt;
SessionLoader already handles three formats following a common interface. Adding a new format - LangSmith is listed as planned, means implementing the same loader interface for the new format. It is then immediately available in the CLI, the Python API, and the web interface without touching any of the analysis modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;ContextTimeMachine makes the context window visible. Instead of inferring what the model saw from its outputs, you can reconstruct and inspect the exact context at any turn, track when specific information entered and left the window, and find where two sessions diverged. For teams debugging long-running agents, that visibility is the difference between guessing and knowing.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/ContextTimeMachine" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/ContextTimeMachine&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Agent Constitution: Policy Enforcement and PII Protection for AI Agents</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 16 May 2026 05:50:37 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/agent-constitution-policy-enforcement-and-pii-protection-for-ai-agents-ehf</link>
      <guid>https://forem.com/nilofer_tweets/agent-constitution-policy-enforcement-and-pii-protection-for-ai-agents-ehf</guid>
      <description>&lt;p&gt;AI agents are getting more capable. They can browse the web, call APIs, read and write files, and execute code. That capability is exactly what makes them useful and exactly what makes them dangerous without guardrails.&lt;/p&gt;

&lt;p&gt;Most agent safety approaches rely on prompt instructions. Tell the model not to delete files. Tell it not to send requests to untrusted URLs. Tell it not to leak PII. But instructions in a prompt are not enforceable — a sufficiently complex agent workflow, a jailbreak attempt, or just an edge case in reasoning can bypass them silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Constitution&lt;/strong&gt; is a policy enforcement framework for AI agents that enforces behavioral rules at the code level, not the prompt level. You define rules in a YAML constitution file, wrap your agent's tool calls with the enforcer, and get PII detection, audit logging, and a real-time dashboard all without modifying your agent's core logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4e3yb18v5gw0syzfztpv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4e3yb18v5gw0syzfztpv.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Policy-Based Enforcement&lt;/strong&gt; - Define rules using YAML constitution files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AST-Based Expression Evaluation&lt;/strong&gt; - Safe condition evaluation without code injection risks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PII Detection&lt;/strong&gt; - Regex and Ollama-powered detection of sensitive information&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit Logging&lt;/strong&gt; - JSONL-based audit trail with rotation support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Dashboard&lt;/strong&gt; - FastAPI + WebSocket + React dashboard for monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLI Interface&lt;/strong&gt; - Rich command-line interface for management&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The core concept is a constitution - a YAML file that defines policies, and within each policy, rules. Each rule has a condition written as a plain expression, an action (block or notify), and a severity level. The enforcer evaluates these conditions against every tool call before it executes.&lt;/p&gt;

&lt;p&gt;The condition evaluation uses AST-based expression parsing not &lt;code&gt;eval()&lt;/code&gt; so there is no code injection risk. An expression like &lt;code&gt;tool_name in ['rm', 'unlink', 'rmdir']&lt;/code&gt; is parsed as an abstract syntax tree and evaluated safely against the tool call context.&lt;/p&gt;

&lt;p&gt;PII detection runs as a separate layer. It can use regex patterns for common formats like email addresses, phone numbers, and SSNs, or it can use Ollama with a local model for more nuanced detection. When PII is detected in a tool's output, it can be blocked or redacted before it reaches the agent.&lt;/p&gt;

&lt;p&gt;Every enforcement decision allowed or blocked is written to a JSONL audit log with a timestamp, tool name, action taken, and the specific rule that triggered. The real-time dashboard reads from this audit log via WebSocket and shows violations, enforcement statistics, and the full constitution in one view.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/yourusername/agent-constitution.git
&lt;span class="nb"&gt;cd &lt;/span&gt;agent-constitution

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Create a Constitution&lt;/strong&gt;&lt;br&gt;
Start with a sample constitution to see the format, or create an empty one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a sample constitution&lt;/span&gt;
agent-constitution init &lt;span class="nt"&gt;--sample&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# Or create an empty one&lt;/span&gt;
agent-constitution init &lt;span class="nt"&gt;-o&lt;/span&gt; my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Validate Your Constitution&lt;/strong&gt;&lt;br&gt;
Before using it, validate that the YAML is well-formed and the expressions are safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-constitution validate my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Test Policy Enforcement&lt;/strong&gt;&lt;br&gt;
Check whether a specific tool call would be allowed or blocked before running it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-constitution check &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;--arg&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/test &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Start the Dashboard&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-constitution dashboard &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;code&gt;http://localhost:8000&lt;/code&gt; in your browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Using the @enforce Decorator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The simplest integration is wrapping tool functions with the &lt;code&gt;@enforce&lt;/code&gt; decorator. The enforcer checks the function against the constitution before it executes, if a rule blocks the call, a &lt;code&gt;PolicyViolationError&lt;/code&gt; is raised before the function body runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enforcer&lt;/span&gt;

&lt;span class="c1"&gt;# Load constitution
&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_yaml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_constitution.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;enforcer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Enforcer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@enforcer.enforce&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;delete_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Delete a file.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
    &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# This will be blocked if rm/delete operations are restricted
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;delete_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/tmp/test.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;PolicyViolationError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Blocked: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Manual Policy Checking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For cases where you need to check a tool call without decorating a function, for example when the tool call is constructed dynamically the enforcer exposes a check method directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enforcer&lt;/span&gt;

&lt;span class="n"&gt;constitution&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Constitution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_yaml&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_constitution.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;enforcer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Enforcer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;constitution&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Check a tool call
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;enforcer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;curl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;extra_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blocked&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Blocked by rule: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;rule_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PII Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The PII detector can be used standalone - detect PII in any text, or redact it before it leaves the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution.rules.pii_detector&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PIIDetector&lt;/span&gt;

&lt;span class="n"&gt;detector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PIIDetector&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Detect PII in text
&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Contact me at john@example.com or call 555-123-4567&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pattern_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matched_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Redact PII
&lt;/span&gt;&lt;span class="n"&gt;redacted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;redact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redacted&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "Contact me at [REDACTED] or call [REDACTED]"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Audit Logging&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The audit logger writes every enforcement decision to a JSONL file and supports log rotation. Logs can be read back programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_constitution.audit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AuditLogger&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AuditLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./audit.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Log an event
&lt;/span&gt;&lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;violations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rule_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;block_file_deletion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read logs
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;entry&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;entry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Constitution Format&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The constitution is a YAML file with versioning, named policies, and rules within each policy. Each rule has a name, a condition expression, an action, and a severity level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0"&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Agent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Constitution"&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Security&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;policies&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;my&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AI&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;agent"&lt;/span&gt;

&lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tool_restrictions&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Restrict&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;access&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dangerous&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tools"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block_file_deletion&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prevent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;deletion&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;operations"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_name&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;['rm',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'unlink',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'rmdir']"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restrict_network_access&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Limit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;unrestricted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;network&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;access"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_name&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'curl'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;context.get('approved',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;False)"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notify&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data_protection&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Protect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sensitive&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data"&lt;/span&gt;
    &lt;span class="na"&gt;priority&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pii_detection&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Detect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;protect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PII&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;outputs"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pii_detected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;True"&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block&lt;/span&gt;
        &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;high&lt;/span&gt;

&lt;span class="na"&gt;pii_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;patterns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ssn"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phone"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;use_ollama&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;ollama_model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3:4b"&lt;/span&gt;
  &lt;span class="na"&gt;ollama_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434"&lt;/span&gt;

&lt;span class="na"&gt;audit_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;log_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./audit_logs.jsonl"&lt;/span&gt;
  &lt;span class="na"&gt;max_file_size_mb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
  &lt;span class="na"&gt;retention_days&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;priority&lt;/code&gt; field controls which policies are evaluated first. Higher priority runs first. The &lt;code&gt;action&lt;/code&gt; field is either &lt;code&gt;block&lt;/code&gt; which raises a &lt;code&gt;PolicyViolationError&lt;/code&gt; or &lt;code&gt;notify&lt;/code&gt;, which logs the event but allows the call through.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Commands
&lt;/h2&gt;

&lt;p&gt;The CLI covers the full lifecycle from creating and validating a constitution to inspecting audit logs and testing expressions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize a constitution&lt;/span&gt;
agent-constitution init &lt;span class="nt"&gt;--sample&lt;/span&gt;

&lt;span class="c"&gt;# Validate a constitution&lt;/span&gt;
agent-constitution validate my_constitution.yaml

&lt;span class="c"&gt;# Display constitution contents&lt;/span&gt;
agent-constitution show my_constitution.yaml

&lt;span class="c"&gt;# Check if a tool call would be allowed&lt;/span&gt;
agent-constitution check &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;--arg&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/tmp/test &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# Start the dashboard&lt;/span&gt;
agent-constitution dashboard &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# View audit logs&lt;/span&gt;
agent-constitution audit &lt;span class="nt"&gt;--log-path&lt;/span&gt; ./audit.jsonl

&lt;span class="c"&gt;# Show statistics&lt;/span&gt;
agent-constitution stats &lt;span class="nt"&gt;--constitution&lt;/span&gt; my_constitution.yaml

&lt;span class="c"&gt;# Test expression evaluation&lt;/span&gt;
agent-constitution eval-expr &lt;span class="s2"&gt;"x &amp;gt; 5"&lt;/span&gt; &lt;span class="nt"&gt;--context&lt;/span&gt; &lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Dashboard
&lt;/h2&gt;

&lt;p&gt;The dashboard provides real-time monitoring via FastAPI, WebSocket, and a React frontend. It shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Policy violations&lt;/li&gt;
&lt;li&gt;Audit logs&lt;/li&gt;
&lt;li&gt;Constitution rules and policies&lt;/li&gt;
&lt;li&gt;Enforcement statistics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open &lt;code&gt;http://localhost:8000&lt;/code&gt; after starting with &lt;code&gt;agent-constitution dashboard&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agent_constitution/
├── constitution.py      # Pydantic models and YAML handling
├── enforcer.py          # Policy enforcement and @enforce decorator
├── audit.py            # JSONL audit logging
├── cli.py              # Click CLI interface
├── rules/
│   ├── evaluator.py    # AST-based expression evaluation
│   └── pii_detector.py # PII detection with regex/Ollama
└── dashboard/
    ├── server.py       # FastAPI + WebSocket server
    └── frontend/       # React + Tailwind dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each module has a single responsibility - &lt;code&gt;constitution.py&lt;/code&gt; handles Pydantic models and YAML parsing, &lt;code&gt;enforcer.py&lt;/code&gt; owns the &lt;code&gt;@enforce&lt;/code&gt; decorator and manual check logic, &lt;code&gt;audit.py&lt;/code&gt; handles JSONL writing and rotation, and the &lt;code&gt;rules/&lt;/code&gt; directory separates expression evaluation from PII detection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;The project has comprehensive test coverage with 84 unit tests, all passing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run all tests&lt;/span&gt;
pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;

&lt;span class="c"&gt;# All tests passing: 84/84 ✓&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test coverage includes constitution loading and YAML parsing, policy enforcement with the &lt;code&gt;@enforce&lt;/code&gt; decorator, manual policy checking, PII detection for regex and patterns, audit logging with rotation, expression evaluation and security validation, and rule violation tracking and statistics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install development dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;

&lt;span class="c"&gt;# Run tests&lt;/span&gt;
pytest

&lt;span class="c"&gt;# Run specific test file&lt;/span&gt;
pytest tests/test_evaluator.py &lt;span class="nt"&gt;-v&lt;/span&gt;

&lt;span class="c"&gt;# Run linting&lt;/span&gt;
flake8 agent_constitution

&lt;span class="c"&gt;# Run type checking&lt;/span&gt;
mypy agent_constitution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a policy enforcement framework for AI agents, one that enforces behavioral rules at the code level rather than relying on prompt instructions, with PII detection, a JSONL audit trail, and a real-time monitoring dashboard. NEO implemented the full system across 10 implementation steps, resulting in a production-ready framework with 84 tests passing.&lt;/p&gt;

&lt;p&gt;NEO built the Pydantic constitution models and YAML handling in &lt;code&gt;constitution.py&lt;/code&gt;, the policy enforcer with the &lt;code&gt;@enforce&lt;/code&gt; decorator in &lt;code&gt;enforcer.py&lt;/code&gt;, the AST-based expression evaluator in &lt;code&gt;rules/evaluator.py&lt;/code&gt;, the regex and Ollama-powered PII detector in &lt;code&gt;rules/pii_detector.py&lt;/code&gt;, the JSONL audit logger with rotation in &lt;code&gt;audit.py&lt;/code&gt;, the Click CLI with all eight commands in &lt;code&gt;cli.py&lt;/code&gt;, and the FastAPI and WebSocket dashboard server with the React and Tailwind frontend.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to enforce safety rules across any agent's tool calls.&lt;/strong&gt;&lt;br&gt;
Wrap any tool function with &lt;code&gt;@enforcer.enforce&lt;/code&gt; and define the rules in a YAML constitution. The enforcement happens at the code level not in the prompt, so it cannot be bypassed by the agent's reasoning or by jailbreak attempts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the audit log to build an observability layer for your agents.&lt;/strong&gt;&lt;br&gt;
Every enforcement decision lands in a JSONL file with a timestamp, tool name, action, and triggering rule. This gives you a structured, queryable record of everything your agent tried to do allowed or blocked, which is useful for debugging unexpected agent behaviour and for compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use PII detection as a standalone layer before agent outputs reach users.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;PIIDetector&lt;/code&gt; works independently of the enforcer. You can run it on any text, agent responses, tool outputs, retrieved documents before they are displayed or stored, and redact sensitive information automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with custom PII patterns.&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;pii_config&lt;/code&gt; section of the constitution accepts a &lt;code&gt;patterns&lt;/code&gt; list. New regex patterns for domain-specific sensitive data can be added to the constitution file without touching any code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional rule conditions.&lt;/strong&gt;&lt;br&gt;
The AST-based evaluator supports arithmetic, comparisons, and context dictionary access. New conditions that reference additional context fields work immediately once those fields are passed as &lt;code&gt;extra_context&lt;/code&gt; in the enforcer's &lt;code&gt;check&lt;/code&gt; call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent Constitution shifts AI agent safety from instructions to enforcement. Rules defined in a YAML file are evaluated at the code level on every tool call before the tool executes, so the safety layer is not part of the agent's reasoning but a hard boundary around it.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Agent-Constitution" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Agent-Constitution&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>security</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>ASR Evaluation Framework: Benchmarking Speech Recognition Models Across Accuracy, Speed, and Robustness</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Fri, 15 May 2026 19:53:04 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/asr-evaluation-framework-benchmarking-speech-recognition-models-across-accuracy-speed-and-5gcn</link>
      <guid>https://forem.com/nilofer_tweets/asr-evaluation-framework-benchmarking-speech-recognition-models-across-accuracy-speed-and-5gcn</guid>
      <description>&lt;p&gt;Picking an ASR model for production is not straightforward. Whisper might be the most accurate for general English but too slow for real-time use. Wav2Vec2 might be fast enough for edge devices but struggle with accented speech. Distil-Whisper might hit the sweet spot for your use case, or it might not. Without a systematic benchmark across your actual conditions, you are guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ASR Evaluation Framework&lt;/strong&gt; is an enterprise-grade benchmarking tool that answers the questions that matter before you commit to a model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which ASR model is most accurate for my use case?&lt;/li&gt;
&lt;li&gt;How fast can each model process audio in real-time?&lt;/li&gt;
&lt;li&gt;How robust is each model against background noise, accents, and degraded audio?&lt;/li&gt;
&lt;li&gt;What are the tradeoffs between speed and accuracy?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feypaypp5aspa9sc18zcu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feypaypp5aspa9sc18zcu.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5 ASR Models&lt;/strong&gt; : IBM Granite, OpenAI Whisper, NVIDIA Canary, Distil-Whisper, Wav2Vec2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive Metrics&lt;/strong&gt; : WER, CER, Accuracy, RTF, and Inference Time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;15+ Test Scenarios&lt;/strong&gt; : Clean speech, background noise, accents, fast/slow speech, technical terms, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Evaluation Modes&lt;/strong&gt; : Speed, accuracy, or complete evaluation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSON Output Schema&lt;/strong&gt; : Standardized metrics schema for result storage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│         run_evaluation.py (CLI Entry)               │
├────────────┬──────────────┬──────────────┬──────────┤
│ --accuracy │ --speed      │ --all        │ Config   │
│ Evaluate   │ Evaluate RTF │ Complete     │ Loading  │
│ WER/CER    │ &amp;amp; Inference  │ Evaluation   │          │
└────────────┴──────────────┴──────────────┴──────────┘
              │
      ┌───────▼────────┐
      │   Evaluator    │
      │  - Load models │
      │  - Test audio  │
      │  - Calc metrics│
      └───────┬────────┘
              │
     ┌────────┼────────┐
     │        │        │
┌────▼──┐┌────▼──┐┌────▼──┐
│Granite ││Whisper││ Wav2V │  ... 5 models
│ Model  ││ Model ││ Model │
└────┬──┘└────┬──┘└────┬──┘
     └────────┼────────┘
              │
      ┌───────▼───────────┐
      │  Metrics Engine   │
      │ - WER/CER calc    │
      │ - RTF calc        │
      │ - Accuracy calc   │
      │ - Aggregation     │
      └───────┬───────────┘
              │
      ┌───────▼──────────┐
      │ JSON Results     │
      │ with schema      │
      └──────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Model Comparison Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ae75i13bofn51wtd978.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ae75i13bofn51wtd978.png" alt=" " width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Dimensions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Accuracy Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WER&lt;/strong&gt; : Word Error Rate. Percentage of words transcribed incorrectly compared to the reference.&lt;br&gt;
&lt;strong&gt;CER&lt;/strong&gt; : Character Error Rate. Character-level error rate for more detailed analysis.&lt;br&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; : 100% minus WER, normalized to a percentage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RTF&lt;/strong&gt; : Real-Time Factor. Inference time divided by audio duration. Below 1.0 means the model is real-time capable. Above 1.0 means it requires more compute than the audio duration.&lt;br&gt;
&lt;strong&gt;Inference Time&lt;/strong&gt; : Absolute seconds to transcribe the audio.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Robustness Testing&lt;/strong&gt;&lt;br&gt;
15 test scenarios covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean speech - baseline accuracy testing&lt;/li&gt;
&lt;li&gt;Background noise - office and street environments&lt;/li&gt;
&lt;li&gt;Accented English&lt;/li&gt;
&lt;li&gt;Fast and slow speech rates&lt;/li&gt;
&lt;li&gt;Technical vocabulary&lt;/li&gt;
&lt;li&gt;Whispered speech&lt;/li&gt;
&lt;li&gt;Phone quality audio&lt;/li&gt;
&lt;li&gt;Numbers and acronyms&lt;/li&gt;
&lt;li&gt;And more scenarios&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create virtual environment&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Python 3.10+. Core dependencies:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;librosa&lt;/code&gt; - Audio processing&lt;br&gt;
&lt;code&gt;numpy, scipy&lt;/code&gt; - Numerical computing&lt;br&gt;
&lt;code&gt;transformers&lt;/code&gt; - HuggingFace model loading&lt;br&gt;
&lt;code&gt;jiwer&lt;/code&gt; - WER and CER calculation&lt;br&gt;
&lt;code&gt;soundfile&lt;/code&gt; - Audio file I/O&lt;br&gt;
&lt;code&gt;pytest&lt;/code&gt; - Testing framework&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Run Complete Evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Runs accuracy and speed evaluation across all five models against all 15 test scenarios:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run Accuracy Evaluation Only&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--accuracy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run Speed Evaluation Only&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--speed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Specify Custom Paths&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python run_evaluation.py &lt;span class="nt"&gt;--all&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-path&lt;/span&gt; ./my_data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-path&lt;/span&gt; ./my_results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results and Output
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Console Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is what a complete evaluation run looks like in the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================
ASR EVALUATION FRAMEWORK v1.0.0
============================================================

=== RUNNING COMPLETE EVALUATION (ACCURACY + SPEED) ===

Evaluating Whisper...
Evaluating Wav2Vec2...
Evaluating Distil-Whisper...
Evaluating Canary...
Evaluating Granite...

✓ Results saved to: results/asr_eval_results_all_20260513_123045.json

============================================================
EVALUATION SUMMARY
============================================================

Model: Whisper
  Status: ✓ OK
  Mean Accuracy: 95.23%
  Mean WER: 0.0477

Model: Wav2Vec2
  Status: ✓ OK
  Mean Accuracy: 91.45%
  Mean WER: 0.0855

Model: Distil-Whisper
  Status: ✓ OK
  Mean Accuracy: 93.78%
  Mean WER: 0.0622
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;JSON Output Format&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Results are saved as structured JSON to &lt;code&gt;results/asr_eval_results_{type}_{timestamp}.json&lt;/code&gt;. The schema includes evaluation metadata, per-model aggregate metrics, and per-scenario test results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-05-13T12:30:45.123Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluator_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"models_tested"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Whisper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Wav2Vec2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Distil-Whisper"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test_scenarios"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"all"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Whisper"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Whisper"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai/whisper-base"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"initialized"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"aggregate_metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;95.23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_wer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0477&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_cer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"mean_rtf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"std_wer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0145&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"test_results"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"test_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"test_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"clean_english"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"wer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.032&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"cer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.015&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"accuracy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;96.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"inference_time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.34&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"rtf"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.17&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluation_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"all"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"completed"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The per-scenario &lt;code&gt;test_results&lt;/code&gt; array shows exactly how each model performed on each specific condition, not just aggregated averages which is what makes this useful for production decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Environment variables, documented in &lt;code&gt;.env.example&lt;/code&gt;:&lt;br&gt;
&lt;code&gt;HUGGINGFACE_TOKEN&lt;/code&gt; : HuggingFace API token for model loading&lt;br&gt;
&lt;code&gt;OPENAI_API_KEY&lt;/code&gt; : OpenAI API key&lt;br&gt;
&lt;code&gt;ASR_EVAL_DATA_PATH&lt;/code&gt; : data directory path&lt;br&gt;
&lt;code&gt;ASR_EVAL_RESULTS_PATH&lt;/code&gt; : results output path&lt;br&gt;
&lt;code&gt;VERBOSE&lt;/code&gt; : enable verbose logging&lt;/p&gt;
&lt;h2&gt;
  
  
  Test Matrix
&lt;/h2&gt;

&lt;p&gt;15 test scenarios covering four categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean Speech&lt;/strong&gt; - Baseline accuracy testing&lt;br&gt;
&lt;strong&gt;Robustness&lt;/strong&gt; - Background noise, accents, variable speech rates&lt;br&gt;
&lt;strong&gt;Challenging Conditions&lt;/strong&gt; - Whispered speech, music, phone quality audio&lt;br&gt;
&lt;strong&gt;Domain-Specific&lt;/strong&gt; - Technical vocabulary, numbers, acronyms&lt;/p&gt;
&lt;h2&gt;
  
  
  Metrics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Accuracy Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WER (Word Error Rate)&lt;/strong&gt; - Percentage of words that differ from reference&lt;br&gt;
&lt;strong&gt;CER (Character Error Rate)&lt;/strong&gt; - Percentage of characters that differ&lt;br&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; - 100% minus WER, normalized to a percentage&lt;br&gt;
Speed Metrics&lt;br&gt;
&lt;strong&gt;RTF (Real-Time Factor)&lt;/strong&gt; - Inference time divided by audio duration. Below 1.0 is real-time capable.&lt;br&gt;
&lt;strong&gt;Inference Time&lt;/strong&gt; - Total time to transcribe audio in seconds&lt;/p&gt;
&lt;h2&gt;
  
  
  Model Details
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0f31xqzg2m6a8bukj6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn0f31xqzg2m6a8bukj6e.png" alt=" " width="406" height="228"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  When to Use This Framework
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Benchmarking ASR models before production deployment&lt;/strong&gt; : run a full evaluation before committing to a model, not after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparing model tradeoffs&lt;/strong&gt; : speed versus accuracy decisions are data-driven rather than based on published benchmarks that may not reflect your audio conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing robustness against real-world audio&lt;/strong&gt; : the 15 test scenarios cover conditions that synthetic benchmarks miss: phone quality audio, background noise, accents, and technical vocabulary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluating cost-performance of different models&lt;/strong&gt; : RTF and inference time metrics let you calculate the compute cost of each model at your actual workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality assurance in voice-enabled applications&lt;/strong&gt; : run evaluations to catch model regressions before they reach production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research and academic speech recognition studies&lt;/strong&gt; : the standardized JSON output schema makes results comparable and reproducible across experiments.&lt;/p&gt;
&lt;h2&gt;
  
  
  Real-World Scenarios
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1 - Call Center AI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluate which model handles phone quality audio best&lt;/li&gt;
&lt;li&gt;Test robustness against background noise&lt;/li&gt;
&lt;li&gt;Measure inference speed for cost calculation&lt;/li&gt;
&lt;li&gt;Result: Select fastest model that maintains accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2 - Voice Assistant&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test against various accents and speech rates&lt;/li&gt;
&lt;li&gt;Evaluate technical command recognition&lt;/li&gt;
&lt;li&gt;Measure real-time performance on edge devices&lt;/li&gt;
&lt;li&gt;Result: Pick model that runs on-device with good accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3 - Transcription Service&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark accuracy across multiple languages&lt;/li&gt;
&lt;li&gt;Evaluate cost versus accuracy tradeoffs&lt;/li&gt;
&lt;li&gt;Test on domain-specific vocabulary&lt;/li&gt;
&lt;li&gt;Result: Choose optimal model for service tier&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── src/                          # Core modules
│   ├── config.py                # Configuration
│   ├── metrics.py               # Metric calculations
│   ├── data_loader.py           # Data loading utilities
│   ├── base_model.py            # ASR model base class
│   └── evaluator.py             # Main evaluator class
├── models/                       # ASR model implementations
│   ├── wav2vec2.py
│   ├── whisper.py
│   ├── distil_whisper.py
│   ├── canary.py
│   └── granite.py
├── tests/                        # Test suite (36 tests)
├── data/                         # Audio files for evaluation
├── results/                      # Output evaluation results
├── notebooks/                    # Jupyter notebooks
├── run_evaluation.py             # CLI entry point
├── asr_eval_test_matrix.csv      # Test scenarios matrix
├── asr_eval_metrics_schema.json  # Output schema
└── requirements.txt              # Python dependencies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;36 tests covering all core modules.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a systematic benchmarking framework for ASR models, one that could evaluate accuracy, speed, and robustness across real-world audio conditions, support multiple models through a common interface, and produce structured output for production decisions. The framework needed to cover five distinct model architectures, a 15-scenario test matrix, and three evaluation modes selectable from the CLI.&lt;/p&gt;

&lt;p&gt;NEO built the full implementation: the base model class in &lt;code&gt;base_model.py&lt;/code&gt; that all five model implementations extend, the five model wrappers for Whisper, Wav2Vec2, Distil-Whisper, Canary, and Granite, the metrics engine in &lt;code&gt;metrics.py&lt;/code&gt; computing WER, CER, accuracy, RTF, and inference time, the main evaluator class in &lt;code&gt;evaluator.py&lt;/code&gt;, the CLI entry point in &lt;code&gt;run_evaluation.py&lt;/code&gt; with all three evaluation modes, the data loader in &lt;code&gt;data_loader.py&lt;/code&gt;, the JSON output &lt;code&gt;schema in asr_eval_metrics_schema.json&lt;/code&gt;, the test scenario matrix in &lt;code&gt;asr_eval_test_matrix.csv&lt;/code&gt;, and the 36-test test suite.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it before committing to an ASR model in production.&lt;/strong&gt;&lt;br&gt;
Run the full evaluation against your own audio samples using &lt;code&gt;--data-path&lt;/code&gt;. The per-scenario breakdown shows exactly how each model performs on the conditions your application will actually encounter, not on generic benchmarks that may not reflect your use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the JSON output to build model selection pipelines.&lt;/strong&gt;&lt;br&gt;
The structured output at &lt;code&gt;results/asr_eval_results_{type}_{timestamp}.json&lt;/code&gt; contains all the metrics needed to make a data-driven model selection decision programmatically. A script that reads the output and selects the model with the best WER for a given RTF threshold builds directly on top of the existing schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to evaluate cost-performance before scaling.&lt;/strong&gt;&lt;br&gt;
RTF and inference time metrics per model let you calculate the compute cost of each option at your actual call volume. The per-scenario breakdown shows where each model spends the most compute, useful for optimising before scaling a voice-enabled product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional ASR models.&lt;/strong&gt;&lt;br&gt;
All five models extend &lt;code&gt;base_model.py&lt;/code&gt; following the same interface. Adding a new ASR model available through HuggingFace Transformers means adding a new file in &lt;code&gt;models/&lt;/code&gt; that implements the same base class, it is then available in all three evaluation modes without touching the evaluator, metrics engine, or CLI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Choosing an ASR model without systematic evaluation is a production risk. ASR Evaluation Framework removes that risk by giving you per-model, per-scenario metrics across accuracy, speed, and robustness before you deploy with structured JSON output that makes the decision data-driven rather than intuitive.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Asr-Evaluation" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Asr-Evaluation&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>whisper</category>
      <category>llm</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>SPEC-TO-SHIP: A Multi-Agent Pipeline That Turns Feature Ideas Into Production Code</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 14 May 2026 11:10:51 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/spec-to-ship-a-multi-agent-pipeline-that-turns-feature-ideas-into-production-code-5e86</link>
      <guid>https://forem.com/nilofer_tweets/spec-to-ship-a-multi-agent-pipeline-that-turns-feature-ideas-into-production-code-5e86</guid>
      <description>&lt;p&gt;Writing a feature spec and getting it to production involves a lot of steps, architecture decisions, task planning, implementation, testing, and code review. In a real engineering team, these are handled by different people with different specializations. Most AI coding tools collapse all of that into a single step and ask one model to do everything.&lt;/p&gt;

&lt;p&gt;SPEC TO SHIP takes a different approach. It orchestrates five specialized AI agents Architect, Planner, Engineer, QA, and Reviewer within a single Node.js process to simulate a complete startup engineering team workflow. Raw feature ideas go in. Committed, tested, reviewed code comes out.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0vtp1vgz2koehvvng3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0vtp1vgz2koehvvng3j.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Agents
&lt;/h2&gt;

&lt;p&gt;The pipeline follows a sequential flow where each agent's output informs the next, with a tight loop between Engineering and QA. Each agent has a defined role, a specific output format, and a clear handoff point - so no single agent is asked to do more than it is designed for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ArchitectAgent-Senior Software Architect:&lt;/strong&gt; The first agent in the pipeline. Takes the raw feature idea and generates a comprehensive technical specification covering Overview, Goals, API Contracts, Data Models, and Security sections. Output is a Markdown spec file that every downstream agent works from. Model:&lt;code&gt;google/gemini-2.0-flash-001&lt;/code&gt;via OpenRouter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PlannerAgent-Staff Engineering Manager:&lt;/strong&gt; Receives the spec from the Architect and breaks it into actionable, dependency-aware development tasks. Output is a JSON array of tasks with topological ordering and acceptance criteria - so the Engineer knows exactly what to build and in what order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EngineerAgent-Principal Software Engineer:&lt;/strong&gt; Takes each task from the Planner and implements production-grade TypeScript code for it. Output is source files with proper typing, error handling, and JSDoc. This is where the actual code gets written.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QAAgent-Senior QA Engineer:&lt;/strong&gt; Receives the Engineer's output and writes exhaustive Vitest test suites for each task. Output is test files covering acceptance criteria and edge cases. The tight loop between Engineer and QA means the implementation is always tested before the Reviewer sees it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ReviewerAgent-Principal Engineer Reviewer:&lt;/strong&gt; The final stage, conducts an audit across security, performance, and correctness across everything the previous agents produced. Output is a score from 0 to 100 and an approval status that tells you whether the output is ready to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality and Resilience
&lt;/h2&gt;

&lt;p&gt;The pipeline is built for production reliability, not just happy-path execution. Several resilience patterns are built in at the infrastructure level so agent failures do not cascade into full pipeline failures:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strict TypeScript&lt;/strong&gt; - No any types allowed anywhere in the generated code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exponential Backoff&lt;/strong&gt; - Retries on 429/529 errors at 1s, 2s, 4s, 8s, and 16s intervals. Rate limit hits do not kill the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON Robustness&lt;/strong&gt; - When an agent returns malformed JSON, the pipeline automatically retries with explicit instructions to fix the format rather than failing immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timeout&lt;/strong&gt; - Ahard 20-minute limit per pipeline run prevents runaway executions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node.js 20+&lt;/li&gt;
&lt;li&gt;OpenRouter API Key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
Clone the repository, then install dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npm install
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure the environment. The only required variable is your OpenRouter API key - everything else has sensible defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env and add your OPENROUTER_API_KEY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;The system uses &lt;code&gt;envalid&lt;/code&gt; for robust configuration management:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; - required for LLM access&lt;br&gt;
&lt;code&gt;DEFAULT_MODEL&lt;/code&gt; - set to &lt;code&gt;google/gemini-2.0-flash-001&lt;/code&gt;&lt;br&gt;
&lt;code&gt;PORT&lt;/code&gt; - API server port, default: &lt;code&gt;3000&lt;/code&gt;&lt;br&gt;
&lt;code&gt;DB_PATH&lt;/code&gt; - SQLite database path, default: &lt;code&gt;spec-to-ship.db&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Terminal UI&lt;/strong&gt;&lt;br&gt;
Run the interactive CLI to start a new pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npm run start
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI uses Ink to provide real-time status updates and token streaming as each agent works through its stage. You can watch the pipeline progress in real time - each agent's output appears as it is generated rather than waiting for the full run to complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Industrial Dashboard&lt;/strong&gt;&lt;br&gt;
Start the API server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;npm run dev
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;code&gt;dashboard/index.html&lt;/code&gt; in your browser. The dashboard features an Industrial Command Center aesthetic with Dark Charcoal and Amber Glow styling and uses Server-Sent Events for real-time observability of the pipeline as it runs. This gives a visual view of the same pipeline that the CLI runs, useful for sharing progress with others or monitoring longer runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Structure
&lt;/h2&gt;

&lt;p&gt;Every pipeline run writes its artifacts to ./output/{runId}/. Each file maps directly to one agent's output, so you can inspect any stage independently:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spec.md&lt;/code&gt; : Architectural specification from the Architect agent. The source of truth every downstream agent works from.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tasks.json&lt;/code&gt; : Task breakdown from the Planner agent. The dependency-ordered list of what gets built.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;src/&lt;/code&gt; : mplementation code from the Engineer agent.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tests/&lt;/code&gt; : Vitest tests from the QA agent.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;review.md&lt;/code&gt; : Final review report from the Reviewer agent, including the score and approval status.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;meta.json&lt;/code&gt; : Token usage, cost, and timing for the full run.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pipeline.log&lt;/code&gt; : NDJSON event log of the entire pipeline execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The idea was a multi-agent pipeline that mirrors how a real engineering team works each role specialized, each handoff structured, and the whole thing running autonomously from a feature idea through to reviewed, committed code. The requirements included five distinct agent roles with clear responsibilities, a sequential handoff structure with a QA loop, production-grade TypeScript output, real-time observability via both a CLI and a web dashboard, and resilience patterns like exponential backoff and JSON retry logic.&lt;/p&gt;

&lt;p&gt;NEO built the full system: the five agent implementations with their respective prompts and output schemas, the pipeline orchestration layer coordinating sequential handoffs, the Ink-based CLI with real-time token streaming, the Node.js API server with SSE for dashboard observability, the Industrial Command Center dashboard in HTML, the SQLite-backed database, the artifact output structure, and the &lt;code&gt;envalid&lt;/code&gt; configuration layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to go from idea to working code in a single command&lt;/strong&gt;. &lt;br&gt;
Write a feature description, run the pipeline, and get a complete implementation with architecture docs, TypeScript source, Vitest tests, and a reviewer score without manually coordinating any of the steps. The five-agent structure ensures each stage is handled by a role optimised for that specific task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the reviewer score as a quality gate&lt;/strong&gt;. &lt;br&gt;
The ReviewerAgent scores every run from 0 to 100 across security, performance, and correctness. Teams can use this score as a threshold before accepting generated code - only promoting runs that clear a minimum score into the codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the NDJSON event log for pipeline observability&lt;/strong&gt;. &lt;br&gt;
Every run writes a structured &lt;code&gt;pipeline.log&lt;/code&gt; in NDJSON format. This can be parsed by any log processing tool to track pipeline performance, token costs, and approval rates across runs over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional agent roles&lt;/strong&gt;. &lt;br&gt;
The five-agent structure is sequential and modular. A new agent that receives the previous stage's output and produces its own artifact can be added without restructuring the existing pipeline - the handoff pattern is already established for each stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;SPEC TO SHIP compresses the gap between a feature idea and production-ready code by distributing the work across five specialized agents, each focused on what it does best. Architecture, planning, implementation, testing, and review - all coordinated automatically, with structured handoffs and resilience built in at every stage.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Spec-To-Ship" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Spec-To-Ship&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;. &lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>multiagent</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>RAG Pipeline Stress Tester: Battle-Test Your RAG System Before It Reaches Production</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 12 May 2026 11:45:30 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/rag-pipeline-stress-tester-battle-test-your-rag-system-before-it-reaches-production-397c</link>
      <guid>https://forem.com/nilofer_tweets/rag-pipeline-stress-tester-battle-test-your-rag-system-before-it-reaches-production-397c</guid>
      <description>&lt;p&gt;Most RAG systems get tested with a handful of happy-path questions. Someone asks "what is machine learning?", gets a reasonable answer, and calls it done. Then it goes to production and users find the edge cases, hallucinations on out-of-scope questions, failed refusals on adversarial prompts, latency that collapses under real concurrent load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG Pipeline Stress Tester&lt;/strong&gt; is a battle-testing toolkit that finds these issues before deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Takes any HTTP RAG endpoint and hammers it with 7 categories of adversarial queries under configurable concurrent load.&lt;/li&gt;
&lt;li&gt;Tracks relevance, hallucination, refusal quality, and latency for every query sent.&lt;/li&gt;
&lt;li&gt;Scores everything into a composite health score from 0 to 100.&lt;/li&gt;
&lt;li&gt;Breaks results down by query category so you know exactly which failure modes are causing issues.&lt;/li&gt;
&lt;li&gt;Measures p50, p95, and p99 latency under realistic concurrent load, not just single-request response times.&lt;/li&gt;
&lt;li&gt;Produces an HTML report with interactive charts and a JSON report for CI/CD integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28iyjk2nc9t6w3r1h1tq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28iyjk2nc9t6w3r1h1tq.png" alt=" " width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Exists
&lt;/h2&gt;

&lt;p&gt;Before deploying a RAG system to production, four questions need answers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Does it hallucinate when asked about things not in the corpus?&lt;/li&gt;
&lt;li&gt;Does it refuse appropriately on out-of-scope questions?&lt;/li&gt;
&lt;li&gt;Does it stay consistent when the same question is asked multiple ways?&lt;/li&gt;
&lt;li&gt;Does it hold up under load 10, 25, 50 concurrent users?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Manual testing cannot answer these questions at scale. This tool does it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without stress testing&lt;/strong&gt; - hallucinations get discovered in production, users find edge cases first, latency under load is guesswork, and there is no audit trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With this tool&lt;/strong&gt; - hallucinations are caught before deployment, you find edge cases in batch, p50/p95/p99 latency is measured at realistic concurrency, and every test run produces a timestamped JSON and HTML report.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 7 Query Categories
&lt;/h2&gt;

&lt;p&gt;The tool ships with 7 pre-built adversarial query banks, each targeting a specific failure mode:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;out_of_scope&lt;/code&gt; - Questions with no answer in the corpus, tests hallucination resistance&lt;br&gt;
&lt;code&gt;adversarial&lt;/code&gt; - Prompt injection and jailbreak attempts, tests instruction-following safety&lt;br&gt;
&lt;code&gt;ambiguous&lt;/code&gt; - Queries with multiple valid interpretations, tests disambiguation&lt;br&gt;
&lt;code&gt;multilingual&lt;/code&gt; - Non-English queries, tests language handling&lt;br&gt;
&lt;code&gt;temporal&lt;/code&gt; - Time-sensitive questions that depend on stale data&lt;br&gt;
&lt;code&gt;negation&lt;/code&gt; - "What is NOT X" style questions, a common failure mode&lt;br&gt;
&lt;code&gt;compound&lt;/code&gt; - Multi-part questions requiring multiple retrievals&lt;/p&gt;

&lt;p&gt;You can add your own queries by appending lines to any file in &lt;code&gt;query_bank/&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Health Score
&lt;/h2&gt;

&lt;p&gt;Every test run produces a composite Health Score from 0 to 100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;≥ 80  EXCELLENT   Production-ready
≥ 60  GOOD        Minor issues, review before deploying
≥ 40  FAIR        Significant issues, fix first
 &amp;lt; 40  POOR        Critical failures, do not deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Calculated from five weighted components:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvislnqf6am0fb8i2m3yi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvislnqf6am0fb8i2m3yi.png" alt=" " width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main.py             Typer CLI — entry point and orchestration
adversarial.py      Query generator — 7 categories, pre-built + corpus-generated
loader.py           Async load driver — aiohttp, configurable concurrency
evaluator.py        Scorer — hallucination, precision, refusal, consistency
reporter.py         Report generator — HTML (Chart.js) + JSON output
corpus_analyzer.py  Optional: generate targeted queries from your own documents
query_bank/         7 pre-built adversarial query files (one per line)
tests/              58 pytest tests (no live endpoint needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The endpoint the tester sends requests to must accept POST with &lt;code&gt;{"query": "..."}&lt;/code&gt; and return JSON containing either a &lt;code&gt;response&lt;/code&gt; or &lt;code&gt;answer&lt;/code&gt; field. Any HTTP status other than 200 is counted as an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running a Stress Test
&lt;/h2&gt;

&lt;p&gt;The core command runs a full stress test against your RAG endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Basic — 10 concurrent users, 60-second run&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--duration&lt;/span&gt; 60

&lt;span class="c"&gt;# Test only specific query categories&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query-types&lt;/span&gt; out_of_scope,adversarial,multilingual

&lt;span class="c"&gt;# Custom output directory&lt;/span&gt;
python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./my-reports
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what a real terminal output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🚀 Starting RAG Stress Test
   Endpoint: http://localhost:8000/query
   Concurrency: 5
   Duration: 20s

📊 Generating test queries...
   Generated 350 test queries

⚡ Running load tests...
📈 Evaluating results...
📝 Generating reports...

✅ Stress test complete!
   JSON Report: reports/stress_test_results.json
   HTML Report: reports/stress_test_report.html

=======================================================
  Overall Health Score : 57.1/100
  Status               : FAIR - Significant issues detected
  Total requests       : 6355
  Error rate           : 0.0%
  Precision score      : 2.1%
  Hallucination rate   : 22.5%
  Refusal rate         : 77.5%
  Consistency score    : 72.1%
  Latency p50/p95/p99  : 2.9 / 6.3 / 8.7 ms

  Query Type          Count   Halluc%   Refusal%    AvgLat
  ------------------ ------  --------  ---------  --------
  adversarial           205     35.1%      64.9%      3.3ms
  ambiguous             250     12.0%      88.0%      3.2ms
  compound              200     22.0%      78.0%      4.0ms
  multilingual          250     10.0%      90.0%      3.1ms
  negation              200     20.0%      80.0%      5.3ms
  out_of_scope          250     20.0%      80.0%      4.0ms
  temporal              200     38.0%      62.0%      3.1ms

  Recommendations:
    - Low precision score. Enhance retrieval mechanism and relevance ranking.
    - Moderate: Several areas need improvement for production readiness.
=======================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Sanity Check
&lt;/h2&gt;

&lt;p&gt;For a fast check before a full run, quick-test runs 35 sample queries - 5 per category and prints the health score without writing any report files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py quick-test &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔍 Running quick sanity test...
   Testing with 35 sample queries

🎯 Quick Test Health Score: 72.4/100
   ✅ Endpoint appears functional
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Generate Queries From Your Own Corpus
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;analyze-corpus&lt;/code&gt; command analyzes your own &lt;code&gt;.txt&lt;/code&gt;, &lt;code&gt;.md&lt;/code&gt;, or &lt;code&gt;.json&lt;/code&gt; files, extracts domain keywords, and produces targeted in-scope, out-of-scope, and adversarial query files you can drop into &lt;code&gt;query_bank/&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py analyze-corpus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--corpus&lt;/span&gt; ./my-docs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./query_bank &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-queries&lt;/span&gt; 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📚 Analyzing corpus: ./my-docs
   Generated 50 in_scope queries → query_bank/in_scope_generated.txt
   Generated 50 out_of_scope queries → query_bank/out_of_scope_generated.txt
   Generated 50 adversarial queries → query_bank/adversarial_generated.txt

✅ Corpus analysis complete!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For very small corpora, lower the keyword frequency threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py analyze-corpus &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--corpus&lt;/span&gt; ./my-docs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; ./query_bank &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-queries&lt;/span&gt; 20 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-word-freq&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Edit &lt;code&gt;config.yaml&lt;/code&gt; to customise load levels, thresholds, and reporting. The &lt;code&gt;--endpoint&lt;/code&gt; CLI flag always takes precedence over &lt;code&gt;config.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;load.concurrency_levels&lt;/code&gt; - Concurrent user levels to test, for example &lt;code&gt;[1, 5, 10, 25]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.ramp_mode&lt;/code&gt; - If true, steps through each concurrency level; if false, runs at the first level for the full duration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.duration_seconds&lt;/code&gt; - How long to run at each concurrency level&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load.rate_limit_per_second&lt;/code&gt; - Maximum requests per second&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evaluation.hallucination_threshold&lt;/code&gt; - Keyword-overlap score below which a response is flagged as a potential hallucination&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;evaluation.refusal_keywords&lt;/code&gt; - Phrases that indicate a refused answer&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;reporter.output_dir&lt;/code&gt; - Where to save HTML and JSON reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pass the config file with &lt;code&gt;--config&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 main.py stress-test &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--endpoint&lt;/span&gt; http://localhost:8000/query &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; config.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Output Reports
&lt;/h2&gt;

&lt;p&gt;Each test run saves two files to &lt;code&gt;./reports/&lt;/code&gt; or your &lt;code&gt;--output&lt;/code&gt; path:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stress_test_results.json&lt;/strong&gt; - Machine-readable raw data with per-query latency, success and failure flags, hallucination scores, and a per-type breakdown. Useful for CI/CD integration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;stress_test_report.html&lt;/strong&gt; - Interactive dashboard with a health score badge coloured by band, metric cards covering success rate, precision, hallucination, latency p95 and consistency, a bar chart of success rate by query type, a grouped bar chart of hallucination and refusal rate by query type, a latency distribution histogram, and prioritised recommendations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Endpoint Requirements
&lt;/h2&gt;

&lt;p&gt;The tester sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;POST&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;/your-endpoint&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is machine learning?"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It expects a JSON response containing either a &lt;code&gt;response&lt;/code&gt; or &lt;code&gt;answer&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"response"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Machine learning is..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any HTTP status other than 200 is counted as an error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Tests
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;58 tests covering all modules. Uses &lt;code&gt;aioresponses&lt;/code&gt; to mock HTTP - no live RAG endpoint required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rag-pipeline-stress-tester/
├── main.py             # CLI entry point
├── adversarial.py      # Query generators (7 types)
├── loader.py           # Async load test driver
├── evaluator.py        # Scoring and metrics
├── reporter.py         # HTML + JSON report generator
├── corpus_analyzer.py  # Optional corpus-based query generation
├── config.yaml         # Test configuration
├── requirements.txt
├── query_bank/         # 7 pre-built adversarial query files
└── tests/              # 58 pytest tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The requirement was a toolkit that could stress test any RAG endpoint automatically, not just for latency but for hallucination, refusal quality, and consistency under concurrent load. The tool needed to work against any endpoint with a standard request format, produce structured reports for CI/CD integration, and ship with pre-built adversarial query banks covering the failure modes that matter most before a RAG deployment.&lt;/p&gt;

&lt;p&gt;xNEO built the full implementation: The Typer CLI with all three commands, the async load driver backed by aiohttp, the query generator covering all 7 adversarial categories, the hallucination and precision scorer, the composite health score calculator with five weighted components, the HTML report generator with Chart.js charts, the JSON reporter, the corpus analyzer for generating domain-specific queries, and the full test suite of 58 tests with HTTP mocked via aioresponses.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a pre-deployment gate for every RAG system.&lt;/strong&gt;&lt;br&gt;
Before any RAG endpoint goes to production, run a stress test against it. The health score gives you a single number, below 60 means review before deploying, below 40 means do not deploy. The per-category breakdown tells you exactly which failure modes are causing the score to drop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it with your own domain queries.&lt;/strong&gt;&lt;br&gt;
The pre-built query banks are general purpose. For domain-specific testing, run &lt;code&gt;analyze-corpus&lt;/code&gt; on your own documents to generate in-scope, out-of-scope, and adversarial queries targeted at your actual corpus, then drop them into &lt;code&gt;query_bank/&lt;/code&gt; and run the stress test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrate the JSON report into CI/CD.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;stress_test_results.json&lt;/code&gt; is machine-readable and contains per-query latency, hallucination scores, and the health score. A CI step that reads the health score and fails the pipeline below a threshold turns RAG quality into an automated deployment gate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional query categories.&lt;/strong&gt;&lt;br&gt;
The 7 query banks are plain text files in &lt;code&gt;query_bank/&lt;/code&gt;, one query per line. Adding a new category for a specific failure mode your RAG system faces means adding a new file to &lt;code&gt;query_bank/&lt;/code&gt; and registering it in &lt;code&gt;adversarial.py&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;RAG systems fail in predictable ways, hallucination on out-of-scope questions, collapsed latency under load, inconsistent refusals. RAG Pipeline Stress Tester surfaces all of these before production, with a structured health score, per-category metrics, and reports that fit directly into a CI/CD pipeline.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/RAG-pipeline-stress-tester" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/RAG-pipeline-stress-tester&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Orbis: Turn Any GitHub Repository Into an Interactive 3D Dependency Graph</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 09 May 2026 10:58:10 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/orbis-turn-any-github-repository-into-an-interactive-3d-dependency-graph-3eei</link>
      <guid>https://forem.com/nilofer_tweets/orbis-turn-any-github-repository-into-an-interactive-3d-dependency-graph-3eei</guid>
      <description>&lt;p&gt;Understanding a large codebase is hard. You clone it, start reading files, and quickly lose track of how everything connects. Which modules are most depended on? Where are the circular dependencies? What would break if you refactored this file?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orbis&lt;/strong&gt; answers these questions visually. Paste a GitHub repository URL, and Orbis clones it, parses the ASTs across Python, JavaScript, TypeScript, Go, Rust, and Java, detects architectural patterns, and renders the entire codebase as a navigable 3D force-directed graph. Click any module to inspect its dependencies, metrics, and exported symbols. Ask the built-in AI assistant questions like "which module should I refactor first?" and get answers grounded in the actual code structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;3D force-directed graph&lt;/strong&gt; - Nodes sized by lines of code, colored by type, with animated directional particles on edges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-language AST parsing&lt;/strong&gt; - Python, JavaScript/TypeScript, Go, Rust, and Java via tree-sitter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI chat assistant&lt;/strong&gt; - Ask Claude questions about the analyzed codebase. Questions like "Which modules have circular dependencies?" or "Where should I add feature X?" are answered with full architectural context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural insights&lt;/strong&gt; - Auto-detected issues including god modules, high coupling, and circular dependencies, each with severity ratings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Focus Mode&lt;/strong&gt; - Dim unconnected nodes to trace dependency paths clearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shareable URLs&lt;/strong&gt; - &lt;code&gt;?repo=https://github.com/...&lt;/code&gt; auto-triggers analysis on load, making it easy to share a specific codebase view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recent history&lt;/strong&gt; - Last 5 repos stored locally for quick re-analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Demo mode&lt;/strong&gt; — Load a pre-analyzed snapshot without a GitHub clone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Backend: FastAPI + Server-Sent Events (SSE)&lt;/li&gt;
&lt;li&gt;AST Parsing: tree-sitter (Python, JS/TS, Go, Rust, Java)&lt;/li&gt;
&lt;li&gt;AI Integration: Claude Opus 4.6 via Anthropic API&lt;/li&gt;
&lt;li&gt;3D Rendering: 3d-force-graph + Three.js&lt;/li&gt;
&lt;li&gt;Frontend: Vanilla JS SPA - no build step&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;orbis
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate   &lt;span class="c"&gt;# Windows: venv\Scripts\activate&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Set up environment&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env and add your ANTHROPIC_API_KEY for the AI chat feature&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get an API key at console.anthropic.com. The AI chat feature requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; in your environment. It degrades gracefully, if the key is missing, the chat panel shows an error message rather than breaking the rest of the app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Run&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uvicorn main:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8001&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; orbis &lt;span class="nb"&gt;.&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8001:8001 &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-... orbis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;Once running, the workflow is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter a public GitHub repository URL - for example &lt;code&gt;https://github.com/expressjs/express&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Optionally specify a branch&lt;/li&gt;
&lt;li&gt;Click Analyze - Orbis clones the repo, parses ASTs, and builds the graph in roughly 5–30 seconds&lt;/li&gt;
&lt;li&gt;Explore the 3D graph - click a node to open its detail drawer, scroll to zoom, drag to rotate&lt;/li&gt;
&lt;li&gt;Use Focus Mode to highlight a node's direct connections&lt;/li&gt;
&lt;li&gt;Use layer filter chips to show or hide architectural layers&lt;/li&gt;
&lt;li&gt;Ask the AI assistant questions about the codebase in the chat panel&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Keyboard Shortcuts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;R: Reset camera&lt;/li&gt;
&lt;li&gt;P: Pause/resume rotation&lt;/li&gt;
&lt;li&gt;F: Toggle Focus Mode&lt;/li&gt;
&lt;li&gt;/: Focus search box&lt;/li&gt;
&lt;li&gt;Esc: Close detail drawer / exit Focus Mode&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The project has four files at its core - a FastAPI backend, a single-file AST parser, and a vanilla JS frontend with no build step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main.py           FastAPI backend — SSE streaming for /analyze, /chat
neo_parser.py     Multi-language AST parser (tree-sitter)
static/
  index.html      Single-page frontend (3d-force-graph + Three.js)
save_analysis.py  Utility: pre-generate demo data from a repo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The backend streams analysis progress to the frontend via Server-Sent Events, The backend streams analysis progress to the frontend via Server-Sent Events while cloning and analyzing the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiky04nqoykgwfknmlsm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuiky04nqoykgwfknmlsm.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Output Schema
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;/analyze&lt;/code&gt; emits SSE events and completes with a &lt;code&gt;complete&lt;/code&gt; event containing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schema_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"architecture_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MVC"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"languages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Codebase contains 42 modules..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"auth.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"utility"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"lines_of_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;315&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"complexity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"exported_symbols"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"AuthBase"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HTTPBasicAuth"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"internal_dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"requests/compat"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"external_dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"functions_total"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"classes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"edges"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"requests/auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"import"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"insights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high_coupling"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"High fan-in on requests/models"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"14 modules import this file directly."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"affected_nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"requests/models"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"recommendation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Consider splitting into smaller focused modules."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each node carries its lines of code, complexity rating, exported symbols, and both internal and external dependencies. The insights block surfaces architectural issues automatically, high coupling, circular dependencies, and god modules - each with a severity rating and a specific recommendation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Languages
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python - &lt;code&gt;.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;JavaScript/TypeScript - &lt;code&gt;.js&lt;/code&gt;, &lt;code&gt;.mjs&lt;/code&gt;, &lt;code&gt;.cjs&lt;/code&gt;, &lt;code&gt;.jsx&lt;/code&gt;, &lt;code&gt;.ts&lt;/code&gt;, &lt;code&gt;.tsx&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Go - &lt;code&gt;.go&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Rust - &lt;code&gt;.rs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Java - &lt;code&gt;.java&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI Chat
&lt;/h2&gt;

&lt;p&gt;The chat assistant uses Claude Opus 4.6 and receives the full architectural graph as context - node list, dependencies, insights, and summary. It can answer questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What does the auth module depend on?"&lt;/li&gt;
&lt;li&gt;"Why are there circular dependencies between X and Y?"&lt;/li&gt;
&lt;li&gt;"Which module should I refactor first?"&lt;/li&gt;
&lt;li&gt;"Where would I add a caching layer?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The assistant's answers are grounded in the actual parsed structure of the codebase - not generic advice. Requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; in your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run with auto-reload&lt;/span&gt;
uvicorn main:app &lt;span class="nt"&gt;--reload&lt;/span&gt; &lt;span class="nt"&gt;--port&lt;/span&gt; 8001

&lt;span class="c"&gt;# Re-generate demo data&lt;/span&gt;
python save_analysis.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The idea was a tool that turns any GitHub repository into an interactive 3D graph, something a developer could paste a URL into and immediately understand the architecture without reading a single file. The requirements included multi-language AST parsing, automatic architectural issue detection, an AI assistant grounded in the actual code structure, and a frontend that required no build step.&lt;/p&gt;

&lt;p&gt;NEO built the full stack from that description: the FastAPI backend with SSE streaming for real-time analysis progress, the multi-language AST parser in &lt;code&gt;neo_parser.py&lt;/code&gt; covering Python, JavaScript, TypeScript, Go, Rust, and Java via tree-sitter, the 3D force-directed graph frontend in vanilla JS, the Claude Opus 4.6 chat assistant with full architectural context, the insights engine detecting god modules, high coupling, and circular dependencies with severity ratings, and the demo mode with pre-generated analysis data.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it to onboard onto an unfamiliar codebase.&lt;/strong&gt;&lt;br&gt;
Instead of spending hours reading files to understand how a project is structured, paste the repo URL into Orbis and get an immediate visual map of every module, its dependencies, and the architectural issues that already exist. The AI assistant can then answer specific questions about the structure without you having to trace imports manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during code review to understand structural impact.&lt;/strong&gt;&lt;br&gt;
When reviewing a large pull request, run Orbis on the repo and use the insights panel to see whether high coupling, circular dependencies, or god modules exist in the areas being changed. The AI assistant can answer specific questions about how the affected modules connect to the rest of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to plan a refactor.&lt;/strong&gt;&lt;br&gt;
Ask the AI assistant "which module should I refactor first?" or "where would I add a caching layer?" and get answers grounded in the actual dependency graph. The focus mode lets you isolate a specific module and trace exactly what depends on it before touching anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional language parsers.&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;neo_parser.py&lt;/code&gt; already handles five languages via tree-sitter. Adding a new language - Ruby, C++, Swift - follows the same parser pattern and surfaces automatically in the language filter chips and the supported languages list without touching the frontend or the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Orbis makes codebase architecture something you can see and navigate rather than something you have to reconstruct in your head. A 3D dependency graph, multi-language AST parsing, automatic architectural issue detection, and an AI assistant that knows the actual structure - all from a single repo URL.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Orbit-dependency-visualised" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Orbit-dependency-visualised&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devtools</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>SmolVLM2 Edge Vision Agent: Visual Monitoring Without a GPU or Cloud API</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 07 May 2026 11:43:31 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/smolvlm2-edge-vision-agent-visual-monitoring-without-a-gpu-or-cloud-api-2afp</link>
      <guid>https://forem.com/nilofer_tweets/smolvlm2-edge-vision-agent-visual-monitoring-without-a-gpu-or-cloud-api-2afp</guid>
      <description>&lt;p&gt;Running vision AI locally has always had a catch, you need a GPU, or you need to send frames to a cloud API and pay per call. SmolVLM2-2.2B changes that. It is a 2.2B-parameter multimodal model specifically designed for CPU inference, and this agent is built around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SmolVLM2 Edge Vision Agent&lt;/strong&gt; is a fully offline edge vision agent that ingests a live webcam feed or an image folder, detects motion using frame-difference analysis, triggers VLM analysis only on scene changes, and persists structured observations to a local SQLite database with a FastAPI web dashboard for review. No API costs. No network calls after the first model download. 16GB RAM, no GPU required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;The agent does five things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingests a live webcam feed or an image folder as input&lt;/li&gt;
&lt;li&gt;Performs continuous visual monitoring, frame-difference based motion detection that triggers VLM analysis only on scene changes&lt;/li&gt;
&lt;li&gt;Describes new objects, reads text from images - receipts, whiteboards, signs, and logs everything as structured observations&lt;/li&gt;
&lt;li&gt;Persists observations to a local SQLite database with timestamps, thumbnails, descriptions, and confidence scores&lt;/li&gt;
&lt;li&gt;Exposes a FastAPI web dashboard with live feed, latest observations, and a searchable log&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It runs entirely offline. The model auto-downloads on first run and is cached locally from that point forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use cases:&lt;/strong&gt; home security camera analysis, document digitization pipelines, accessibility tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The key design decision is the motion gate. Running a 2.2B-parameter model on every frame would be unusable on CPU hardware, inference is not instant. The agent solves this by running frame-difference motion detection on every frame first, and only invoking the VLM when a scene change is detected above the configured threshold.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfmzuo5rq01ymye9ocj7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frfmzuo5rq01ymye9ocj7.png" alt=" " width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-frame timeline:&lt;/strong&gt;&lt;br&gt;
Every frame goes through motion detection first. If the frame difference is below the threshold, the frame is dropped with no further processing. If motion is detected, the VLM runs, produces a description, and the observation is stored in SQLite with a thumbnail. This design means expensive model inference only happens when something actually changes in the scene, keeping a Pi-class CPU usable while still describing every meaningful scene change.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;FRAME_DIFF_THRESHOLD&lt;/code&gt; defaults to 0.15 and controls how sensitive the motion detector is. A higher value means less sensitivity, minor lighting changes or small movements are ignored. A lower value triggers the VLM more frequently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbhclbik1cc2vpheuh6k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzbhclbik1cc2vpheuh6k.png" alt=" " width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python:&lt;/strong&gt; 3.11 or newer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 16GB minimum for the real model; less is fine in &lt;code&gt;--mock&lt;/code&gt; mode.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; ~5GB free for the model cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Linux, macOS, or WSL2 on Windows - uses OpenCV, and webcam access requires native camera support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No GPU required&lt;/strong&gt; - SmolVLM2-2.2B is designed for CPU inference.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/smolvlm2-edge-agent.git
&lt;span class="nb"&gt;cd &lt;/span&gt;smolvlm2-edge-agent
make &lt;span class="nb"&gt;install&lt;/span&gt;                                  &lt;span class="c"&gt;# pip install -e .&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env                          &lt;span class="c"&gt;# then edit values as needed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;make install&lt;/code&gt; command runs &lt;code&gt;pip install -e&lt;/code&gt; . which installs the package and its pinned runtime dependencies from &lt;code&gt;requirements.txt&lt;/code&gt;. The &lt;code&gt;.env.example&lt;/code&gt; file contains all documented environment variables, copy it to &lt;code&gt;.env&lt;/code&gt; and edit the values you want to override before running.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Every tunable is configurable via CLI flags and environment variables. CLI flags take precedence over environment variables. All variables are documented in &lt;code&gt;.env.example&lt;/code&gt; in the &lt;a href="https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MODEL_NAME&lt;/code&gt; - HuggingFace model id, default: &lt;code&gt;HuggingFaceTB/SmolVLM2-2.2B-Instruct&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;USE_MOCK_MODE&lt;/code&gt; - bypass model loading with deterministic stub responses, default: &lt;code&gt;false&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MODEL_CACHE_DIR&lt;/code&gt; - where the HuggingFace model is cached on disk, default: &lt;code&gt;./models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DB_PATH&lt;/code&gt; - SQLite database file path, default: &lt;code&gt;./data/observations.db&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FRAME_DIFF_THRESHOLD&lt;/code&gt; - motion sensitivity on a 0–1 scale, higher means less sensitive, default: &lt;code&gt;0.15&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MIN_CONFIDENCE&lt;/code&gt; - minimum VLM confidence required to log an observation, default: &lt;code&gt;0.5&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PROCESSING_INTERVAL&lt;/code&gt; - seconds between frame samples, default: &lt;code&gt;1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MAX_OBSERVATIONS&lt;/code&gt; - cap on stored rows, older observations are pruned, default: &lt;code&gt;10000&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DASHBOARD_HOST&lt;/code&gt; - FastAPI bind host, default: &lt;code&gt;0.0.0.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DASHBOARD_PORT&lt;/code&gt; - FastAPI port, default: &lt;code&gt;8080&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;INPUT_SOURCE&lt;/code&gt; - camera index or path to image folder, default: &lt;code&gt;0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;OUTPUT_DIR&lt;/code&gt; - where observation artifacts are written, default: &lt;code&gt;./data/observations/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;THUMBNAIL_DIR&lt;/code&gt; - where frame thumbnails are saved, default: &lt;code&gt;./data/thumbnails/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOG_LEVEL&lt;/code&gt; - Python logging level, default: &lt;code&gt;INFO&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LOG_FILE&lt;/code&gt; - optional log file path, default: &lt;code&gt;./data/agent.log&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;MIN_CONFIDENCE&lt;/code&gt; is worth paying attention to — observations where the VLM's confidence falls below 0.5 are not stored. Raising this filters out uncertain detections. Lowering it logs more, including lower-confidence observations.&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Quick start - mock mode, no model download&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The fastest way to verify the full pipeline is mock mode. It bypasses model loading entirely and uses deterministic stub responses, so you can confirm the agent loop, database writes, thumbnail generation, and dashboard all work before committing to the 5GB model download:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; data/test_images
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--mock&lt;/span&gt; &lt;span class="nt"&gt;--input&lt;/span&gt; ./data/test_images &lt;span class="nt"&gt;--duration&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs the agent for 30 seconds against the &lt;code&gt;data/test_images/&lt;/code&gt; folder using the mock VLM, populates &lt;code&gt;data/observations.db&lt;/code&gt;, and writes thumbnails to &lt;code&gt;data/thumbnails/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run against a webcam&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--input&lt;/span&gt; 0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Camera index 0 is the default device. For additional cameras, use index 1, 2, and so on. Open &lt;code&gt;http://localhost:8080&lt;/code&gt; in a browser to see the live dashboard. The dashboard shows the live feed, the most recent observations, and a searchable log of everything the agent has recorded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run against an image folder&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--input&lt;/span&gt; ./images &lt;span class="nt"&gt;--interval&lt;/span&gt; 2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Iterates over &lt;code&gt;./images&lt;/code&gt; at 2-second intervals. Useful for batch processing a folder of scanned documents, receipts, or photos without a live camera feed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dashboard only in read mode&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; src &lt;span class="nt"&gt;--mode&lt;/span&gt; dashboard &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Serves the dashboard against an existing &lt;code&gt;data/observations.db&lt;/code&gt; without running the agent. Useful for reviewing historical observations without starting a new capture session.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Reference
&lt;/h2&gt;

&lt;p&gt;The FastAPI dashboard exposes six endpoints:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm66ohxfya82kfvtv4gx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcm66ohxfya82kfvtv4gx.png" alt=" " width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/api/search&lt;/code&gt; endpoint runs full-text search over stored observation descriptions, useful for finding all observations that mention a specific object, person, or piece of text across the full history.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/api/observations&lt;/code&gt; endpoint is paginated with &lt;code&gt;limit&lt;/code&gt; and &lt;code&gt;offset&lt;/code&gt; parameters. The default returns the 50 most recent observations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models Used
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l2t21i3rhps3jm4ra5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l2t21i3rhps3jm4ra5f.png" alt=" " width="800" height="150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the default &lt;code&gt;--model&lt;/code&gt; argument and &lt;code&gt;MODEL_NAME&lt;/code&gt; env var. No other models are referenced in code, config, or docs. The model is downloaded from HuggingFace on first run and cached in &lt;code&gt;./models&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;The test suite covers all five modules - database, vision, agent, dashboard, and CLI - with the VLM fully mocked so no model download is needed to run tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight make"&gt;&lt;code&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;test&lt;/span&gt;                  &lt;span class="c"&gt;# python3 -m pytest tests/ -v
&lt;/span&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;lint&lt;/span&gt;                  &lt;span class="c"&gt;# ruff check src/ tests/ --fix
&lt;/span&gt;&lt;span class="err"&gt;make&lt;/span&gt; &lt;span class="err"&gt;typecheck&lt;/span&gt;             &lt;span class="c"&gt;# mypy src/ --ignore-missing-imports
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test coverage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;tests/test_db.py&lt;/code&gt; - 10 tests covering SQLite schema, CRUD, and search&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_vision.py&lt;/code&gt; - 6 tests covering mock VLM and prompt rendering&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_agent.py&lt;/code&gt; - 9 tests covering motion detection and the agent loop&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_dashboard.py&lt;/code&gt; - 6 tests covering HTTP route handlers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tests/test_cli.py&lt;/code&gt; - 7 tests covering argparse and env-var loading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total: 36 tests, all passing. No skipped tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;smolvlm2-edge-agent/
├── src/
│   ├── __init__.py
│   ├── __main__.py              # entry point for python -m src
│   ├── agent.py                 # MotionDetector + VisionAgent
│   ├── vision.py                # VisionEngine (SmolVLM2 wrapper, with MockVisionEngine)
│   ├── db.py                    # SQLite Database class
│   ├── dashboard.py             # FastAPI app factory + route handlers
│   └── cli.py                   # argparse + env loading
├── tests/                       # 36 pytest tests, VLM fully mocked
├── data/.gitkeep                # observations.db, thumbnails/, test_images/ land here
├── models/.gitkeep              # HF model cache
├── pyproject.toml               # ruff + mypy config + console_script
├── requirements.txt             # pinned runtime deps
├── Makefile                     # install, test, lint, typecheck, run, clean
├── .env.example                 # documented env vars
├── .gitignore
├── BUILD_NOTES.md               # build/verification trace
└── PUBLISH.md                   # exact GitHub push commands
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;src/&lt;/code&gt; directory maps cleanly to the agent's responsibilities - &lt;code&gt;agent.py&lt;/code&gt; handles the motion detection and VLM orchestration loop, &lt;code&gt;vision.py&lt;/code&gt; wraps the model with a mock-compatible interface, &lt;code&gt;db.py&lt;/code&gt; handles all SQLite operations, &lt;code&gt;dashboard.py&lt;/code&gt; is the FastAPI application, and &lt;code&gt;cli.py&lt;/code&gt; handles all argument parsing and environment variable loading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;PRs welcome. Before submitting, all three of the following must pass with zero errors:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make lint &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make typecheck &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The process started with an idea - a fully offline edge vision agent that runs on CPU-only hardware with no GPU and no cloud API calls. I put together a clear project description with the requirements, tech stack, and expected output, and handed it to NEO. From there NEO handled the full build autonomously: writing the code, running tests, fixing issues, and iterating until everything was working end to end. Once NEO completed the build, I did a manual review, tested it myself, and fed any improvements back - which NEO then implemented.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as an offline home security monitor:&lt;/strong&gt; Point it at a webcam, let it run, and review what it logged through the dashboard. Every scene change is stored with a timestamp, description, confidence score, and thumbnail - all locally, with no data leaving your machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it for document digitization pipelines:&lt;/strong&gt; Point &lt;code&gt;--input&lt;/code&gt; at a folder of scanned receipts, whiteboards, or handwritten notes. The VLM reads text from images and logs structured observations. The &lt;code&gt;/api/search&lt;/code&gt; endpoint lets you query what was found across the full document set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as an accessibility tool:&lt;/strong&gt; Run it against a webcam feed to generate continuous natural language descriptions of what is visible in the environment - stored and searchable, entirely offline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional VLM backends:&lt;/strong&gt; &lt;code&gt;VisionEngine&lt;/code&gt; in &lt;code&gt;vision.py&lt;/code&gt; wraps SmolVLM2-2.2B with a clean interface that &lt;code&gt;MockVisionEngine&lt;/code&gt; also implements. Swapping in a different HuggingFace multimodal model means updating &lt;code&gt;vision.py&lt;/code&gt; - the agent, database, dashboard, and CLI stay entirely unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;SmolVLM2 Edge Vision Agent shows that meaningful vision AI does not require a GPU or a cloud API. A 2.2B-parameter model, motion-gated inference, a local SQLite store, and a FastAPI dashboard, all running offline on commodity hardware.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>fastapi</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Prompt Compression Benchmarker: Cut LLM Input Costs by 35–63% With Measurable Quality Tracking</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 06 May 2026 07:07:48 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/prompt-compression-benchmarker-cut-llm-input-costs-by-35-63-with-measurable-quality-tracking-12f3</link>
      <guid>https://forem.com/nilofer_tweets/prompt-compression-benchmarker-cut-llm-input-costs-by-35-63-with-measurable-quality-tracking-12f3</guid>
      <description>&lt;p&gt;Most LLM cost comes from input tokens, the long documents, codebases, or conversation histories you send as context. There are several prompt compression algorithms available, but nobody tells you which one actually works best for your specific workload, or how much quality you are trading for the savings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Compression Benchmarker (PCB)&lt;/strong&gt; answers both questions. It benchmarks every major prompt compression algorithm against your actual data, shows you exactly how much quality each one drops, projects the real dollar savings at your call volume, and then gives you a one-line wrapper to deploy the winner as a drop-in replacement around your Anthropic or OpenAI client.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;PCB answers two questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which compression algorithm preserves the most quality at a given token budget?&lt;/strong&gt; &lt;br&gt;
Benchmark mode runs all compressors against your data and scores each one with task-specific quality metrics and an optional LLM-as-judge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much money does that save at your actual call volume?&lt;/strong&gt; &lt;br&gt;
Cost projection mode takes your daily token volume and model pricing and gives you monthly and annual savings per compressor.&lt;/p&gt;

&lt;p&gt;Then it gives you a one-line wrapper to deploy the answer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm265a0k98rmkebalhz97.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm265a0k98rmkebalhz97.png" alt=" " width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# From source
git clone https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
cd Prompt-Compression-Benchmarker
pip install .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# From PyPI (once published)
pip install prompt-compression-benchmarker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Python 3.9+. No GPU required. Core dependencies: &lt;code&gt;tiktoken&lt;/code&gt;, &lt;code&gt;scikit-learn&lt;/code&gt;, &lt;code&gt;rouge-score&lt;/code&gt;, &lt;code&gt;rank-bm25&lt;/code&gt;, &lt;code&gt;typer&lt;/code&gt;, &lt;code&gt;rich&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Verify
pcb --help

# Optional extras
pip install "prompt-compression-benchmarker[anthropic]"   # SDK wrapper for Anthropic
pip install "prompt-compression-benchmarker[openai]"      # SDK wrapper for OpenAI
pip install "prompt-compression-benchmarker[mcp]"         # MCP server for Claude Code
pip install "prompt-compression-benchmarker[all]"         # Everything
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Run the benchmark&lt;/strong&gt;&lt;br&gt;
The simplest run uses bundled sample data - no setup needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# All compressors × all task types, bundled sample data — no setup needed
pcb run

# Target a specific task with cost projection
pcb run --task rag --max-samples 20 --daily-tokens 2000000 --cost-model claude-sonnet-4-6

# Add LLM-as-judge for deeper quality scoring (requires OpenRouter API key)
export OPENROUTER_API_KEY=sk-or-...
pcb run --llm-judge --judge-model claude-sonnet-4-6 --max-samples 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what a real benchmark run looks like - RAG task, 3M tokens/day, claude-sonnet-4-6 pricing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --daily-tokens 3000000 --cost-model claude-sonnet-4-6
                          RAG
 Compressor          Token Reduc %  Proxy Score  Proxy Drop %   ms
 no_compression           0.0%        0.2983         0.0%      0.3
 tfidf ★                 40.1%        0.2519        +16.5%     12.1
 selective_context        56.9%        0.1874        +34.4%      8.3
 llmlingua                53.6%        0.2182        +28.1%      9.7
 llmlingua2               45.0%        0.2204        +27.3%     11.2

 Monthly Cost Projection  claude-sonnet-4-6 · $3/1M · 3M tokens/day
 tfidf             38.3% reduction   $103/mo saved   $1,240/yr
 selective_context 57.5% reduction   $155/mo saved   $1,863/yr
 llmlingua2        43.6% reduction   $118/mo saved   $1,413/yr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ★ marks the Pareto-optimal compressor - best token savings given a quality drop below 20%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Compress a file directly&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Compress from a file or stdin, output to stdout
pcb compress context.txt --compressor llmlingua2 --rate 0.45 --stats

# Pipe it into any script
cat rag_context.txt | pcb compress | python send_to_claude.py

# Save compressed output
pcb compress context.txt -o compressed.txt --stats
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Deploy the winner&lt;/strong&gt;&lt;br&gt;
Once you know which compressor wins on your data, deploying it is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingAnthropic

# Drop-in replacement for anthropic.Anthropic()
client = CompressingAnthropic(compressor="llmlingua2", rate=0.45)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

print(client.stats)  # CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else in your codebase stays the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Results
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Benchmark table columns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5abgr0r8hnyzshkj5ltz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5abgr0r8hnyzshkj5ltz.png" alt=" " width="800" height="269"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality drop color coding&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cyan   = negative drop (compression improved the metric — noise removal)
green  = &amp;lt; 5% drop    (effectively lossless)
yellow = 5–15% drop   (acceptable for most use cases)
red    = ≥ 15% drop   (significant information loss)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why use the LLM judge?
&lt;/h2&gt;

&lt;p&gt;The proxy score (F1, ROUGE, BM25) is fast and free but mechanical. The LLM judge calls a real model to evaluate whether the compressed context still supports the correct answer, it reveals things proxy metrics miss.&lt;/p&gt;

&lt;p&gt;Here is a real example showing why this matters - RAG task, 5 samples, LLM judge = claude-sonnet-4-6:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compressor           Proxy Drop %   LLM Score   LLM Drop %
no_compression           0.0%         0.94         0.0%
tfidf                  +23.7%         0.40        -57.4%    ← proxy hid the severity
llmlingua2             +29.9%         0.70        -25.5%    ← much better than proxy suggested
selective_context      +37.6%         0.14        -85.1%    ← dangerous despite high compression
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rule of thumb: use proxy scores to compare many configs quickly, then LLM-judge the top 2–3 before deploying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing a Compressor
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;RAG:&lt;/strong&gt; &lt;code&gt;llmlingua2&lt;/code&gt; at rate 0.40 - preserves named entities and key facts better than sentence-dropping&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summarization:&lt;/strong&gt; &lt;code&gt;llmlingua&lt;/code&gt; at rate 0.45 - sentence-level pruning maintains structural coverage&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code contexts:&lt;/strong&gt; &lt;code&gt;llmlingua2&lt;/code&gt; at rate 0.35 - keeps imports, identifiers, type names; removes boilerplate&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General chat:&lt;/strong&gt; &lt;code&gt;tfidf&lt;/code&gt; at rate 0.40 - safe default, fast, reliable&lt;/p&gt;

&lt;h2&gt;
  
  
  Target compression rate
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;--rate&lt;/code&gt; is the fraction of tokens to remove. 0.45 means keep 55% of tokens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5q2a5j4q65dt793zgm5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz5q2a5j4q65dt793zgm5.png" alt=" " width="800" height="241"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Savings - The Real Numbers
&lt;/h2&gt;

&lt;p&gt;Compression saves money on input tokens only. Output tokens are unchanged.&lt;br&gt;
At 3M input tokens per day:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fthywdxzquiykzeazrosi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fthywdxzquiykzeazrosi.png" alt=" " width="796" height="291"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Compression is most valuable on premium models. On DeepSeek or GPT-4.1-mini, the savings are too small to justify the complexity, use it only if you're hitting context window limits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check your own workload
pcb run --max-samples 10 --daily-tokens 5000000 --cost-model claude-opus-4-7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Deploy: Python SDK Wrappers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    verbose=True,
)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

# Cumulative stats
print(client.stats)
# CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)

# Estimate monthly savings
print(client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000))
# 588.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;OpenAI (Chat Completions + Codex Responses API)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingOpenAI

client = CompressingOpenAI(compressor="tfidf", rate=0.40)

# Chat Completions API — unchanged
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": long_context}]
)

# Responses API (Codex / o-series)
response = client.responses.create(
    model="codex-mini-latest",
    input=long_codebase_context,
    reasoning={"effort": "high"}
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What gets compressed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By default, only "user" role messages over 100 tokens are compressed. System prompts and assistant history are passed through unchanged.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    compress_roles=("user", "system"),  # also compress system prompt
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Claude Code Integration (MCP)
&lt;/h2&gt;

&lt;p&gt;PCB ships an MCP server that adds four compression tools directly into Claude Code conversations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Add to the current project
claude mcp add pcb -s project -- python -m pcb.mcp_server

# Or add to all your projects
claude mcp add pcb -s user -- python -m pcb.mcp_server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or drop &lt;code&gt;.mcp.json&lt;/code&gt; into any project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "mcpServers": {
    "pcb": {
      "type": "stdio",
      "command": "python",
      "args": ["-m", "pcb.mcp_server"]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Available tools&lt;/strong&gt;&lt;br&gt;
Once connected, you can ask Claude:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Compress this RAG context before sending it to the model"&lt;/li&gt;
&lt;li&gt;"Estimate how much I'd save compressing my prompts on claude-opus-4-7 at 2000 calls/day"&lt;/li&gt;
&lt;li&gt;"What compressor should I use for my coding assistant at 90% quality floor?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxqk0rzatmcf5xghgtyd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbxqk0rzatmcf5xghgtyd.png" alt=" " width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI Codex (Agents SDK)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from agents import Agent, Runner
from agents.mcp import MCPServerStdio
import asyncio

async def main():
    async with MCPServerStdio(
        name="pcb",
        params={"command": "python", "args": ["-m", "pcb.mcp_server"]},
    ) as pcb_server:
        agent = Agent(
            name="CostAwareAssistant",
            model="codex-mini-latest",
            mcp_servers=[pcb_server],
        )
        result = await Runner.run(
            agent,
            "Compress this codebase context and estimate savings: " + codebase_context
        )
        print(result.final_output)

asyncio.run(main())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Bring Your Own Data
&lt;/h2&gt;

&lt;p&gt;Data is JSONL - one JSON object per line. Check the schema for each task type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb show-schema rag
pcb show-schema summarization
pcb show-schema coding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RAG schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "id": "my_001",
  "context": "&amp;lt;passage 300–1500 tokens&amp;gt;",
  "question": "&amp;lt;specific question requiring the full context&amp;gt;",
  "answer": "&amp;lt;short, precise answer string&amp;gt;"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Summarization schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "id": "my_001",
  "article": "&amp;lt;article or document 300–800 tokens&amp;gt;",
  "summary": "&amp;lt;2–3 sentence reference summary&amp;gt;"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Coding schema&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "id": "my_001",
  "context": "&amp;lt;imports, helpers, type definitions — 400–800 tokens&amp;gt;",
  "docstring": "&amp;lt;description of the function to implement&amp;gt;",
  "solution": "&amp;lt;correct Python implementation&amp;gt;"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running on your data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --data-dir ./my_data --task rag --max-samples 50

# Compare specific compressors
pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2

# Export results
pcb run --data-dir ./my_data --output results.json
pcb run --data-dir ./my_data --output results.csv
pcb run --data-dir ./my_data --output results.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Workflow: Benchmark to Production
&lt;/h2&gt;

&lt;p&gt;Here is the full path from benchmarking to deploying a compressor in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 : Benchmark on your actual data&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --data-dir ./my_data --max-samples 50 --task rag \
        --daily-tokens 2000000 --cost-model claude-opus-4-7 \
        --output benchmark.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 : LLM-judge the top candidates&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2 \
        --llm-judge --judge-model claude-sonnet-4-6 --max-samples 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 : Deploy the winner&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(compressor="llmlingua2", rate=0.40)
# Everything else in your codebase stays the same
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4 : Monitor in production&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if client.stats.calls % 1000 == 0:
    logger.info(
        "pcb savings: calls=%d saved=%d tokens (%.1f%%) est_monthly=$%.0f",
        client.stats.calls,
        client.stats.tokens_saved,
        client.stats.reduction_pct,
        client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000),
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PCB run - benchmark&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Options:
  -c, --compressor TEXT       Compressor to include (repeat for multiple). Default: all five.
  -t, --task TEXT             Task type: rag, summarization, coding (repeat for multiple).
  -n, --max-samples INT       Max samples per task.
  -r, --rate FLOAT            Target compression rate 0.0–1.0. Default: 0.5
  -d, --data-dir PATH         Directory with *_samples.jsonl files.
  -o, --output PATH           Save report as .json, .csv, or .html.
  -j, --llm-judge             Enable LLM-as-judge scoring via OpenRouter.
  -m, --judge-model TEXT      Model for LLM judge. Default: claude-sonnet-4-6.
      --openrouter-key TEXT   OpenRouter API key (or set OPENROUTER_API_KEY).
      --daily-tokens INT      Daily token volume for cost projection.
      --cost-model TEXT       Model name for cost lookup (e.g. claude-opus-4-7).
      --token-price FLOAT     Manual price override in $/1M tokens.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PCB compress - compress text&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Arguments:
  [INPUT_FILE]                File to compress. Reads stdin if omitted.

Options:
  -c, --compressor TEXT       Algorithm. Default: tfidf.
  -r, --rate FLOAT            Fraction to remove. Default: 0.45.
  -o, --output PATH           Write to file instead of stdout.
  -s, --stats                 Print token stats to stderr.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Other commands&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb list-compressors          # Show all algorithms
pcb list-models               # Show 75+ supported LLM judge models
pcb show-schema rag           # Show JSONL schema for a task type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Output Formats
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;JSON - full detail per sample&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --output results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CSV - one row per compressor × task&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --output results.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Columns: &lt;code&gt;compressor&lt;/code&gt;, &lt;code&gt;task&lt;/code&gt;, &lt;code&gt;avg_token_reduction_pct&lt;/code&gt;, &lt;code&gt;avg_quality_score&lt;/code&gt;, &lt;code&gt;avg_quality_drop_pct&lt;/code&gt;, &lt;code&gt;avg_llm_score&lt;/code&gt;, &lt;code&gt;avg_llm_drop_pct&lt;/code&gt;, &lt;code&gt;avg_latency_ms&lt;/code&gt;, &lt;code&gt;num_samples&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTML - shareable visual report&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcb run --output results.html
# Open in any browser — Chart.js scatter plots, dark theme, Pareto highlights
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  When NOT to Use Compression
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Short prompts (&amp;lt; 200 tokens):&lt;/strong&gt; PCB skips these automatically overhead exceeds savings.&lt;br&gt;
&lt;strong&gt;Cheap models (&amp;lt; $0.50/1M):&lt;/strong&gt; DeepSeek, Gemini Flash, GPT-4.1-mini - savings too small.&lt;br&gt;
&lt;strong&gt;High-precision tasks:&lt;/strong&gt; Legal review, medical diagnosis - verify your quality floor with &lt;code&gt;--llm-judge&lt;/code&gt; first.&lt;br&gt;
&lt;strong&gt;Output-bottlenecked workloads:&lt;/strong&gt; Compression only affects input tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/pcb/
├── cli.py                      # Typer CLI — all commands
├── config.py                   # Pydantic config and model pricing table
├── runner.py                   # Benchmark orchestration + BenchmarkReport
├── mcp_server.py               # FastMCP server for Claude Code / Codex
├── compressors/
│   ├── tfidf.py                # TF-IDF sentence scoring
│   ├── selective_context.py    # Greedy token-budget selection
│   ├── llmlingua.py            # Sentence-level coarse pruning
│   └── no_compression.py       # Passthrough baseline
├── tasks/
│   ├── rag.py                  # F1/EM/context-recall evaluator
│   ├── summarization.py        # ROUGE-L evaluator
│   └── coding.py               # BM25 + identifier preservation
├── evaluators/
│   └── llm_judge.py            # OpenRouter LLM-as-judge (75+ models)
├── reporters/
│   ├── terminal.py             # Rich terminal tables
│   ├── json_reporter.py        # JSON output
│   ├── csv_reporter.py         # CSV output
│   └── html_reporter.py        # Chart.js HTML report
├── middleware/
│   ├── anthropic_client.py     # CompressingAnthropic drop-in wrapper
│   └── openai_client.py        # CompressingOpenAI drop-in wrapper
└── data/
    ├── rag_samples.jsonl        # 20 real-world factual passages (400–450 tokens)
    ├── summarization_samples.jsonl  # 10 real news-style articles
    └── coding_samples.jsonl     # 10 real Python code contexts (370–800 tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using NEO. &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;I described the problem at a high level: a tool that benchmarks multiple prompt compression algorithms against real workloads, scores quality loss empirically, projects actual dollar savings at a given token volume, and makes it trivially easy to deploy the winning algorithm into an existing Anthropic or OpenAI codebase.&lt;/p&gt;

&lt;p&gt;NEO built the entire thing autonomously, the Typer CLI with all commands and flags, all five compressor implementations, the F1/ROUGE-L/BM25 task evaluators, the OpenRouter LLM-as-judge with support for 75+ models, the cost projection engine with the model pricing table, the &lt;code&gt;CompressingAnthropic&lt;/code&gt; and &lt;code&gt;CompressingOpenAI&lt;/code&gt; drop-in wrappers, the FastMCP server with four tools, the JSON/CSV/HTML reporters, and the three bundled sample datasets - 20 RAG passages, 10 summarization articles, and 10 coding contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it before committing to a compression strategy.&lt;/strong&gt;&lt;br&gt;
Before wiring any compressor into your production stack, run pcb on a sample of your actual prompts. The benchmark tells you which algorithm preserves the most quality at your target compression rate specific to your data, not a generic recommendation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to justify the cost of compression infrastructure.&lt;/strong&gt;&lt;br&gt;
The cost projection output gives you monthly and annual savings at your actual token volume and model pricing. This is the number you need to make a case for adding compression to your pipeline, not a rough estimate but a measured projection against your workload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the MCP tools inside Claude Code sessions.&lt;/strong&gt;&lt;br&gt;
With the MCP server connected, you can ask Claude to compress a context, estimate savings, or recommend a compressor without leaving your coding environment. This makes compression a natural part of the agent workflow rather than a separate offline step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional compressors.&lt;/strong&gt;&lt;br&gt;
The four compressors share a common interface in &lt;code&gt;src/pcb/compressors/&lt;/code&gt;. A new algorithm - semantic chunking, abstractive summarization, or a custom retrieval-based approach, slots in as a new file in that directory and appears automatically in &lt;code&gt;pcb run&lt;/code&gt;, &lt;code&gt;pcb compress&lt;/code&gt;, and the MCP &lt;code&gt;recommend&lt;/code&gt; tool without touching any other part of the codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Most teams discover they are over-spending on input tokens only after the bill arrives. pcb gives you the benchmark data to make an informed decision before committing - which algorithm, at what rate, for which task type and the deployment tooling to act on it immediately.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>promptengineering</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>ContextCraft: A Visual Workbench for Building and Managing LLM Context Windows</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 05 May 2026 16:00:35 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/contextcraft-a-visual-workbench-for-building-and-managing-llm-context-windows-3f8e</link>
      <guid>https://forem.com/nilofer_tweets/contextcraft-a-visual-workbench-for-building-and-managing-llm-context-windows-3f8e</guid>
      <description>&lt;p&gt;Building a good LLM prompt is not a one-shot task. You assemble pieces, a system message, a few examples, some context, the actual instruction and then you iterate. You compress things that are too long, test whether the output still holds up, check how many tokens you are spending, and save versions so you can roll back when something breaks.&lt;/p&gt;

&lt;p&gt;Most developers do this in a text editor, a notebook, or scattered across a handful of scripts. There is no single place where you can see the whole context window, manipulate it visually, compress a block, run a live test, and save a snapshot, all without switching tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ContextCraft&lt;/strong&gt; is that place. It is a canvas-based interactive workbench for assembling, compressing, testing, and versioning LLM context windows. It runs locally, connects to Ollama for local compression and testing, supports OpenRouter for cloud LLM testing, and exports directly to OpenAI, Anthropic, LangChain, and JSON formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Visual Canvas:&lt;/strong&gt; Drag and drop interface for organizing prompt blocks with real-time token counting and visual progress bars.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smart Compression:&lt;/strong&gt; AI-powered compression using Ollama with semantic preservation. Set a target compression ratio, choose whether to preserve structure, and review a before/after comparison before applying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coverage Analysis:&lt;/strong&gt; Semantic similarity scoring between original and compressed content. Key concept preservation is surfaced as a score so you know exactly what you are trading for token savings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Testing:&lt;/strong&gt; Test prompts with streaming responses from Ollama or OpenRouter directly from the canvas. Select provider, model, and temperature and view responses in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version Control:&lt;/strong&gt; Save and restore canvas versions with a SQLite backend. Name versions for easy reference and compare two versions to see what changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Format Export:&lt;/strong&gt; Export to OpenAI, Anthropic, LangChain, and JSON formats. Copy the generated code and paste directly into your application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Block Library:&lt;/strong&gt; Pre-built starter blocks for common use cases, available from the sidebar. Add your own blocks to the library for reuse across canvases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;ContextCraft is split into a FastAPI backend and a React + Vite frontend.&lt;/p&gt;

&lt;p&gt;The backend handles token counting via tiktoken, semantic similarity analysis via sentence-transformers, compression via Ollama, streaming LLM test responses via Ollama or OpenRouter, SQLite-backed version management, and export format generation. The frontend renders the visual canvas with drag-and-drop via &lt;code&gt;@hello-pangea/dnd&lt;/code&gt; and code editing via CodeMirror.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4emd0squlz4375pjo2rv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4emd0squlz4375pjo2rv.png" alt=" " width="800" height="461"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;contextcraft/
├── server/                 # FastAPI backend
│   ├── main.py            # FastAPI app entry point
│   ├── models.py          # Pydantic data models
│   ├── tokenizer.py       # Token counting (tiktoken)
│   ├── coverage.py        # Semantic similarity analysis
│   ├── compress.py        # Ollama compression service
│   ├── tester.py          # LLM streaming test service
│   ├── export.py          # Export format generators
│   ├── versions.py        # SQLite version management
│   └── pricing.py         # OpenRouter pricing API
├── frontend/              # React + Vite frontend
│   ├── src/
│   │   ├── components/    # React components
│   │   ├── hooks/         # Custom React hooks
│   │   └── App.jsx        # Main application
│   └── package.json
├── cli/                   # CLI entry point
│   └── main.py
└── pyproject.toml         # Python package config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.9+&lt;/li&gt;
&lt;li&gt;Node.js 18+&lt;/li&gt;
&lt;li&gt;Ollama (optional, for local compression and testing)&lt;/li&gt;
&lt;li&gt;OpenRouter API key (optional, for cloud LLM testing)&lt;/li&gt;
&lt;li&gt;Installation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;
git clone https://github.com/contextcraft/contextcraft.git
&lt;span class="nb"&gt;cd &lt;/span&gt;contextcraft

&lt;span class="c"&gt;# Install Python dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;

&lt;span class="c"&gt;# Install frontend dependencies&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;frontend
npm &lt;span class="nb"&gt;install
cd&lt;/span&gt; ..

&lt;span class="c"&gt;# Initialize the database&lt;/span&gt;
contextcraft init-db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running the Application&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the server and frontend&lt;/span&gt;
contextcraft serve

&lt;span class="c"&gt;# Or start with custom options&lt;/span&gt;
contextcraft serve &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8000 &lt;span class="nt"&gt;--frontend-port&lt;/span&gt; 3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once running, the frontend is available at localhost:5173 and the API docs at localhost:8000/docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file in the project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;# OpenRouter API key (for cloud LLM testing)
&lt;/span&gt;&lt;span class="py"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your_api_key_here&lt;/span&gt;

&lt;span class="c"&gt;# Ollama URL (default: http://localhost:11434)
&lt;/span&gt;&lt;span class="py"&gt;OLLAMA_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;

&lt;span class="c"&gt;# Default compression model
&lt;/span&gt;&lt;span class="py"&gt;DEFAULT_COMPRESSION_MODEL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gemma2:2b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Supported Models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token Counting:&lt;/strong&gt; GPT-4, GPT-4o, GPT-4o Mini, GPT-3.5 Turbo, Claude 3 Opus, Sonnet, Haiku, Claude 3.5 Sonnet&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compression (via Ollama):&lt;/strong&gt; gemma2:2b (default), any Ollama-compatible model&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing:&lt;/strong&gt; Ollama local models, OpenRouter cloud models (requires API key)&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage Guide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Creating a Canvas:&lt;/strong&gt; Start with an empty canvas or load from the library. Add blocks using the sidebar buttons or drag from the library. Arrange blocks by dragging to reorder. Edit block content inline or in the full editor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compressing Content:&lt;/strong&gt; Click the compress icon on any block. Set the target compression ratio (0.1 to 0.9). Choose whether to preserve structure. Review the before/after comparison. Apply compression when satisfied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing Prompts:&lt;/strong&gt; Add your prompt blocks to the canvas. Click the Test button. Select provider (Ollama or OpenRouter). Choose model and set temperature. View streaming responses in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyzing Coverage:&lt;/strong&gt; Compress one or more blocks. Click the Coverage button. View semantic similarity scores. Check key concept preservation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Managing Versions:&lt;/strong&gt; Click Versions to save the current state. Name your version for easy reference. Restore previous versions at any time. Compare versions to see changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exporting:&lt;/strong&gt; Click Export when ready. Choose format, OpenAI, Anthropic, LangChain, or JSON. Copy the generated code. Paste into your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  API Endpoints
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Token Management&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /api/tokenize&lt;/code&gt; - count tokens for text or blocks&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /api/pricing&lt;/code&gt; - get model pricing information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compression&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;POST /api/compress&lt;/code&gt; - compress text using Ollama&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/coverage&lt;/code&gt; - analyze semantic coverage between original and compressed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/test&lt;/code&gt; - stream LLM responses from Ollama or OpenRouter&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Versioning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GET /api/versions&lt;/code&gt; - list all versions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/versions&lt;/code&gt; - save a new version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GET /api/versions/{id}&lt;/code&gt; - get a specific version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/versions/{id}/restore&lt;/code&gt; - restore a version&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;POST /api/versions/compare&lt;/code&gt; - compare two versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Export&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;POST /api/export&lt;/code&gt; - export canvas to OpenAI, Anthropic, LangChain, or JSON format&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Library&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GET /api/library&lt;/code&gt; - get starter block library&lt;br&gt;
&lt;code&gt;POST /api/library&lt;/code&gt; - add a block to the library&lt;/p&gt;
&lt;h2&gt;
  
  
  CLI Commands
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start the application&lt;/span&gt;
contextcraft serve

&lt;span class="c"&gt;# Initialize database&lt;/span&gt;
contextcraft init-db

&lt;span class="c"&gt;# Add a block to library&lt;/span&gt;
contextcraft add-block &lt;span class="nt"&gt;--type&lt;/span&gt; system &lt;span class="nt"&gt;--label&lt;/span&gt; &lt;span class="s2"&gt;"My Template"&lt;/span&gt; &lt;span class="nt"&gt;--content&lt;/span&gt; &lt;span class="s2"&gt;"..."&lt;/span&gt;

&lt;span class="c"&gt;# Get help&lt;/span&gt;
contextcraft &lt;span class="nt"&gt;--help&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Docker
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build image&lt;/span&gt;
docker build &lt;span class="nt"&gt;-t&lt;/span&gt; contextcraft &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Run container&lt;/span&gt;
docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8000:8000 &lt;span class="nt"&gt;-p&lt;/span&gt; 5173:5173 contextcraft
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Development
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install dev dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;

&lt;span class="c"&gt;# Run tests&lt;/span&gt;
pytest

&lt;span class="c"&gt;# Format code&lt;/span&gt;
black server/ cli/
isort server/ cli/

&lt;span class="c"&gt;# Type checking&lt;/span&gt;
mypy server/ cli/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;frontend

&lt;span class="c"&gt;# Start dev server&lt;/span&gt;
npm run dev

&lt;span class="c"&gt;# Build for production&lt;/span&gt;
npm run build

&lt;span class="c"&gt;# Run linter&lt;/span&gt;
npm run lint
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Contributing
&lt;/h2&gt;

&lt;p&gt;Fork the repository. Create a feature branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; feature/amazing-feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Commit your changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s1"&gt;'Add amazing feature'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Push to the branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git push origin feature/amazing-feature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open a Pull Request.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This tool was designed, built, debugged, and iterated entirely using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; -An autonomous AI engineering agent that writes, runs, and refines real code end-to-end.&lt;/p&gt;

&lt;p&gt;ContextCraft is a full-stack application, a FastAPI backend, a React + Vite frontend, a CLI, and a SQLite-backed versioning layer. Every part of the system was generated and connected through NEO: the backend services for token counting, semantic coverage analysis, compression via Ollama, streaming LLM testing, export pipelines, version management, and pricing integration, along with the interactive frontend canvas for assembling prompt blocks with drag-and-drop, inline editing, and real-time token tracking.&lt;/p&gt;

&lt;p&gt;The compression and coverage pipeline, the live testing flow across Ollama and OpenRouter, the version save/restore and comparison system, and the multi-format export layer were all built end-to-end from a high-level problem description. NEO handled the full cycle - generating code, wiring components, resolving issues, and refining the system into a working product.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt engineering workbench:&lt;/strong&gt; Instead of iterating on prompts in a text editor and manually counting tokens, assemble your context window visually, compress blocks that are too long, and test the result, all in one place. The version control means you never lose a working configuration while experimenting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluate compression quality before shipping:&lt;/strong&gt; Before deploying a compressed prompt to production, run coverage analysis to get a semantic similarity score between the original and compressed version. You know exactly how much meaning you are trading for token savings, not just a token count but an actual semantic measurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manage prompt libraries across projects:&lt;/strong&gt; The block library lets you save reusable prompt blocks and load them into any canvas. Teams building multiple LLM products can maintain a shared library of tested, versioned prompt components.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional export formats:&lt;/strong&gt; The export module currently supports OpenAI, Anthropic, LangChain, and JSON. Adding a new format follows the same pattern in &lt;code&gt;export.py&lt;/code&gt; and surfaces automatically in the Export UI without touching any other part of the stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Context window management is one of those problems that looks simple until you are doing it seriously. ContextCraft brings together the pieces that are usually scattered across different tools visual assembly, token counting, AI compression, semantic coverage analysis, live testing, version control, and export into a single local workbench.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/ContextCraft" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/ContextCraft&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;br&gt;
You can also use NEO MCP with Claude Code: &lt;a href="https://heyneo.com/claude-code" rel="noopener noreferrer"&gt;https://heyneo.com/claude-code&lt;/a&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>promptengineering</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>LLM Behavior Diff Model Update Detector</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 04 May 2026 11:18:41 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/llm-behavior-diff-model-update-detector-3e7b</link>
      <guid>https://forem.com/nilofer_tweets/llm-behavior-diff-model-update-detector-3e7b</guid>
      <description>&lt;p&gt;You swap a model. The new one scores better on your benchmarks. You deploy it. Two days later, a user reports that something that used to work reliably now behaves differently.&lt;/p&gt;

&lt;p&gt;The benchmark never caught it because benchmarks measure averages. What changed was the behavior on specific prompts, the ones your users actually send.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Behavior Diff&lt;/strong&gt; is a tool that catches this before it happens. Feed it two model versions and a prompt suite, and it runs every prompt through both, scores the responses for semantic similarity, classifies each divergence by severity, and produces an HTML report you can drop into a CI artifact or diff review.&lt;/p&gt;

&lt;p&gt;It ships as a CLI, a Python API, and an MCP server so Claude Code or any MCP-compatible agent can run a behavioral diff before a model swap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Model Updates
&lt;/h2&gt;

&lt;p&gt;Every model update is a tradeoff. The new version might score better on reasoning benchmarks while quietly regressing on instruction-following for your specific use case. Or it might phrase safety refusals differently in a way that breaks downstream parsing. Or two models might produce semantically identical answers that look completely different at the token level, which a naive string comparison would flag as a major change when it isn't one.&lt;/p&gt;

&lt;p&gt;LLM Behavior Diff addresses all three scenarios. Embedding-based semantic similarity catches meaning-level changes that token-level comparison misses. The LLM-as-judge layer adds a reasoning layer for ambiguous cases. Severity classification separates noise from real regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The pipeline runs in five steps for every prompt in your suite:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Load:&lt;/strong&gt; A YAML prompt suite is loaded into a &lt;code&gt;PromptSuite&lt;/code&gt; Pydantic model. Each prompt has an ID, text, category, tags, and an expected behavior description.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run:&lt;/strong&gt; Each prompt is sent through Model A and Model B via &lt;code&gt;LLMRunner&lt;/code&gt;. Three providers are supported: Ollama (&lt;code&gt;/api/generate&lt;/code&gt;), OpenRouter (chat completions), and a deterministic stub provider for offline CI runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Score:&lt;/strong&gt; Each response pair is scored with either &lt;code&gt;EmbeddingDiffer&lt;/code&gt; (cosine similarity on &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; embeddings) or &lt;code&gt;SimpleDiffer&lt;/code&gt; (Jaccard over words). Optionally, an LLM-as-judge score is combined with the similarity score, default judge model is &lt;code&gt;google/gemini-2.0-flash-lite-001&lt;/code&gt; via OpenRouter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classify:&lt;/strong&gt; Each prompt is classified against the &lt;code&gt;--threshold&lt;/code&gt;. Changes are bucketed by severity: combined score &amp;gt;= 0.7 is minor, &amp;gt;= 0.4 is moderate, &amp;lt; 0.4 is major.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Report:&lt;/strong&gt; An HTML report is rendered and a rich summary table is printed to the terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Embeddings Over Token Matching
&lt;/h2&gt;

&lt;p&gt;The difference matters. Here is the same two-model comparison run two ways:&lt;br&gt;
With &lt;code&gt;--use-embeddings&lt;/code&gt; (cosine on &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;):&lt;/p&gt;

&lt;p&gt;Avg Similarity: 91.4%&lt;br&gt;
Changes Detected: 0 of 5&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;--no-use-embeddings&lt;/code&gt; (Jaccard fallback):&lt;/p&gt;

&lt;p&gt;Avg Similarity: 25.0%&lt;br&gt;
Changes Detected: 5 of 5&lt;/p&gt;

&lt;p&gt;Same two models, same prompts, completely opposite conclusions. The Llama and Gemini answers shared few exact tokens even when semantically identical, which is exactly why the embeddings path is on by default.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Requires Python 3.11+. Embedding similarity uses &lt;code&gt;sentence-transformers/all-MiniLM-L6-v2&lt;/code&gt;, downloaded on first use. The LLM-judge path requires &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt; without it, scoring falls back to embeddings-only.&lt;/p&gt;
&lt;h2&gt;
  
  
  Running a Diff
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Offline - stub provider&lt;/strong&gt;&lt;br&gt;
A stub provider returns deterministic hashed responses, so the whole pipeline runs offline without Ollama or an API key. Good for CI and testing the setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-diff run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-a&lt;/span&gt; stub-a &lt;span class="nt"&gt;--provider-a&lt;/span&gt; stub &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-b&lt;/span&gt; stub-b &lt;span class="nt"&gt;--provider-b&lt;/span&gt; stub &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompts&lt;/span&gt; prompts/default.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; output/report.html &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-use-embeddings&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real output from this run (stub + Jaccard, threshold 0.5):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╭───────────────────────────────────────────────────╮
│ LLM Behavior Diff                                 │
│ Detecting behavioral shifts between model updates │
╰───────────────────────────────────────────────────╯
  Processing: safety-001 ━━━━━━━━━━━━━━━━━━━━ 100%

Comparison Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric           ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Total Prompts    │ 5     │
│ Changes Detected │ 3     │
│ Change Rate      │ 60.0% │
│ Avg Similarity   │ 40.0% │
└──────────────────┴───────┘
Report saved to: output/stub_jaccard.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real models - OpenRouter&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
llm-diff run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-a&lt;/span&gt; meta-llama/llama-3.2-3b-instruct &lt;span class="nt"&gt;--provider-a&lt;/span&gt; openrouter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-b&lt;/span&gt; google/gemini-2.0-flash-lite-001 &lt;span class="nt"&gt;--provider-b&lt;/span&gt; openrouter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompts&lt;/span&gt; prompts/default.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; output/or_emb.html &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--use-embeddings&lt;/span&gt; &lt;span class="nt"&gt;--threshold&lt;/span&gt; 0.85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real output (embeddings only):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Comparison Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Total Prompts    │ 5     │
┃ Changes Detected │ 0     │
┃ Change Rate      │ 0.0%  │
┃ Avg Similarity   │ 91.4% │
└──────────────────┴───────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adding &lt;code&gt;--use-judge&lt;/code&gt; brings the average similarity to 91.8% and surfaces reasoning like: "Both responses correctly answer 'yes' and provide essentially the same explanation... Response A is slightly more verbose, but the core meaning is identical."&lt;/p&gt;

&lt;p&gt;Real models - Ollama&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-diff run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-a&lt;/span&gt; qwen3:8b &lt;span class="nt"&gt;--provider-a&lt;/span&gt; ollama &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model-b&lt;/span&gt; gemma4:e4b &lt;span class="nt"&gt;--provider-b&lt;/span&gt; ollama &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--prompts&lt;/span&gt; prompts/default.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; output/report.html &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--use-embeddings&lt;/span&gt; &lt;span class="nt"&gt;--threshold&lt;/span&gt; 0.85
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;llm-diff --help

Usage: llm-diff [OPTIONS] COMMAND [ARGS]...

 LLM Behavior Diff — Model Update Detector

 --version                Show version information
 --help                   Show this message and exit.

 Commands
   run  Run a comparison between two models.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;llm-diff --version
LLM Behavior Diff version 0.1.0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key options for &lt;code&gt;llm-diff run&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7sqizy9sheh9lm5ss0h1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7sqizy9sheh9lm5ss0h1.png" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Severity buckets applied when a change is detected: combined &amp;gt;= 0.7 is minor, &amp;gt;= 0.4 is moderate, &amp;lt; 0.4 is major.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Suite Format
&lt;/h2&gt;

&lt;p&gt;The prompt suite is a YAML file. &lt;code&gt;prompts/default.yaml&lt;/code&gt; ships with 5 prompts spanning reasoning, coding, factual, instruction-following, and safety. You can write your own:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;suite"&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0.0"&lt;/span&gt;
&lt;span class="na"&gt;prompts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code-001"&lt;/span&gt;
    &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Python&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;reverse_string(s)..."&lt;/span&gt;
    &lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coding"&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;expected_behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Short&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;correct&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;function"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IDs must be unique. Category must be one of: &lt;code&gt;reasoning&lt;/code&gt;, &lt;code&gt;coding&lt;/code&gt;, &lt;code&gt;creativity&lt;/code&gt;, &lt;code&gt;safety&lt;/code&gt;, &lt;code&gt;instruction_following&lt;/code&gt;, &lt;code&gt;factual&lt;/code&gt;, &lt;code&gt;conversational&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;

&lt;p&gt;The full pipeline is available as a library. A synchronous one-shot call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.runner&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_prompt_sync&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ModelConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ProviderType&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_prompt_sync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;ModelConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ProviderType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;STUB&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; Model stub-m says: 921fac0c4c True
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similarity scoring directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.differ&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleDiffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;EmbeddingDiffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_differ&lt;/span&gt;

&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleDiffer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compute_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the cat sat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the cat ran&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# 0.5
&lt;/span&gt;
&lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EmbeddingDiffer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compute_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The answer is 4.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Two plus two equals four.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# -&amp;gt; ~0.59
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;create_differ(use_embeddings=False)&lt;/code&gt; returns a &lt;code&gt;SimpleDiffer&lt;/code&gt; (Jaccard). &lt;code&gt;True&lt;/code&gt; returns an &lt;code&gt;EmbeddingDiffer&lt;/code&gt; if sentence-transformers is importable, otherwise falls back to &lt;code&gt;SimpleDiffer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Generating a report from a &lt;code&gt;ComparisonRun&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.report&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ReportGenerator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;

&lt;span class="nc"&gt;ReportGenerator&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;save_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;out.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ReportGenerator&lt;/code&gt; looks for a Jinja template in the CWD, the package directory, and a legacy path, then falls back to a built-in template so reports always render.&lt;/p&gt;

&lt;h2&gt;
  
  
  MCP Server
&lt;/h2&gt;

&lt;p&gt;The tool also runs as an MCP server over stdio transport, exposing three tools so Claude Code or any MCP-compatible agent can trigger a behavioral diff during a session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llm-diff-mcp
&lt;span class="c"&gt;# or: python -m llm_behavior_diff.mcp_server&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three exposed tools:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;compare_models&lt;/strong&gt; - runs a full prompt suite through two models and returns per-prompt similarity, severity, and response text.&lt;br&gt;
&lt;strong&gt;analyze_drift&lt;/strong&gt; - scores drift between two candidate responses for a single prompt.&lt;br&gt;
&lt;strong&gt;generate_report&lt;/strong&gt; - renders an HTML summary from a JSON list of results.&lt;/p&gt;

&lt;p&gt;Claude Code config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"llm-behavior-diff"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm-diff-mcp"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Smoke test - all three tools, offline, via Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llm_behavior_diff.mcp_server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;compare_models&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CompareModelsRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;analyze_drift&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AnalyzeDriftRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;generate_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GenerateReportRequest&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;compare_models&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompareModelsRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub-a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub-b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stub&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompts_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompts/default.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_prompts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;changes_detected&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_similarity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# real: 5 4 0.3446
&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;analyze_drift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AnalyzeDriftRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompt_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;math&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The answer is 4.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2+2 equals 4.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embedding_similarity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# real: 0.5572 moderate
&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generate_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;GenerateReportRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;results_json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;behavioral_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;behavioral_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;major&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output/mcp_report.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MCP Smoke&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verified over stdio JSON-RPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;llm-diff-mcp   #&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;speaks MCP 2024-11-05 on stdio
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tools/list -&amp;gt; compare_models, analyze_drift, generate_report
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tools/call analyze_drift &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"prompt_text"&lt;/span&gt;:&lt;span class="s2"&gt;"..."&lt;/span&gt;,&lt;span class="s2"&gt;"response_a"&lt;/span&gt;:&lt;span class="s2"&gt;"Paris"&lt;/span&gt;,
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="s2"&gt;"response_b"&lt;/span&gt;:&lt;span class="s2"&gt;"The capital is Paris."&lt;/span&gt;,&lt;span class="s2"&gt;"use_embeddings"&lt;/span&gt;:true&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-&amp;gt; &lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"embedding_similarity"&lt;/span&gt;:0.7761,&lt;span class="s2"&gt;"severity"&lt;/span&gt;:&lt;span class="s2"&gt;"minor"&lt;/span&gt;, ...&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge requires OpenRouter&lt;/strong&gt; -  Without &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;, judging is skipped and the combined score equals the embedding or Jaccard similarity alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First embedding run is slow&lt;/strong&gt; - &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; is downloaded from Hugging Face on first use. Subsequent runs use the local cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama is not spawned automatically&lt;/strong&gt; - The client talks to &lt;code&gt;http://localhost:11434&lt;/code&gt; by default (OLLAMA_HOST env var overrides). Ollama must already be running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stub provider is for CI and demos only&lt;/strong&gt; - It produces deterministic fake text keyed on model name, temperature, and prompt. Not suitable for real behavioral conclusions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How You Can Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Gate model upgrades in CI before they ship:&lt;/strong&gt; Add &lt;code&gt;llm-diff&lt;/code&gt; run to your deployment pipeline. Before any model swap reaches production, the tool runs your prompt suite through both versions and fails the pipeline if behavioral drift exceeds your threshold. You catch regressions automatically, not from user reports two days later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during prompt engineering to measure real impact:&lt;/strong&gt; When you change a system prompt or few-shot examples, run a diff between the old and new configuration. The severity classification tells you whether the change is minor, moderate, or major across your prompt categories, so you know what you are actually shipping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the MCP server to make your agent self-aware of drift:&lt;/strong&gt; With the MCP server running, Claude Code or any MCP-compatible agent can call &lt;code&gt;compare_models&lt;/code&gt; or &lt;code&gt;analyze_drift&lt;/code&gt; directly during a session. An agent working on a model integration can check for behavioral drift without leaving the coding environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extend it with additional providers:&lt;/strong&gt; The tool currently supports Ollama, OpenRouter, and a stub provider, all sharing a common &lt;code&gt;LLMRunner&lt;/code&gt; interface. Adding a new provider for Anthropic, Gemini, or any OpenAI-compatible endpoint follows the same pattern without touching the differ, classifier, or report logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Behavioral drift is the category of model regression that benchmarks miss. LLM Behavior Diff catches it by running the same prompts through both model versions, scoring the responses semantically rather than lexically, and classifying the divergence by severity before a swap reaches production.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/-LLM-Behavior-Diff-Model-Update-Detector" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/-LLM-Behavior-Diff-Model-Update-Detector&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also build with &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;NEO is a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

</description>
      <category>mlops</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>AI Slop Cleaner: Automating Your Codebase Hygiene</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Thu, 30 Apr 2026 09:56:31 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/ai-slop-cleaner-automating-your-codebase-hygiene-4hi</link>
      <guid>https://forem.com/nilofer_tweets/ai-slop-cleaner-automating-your-codebase-hygiene-4hi</guid>
      <description>&lt;p&gt;Every codebase accumulates clutter over time. An import left behind after a refactor. A helper function that nothing calls anymore. A method that grew too complex to reason about. None of it breaks anything immediately, but it slows down every developer who reads through it, and it silently raises the cost of every future change.&lt;/p&gt;

&lt;p&gt;The usual fix is a manual review pass. Someone spends an hour looking for unused imports, searching for dead functions, flagging complexity hotspots. It is tedious, inconsistent, and happens far less often than it should.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slop Cleaner&lt;/strong&gt; is a CLI tool that does this automatically. It detects unused imports, dead functions and classes, and over-complex code using tree-sitter AST analysis (not regex) so it never removes an import that appears in a docstring or string annotation. Every patch is atomic: backed up before writing, and rolled back automatically if your test suite fails.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ndoo779n1lyz832dx0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ndoo779n1lyz832dx0b.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What slop-cleaner Detects and Fixes
&lt;/h2&gt;

&lt;p&gt;slop-cleaner targets three specific categories of clutter that accumulate in every codebase:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unused imports:&lt;/strong&gt; Imports left behind after refactors. These are detected at HIGH confidence and removed automatically. The tool handles single-line imports, selective removal from &lt;code&gt;from X import A, B&lt;/code&gt; blocks where only one name is unused, and multi-line &lt;code&gt;from X import (...)&lt;/code&gt; blocks where individual lines are surgically removed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dead functions and classes:&lt;/strong&gt; Defined but never called anywhere in the codebase. These are detected at MEDIUM confidence and flagged in the report for human review. The tool builds a full call graph across all symbols before making this determination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Over-complex functions:&lt;/strong&gt; Functions with cyclomatic complexity above the threshold (default 10). These are flagged at MEDIUM confidence for manual refactoring. The tool never auto-removes a function, complexity is a signal that something needs attention, not a safe automatic fix.&lt;/p&gt;

&lt;p&gt;HIGH confidence issues are fixed automatically. MEDIUM confidence issues are always left to human judgment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-Phase Pipeline
&lt;/h2&gt;

&lt;p&gt;Everything slop-cleaner does runs as a five-phase pipeline. Each phase feeds into the next, and every patch is atomic, backed up before writing and rolled back automatically if your tests fail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit:&lt;/strong&gt;Parses every &lt;code&gt;.py&lt;/code&gt; and &lt;code&gt;.ts&lt;/code&gt;/&lt;code&gt;.tsx&lt;/code&gt; file with tree-sitter. Collects unused imports at HIGH confidence and high-complexity functions at MEDIUM confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyze:&lt;/strong&gt; Builds a call graph across all symbols in the project. Identifies dead code, defined but never called.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean:&lt;/strong&gt; Applies HIGH-confidence patches atomically. Handles single-line and multi-line &lt;code&gt;from X import (...)&lt;/code&gt; blocks. Backs up each file before writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verify:&lt;/strong&gt; Runs your test suite with pytest. On any failure, rolls back every patched file to its backup automatically. Returns exit code 1 so CI catches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document:&lt;/strong&gt; Generates three Markdown reports: &lt;code&gt;ARCHITECTURE.md&lt;/code&gt;, &lt;code&gt;FUNCTION_MAP.md&lt;/code&gt;, and &lt;code&gt;SLOP_REPORT.md&lt;/code&gt; covered in detail below.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Fixed Automatically vs. Flagged
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1g1a3936hz63kzupcf8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1g1a3936hz63kzupcf8.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The distinction matters. HIGH confidence patches are cases where the tool is certain removal is safe. MEDIUM confidence cases, dead code and complexity could have dynamic dispatch, reflection, or other patterns that make automatic removal risky. The tool flags these and leaves the decision to you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge Cases Handled
&lt;/h2&gt;

&lt;p&gt;One of the hardest parts of detecting unused imports is knowing when a name that looks unused is actually needed. slop-cleaner handles these correctly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyztlbmueky8h7euxbuq0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyztlbmueky8h7euxbuq0.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These are exactly the cases where regex-based tools get it wrong. Because slop-cleaner parses an actual AST, it understands the difference between a name appearing inside a string and a name being used as an identifier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/your-org/slop-cleaner
&lt;span class="nb"&gt;cd &lt;/span&gt;slop-cleaner

&lt;span class="c"&gt;# Create and activate a virtual environment&lt;/span&gt;
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate      &lt;span class="c"&gt;# macOS / Linux&lt;/span&gt;
&lt;span class="c"&gt;# .venv\Scripts\activate       # Windows&lt;/span&gt;

&lt;span class="c"&gt;# Install the tool and its dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This registers two CLI commands: &lt;code&gt;slop-audit&lt;/code&gt; and &lt;code&gt;slop-clean&lt;/code&gt;.&lt;br&gt;
Dependencies installed automatically via &lt;code&gt;pyproject.toml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tree-sitter&amp;gt;=0.25
tree-sitter-python&amp;gt;=0.23
tree-sitter-typescript&amp;gt;=0.23
rich&amp;gt;=13
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note for modern Linux (Ubuntu 23.04+, Debian 12+)&lt;/strong&gt;: system Python blocks global pip install. Always use a virtual environment as shown above, or use &lt;code&gt;pipx install .&lt;/code&gt; to install the CLI tools globally without a venv.&lt;/p&gt;

&lt;p&gt;To run the test suite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[test]"&lt;/span&gt;   &lt;span class="c"&gt;# adds pytest + pytest-cov&lt;/span&gt;
pytest                     &lt;span class="c"&gt;# runs tests/test_parsers.py (22 tests)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Audit a project — report issues, exit 1 if any found (CI-friendly)&lt;/span&gt;
slop-audit path/to/project/

&lt;span class="c"&gt;# Audit a single file&lt;/span&gt;
slop-audit src/services/user_service.py

&lt;span class="c"&gt;# Full clean — audit, fix, verify tests, generate docs&lt;/span&gt;
slop-clean path/to/project/

&lt;span class="c"&gt;# Dry run — show what would change without touching files&lt;/span&gt;
slop-clean path/to/project/ &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Write audit JSON for tooling integration&lt;/span&gt;
slop-audit path/to/project/ &lt;span class="nt"&gt;--output&lt;/span&gt; report.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Commands
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;slop-audit&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;slop-audit&lt;/code&gt; scans a file or directory and prints a table of all issues found without touching anything. It exits with code 1 if issues are found, making it a clean CI gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;slop-audit &amp;lt;target&amp;gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt; JSON] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--threshold&lt;/span&gt; N] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--verbose&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62hiadeh2o86hmc6y3u2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62hiadeh2o86hmc6y3u2.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Exit codes: &lt;code&gt;0&lt;/code&gt; = clean, &lt;code&gt;1&lt;/code&gt; = issues found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;slop-clean&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;slop-clean&lt;/code&gt; runs the full 5-phase pipeline. Always run &lt;code&gt;--dry-run&lt;/code&gt; first on an unfamiliar project, it shows exactly what patches would be applied without touching any file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;slop-clean &amp;lt;target&amp;gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--output&lt;/span&gt; DIR] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--threshold&lt;/span&gt; N] &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--dry-run&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nt"&gt;--verbose&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8fuen6i2plts3rceqho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8fuen6i2plts3rceqho.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Exit codes: &lt;code&gt;0&lt;/code&gt; = success, &lt;code&gt;1&lt;/code&gt; = rollback was triggered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generated Output
&lt;/h2&gt;

&lt;p&gt;After &lt;code&gt;slop-clean&lt;/code&gt; runs, the &lt;code&gt;--output&lt;/code&gt; directory contains three reports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slop-output/
├── ARCHITECTURE.md   — file tree + Mermaid dependency graph
├── FUNCTION_MAP.md   — every symbol with start line and complexity score
└── SLOP_REPORT.md    — issue summary, patches applied, dead-code candidates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ARCHITECTURE.md&lt;/code&gt; gives a structural overview of the codebase with a visual Mermaid call graph, useful for understanding how the project fits together at a glance.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;FUNCTION_MAP.md&lt;/code&gt; is a complete index of every function and class with its start line and complexity score. For large codebases, this is the fastest way to see where complexity is concentrated.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SLOP_REPORT.md&lt;/code&gt; is the actionable output: what was fixed automatically, what was flagged for manual review, and which dead-code candidates need a human decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trying It on the Example Projects
&lt;/h2&gt;

&lt;p&gt;Two sample projects ship in examples/ so you can try the tool immediately after installing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Audit only — no test verification needed&lt;/span&gt;
slop-audit examples/todo_app/ &lt;span class="nt"&gt;--verbose&lt;/span&gt;
slop-audit examples/event_pipeline/ &lt;span class="nt"&gt;--verbose&lt;/span&gt;

&lt;span class="c"&gt;# Full clean — dry run first, then apply&lt;/span&gt;
slop-clean examples/todo_app/ &lt;span class="nt"&gt;--dry-run&lt;/span&gt;
slop-clean examples/todo_app/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run the example test suites directly, change into the project directory first so their imports resolve correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;examples/todo_app &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;examples/event_pipeline &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;todo_app&lt;/code&gt; is a sample Python app with intentional slop, a good first run to see exactly what the tool catches. &lt;code&gt;event_pipeline&lt;/code&gt; covers tricky import patterns like aliases, multi-line imports, and string annotations, designed to show the AST analysis handling edge cases correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI Integration
&lt;/h2&gt;

&lt;p&gt;slop-audit drops directly into CI as a quality gate. It exits 1 if any issues are found, failing the workflow step automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/quality.yml&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;slop-check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install -e .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slop-audit src/ --threshold &lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches unused imports and complexity regressions on every pull request before they get merged into the main branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;slop-cleaner/
├── cli/
│   └── main.py               — entry points + rich-formatted output
├── engines/
│   ├── auditor.py            — Phase 1 · AST-based issue detection
│   ├── analyzer.py           — Phase 2 · call-graph + dead-code finder
│   ├── cleaner.py            — Phase 3 · atomic patch application
│   ├── verifier.py           — Phase 4 · pytest runner + rollback
│   └── documenter.py         — Phase 5 · Markdown report generator
├── parsers/
│   ├── python_parser.py      — tree-sitter Python wrapper
│   └── typescript_parser.py  — tree-sitter TypeScript/TSX wrapper
├── examples/
│   ├── todo_app/             — sample Python app with intentional slop
│   └── event_pipeline/       — sample project with tricky import patterns
├── assets/
│   ├── pipeline.svg          — 5-phase flow diagram
│   ├── features.svg          — feature overview infographic
│   └── before-after.svg      — code transformation visual
└── tests/
    └── test_parsers.py       — 22 tests covering the parsers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The structure maps cleanly to the five phases, one engine per phase, two parsers for Python and TypeScript/TSX, and a single CLI entry point that ties them together.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This tool was designed, built, debugged, and refined entirely using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes, runs, and iterates on real code without hand-holding.&lt;/p&gt;

&lt;p&gt;Every engine, every edge-case fix, and every SVG was produced by NEO in a single session. The five-phase pipeline, the tree-sitter AST parsers for Python and TypeScript, the atomic patch application with backup and rollback, the call graph builder, the pytest runner with automatic rollback on failure, and the three Markdown report generators all of it built end-to-end from a high-level problem description.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Use and Extend This With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a CI quality gate on every pull request:&lt;/strong&gt;&lt;br&gt;
Drop &lt;code&gt;slop-audit&lt;/code&gt; into your CI pipeline. Every PR gets checked for unused imports and complexity regressions before it touches main. The exit code 1 on issues means it fails the build automatically, no configuration beyond one workflow step. It is already wired for this out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it before a major refactor:&lt;/strong&gt;&lt;br&gt;
Before starting a large refactor, run &lt;code&gt;slop-clean&lt;/code&gt; on the codebase. It removes accumulated import clutter, flags dead code that no longer needs to be worked around, and generates a complexity map. You go into the refactor with a cleaner starting point and a clear picture of where complexity lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to onboard onto an unfamiliar codebase:&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;slop-audit&lt;/code&gt; on a project you have just inherited and read &lt;code&gt;SLOP_REPORT.md&lt;/code&gt;. You get a structured list of unused imports, dead functions, and complexity hotspots, a map of technical debt you can act on rather than discovering piece by piece while working in the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to generate a documentation baseline for a legacy project:&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;slop-clean&lt;/code&gt; and read &lt;code&gt;ARCHITECTURE.md&lt;/code&gt; and &lt;code&gt;FUNCTION_MAP.md&lt;/code&gt;. You get a file tree, a visual call graph, and a complete index of every symbol with its complexity score. For a project with no existing documentation, that is a meaningful starting point.&lt;/p&gt;

&lt;p&gt;The tool is also designed to be extended and NEO can take any of these further without starting from scratch:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JavaScript/JSX parser-&lt;/strong&gt; Two parsers already exist (&lt;code&gt;python_parser.py&lt;/code&gt;, &lt;code&gt;typescript_parser.py&lt;/code&gt;) following the same tree-sitter wrapper pattern. A third for JavaScript/JSX follows the same interface and plugs into all five engines immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional complexity metrics-&lt;/strong&gt; &lt;code&gt;auditor.py&lt;/code&gt; already tracks cyclomatic complexity. Additional metrics like function length follow the same detection pattern and surface automatically in &lt;code&gt;FUNCTION_MAP.md&lt;/code&gt; and the audit output once added.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-fix for dead code-&lt;/strong&gt; dead code is currently flagged at MEDIUM confidence and left to human judgment. For clearly dead private functions confirmed unreachable by the call graph, automatic removal is a natural next step built directly on the existing &lt;code&gt;analyzer.py&lt;/code&gt; and &lt;code&gt;cleaner.py&lt;/code&gt; infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-commit hook-&lt;/strong&gt; &lt;code&gt;slop-audit&lt;/code&gt; already exits 1 on issues. A small wrapper that hooks into the existing CLI entry point brings slop detection into the local development loop before anything reaches CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Code clutter is not a cosmetic problem. Unused imports add noise to every code review. Dead functions add surface area that has to be mentally accounted for. High-complexity functions resist change and hide bugs. slop-cleaner addresses all three automatically, with AST-level precision that regex-based tools cannot match, and with a test-verification step that means it never leaves your codebase in a worse state than it found it.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/Ai_Slop_Cleaner" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Ai_Slop_Cleaner&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>nlp</category>
    </item>
    <item>
      <title>Agent Failure Classifier: Post-Hoc Root Cause Analysis for Failed LLM Agent Runs</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 29 Apr 2026 07:56:24 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/agent-failure-classifier-post-hoc-root-cause-analysis-for-failed-llm-agent-runs-1i79</link>
      <guid>https://forem.com/nilofer_tweets/agent-failure-classifier-post-hoc-root-cause-analysis-for-failed-llm-agent-runs-1i79</guid>
      <description>&lt;p&gt;When an LLM agent fails, the trace is right there, the user turns, the tool calls, the responses, the final result. But knowing what happened and knowing why it failed are two different things. Most teams read traces manually, form a guess, and move on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Failure Classifier&lt;/strong&gt; is a CLI tool and Python library for post-hoc root cause analysis of failed or low-quality LLM agent runs. Feed it any agent trace and it classifies the failure into one of eight named failure modes, identifies the first turn where things went wrong, and produces a structured report with actionable fixes.&lt;/p&gt;

&lt;p&gt;The classifier combines eight fast rule-based detectors with an optional LLM-as-judge pass via OpenRouter. The rule-based layer is free, deterministic, and requires no network access. The LLM pass breaks ties and classifies traces the rules cannot resolve alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Eight Failure Modes
&lt;/h2&gt;

&lt;p&gt;The classifier recognises exactly eight failure modes, each with a precise definition:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HALLUCINATION:&lt;/strong&gt; Agent stated facts or called tools that do not exist&lt;br&gt;
&lt;strong&gt;TOOL_MISUSE:&lt;/strong&gt; Agent called a real tool with wrong parameters or at the wrong time&lt;br&gt;
&lt;strong&gt;CONTEXT_LOSS:&lt;/strong&gt; Agent forgot earlier decisions or repeated already-completed steps&lt;br&gt;
&lt;strong&gt;CIRCULAR_REASONING:&lt;/strong&gt; Agent looped between the same 2-3 steps without making progress&lt;br&gt;
&lt;strong&gt;GOAL_DRIFT:&lt;/strong&gt; Agent started pursuing a sub-goal and forgot the original task&lt;br&gt;
&lt;strong&gt;OVER_REFUSAL:&lt;/strong&gt; Agent refused an action it was capable of and should have taken&lt;br&gt;
&lt;strong&gt;SCHEMA_ERROR:&lt;/strong&gt; Agent generated malformed JSON for a tool call or structured output&lt;br&gt;
&lt;strong&gt;TIMEOUT_CASCADE:&lt;/strong&gt; One slow tool call caused the agent to rush or skip subsequent steps&lt;/p&gt;

&lt;p&gt;These are not fuzzy categories. Each one maps to a specific detector with specific signals. A hallucination is flagged when the agent asserts a factual claim without invoking any retrieval tool. A timeout cascade is flagged when a tool call exceeds a latency threshold and the subsequent agent turn is unusually short relative to the tool output.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The classification pipeline runs in two layers.&lt;/p&gt;

&lt;p&gt;The rule-based layer runs eight deterministic detectors over the trace. Each detector looks for specific structural signals repeated tool calls with identical inputs, cycles in agent turn content, latency spikes followed by short responses, malformed JSON in tool call outputs. This layer runs offline, requires no API key, and classifies all eight failure modes.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;LLM-as-judge&lt;/strong&gt; layer is optional. When enabled, it receives traces the rule-based layer couldn't resolve with high confidence and breaks ties. The judge runs via OpenRouter and can be pointed at any OpenRouter model or a local OpenAI-compatible server (Ollama, vLLM, llama.cpp).&lt;br&gt;
Every classification produces a structured report with the classified failure mode, a confidence score, the first turn where the failure was detected, a root cause summary, and a list of actionable fixes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/agent-failure-classifier
&lt;span class="nb"&gt;cd &lt;/span&gt;agent-failure-classifier
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.8+. The only dependencies are &lt;code&gt;pydantic&lt;/code&gt;, &lt;code&gt;rich&lt;/code&gt;, &lt;code&gt;click&lt;/code&gt;, and &lt;code&gt;requests&lt;/code&gt;. The rule-based layer runs with no additional setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Judge Setup (Optional)&lt;/strong&gt;&lt;br&gt;
To enable the LLM-judge pass, copy &lt;code&gt;.env.example&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt; and set your OpenRouter key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# edit .env and set OPENROUTER_API_KEY=sk-or-...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv2sdra53suzkiojox5t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv2sdra53suzkiojox5t.png" alt=" " width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Without any key, pass &lt;code&gt;--no-llm&lt;/code&gt; to every classify or batch call. The rule-based layer alone classifies all eight failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI
&lt;/h2&gt;

&lt;p&gt;The CLI is exposed as both a console script (&lt;code&gt;agent-failure-classifier&lt;/code&gt;) and an importable module (&lt;code&gt;python -m agent_failure_classifier.cli&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classify a single trace&lt;/strong&gt;&lt;br&gt;
The core command takes a trace JSON file and returns a structured report. &lt;code&gt;--no-llm&lt;/code&gt; keeps it offline, rule-based only, no API call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier classify &lt;span class="nt"&gt;--trace&lt;/span&gt; traces/hallucination_example.json &lt;span class="nt"&gt;--no-llm&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key flags:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdabmg2rqnqa9i9tyks.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdabmg2rqnqa9i9tyks.png" alt=" " width="746" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate a trace&lt;/strong&gt;&lt;br&gt;
Before classifying, &lt;code&gt;validate&lt;/code&gt; parses the trace and prints its structure: trace ID, goal, turn count, and a preview of each turn. Useful for confirming the trace loaded correctly before running classification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier validate &lt;span class="nt"&gt;--trace&lt;/span&gt; traces/hallucination_example.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Batch classification&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;batch&lt;/code&gt; runs classification over every &lt;code&gt;*.json&lt;/code&gt; file in a directory and produces a failure-mode distribution table plus a per-trace summary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier batch &lt;span class="nt"&gt;--traces-dir&lt;/span&gt; ./traces/ &lt;span class="nt"&gt;--no-llm&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Worked Examples
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Example 1 - Hallucination&lt;/strong&gt;&lt;br&gt;
The trace has a user asking for WWII death statistics. The agent responds directly with a factual claim, no tool call, no retrieval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hallucination-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"original_goal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Get population statistics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"final_result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"70 million people died in WWII."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"is_successful"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"turns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"turn_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How many people died in WWII?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"turn_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"70 million people died in WWII."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classification: &lt;code&gt;HALLUCINATION&lt;/code&gt;, confidence 75%, first failure at turn 1. The detector flags that the agent asserted a factual claim without invoking any retrieval tool. Recommended fixes include adding a fact-checking step, requiring tool verification for factual claims, and implementing retrieval-augmented generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 2 - Circular Reasoning&lt;/strong&gt;&lt;br&gt;
Four turns alternating between &lt;code&gt;"Let me analyze this step by step."&lt;/code&gt; and &lt;code&gt;"I need more information."&lt;/code&gt; The agent makes no progress across the entire trace.&lt;br&gt;
Classification: &lt;code&gt;CIRCULAR_REASONING&lt;/code&gt;, confidence 80%. The rule-based detector identifies a 2-step cycle repeating across agent turns and recommends a maximum-iteration limit plus state-change detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 3 - Timeout Cascade&lt;/strong&gt;&lt;br&gt;
A &lt;code&gt;slow_api&lt;/code&gt; tool call with &lt;code&gt;latency_ms: 6000&lt;/code&gt; followed by a one-word agent response &lt;code&gt;"OK"&lt;/code&gt;.&lt;br&gt;
Classification: &lt;code&gt;TIMEOUT_CASCADE&lt;/code&gt;, confidence 70%. The detector flags the latency breach and notes that the subsequent agent turn is a one-word response, less than half the length of the tool output, indicating the agent rushed through the remaining steps.&lt;/p&gt;
&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Classify a trace programmatically&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.classifier&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FailureClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;

&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traces/hallucination_example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FailureClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classified_failure_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root_cause_summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Record a trace live with TraceRecorder&lt;/strong&gt;&lt;br&gt;
Rather than constructing trace JSON by hand, &lt;code&gt;TraceRecorder&lt;/code&gt; is a context manager that captures an agent run as it executes and writes a trace file to disk on exit. The output is immediately compatible with the CLI and with &lt;code&gt;FailureClassifier&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.recorder&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceRecorder&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;TraceRecorder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find Italian restaurants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./traces&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find Italian restaurants near me&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Searching...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;italian restaurants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;tool_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luigi Bistro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pasta Palace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I found Luigi Bistro and Pasta Palace.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_final_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found Luigi Bistro and Pasta Palace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_successful&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On exit the trace is saved to &lt;code&gt;./traces/trace_&amp;lt;id&amp;gt;_&amp;lt;timestamp&amp;gt;.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parse traces from other frameworks&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;AutoParser&lt;/code&gt; auto-detects and normalises three input formats into the canonical &lt;code&gt;AgentTrace&lt;/code&gt; model. No manual conversion needed regardless of where the trace came from.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.formats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoParser&lt;/span&gt;

&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AutoParser&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;parse_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/trace.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three supported formats are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native / generic:&lt;/strong&gt; a dict with &lt;code&gt;trace_id&lt;/code&gt;, &lt;code&gt;original_goal&lt;/code&gt;, &lt;code&gt;is_successful&lt;/code&gt;, and a turns list. This is the format emitted by &lt;code&gt;TraceRecorder&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith run export:&lt;/strong&gt; a dict with &lt;code&gt;run_type&lt;/code&gt;, &lt;code&gt;inputs&lt;/code&gt;, &lt;code&gt;outputs&lt;/code&gt;, and optional &lt;code&gt;child_runs&lt;/code&gt;. Tool child runs become TOOL turns; chain and LLM child runs become AGENT turns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph state dict:&lt;/strong&gt; a dict with &lt;code&gt;thread_id&lt;/code&gt; and a &lt;code&gt;state.messages&lt;/code&gt; list whose entries use type values &lt;code&gt;human&lt;/code&gt;, &lt;code&gt;ai&lt;/code&gt;, and &lt;code&gt;tool&lt;/code&gt;.
A minimal list-of-dicts (&lt;code&gt;[{"role": "...", "content": "..."}, ...]&lt;/code&gt;) is also accepted by the generic parser.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

&lt;p&gt;The problem was defined at a high level: a tool that takes any agent trace, runs deterministic detectors over it, and classifies the failure into a named category with a structured report and actionable fixes. NEO generated the full implementation: the eight rule-based detectors, the &lt;code&gt;FailureClassifier&lt;/code&gt; orchestration layer, the optional LLM-as-judge pass via OpenRouter, the &lt;code&gt;TraceRecorder&lt;/code&gt; context manager, the &lt;code&gt;AutoParser&lt;/code&gt; with support for native, LangSmith, and LangGraph formats, and the Click-based CLI with classify, validate, and batch commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Build Further With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a CI/CD quality gate for your agent.&lt;/strong&gt;&lt;br&gt;
If you're shipping an LLM agent, you can integrate the classifier directly into your deployment pipeline. Record traces from your test suite with &lt;code&gt;TraceRecorder&lt;/code&gt;, run &lt;code&gt;batch&lt;/code&gt; classification on every pull request, and fail the build if a new failure mode appears or if the rate of a known one spikes. You get a systematic regression check on agent behaviour, not just on code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to understand where your agent breaks most.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;batch&lt;/code&gt; classification across a directory of historical traces and look at the failure mode distribution. If CONTEXT_LOSS shows up in 40% of your traces, that's a signal about your agent's memory design, not a one-off bug. This turns debugging from reactive to diagnostic, you're looking at patterns across runs, not reading individual traces one by one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as a live monitoring layer in a multi-agent system.&lt;/strong&gt;&lt;br&gt;
The classifier runs as an A2A agent, which means it can sit as a node in a multi-agent pipeline. Any agent in the system can send its trace to the classifier after each run and get a structured failure report back. An orchestrator can use that signal to decide whether to retry, reroute, or escalate without any human in the loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during agent development to catch regressions early.&lt;/strong&gt;&lt;br&gt;
Wrap &lt;code&gt;TraceRecorder&lt;/code&gt; around your agent during development. Every run produces a trace. Feed those traces into the classifier after each session and you'll know immediately if a change introduced a new failure mode. It's the difference between finding out something broke in production versus finding out in your local environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent Failure Classifier turns trace debugging from a manual read-and-guess process into a systematic one. Eight named failure modes, a deterministic rule-based layer that runs offline, an optional LLM judge for ambiguous cases, and support for traces from native formats, LangSmith, and LangGraph, all producing a structured report with the first failure turn and actionable fixes.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/agent-failure-classifier" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/agent-failure-classifier&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
