Forem: Nilofer 🚀

Context Time Machine: Forensic Investigation of What Your Agent Actually Saw

Nilofer 🚀 — Sat, 16 May 2026 11:10:19 +0000

Long-running agent sessions fail in a specific way that is hard to debug. The agent runs 40 turns. At turn 38, it gives a wrong answer that ignores something it decided at turn 12. You look at the logs, the turn 12 decision is there. The turn 38 response is there. But you cannot see what the context window looked like at turn 38. Was the turn 12 decision still in context? Was it evicted? Was it there but semantically overwhelmed by 25 other turns?

This is the forensic problem that ContextTimeMachine solves. It is different from real-time session monitoring, it is for deep post-hoc investigation of what happened during a session, after it has already run. The key insight it is built on: the context window at any given turn is deterministic given the conversation history. You can reconstruct exactly what the model saw at turn 38, render it interactively, and query it.

Three Investigation Modes

Mode 1 - Timeline Navigator

The primary view is a vertical timeline of all turns in the session. Each turn shows the turn number, agent name if available, turn type, token count at that turn, and a sparkline showing how the context composition changed.

Click any turn to travel to it - the context window at that exact point reconstructs and renders in the main panel. You see exactly what the model saw: every message in order, with token counts, with a red line showing where the context would have been truncated if it exceeded the model's limit. Scrub through turns with keyboard arrows. Watch the context window evolve turn by turn. See turns disappear as eviction happens. See tool results arrive and push older content further back.

Mode 2 - Fact Tracker

You know something specific, a decision made at turn 5, a fact retrieved at turn 15, a user instruction given at turn 3. You want to know: at what turn did this fact leave the context window?

Enter any text snippet in the Fact Tracker search box. ContextTimeMachine embeds it locally using sentence-transformers, then searches every turn's context snapshot for the nearest matching content. It renders a presence chart, a horizontal bar across all turns colored green when the fact is present or red when absent and shows the exact turn where the fact entered context and the exact turn where it left.

This answers the most common debugging question for long agent sessions: "When exactly did the agent stop knowing X?"

Mode 3 - Divergence Finder

You have two agent sessions that started identically but ended differently. One succeeded, one failed. Load both sessions and ContextTimeMachine finds the earliest turn where their context windows diverged where they started seeing different content and highlights that turn as the likely root cause of the different outcomes.

It shows a side-by-side comparison of the two context windows at the divergence point with diffed content highlighted. This is the automated version of the manual debugging process every team does when comparing "the run that worked" against "the run that didn't."

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    ContextTimeMachine                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Frontend (React)                                               │
│  ├─ TimelineNavigator    — Turn-by-turn timeline scrubber       │
│  ├─ ContextPanel         — Renders reconstructed context        │
│  ├─ FactTracker          — Fact presence chart                  │
│  └─ DivergenceFinder     — Two-session comparison               │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  FastAPI Backend                                                │
│  ├─ /api/session/load          — Load session from file         │
│  ├─ /api/session/{id}/profile  — Get token profile              │
│  ├─ /api/session/{id}/turn/{n} — Reconstruct context at turn    │
│  ├─ /api/session/{id}/fact     — Track fact presence            │
│  ├─ /api/divergence            — Find divergence point          │
│  └─ /api/sessions              — List all sessions              │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Core Analysis Modules                                          │
│  ├─ SessionLoader        — Load from multiple formats           │
│  ├─ ContextReconstructor — Reconstruct at any turn              │
│  ├─ FactTracker          — Track presence via embeddings        │
│  ├─ DivergenceFinder     — Find divergence points               │
│  ├─ TokenAnalyzer        — Token budget analysis                │
│  └─ EmbeddingService     — Local embeddings (all-MiniLM)        │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Storage                                                        │
│  └─ SQLite DB            — Session snapshots & metadata         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Installation

Prerequisites

Python 3.10+
pip

Quick Start

# Clone the repository
git clone https://github.com/dakshjain-1616/context-time-machine.git
cd context-time-machine

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install package
pip install -e .

# Start the server
timemachine serve

Open http://localhost:8000 in your browser. The server will automatically open your browser if it can.

Usage

Loading Sessions
Sessions can be loaded from two formats.

From LiveContext SQLite Export:

timemachine load --file session.db

From Generic JSON:

timemachine load --file session.json

The generic JSON format expects a turns array where each turn contains a messages list, a model_id, and a timestamp:

{
  "turns": [
    {
      "turn": 0,
      "messages": [
        {"role": "system", "content": "You are helpful.", "token_count": 3},
        {"role": "user", "content": "What is 2+2?", "token_count": 4}
      ],
      "model_id": "gpt-4",
      "timestamp": "2026-05-09T10:00:00Z"
    }
  ],
  "model_id": "gpt-4"
}

CLI Commands
The CLI covers the full workflow from loading sessions to querying them:

# Start the web interface
timemachine serve

# Load a session
timemachine load --file session.json

# Track fact across session
timemachine fact --session <session-id> --fact "the user prefers JSON output"

# Find divergence between two sessions
timemachine diverge --session-a <id-a> --session-b <id-b>

# List all stored sessions
timemachine sessions

# Clear all sessions
timemachine clear

Python API

Every capability the CLI and web interface expose is also available as a Python library. This makes it straightforward to integrate ContextTimeMachine into evaluation pipelines or automated debugging scripts:

from context_time_machine import (
    SessionLoader,
    ContextReconstructor,
    FactTracker,
    DivergenceFinder,
    TokenAnalyzer,
)

# Load session
loader = SessionLoader()
session = loader.load("session.json")

# Reconstruct context at turn 10
reconstructor = ContextReconstructor()
context = reconstructor.reconstruct(session, turn_number=10)
print(f"Context at turn 10: {context.total_tokens} tokens")
print(f"Messages: {len(context.messages)}")
print(f"Utilization: {context.utilization_percent}%")

# Track a fact
tracker = FactTracker()
result = tracker.track(session, "specific decision from turn 5")
print(f"Fact first appeared: Turn {result.first_appeared_turn}")
print(f"Fact last present: Turn {result.last_present_turn}")
print(f"Disappeared at: Turn {result.disappeared_at_turn}")

# Analyze token budget
analyzer = TokenAnalyzer()
profile = analyzer.analyze_session(session)
print(f"Peak tokens: {profile.peak_tokens} at turn {profile.peak_turn}")
print(f"Eviction turns: {profile.eviction_turns}")

# Find divergence between sessions
session_b = loader.load("session_b.json")
finder = DivergenceFinder()
result = finder.find(session, session_b)
print(f"Divergence at turn: {result.divergence_turn}")
print(result.summary)

Supported Session Formats

How It Works

Context Reconstruction

For each turn N, ContextTimeMachine loads all messages from turns 0 to N and counts the total tokens using tiktoken. If the total exceeds the model's context limit, it simulates eviction using a model-specific strategy: GPT and Claude use left-truncation (oldest messages first), DeepSeek uses a sliding window with a recency bias, and Gemma uses local-global attention sampling from the middle. System messages are never evicted regardless of which strategy applies. The result is a reconstructed context with a full token breakdown exactly what the model would have seen at that turn.

Fact Tracking

For each turn, ContextTimeMachine embeds the fact text using all-MiniLM-L6-v2. It then computes cosine similarity between that embedding and every message in the turn's reconstructed context. A fact is considered present if any message has a similarity above 0.75. Embeddings are cached for performance so repeated queries against the same session do not recompute embeddings. The output is a presence chart showing the fact's full lifecycle across the session.

Divergence Detection

For two sessions, ContextTimeMachine aligns turns and analyzes up to the minimum length of the two sessions. At each turn it reconstructs the context for both sessions, embeds all messages, and computes an average maximum cosine similarity between the two context windows. When this similarity drops below 0.85, the turn is flagged as the divergence point. The output includes a message diff at the divergence point and a summary of what changed.

API Endpoints

Session Management

POST /api/session/load - load session from file or JSON
GET /api/sessions - list all stored sessions
DELETE /api/session/{id} - delete a session

Analysis

GET /api/session/{id}/profile - get token profile for session
GET /api/session/{id}/turn/{num} - reconstruct context at turn
POST /api/session/{id}/fact - track fact presence
POST /api/divergence - find divergence between sessions

Performance

Context Reconstruction: < 100ms for typical sessions
Fact Tracking: ~1-5 seconds for full session (includes embedding)
Divergence Detection: ~2-10 seconds for 2 sessions
Memory: ~50-200MB per stored session (depending on size)

Dependencies

Core

fastapi - Web framework
uvicorn - ASGI server
pydantic - Data validation
click - CLI framework
tiktoken - Token counting
sentence-transformers - Local embeddings
numpy - Numerical operations
sqlalchemy - Database ORM
aiofiles - Async file operations

Frontend
React, Tailwind CSS, Framer Motion, Recharts

Known Limitations

Frontend is a React stub - core analysis is fully functional
LangSmith format not yet implemented
No streaming support for very large sessions (>10k turns)
Embedding cache cleared on restart

Future Enhancements

Complete React frontend with real-time updates
WebSocket streaming for large sessions
LangSmith format support
Multi-session comparison UI
Export to markdown/HTML
Attention visualization
Custom eviction strategy support

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a forensic debugging tool for long-running agent sessions, one that could reconstruct the exact context window at any historical turn, track when specific facts entered and left context using semantic embeddings, and find the earliest point where two divergent sessions started seeing different content. The tool needed to support multiple session formats, expose a Python API alongside the web interface, and work entirely offline with local embeddings.

NEO handled all 12 specification steps autonomously, building the SessionLoader with support for LiveContext SQLite, generic JSON, and raw conversation formats, the ContextReconstructor with model-specific eviction strategies for GPT, Claude, DeepSeek, and Gemma, the FactTracker with all-MiniLM-L6-v2 embeddings and cosine similarity scoring, the DivergenceFinder with turn-aligned context comparison, the TokenAnalyzer for peak token and eviction turn detection, the FastAPI backend with all six API endpoints, the SQLite storage layer via SQLAlchemy, the Click CLI with all six commands, and the full 58-test suite covering all core modules.

How You Can Use and Extend This With NEO

Use it to find the root cause of long-session failures.
When an agent gives a wrong answer deep into a long session, load the session into ContextTimeMachine, travel to the failure turn in the Timeline Navigator, and see exactly what was in context at that point. The reconstructed view shows every message the model saw, in order, with token counts, so you can see immediately whether the relevant context was present or had been evicted.

Use Fact Tracker to measure context retention across your agent design.
Before settling on a context management strategy for your agent, run Fact Tracker against a set of real sessions. The presence chart for key decisions and instructions tells you at what turn they reliably drop out of context giving you a data-driven basis for choosing context window sizes, eviction strategies, or compression approaches.

Use Divergence Finder to debug non-deterministic agent behaviour.
When two runs of the same agent with the same input produce different outcomes, load both into Divergence Finder. The tool identifies the exact turn where their context windows started differing and shows a diff of what changed, turning a difficult debugging problem into a specific, actionable finding.

Extend it with additional session format parsers.
SessionLoader already handles three formats following a common interface. Adding a new format - LangSmith is listed as planned, means implementing the same loader interface for the new format. It is then immediately available in the CLI, the Python API, and the web interface without touching any of the analysis modules.

Final Notes

ContextTimeMachine makes the context window visible. Instead of inferring what the model saw from its outputs, you can reconstruct and inspect the exact context at any turn, track when specific information entered and left the window, and find where two sessions diverged. For teams debugging long-running agents, that visibility is the difference between guessing and knowing.

The code is at https://github.com/dakshjain-1616/ContextTimeMachine
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Agent Constitution: Policy Enforcement and PII Protection for AI Agents

Nilofer 🚀 — Sat, 16 May 2026 05:50:37 +0000

AI agents are getting more capable. They can browse the web, call APIs, read and write files, and execute code. That capability is exactly what makes them useful and exactly what makes them dangerous without guardrails.

Most agent safety approaches rely on prompt instructions. Tell the model not to delete files. Tell it not to send requests to untrusted URLs. Tell it not to leak PII. But instructions in a prompt are not enforceable — a sufficiently complex agent workflow, a jailbreak attempt, or just an edge case in reasoning can bypass them silently.

Agent Constitution is a policy enforcement framework for AI agents that enforces behavioral rules at the code level, not the prompt level. You define rules in a YAML constitution file, wrap your agent's tool calls with the enforcer, and get PII detection, audit logging, and a real-time dashboard all without modifying your agent's core logic.

Features

Policy-Based Enforcement - Define rules using YAML constitution files
AST-Based Expression Evaluation - Safe condition evaluation without code injection risks
PII Detection - Regex and Ollama-powered detection of sensitive information
Audit Logging - JSONL-based audit trail with rotation support
Real-Time Dashboard - FastAPI + WebSocket + React dashboard for monitoring
CLI Interface - Rich command-line interface for management

How It Works

The core concept is a constitution - a YAML file that defines policies, and within each policy, rules. Each rule has a condition written as a plain expression, an action (block or notify), and a severity level. The enforcer evaluates these conditions against every tool call before it executes.

The condition evaluation uses AST-based expression parsing not eval() so there is no code injection risk. An expression like tool_name in ['rm', 'unlink', 'rmdir'] is parsed as an abstract syntax tree and evaluated safely against the tool call context.

PII detection runs as a separate layer. It can use regex patterns for common formats like email addresses, phone numbers, and SSNs, or it can use Ollama with a local model for more nuanced detection. When PII is detected in a tool's output, it can be blocked or redacted before it reaches the agent.

Every enforcement decision allowed or blocked is written to a JSONL audit log with a timestamp, tool name, action taken, and the specific rule that triggered. The real-time dashboard reads from this audit log via WebSocket and shows violations, enforcement statistics, and the full constitution in one view.

Installation

# Clone the repository
git clone https://github.com/yourusername/agent-constitution.git
cd agent-constitution

# Install dependencies
pip install -e .

Quick Start

1. Create a Constitution
Start with a sample constitution to see the format, or create an empty one:

# Create a sample constitution
agent-constitution init --sample -o my_constitution.yaml

# Or create an empty one
agent-constitution init -o my_constitution.yaml

2. Validate Your Constitution
Before using it, validate that the YAML is well-formed and the expressions are safe:

agent-constitution validate my_constitution.yaml

3. Test Policy Enforcement
Check whether a specific tool call would be allowed or blocked before running it:

agent-constitution check rm --arg path=/tmp/test --constitution my_constitution.yaml

4. Start the Dashboard

agent-constitution dashboard --constitution my_constitution.yaml

Then open http://localhost:8000 in your browser.

Usage

Using the @enforce Decorator

The simplest integration is wrapping tool functions with the @enforce decorator. The enforcer checks the function against the constitution before it executes, if a rule blocks the call, a PolicyViolationError is raised before the function body runs.

from agent_constitution import Constitution, Enforcer

# Load constitution
constitution = Constitution.from_yaml("my_constitution.yaml")
enforcer = Enforcer(constitution=constitution)

@enforcer.enforce
def delete_file(path: str):
    """Delete a file."""
    import os
    os.remove(path)

# This will be blocked if rm/delete operations are restricted
try:
    delete_file("/tmp/test.txt")
except PolicyViolationError as e:
    print(f"Blocked: {e}")

Manual Policy Checking

For cases where you need to check a tool call without decorating a function, for example when the tool call is constructed dynamically the enforcer exposes a check method directly:

from agent_constitution import Constitution, Enforcer

constitution = Constitution.from_yaml("my_constitution.yaml")
enforcer = Enforcer(constitution=constitution)

# Check a tool call
result = enforcer.check(
    tool_name="curl",
    tool_args={"url": "https://example.com"},
    extra_context={"approved": False}
)

if result.blocked:
    print(f"Blocked by rule: {result.violations[0].rule_name}")
else:
    print("Allowed")

PII Detection

The PII detector can be used standalone - detect PII in any text, or redact it before it leaves the agent:

from agent_constitution.rules.pii_detector import PIIDetector

detector = PIIDetector()

# Detect PII in text
text = "Contact me at john@example.com or call 555-123-4567"
matches = detector.detect(text)

for match in matches:
    print(f"Found {match.pattern_name}: {match.matched_text}")

# Redact PII
redacted = detector.redact(text)
print(redacted)  # "Contact me at [REDACTED] or call [REDACTED]"

Audit Logging

The audit logger writes every enforcement decision to a JSONL file and supports log rotation. Logs can be read back programmatically:

from agent_constitution.audit import AuditLogger

logger = AuditLogger(log_path="./audit.jsonl")

# Log an event
logger.log(
    event_type="tool_call",
    tool_name="rm",
    action="block",
    allowed=False,
    violations=[{"rule_name": "block_file_deletion"}]
)

# Read logs
for entry in logger.read_logs(limit=10):
    print(f"{entry.timestamp}: {entry.event_type} - {entry.tool_name}")

Constitution Format

The constitution is a YAML file with versioning, named policies, and rules within each policy. Each rule has a name, a condition expression, an action, and a severity level:

version: "1.0"
name: "My Agent Constitution"
description: "Security policies for my AI agent"

policies:
  - name: tool_restrictions
    description: "Restrict access to dangerous tools"
    priority: 10
    rules:
      - name: block_file_deletion
        description: "Prevent file system deletion operations"
        condition: "tool_name in ['rm', 'unlink', 'rmdir']"
        action: block
        severity: critical

      - name: restrict_network_access
        description: "Limit unrestricted network access"
        condition: "tool_name == 'curl' and not context.get('approved', False)"
        action: notify
        severity: high

  - name: data_protection
    description: "Protect sensitive data"
    priority: 5
    rules:
      - name: pii_detection
        description: "Detect and protect PII in outputs"
        condition: "pii_detected == True"
        action: block
        severity: high

pii_config:
  enabled: true
  patterns: ["email", "ssn", "phone"]
  use_ollama: true
  ollama_model: "gemma3:4b"
  ollama_url: "http://localhost:11434"

audit_config:
  enabled: true
  log_path: "./audit_logs.jsonl"
  max_file_size_mb: 100
  retention_days: 30

The priority field controls which policies are evaluated first. Higher priority runs first. The action field is either block which raises a PolicyViolationError or notify, which logs the event but allows the call through.

CLI Commands

The CLI covers the full lifecycle from creating and validating a constitution to inspecting audit logs and testing expressions:

# Initialize a constitution
agent-constitution init --sample

# Validate a constitution
agent-constitution validate my_constitution.yaml

# Display constitution contents
agent-constitution show my_constitution.yaml

# Check if a tool call would be allowed
agent-constitution check rm --arg path=/tmp/test --constitution my_constitution.yaml

# Start the dashboard
agent-constitution dashboard --constitution my_constitution.yaml

# View audit logs
agent-constitution audit --log-path ./audit.jsonl

# Show statistics
agent-constitution stats --constitution my_constitution.yaml

# Test expression evaluation
agent-constitution eval-expr "x > 5" --context x=10

Dashboard

The dashboard provides real-time monitoring via FastAPI, WebSocket, and a React frontend. It shows:

Policy violations
Audit logs
Constitution rules and policies
Enforcement statistics

Open http://localhost:8000 after starting with agent-constitution dashboard.

Architecture

agent_constitution/
├── constitution.py      # Pydantic models and YAML handling
├── enforcer.py          # Policy enforcement and @enforce decorator
├── audit.py            # JSONL audit logging
├── cli.py              # Click CLI interface
├── rules/
│   ├── evaluator.py    # AST-based expression evaluation
│   └── pii_detector.py # PII detection with regex/Ollama
└── dashboard/
    ├── server.py       # FastAPI + WebSocket server
    └── frontend/       # React + Tailwind dashboard

Each module has a single responsibility - constitution.py handles Pydantic models and YAML parsing, enforcer.py owns the @enforce decorator and manual check logic, audit.py handles JSONL writing and rotation, and the rules/ directory separates expression evaluation from PII detection.

Testing

The project has comprehensive test coverage with 84 unit tests, all passing:

# Run all tests
pytest tests/ -v

# All tests passing: 84/84 ✓

Test coverage includes constitution loading and YAML parsing, policy enforcement with the @enforce decorator, manual policy checking, PII detection for regex and patterns, audit logging with rotation, expression evaluation and security validation, and rule violation tracking and statistics.

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run specific test file
pytest tests/test_evaluator.py -v

# Run linting
flake8 agent_constitution

# Run type checking
mypy agent_constitution

How I Built This Using NEO

The requirement was a policy enforcement framework for AI agents, one that enforces behavioral rules at the code level rather than relying on prompt instructions, with PII detection, a JSONL audit trail, and a real-time monitoring dashboard. NEO implemented the full system across 10 implementation steps, resulting in a production-ready framework with 84 tests passing.

NEO built the Pydantic constitution models and YAML handling in constitution.py, the policy enforcer with the @enforce decorator in enforcer.py, the AST-based expression evaluator in rules/evaluator.py, the regex and Ollama-powered PII detector in rules/pii_detector.py, the JSONL audit logger with rotation in audit.py, the Click CLI with all eight commands in cli.py, and the FastAPI and WebSocket dashboard server with the React and Tailwind frontend.

How You Can Use and Extend This With NEO

Use it to enforce safety rules across any agent's tool calls.
Wrap any tool function with @enforcer.enforce and define the rules in a YAML constitution. The enforcement happens at the code level not in the prompt, so it cannot be bypassed by the agent's reasoning or by jailbreak attempts.

Use the audit log to build an observability layer for your agents.
Every enforcement decision lands in a JSONL file with a timestamp, tool name, action, and triggering rule. This gives you a structured, queryable record of everything your agent tried to do allowed or blocked, which is useful for debugging unexpected agent behaviour and for compliance requirements.

Use PII detection as a standalone layer before agent outputs reach users.
The PIIDetector works independently of the enforcer. You can run it on any text, agent responses, tool outputs, retrieved documents before they are displayed or stored, and redact sensitive information automatically.

Extend it with custom PII patterns.
The pii_config section of the constitution accepts a patterns list. New regex patterns for domain-specific sensitive data can be added to the constitution file without touching any code.

Extend it with additional rule conditions.
The AST-based evaluator supports arithmetic, comparisons, and context dictionary access. New conditions that reference additional context fields work immediately once those fields are passed as extra_context in the enforcer's check call.

Final Notes

Agent Constitution shifts AI agent safety from instructions to enforcement. Rules defined in a YAML file are evaluated at the code level on every tool call before the tool executes, so the safety layer is not part of the agent's reasoning but a hard boundary around it.

The code is at https://github.com/dakshjain-1616/Agent-Constitution
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

ASR Evaluation Framework: Benchmarking Speech Recognition Models Across Accuracy, Speed, and Robustness

Nilofer 🚀 — Fri, 15 May 2026 19:53:04 +0000

Picking an ASR model for production is not straightforward. Whisper might be the most accurate for general English but too slow for real-time use. Wav2Vec2 might be fast enough for edge devices but struggle with accented speech. Distil-Whisper might hit the sweet spot for your use case, or it might not. Without a systematic benchmark across your actual conditions, you are guessing.

ASR Evaluation Framework is an enterprise-grade benchmarking tool that answers the questions that matter before you commit to a model:

Which ASR model is most accurate for my use case?
How fast can each model process audio in real-time?
How robust is each model against background noise, accents, and degraded audio?
What are the tradeoffs between speed and accuracy?

Features

5 ASR Models : IBM Granite, OpenAI Whisper, NVIDIA Canary, Distil-Whisper, Wav2Vec2
Comprehensive Metrics : WER, CER, Accuracy, RTF, and Inference Time
15+ Test Scenarios : Clean speech, background noise, accents, fast/slow speech, technical terms, and more
Flexible Evaluation Modes : Speed, accuracy, or complete evaluation
JSON Output Schema : Standardized metrics schema for result storage

Architecture

┌─────────────────────────────────────────────────────┐
│         run_evaluation.py (CLI Entry)               │
├────────────┬──────────────┬──────────────┬──────────┤
│ --accuracy │ --speed      │ --all        │ Config   │
│ Evaluate   │ Evaluate RTF │ Complete     │ Loading  │
│ WER/CER    │ & Inference  │ Evaluation   │          │
└────────────┴──────────────┴──────────────┴──────────┘
              │
      ┌───────▼────────┐
      │   Evaluator    │
      │  - Load models │
      │  - Test audio  │
      │  - Calc metrics│
      └───────┬────────┘
              │
     ┌────────┼────────┐
     │        │        │
┌────▼──┐┌────▼──┐┌────▼──┐
│Granite ││Whisper││ Wav2V │  ... 5 models
│ Model  ││ Model ││ Model │
└────┬──┘└────┬──┘└────┬──┘
     └────────┼────────┘
              │
      ┌───────▼───────────┐
      │  Metrics Engine   │
      │ - WER/CER calc    │
      │ - RTF calc        │
      │ - Accuracy calc   │
      │ - Aggregation     │
      └───────┬───────────┘
              │
      ┌───────▼──────────┐
      │ JSON Results     │
      │ with schema      │
      └──────────────────┘

Model Comparison Overview

Evaluation Dimensions

Accuracy Metrics

WER : Word Error Rate. Percentage of words transcribed incorrectly compared to the reference.
CER : Character Error Rate. Character-level error rate for more detailed analysis.
Accuracy : 100% minus WER, normalized to a percentage.

Speed Metrics

RTF : Real-Time Factor. Inference time divided by audio duration. Below 1.0 means the model is real-time capable. Above 1.0 means it requires more compute than the audio duration.
Inference Time : Absolute seconds to transcribe the audio.

Robustness Testing
15 test scenarios covering:

Clean speech - baseline accuracy testing
Background noise - office and street environments
Accented English
Fast and slow speech rates
Technical vocabulary
Whispered speech
Phone quality audio
Numbers and acronyms
And more scenarios

Installation

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Requires Python 3.10+. Core dependencies:

librosa - Audio processing
numpy, scipy - Numerical computing
transformers - HuggingFace model loading
jiwer - WER and CER calculation
soundfile - Audio file I/O
pytest - Testing framework

Usage

Run Complete Evaluation

Runs accuracy and speed evaluation across all five models against all 15 test scenarios:

python run_evaluation.py --all

Run Accuracy Evaluation Only

python run_evaluation.py --accuracy

Run Speed Evaluation Only

python run_evaluation.py --speed

Specify Custom Paths

python run_evaluation.py --all \
  --data-path ./my_data \
  --output-path ./my_results

Results and Output

Console Output

Here is what a complete evaluation run looks like in the terminal:

============================================================
ASR EVALUATION FRAMEWORK v1.0.0
============================================================

=== RUNNING COMPLETE EVALUATION (ACCURACY + SPEED) ===

Evaluating Whisper...
Evaluating Wav2Vec2...
Evaluating Distil-Whisper...
Evaluating Canary...
Evaluating Granite...

✓ Results saved to: results/asr_eval_results_all_20260513_123045.json

============================================================
EVALUATION SUMMARY
============================================================

Model: Whisper
  Status: ✓ OK
  Mean Accuracy: 95.23%
  Mean WER: 0.0477

Model: Wav2Vec2
  Status: ✓ OK
  Mean Accuracy: 91.45%
  Mean WER: 0.0855

Model: Distil-Whisper
  Status: ✓ OK
  Mean Accuracy: 93.78%
  Mean WER: 0.0622

JSON Output Format

Results are saved as structured JSON to results/asr_eval_results_{type}_{timestamp}.json. The schema includes evaluation metadata, per-model aggregate metrics, and per-scenario test results:

{
  "evaluation_metadata": {
    "timestamp": "2026-05-13T12:30:45.123Z",
    "evaluator_version": "1.0.0",
    "models_tested": ["Whisper", "Wav2Vec2", "Distil-Whisper"],
    "test_scenarios": 15,
    "evaluation_type": "all"
  },
  "model_results": {
    "Whisper": {
      "model_name": "Whisper",
      "model_id": "openai/whisper-base",
      "initialized": true,
      "aggregate_metrics": {
        "mean_accuracy": 95.23,
        "mean_wer": 0.0477,
        "mean_cer": 0.0234,
        "mean_rtf": 1.15,
        "std_wer": 0.0145
      },
      "test_results": [
        {
          "test_id": 1,
          "test_name": "clean_english",
          "wer": 0.032,
          "cer": 0.015,
          "accuracy": 96.8,
          "inference_time": 2.34,
          "rtf": 1.17
        }
      ]
    }
  },
  "summary": {
    "total_tests": 15,
    "evaluation_type": "all",
    "status": "completed"
  }
}

The per-scenario test_results array shows exactly how each model performed on each specific condition, not just aggregated averages which is what makes this useful for production decisions.

Configuration

Environment variables, documented in .env.example:
HUGGINGFACE_TOKEN : HuggingFace API token for model loading
OPENAI_API_KEY : OpenAI API key
ASR_EVAL_DATA_PATH : data directory path
ASR_EVAL_RESULTS_PATH : results output path
VERBOSE : enable verbose logging

Test Matrix

15 test scenarios covering four categories:

Clean Speech - Baseline accuracy testing
Robustness - Background noise, accents, variable speech rates
Challenging Conditions - Whispered speech, music, phone quality audio
Domain-Specific - Technical vocabulary, numbers, acronyms

Metrics

Accuracy Metrics

WER (Word Error Rate) - Percentage of words that differ from reference
CER (Character Error Rate) - Percentage of characters that differ
Accuracy - 100% minus WER, normalized to a percentage
Speed Metrics
RTF (Real-Time Factor) - Inference time divided by audio duration. Below 1.0 is real-time capable.
Inference Time - Total time to transcribe audio in seconds

Model Details

When to Use This Framework

Benchmarking ASR models before production deployment : run a full evaluation before committing to a model, not after.

Comparing model tradeoffs : speed versus accuracy decisions are data-driven rather than based on published benchmarks that may not reflect your audio conditions.

Testing robustness against real-world audio : the 15 test scenarios cover conditions that synthetic benchmarks miss: phone quality audio, background noise, accents, and technical vocabulary.

Evaluating cost-performance of different models : RTF and inference time metrics let you calculate the compute cost of each model at your actual workload.

Quality assurance in voice-enabled applications : run evaluations to catch model regressions before they reach production.

Research and academic speech recognition studies : the standardized JSON output schema makes results comparable and reproducible across experiments.

Real-World Scenarios

Scenario 1 - Call Center AI

Evaluate which model handles phone quality audio best
Test robustness against background noise
Measure inference speed for cost calculation
Result: Select fastest model that maintains accuracy

Scenario 2 - Voice Assistant

Test against various accents and speech rates
Evaluate technical command recognition
Measure real-time performance on edge devices
Result: Pick model that runs on-device with good accuracy

Scenario 3 - Transcription Service

Benchmark accuracy across multiple languages
Evaluate cost versus accuracy tradeoffs
Test on domain-specific vocabulary
Result: Choose optimal model for service tier

Project Structure

.
├── src/                          # Core modules
│   ├── config.py                # Configuration
│   ├── metrics.py               # Metric calculations
│   ├── data_loader.py           # Data loading utilities
│   ├── base_model.py            # ASR model base class
│   └── evaluator.py             # Main evaluator class
├── models/                       # ASR model implementations
│   ├── wav2vec2.py
│   ├── whisper.py
│   ├── distil_whisper.py
│   ├── canary.py
│   └── granite.py
├── tests/                        # Test suite (36 tests)
├── data/                         # Audio files for evaluation
├── results/                      # Output evaluation results
├── notebooks/                    # Jupyter notebooks
├── run_evaluation.py             # CLI entry point
├── asr_eval_test_matrix.csv      # Test scenarios matrix
├── asr_eval_metrics_schema.json  # Output schema
└── requirements.txt              # Python dependencies

Testing

pytest tests/ -v

36 tests covering all core modules.

How I Built This Using NEO

The requirement was a systematic benchmarking framework for ASR models, one that could evaluate accuracy, speed, and robustness across real-world audio conditions, support multiple models through a common interface, and produce structured output for production decisions. The framework needed to cover five distinct model architectures, a 15-scenario test matrix, and three evaluation modes selectable from the CLI.

NEO built the full implementation: the base model class in base_model.py that all five model implementations extend, the five model wrappers for Whisper, Wav2Vec2, Distil-Whisper, Canary, and Granite, the metrics engine in metrics.py computing WER, CER, accuracy, RTF, and inference time, the main evaluator class in evaluator.py, the CLI entry point in run_evaluation.py with all three evaluation modes, the data loader in data_loader.py, the JSON output schema in asr_eval_metrics_schema.json, the test scenario matrix in asr_eval_test_matrix.csv, and the 36-test test suite.

How You Can Use and Extend This With NEO

Use it before committing to an ASR model in production.
Run the full evaluation against your own audio samples using --data-path. The per-scenario breakdown shows exactly how each model performs on the conditions your application will actually encounter, not on generic benchmarks that may not reflect your use case.

Use the JSON output to build model selection pipelines.
The structured output at results/asr_eval_results_{type}_{timestamp}.json contains all the metrics needed to make a data-driven model selection decision programmatically. A script that reads the output and selects the model with the best WER for a given RTF threshold builds directly on top of the existing schema.

Use it to evaluate cost-performance before scaling.
RTF and inference time metrics per model let you calculate the compute cost of each option at your actual call volume. The per-scenario breakdown shows where each model spends the most compute, useful for optimising before scaling a voice-enabled product.

Extend it with additional ASR models.
All five models extend base_model.py following the same interface. Adding a new ASR model available through HuggingFace Transformers means adding a new file in models/ that implements the same base class, it is then available in all three evaluation modes without touching the evaluator, metrics engine, or CLI.

Final Notes

Choosing an ASR model without systematic evaluation is a production risk. ASR Evaluation Framework removes that risk by giving you per-model, per-scenario metrics across accuracy, speed, and robustness before you deploy with structured JSON output that makes the decision data-driven rather than intuitive.

The code is at https://github.com/dakshjain-1616/Asr-Evaluation
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

SPEC-TO-SHIP: A Multi-Agent Pipeline That Turns Feature Ideas Into Production Code

Nilofer 🚀 — Thu, 14 May 2026 11:10:51 +0000

Writing a feature spec and getting it to production involves a lot of steps, architecture decisions, task planning, implementation, testing, and code review. In a real engineering team, these are handled by different people with different specializations. Most AI coding tools collapse all of that into a single step and ask one model to do everything.

SPEC TO SHIP takes a different approach. It orchestrates five specialized AI agents Architect, Planner, Engineer, QA, and Reviewer within a single Node.js process to simulate a complete startup engineering team workflow. Raw feature ideas go in. Committed, tested, reviewed code comes out.

The Five Agents

The pipeline follows a sequential flow where each agent's output informs the next, with a tight loop between Engineering and QA. Each agent has a defined role, a specific output format, and a clear handoff point - so no single agent is asked to do more than it is designed for.

ArchitectAgent-Senior Software Architect: The first agent in the pipeline. Takes the raw feature idea and generates a comprehensive technical specification covering Overview, Goals, API Contracts, Data Models, and Security sections. Output is a Markdown spec file that every downstream agent works from. Model:google/gemini-2.0-flash-001via OpenRouter.

PlannerAgent-Staff Engineering Manager: Receives the spec from the Architect and breaks it into actionable, dependency-aware development tasks. Output is a JSON array of tasks with topological ordering and acceptance criteria - so the Engineer knows exactly what to build and in what order.

EngineerAgent-Principal Software Engineer: Takes each task from the Planner and implements production-grade TypeScript code for it. Output is source files with proper typing, error handling, and JSDoc. This is where the actual code gets written.

QAAgent-Senior QA Engineer: Receives the Engineer's output and writes exhaustive Vitest test suites for each task. Output is test files covering acceptance criteria and edge cases. The tight loop between Engineer and QA means the implementation is always tested before the Reviewer sees it.

ReviewerAgent-Principal Engineer Reviewer: The final stage, conducts an audit across security, performance, and correctness across everything the previous agents produced. Output is a score from 0 to 100 and an approval status that tells you whether the output is ready to ship.

Quality and Resilience

The pipeline is built for production reliability, not just happy-path execution. Several resilience patterns are built in at the infrastructure level so agent failures do not cascade into full pipeline failures:

Strict TypeScript - No any types allowed anywhere in the generated code.

Exponential Backoff - Retries on 429/529 errors at 1s, 2s, 4s, 8s, and 16s intervals. Rate limit hits do not kill the pipeline.

JSON Robustness - When an agent returns malformed JSON, the pipeline automatically retries with explicit instructions to fix the format rather than failing immediately.

Timeout - Ahard 20-minute limit per pipeline run prevents runaway executions.

Getting Started

Prerequisites

Node.js 20+
OpenRouter API Key

Installation
Clone the repository, then install dependencies:

npm install

Configure the environment. The only required variable is your OpenRouter API key - everything else has sensible defaults:

cp .env.example .env
# Edit .env and add your OPENROUTER_API_KEY

Configuration

The system uses envalid for robust configuration management:

OPENROUTER_API_KEY - required for LLM access
DEFAULT_MODEL - set to google/gemini-2.0-flash-001
PORT - API server port, default: 3000
DB_PATH - SQLite database path, default: spec-to-ship.db

Usage

Terminal UI
Run the interactive CLI to start a new pipeline:

npm run start

The CLI uses Ink to provide real-time status updates and token streaming as each agent works through its stage. You can watch the pipeline progress in real time - each agent's output appears as it is generated rather than waiting for the full run to complete.

Industrial Dashboard
Start the API server:

npm run dev

Then open dashboard/index.html in your browser. The dashboard features an Industrial Command Center aesthetic with Dark Charcoal and Amber Glow styling and uses Server-Sent Events for real-time observability of the pipeline as it runs. This gives a visual view of the same pipeline that the CLI runs, useful for sharing progress with others or monitoring longer runs.

Output Structure

Every pipeline run writes its artifacts to ./output/{runId}/. Each file maps directly to one agent's output, so you can inspect any stage independently:

spec.md : Architectural specification from the Architect agent. The source of truth every downstream agent works from.

tasks.json : Task breakdown from the Planner agent. The dependency-ordered list of what gets built.

src/ : mplementation code from the Engineer agent.

tests/ : Vitest tests from the QA agent.

review.md : Final review report from the Reviewer agent, including the score and approval status.

meta.json : Token usage, cost, and timing for the full run.

pipeline.log : NDJSON event log of the entire pipeline execution.

How I Built This Using NEO

The idea was a multi-agent pipeline that mirrors how a real engineering team works each role specialized, each handoff structured, and the whole thing running autonomously from a feature idea through to reviewed, committed code. The requirements included five distinct agent roles with clear responsibilities, a sequential handoff structure with a QA loop, production-grade TypeScript output, real-time observability via both a CLI and a web dashboard, and resilience patterns like exponential backoff and JSON retry logic.

NEO built the full system: the five agent implementations with their respective prompts and output schemas, the pipeline orchestration layer coordinating sequential handoffs, the Ink-based CLI with real-time token streaming, the Node.js API server with SSE for dashboard observability, the Industrial Command Center dashboard in HTML, the SQLite-backed database, the artifact output structure, and the envalid configuration layer.

How You Can Use and Extend This With NEO

Use it to go from idea to working code in a single command.
Write a feature description, run the pipeline, and get a complete implementation with architecture docs, TypeScript source, Vitest tests, and a reviewer score without manually coordinating any of the steps. The five-agent structure ensures each stage is handled by a role optimised for that specific task.

Use the reviewer score as a quality gate.
The ReviewerAgent scores every run from 0 to 100 across security, performance, and correctness. Teams can use this score as a threshold before accepting generated code - only promoting runs that clear a minimum score into the codebase.

Use the NDJSON event log for pipeline observability.
Every run writes a structured pipeline.log in NDJSON format. This can be parsed by any log processing tool to track pipeline performance, token costs, and approval rates across runs over time.

Extend it with additional agent roles.
The five-agent structure is sequential and modular. A new agent that receives the previous stage's output and produces its own artifact can be added without restructuring the existing pipeline - the handoff pattern is already established for each stage.

Final Notes

SPEC TO SHIP compresses the gap between a feature idea and production-ready code by distributing the work across five specialized agents, each focused on what it does best. Architecture, planning, implementation, testing, and review - all coordinated automatically, with structured handoffs and resilience built in at every stage.

The code is at https://github.com/dakshjain-1616/Spec-To-Ship
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

RAG Pipeline Stress Tester: Battle-Test Your RAG System Before It Reaches Production

Nilofer 🚀 — Tue, 12 May 2026 11:45:30 +0000

Most RAG systems get tested with a handful of happy-path questions. Someone asks "what is machine learning?", gets a reasonable answer, and calls it done. Then it goes to production and users find the edge cases, hallucinations on out-of-scope questions, failed refusals on adversarial prompts, latency that collapses under real concurrent load.

RAG Pipeline Stress Tester is a battle-testing toolkit that finds these issues before deployment.

What It Does

Takes any HTTP RAG endpoint and hammers it with 7 categories of adversarial queries under configurable concurrent load.
Tracks relevance, hallucination, refusal quality, and latency for every query sent.
Scores everything into a composite health score from 0 to 100.
Breaks results down by query category so you know exactly which failure modes are causing issues.
Measures p50, p95, and p99 latency under realistic concurrent load, not just single-request response times.
Produces an HTML report with interactive charts and a JSON report for CI/CD integration.

Why This Exists

Before deploying a RAG system to production, four questions need answers:

Does it hallucinate when asked about things not in the corpus?
Does it refuse appropriately on out-of-scope questions?
Does it stay consistent when the same question is asked multiple ways?
Does it hold up under load 10, 25, 50 concurrent users?

Manual testing cannot answer these questions at scale. This tool does it automatically.

Without stress testing - hallucinations get discovered in production, users find edge cases first, latency under load is guesswork, and there is no audit trail.

With this tool - hallucinations are caught before deployment, you find edge cases in batch, p50/p95/p99 latency is measured at realistic concurrency, and every test run produces a timestamped JSON and HTML report.

The 7 Query Categories

The tool ships with 7 pre-built adversarial query banks, each targeting a specific failure mode:

out_of_scope - Questions with no answer in the corpus, tests hallucination resistance
adversarial - Prompt injection and jailbreak attempts, tests instruction-following safety
ambiguous - Queries with multiple valid interpretations, tests disambiguation
multilingual - Non-English queries, tests language handling
temporal - Time-sensitive questions that depend on stale data
negation - "What is NOT X" style questions, a common failure mode
compound - Multi-part questions requiring multiple retrievals

You can add your own queries by appending lines to any file in query_bank/.

Health Score

Every test run produces a composite Health Score from 0 to 100:

≥ 80  EXCELLENT   Production-ready
≥ 60  GOOD        Minor issues, review before deploying
≥ 40  FAIR        Significant issues, fix first
 < 40  POOR        Critical failures, do not deploy

Calculated from five weighted components:

Architecture

main.py             Typer CLI — entry point and orchestration
adversarial.py      Query generator — 7 categories, pre-built + corpus-generated
loader.py           Async load driver — aiohttp, configurable concurrency
evaluator.py        Scorer — hallucination, precision, refusal, consistency
reporter.py         Report generator — HTML (Chart.js) + JSON output
corpus_analyzer.py  Optional: generate targeted queries from your own documents
query_bank/         7 pre-built adversarial query files (one per line)
tests/              58 pytest tests (no live endpoint needed)

Install

pip install -r requirements.txt

The endpoint the tester sends requests to must accept POST with {"query": "..."} and return JSON containing either a response or answer field. Any HTTP status other than 200 is counted as an error.

Running a Stress Test

The core command runs a full stress test against your RAG endpoint:

# Basic — 10 concurrent users, 60-second run
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --concurrency 10 \
  --duration 60

# Test only specific query categories
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --query-types out_of_scope,adversarial,multilingual

# Custom output directory
python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --output ./my-reports

Here is what a real terminal output looks like:

🚀 Starting RAG Stress Test
   Endpoint: http://localhost:8000/query
   Concurrency: 5
   Duration: 20s

📊 Generating test queries...
   Generated 350 test queries

⚡ Running load tests...
📈 Evaluating results...
📝 Generating reports...

✅ Stress test complete!
   JSON Report: reports/stress_test_results.json
   HTML Report: reports/stress_test_report.html

=======================================================
  Overall Health Score : 57.1/100
  Status               : FAIR - Significant issues detected
  Total requests       : 6355
  Error rate           : 0.0%
  Precision score      : 2.1%
  Hallucination rate   : 22.5%
  Refusal rate         : 77.5%
  Consistency score    : 72.1%
  Latency p50/p95/p99  : 2.9 / 6.3 / 8.7 ms

  Query Type          Count   Halluc%   Refusal%    AvgLat
  ------------------ ------  --------  ---------  --------
  adversarial           205     35.1%      64.9%      3.3ms
  ambiguous             250     12.0%      88.0%      3.2ms
  compound              200     22.0%      78.0%      4.0ms
  multilingual          250     10.0%      90.0%      3.1ms
  negation              200     20.0%      80.0%      5.3ms
  out_of_scope          250     20.0%      80.0%      4.0ms
  temporal              200     38.0%      62.0%      3.1ms

  Recommendations:
    - Low precision score. Enhance retrieval mechanism and relevance ranking.
    - Moderate: Several areas need improvement for production readiness.
=======================================================

Quick Sanity Check

For a fast check before a full run, quick-test runs 35 sample queries - 5 per category and prints the health score without writing any report files:

python3 main.py quick-test --endpoint http://localhost:8000/query

🔍 Running quick sanity test...
   Testing with 35 sample queries

🎯 Quick Test Health Score: 72.4/100
   ✅ Endpoint appears functional

Generate Queries From Your Own Corpus

The analyze-corpus command analyzes your own .txt, .md, or .json files, extracts domain keywords, and produces targeted in-scope, out-of-scope, and adversarial query files you can drop into query_bank/:

python3 main.py analyze-corpus \
  --corpus ./my-docs \
  --output ./query_bank \
  --num-queries 50

📚 Analyzing corpus: ./my-docs
   Generated 50 in_scope queries → query_bank/in_scope_generated.txt
   Generated 50 out_of_scope queries → query_bank/out_of_scope_generated.txt
   Generated 50 adversarial queries → query_bank/adversarial_generated.txt

✅ Corpus analysis complete!

For very small corpora, lower the keyword frequency threshold:

python3 main.py analyze-corpus \
  --corpus ./my-docs \
  --output ./query_bank \
  --num-queries 20 \
  --min-word-freq 1

Configuration

Edit config.yaml to customise load levels, thresholds, and reporting. The --endpoint CLI flag always takes precedence over config.yaml.

load.concurrency_levels - Concurrent user levels to test, for example [1, 5, 10, 25]
load.ramp_mode - If true, steps through each concurrency level; if false, runs at the first level for the full duration
load.duration_seconds - How long to run at each concurrency level
load.rate_limit_per_second - Maximum requests per second
evaluation.hallucination_threshold - Keyword-overlap score below which a response is flagged as a potential hallucination
evaluation.refusal_keywords - Phrases that indicate a refused answer
reporter.output_dir - Where to save HTML and JSON reports

Pass the config file with --config:

python3 main.py stress-test \
  --endpoint http://localhost:8000/query \
  --config config.yaml

Output Reports

Each test run saves two files to ./reports/ or your --output path:

stress_test_results.json - Machine-readable raw data with per-query latency, success and failure flags, hallucination scores, and a per-type breakdown. Useful for CI/CD integration.

stress_test_report.html - Interactive dashboard with a health score badge coloured by band, metric cards covering success rate, precision, hallucination, latency p95 and consistency, a bar chart of success rate by query type, a grouped bar chart of hallucination and refusal rate by query type, a latency distribution histogram, and prioritised recommendations.

Endpoint Requirements

The tester sends:

POST /your-endpoint
{"query": "What is machine learning?"}

It expects a JSON response containing either a response or answer field:

{"response": "Machine learning is..."}

Any HTTP status other than 200 is counted as an error.

Running Tests

python3 -m pytest tests/ -v

58 tests covering all modules. Uses aioresponses to mock HTTP - no live RAG endpoint required.

Project Structure

rag-pipeline-stress-tester/
├── main.py             # CLI entry point
├── adversarial.py      # Query generators (7 types)
├── loader.py           # Async load test driver
├── evaluator.py        # Scoring and metrics
├── reporter.py         # HTML + JSON report generator
├── corpus_analyzer.py  # Optional corpus-based query generation
├── config.yaml         # Test configuration
├── requirements.txt
├── query_bank/         # 7 pre-built adversarial query files
└── tests/              # 58 pytest tests

How I Built This Using NEO

The requirement was a toolkit that could stress test any RAG endpoint automatically, not just for latency but for hallucination, refusal quality, and consistency under concurrent load. The tool needed to work against any endpoint with a standard request format, produce structured reports for CI/CD integration, and ship with pre-built adversarial query banks covering the failure modes that matter most before a RAG deployment.

xNEO built the full implementation: The Typer CLI with all three commands, the async load driver backed by aiohttp, the query generator covering all 7 adversarial categories, the hallucination and precision scorer, the composite health score calculator with five weighted components, the HTML report generator with Chart.js charts, the JSON reporter, the corpus analyzer for generating domain-specific queries, and the full test suite of 58 tests with HTTP mocked via aioresponses.

How You Can Use and Extend This With NEO

Use it as a pre-deployment gate for every RAG system.
Before any RAG endpoint goes to production, run a stress test against it. The health score gives you a single number, below 60 means review before deploying, below 40 means do not deploy. The per-category breakdown tells you exactly which failure modes are causing the score to drop.

Use it with your own domain queries.
The pre-built query banks are general purpose. For domain-specific testing, run analyze-corpus on your own documents to generate in-scope, out-of-scope, and adversarial queries targeted at your actual corpus, then drop them into query_bank/ and run the stress test.

Integrate the JSON report into CI/CD.
stress_test_results.json is machine-readable and contains per-query latency, hallucination scores, and the health score. A CI step that reads the health score and fails the pipeline below a threshold turns RAG quality into an automated deployment gate.

Extend it with additional query categories.
The 7 query banks are plain text files in query_bank/, one query per line. Adding a new category for a specific failure mode your RAG system faces means adding a new file to query_bank/ and registering it in adversarial.py.

Final Notes

RAG systems fail in predictable ways, hallucination on out-of-scope questions, collapsed latency under load, inconsistent refusals. RAG Pipeline Stress Tester surfaces all of these before production, with a structured health score, per-category metrics, and reports that fit directly into a CI/CD pipeline.

The code is at https://github.com/dakshjain-1616/RAG-pipeline-stress-tester
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Orbis: Turn Any GitHub Repository Into an Interactive 3D Dependency Graph

Nilofer 🚀 — Sat, 09 May 2026 10:58:10 +0000

Understanding a large codebase is hard. You clone it, start reading files, and quickly lose track of how everything connects. Which modules are most depended on? Where are the circular dependencies? What would break if you refactored this file?

Orbis answers these questions visually. Paste a GitHub repository URL, and Orbis clones it, parses the ASTs across Python, JavaScript, TypeScript, Go, Rust, and Java, detects architectural patterns, and renders the entire codebase as a navigable 3D force-directed graph. Click any module to inspect its dependencies, metrics, and exported symbols. Ask the built-in AI assistant questions like "which module should I refactor first?" and get answers grounded in the actual code structure.

Features

3D force-directed graph - Nodes sized by lines of code, colored by type, with animated directional particles on edges.

Multi-language AST parsing - Python, JavaScript/TypeScript, Go, Rust, and Java via tree-sitter.

AI chat assistant - Ask Claude questions about the analyzed codebase. Questions like "Which modules have circular dependencies?" or "Where should I add feature X?" are answered with full architectural context.

Architectural insights - Auto-detected issues including god modules, high coupling, and circular dependencies, each with severity ratings.

Focus Mode - Dim unconnected nodes to trace dependency paths clearly.

Shareable URLs - ?repo=https://github.com/... auto-triggers analysis on load, making it easy to share a specific codebase view.

Recent history - Last 5 repos stored locally for quick re-analysis.

Demo mode — Load a pre-analyzed snapshot without a GitHub clone.

Tech Stack

Backend: FastAPI + Server-Sent Events (SSE)
AST Parsing: tree-sitter (Python, JS/TS, Go, Rust, Java)
AI Integration: Claude Opus 4.6 via Anthropic API
3D Rendering: 3d-force-graph + Three.js
Frontend: Vanilla JS SPA - no build step

Quick Start

1. Clone and install

cd orbis
python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Set up environment

cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY for the AI chat feature

Get an API key at console.anthropic.com. The AI chat feature requires ANTHROPIC_API_KEY in your environment. It degrades gracefully, if the key is missing, the chat panel shows an error message rather than breaking the rest of the app.

3. Run

uvicorn main:app --host 0.0.0.0 --port 8001

Open http://localhost:8001.

Docker

docker build -t orbis .
docker run -p 8001:8001 -e ANTHROPIC_API_KEY=sk-ant-... orbis

Usage

Once running, the workflow is straightforward:

Enter a public GitHub repository URL - for example https://github.com/expressjs/express
Optionally specify a branch
Click Analyze - Orbis clones the repo, parses ASTs, and builds the graph in roughly 5–30 seconds
Explore the 3D graph - click a node to open its detail drawer, scroll to zoom, drag to rotate
Use Focus Mode to highlight a node's direct connections
Use layer filter chips to show or hide architectural layers
Ask the AI assistant questions about the codebase in the chat panel

Keyboard Shortcuts

R: Reset camera
P: Pause/resume rotation
F: Toggle Focus Mode
/: Focus search box
Esc: Close detail drawer / exit Focus Mode

Architecture

The project has four files at its core - a FastAPI backend, a single-file AST parser, and a vanilla JS frontend with no build step:

main.py           FastAPI backend — SSE streaming for /analyze, /chat
neo_parser.py     Multi-language AST parser (tree-sitter)
static/
  index.html      Single-page frontend (3d-force-graph + Three.js)
save_analysis.py  Utility: pre-generate demo data from a repo

The backend streams analysis progress to the frontend via Server-Sent Events, The backend streams analysis progress to the frontend via Server-Sent Events while cloning and analyzing the repo.

API Endpoints

Output Schema

/analyze emits SSE events and completes with a complete event containing:

{
  "schema_version": "2.0",
  "architecture_type": "MVC",
  "languages": { "python": 42 },
  "summary": "Codebase contains 42 modules...",
  "nodes": [{
    "id": "requests/auth",
    "label": "auth.py",
    "type": "utility",
    "language": "python",
    "lines_of_code": 315,
    "complexity": "medium",
    "exported_symbols": ["AuthBase", "HTTPBasicAuth"],
    "internal_dependencies": ["requests/compat"],
    "external_dependencies": [],
    "metrics": { "functions_total": 12, "classes": 4 }
  }],
  "edges": [{ "from": "requests/api", "to": "requests/auth", "type": "import" }],
  "insights": [{
    "type": "high_coupling",
    "severity": "high",
    "title": "High fan-in on requests/models",
    "description": "14 modules import this file directly.",
    "affected_nodes": ["requests/models"],
    "recommendation": "Consider splitting into smaller focused modules."
  }]
}

Each node carries its lines of code, complexity rating, exported symbols, and both internal and external dependencies. The insights block surfaces architectural issues automatically, high coupling, circular dependencies, and god modules - each with a severity rating and a specific recommendation.

Supported Languages

Python - .py
JavaScript/TypeScript - .js, .mjs, .cjs, .jsx, .ts, .tsx
Go - .go
Rust - .rs
Java - .java

AI Chat

The chat assistant uses Claude Opus 4.6 and receives the full architectural graph as context - node list, dependencies, insights, and summary. It can answer questions like:

"What does the auth module depend on?"
"Why are there circular dependencies between X and Y?"
"Which module should I refactor first?"
"Where would I add a caching layer?"

The assistant's answers are grounded in the actual parsed structure of the codebase - not generic advice. Requires ANTHROPIC_API_KEY in your environment.

Development

# Run with auto-reload
uvicorn main:app --reload --port 8001

# Re-generate demo data
python save_analysis.py

How I Built This Using NEO

The idea was a tool that turns any GitHub repository into an interactive 3D graph, something a developer could paste a URL into and immediately understand the architecture without reading a single file. The requirements included multi-language AST parsing, automatic architectural issue detection, an AI assistant grounded in the actual code structure, and a frontend that required no build step.

NEO built the full stack from that description: the FastAPI backend with SSE streaming for real-time analysis progress, the multi-language AST parser in neo_parser.py covering Python, JavaScript, TypeScript, Go, Rust, and Java via tree-sitter, the 3D force-directed graph frontend in vanilla JS, the Claude Opus 4.6 chat assistant with full architectural context, the insights engine detecting god modules, high coupling, and circular dependencies with severity ratings, and the demo mode with pre-generated analysis data.

How You Can Use and Extend This With NEO

Use it to onboard onto an unfamiliar codebase.
Instead of spending hours reading files to understand how a project is structured, paste the repo URL into Orbis and get an immediate visual map of every module, its dependencies, and the architectural issues that already exist. The AI assistant can then answer specific questions about the structure without you having to trace imports manually.

Use it during code review to understand structural impact.
When reviewing a large pull request, run Orbis on the repo and use the insights panel to see whether high coupling, circular dependencies, or god modules exist in the areas being changed. The AI assistant can answer specific questions about how the affected modules connect to the rest of the codebase.

Use it to plan a refactor.
Ask the AI assistant "which module should I refactor first?" or "where would I add a caching layer?" and get answers grounded in the actual dependency graph. The focus mode lets you isolate a specific module and trace exactly what depends on it before touching anything.

Extend it with additional language parsers.
neo_parser.py already handles five languages via tree-sitter. Adding a new language - Ruby, C++, Swift - follows the same parser pattern and surfaces automatically in the language filter chips and the supported languages list without touching the frontend or the API.

Final Notes

Orbis makes codebase architecture something you can see and navigate rather than something you have to reconstruct in your head. A 3D dependency graph, multi-language AST parsing, automatic architectural issue detection, and an AI assistant that knows the actual structure - all from a single repo URL.

The code is at https://github.com/dakshjain-1616/Orbit-dependency-visualised
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

SmolVLM2 Edge Vision Agent: Visual Monitoring Without a GPU or Cloud API

Nilofer 🚀 — Thu, 07 May 2026 11:43:31 +0000

Running vision AI locally has always had a catch, you need a GPU, or you need to send frames to a cloud API and pay per call. SmolVLM2-2.2B changes that. It is a 2.2B-parameter multimodal model specifically designed for CPU inference, and this agent is built around it.

SmolVLM2 Edge Vision Agent is a fully offline edge vision agent that ingests a live webcam feed or an image folder, detects motion using frame-difference analysis, triggers VLM analysis only on scene changes, and persists structured observations to a local SQLite database with a FastAPI web dashboard for review. No API costs. No network calls after the first model download. 16GB RAM, no GPU required.

Project Overview

The agent does five things:

Ingests a live webcam feed or an image folder as input
Performs continuous visual monitoring, frame-difference based motion detection that triggers VLM analysis only on scene changes
Describes new objects, reads text from images - receipts, whiteboards, signs, and logs everything as structured observations
Persists observations to a local SQLite database with timestamps, thumbnails, descriptions, and confidence scores
Exposes a FastAPI web dashboard with live feed, latest observations, and a searchable log

It runs entirely offline. The model auto-downloads on first run and is cached locally from that point forward.

Use cases: home security camera analysis, document digitization pipelines, accessibility tools.

Architecture

The key design decision is the motion gate. Running a 2.2B-parameter model on every frame would be unusable on CPU hardware, inference is not instant. The agent solves this by running frame-difference motion detection on every frame first, and only invoking the VLM when a scene change is detected above the configured threshold.

Per-frame timeline:
Every frame goes through motion detection first. If the frame difference is below the threshold, the frame is dropped with no further processing. If motion is detected, the VLM runs, produces a description, and the observation is stored in SQLite with a thumbnail. This design means expensive model inference only happens when something actually changes in the scene, keeping a Pi-class CPU usable while still describing every meaningful scene change.

The FRAME_DIFF_THRESHOLD defaults to 0.15 and controls how sensitive the motion detector is. A higher value means less sensitivity, minor lighting changes or small movements are ignored. A lower value triggers the VLM more frequently.

Prerequisites

Python: 3.11 or newer.
RAM: 16GB minimum for the real model; less is fine in --mock mode.
Disk: ~5GB free for the model cache.
OS: Linux, macOS, or WSL2 on Windows - uses OpenCV, and webcam access requires native camera support.
No GPU required - SmolVLM2-2.2B is designed for CPU inference.

Installation

git clone https://github.com/dakshjain-1616/smolvlm2-edge-agent.git
cd smolvlm2-edge-agent
make install                                  # pip install -e .
cp .env.example .env                          # then edit values as needed

The make install command runs pip install -e . which installs the package and its pinned runtime dependencies from requirements.txt. The .env.example file contains all documented environment variables, copy it to .env and edit the values you want to override before running.

Configuration

Every tunable is configurable via CLI flags and environment variables. CLI flags take precedence over environment variables. All variables are documented in .env.example in the repository.

MODEL_NAME - HuggingFace model id, default: HuggingFaceTB/SmolVLM2-2.2B-Instruct
USE_MOCK_MODE - bypass model loading with deterministic stub responses, default: false
MODEL_CACHE_DIR - where the HuggingFace model is cached on disk, default: ./models
DB_PATH - SQLite database file path, default: ./data/observations.db
FRAME_DIFF_THRESHOLD - motion sensitivity on a 0–1 scale, higher means less sensitive, default: 0.15
MIN_CONFIDENCE - minimum VLM confidence required to log an observation, default: 0.5
PROCESSING_INTERVAL - seconds between frame samples, default: 1.0
MAX_OBSERVATIONS - cap on stored rows, older observations are pruned, default: 10000
DASHBOARD_HOST - FastAPI bind host, default: 0.0.0.0
DASHBOARD_PORT - FastAPI port, default: 8080
INPUT_SOURCE - camera index or path to image folder, default: 0
OUTPUT_DIR - where observation artifacts are written, default: ./data/observations/
THUMBNAIL_DIR - where frame thumbnails are saved, default: ./data/thumbnails/
LOG_LEVEL - Python logging level, default: INFO
LOG_FILE - optional log file path, default: ./data/agent.log

MIN_CONFIDENCE is worth paying attention to — observations where the VLM's confidence falls below 0.5 are not stored. Raising this filters out uncertain detections. Lowering it logs more, including lower-confidence observations.

Usage

Quick start - mock mode, no model download

The fastest way to verify the full pipeline is mock mode. It bypasses model loading entirely and uses deterministic stub responses, so you can confirm the agent loop, database writes, thumbnail generation, and dashboard all work before committing to the 5GB model download:

mkdir -p data/test_images
python3 -m src --mock --input ./data/test_images --duration 30

This runs the agent for 30 seconds against the data/test_images/ folder using the mock VLM, populates data/observations.db, and writes thumbnails to data/thumbnails/.

Run against a webcam

python3 -m src --input 0 --port 8080

Camera index 0 is the default device. For additional cameras, use index 1, 2, and so on. Open http://localhost:8080 in a browser to see the live dashboard. The dashboard shows the live feed, the most recent observations, and a searchable log of everything the agent has recorded.

Run against an image folder

python3 -m src --input ./images --interval 2.0

Iterates over ./images at 2-second intervals. Useful for batch processing a folder of scanned documents, receipts, or photos without a live camera feed.

Dashboard only in read mode

python3 -m src --mode dashboard --port 8080

Serves the dashboard against an existing data/observations.db without running the agent. Useful for reviewing historical observations without starting a new capture session.

API Reference

The FastAPI dashboard exposes six endpoints:

The /api/search endpoint runs full-text search over stored observation descriptions, useful for finding all observations that mention a specific object, person, or piece of text across the full history.

The /api/observations endpoint is paginated with limit and offset parameters. The default returns the 50 most recent observations.

Models Used

This is the default --model argument and MODEL_NAME env var. No other models are referenced in code, config, or docs. The model is downloaded from HuggingFace on first run and cached in ./models.

Testing

The test suite covers all five modules - database, vision, agent, dashboard, and CLI - with the VLM fully mocked so no model download is needed to run tests:

make test                  # python3 -m pytest tests/ -v
make lint                  # ruff check src/ tests/ --fix
make typecheck             # mypy src/ --ignore-missing-imports

Test coverage:

tests/test_db.py - 10 tests covering SQLite schema, CRUD, and search
tests/test_vision.py - 6 tests covering mock VLM and prompt rendering
tests/test_agent.py - 9 tests covering motion detection and the agent loop
tests/test_dashboard.py - 6 tests covering HTTP route handlers
tests/test_cli.py - 7 tests covering argparse and env-var loading

Total: 36 tests, all passing. No skipped tests.

Project Structure

smolvlm2-edge-agent/
├── src/
│   ├── __init__.py
│   ├── __main__.py              # entry point for python -m src
│   ├── agent.py                 # MotionDetector + VisionAgent
│   ├── vision.py                # VisionEngine (SmolVLM2 wrapper, with MockVisionEngine)
│   ├── db.py                    # SQLite Database class
│   ├── dashboard.py             # FastAPI app factory + route handlers
│   └── cli.py                   # argparse + env loading
├── tests/                       # 36 pytest tests, VLM fully mocked
├── data/.gitkeep                # observations.db, thumbnails/, test_images/ land here
├── models/.gitkeep              # HF model cache
├── pyproject.toml               # ruff + mypy config + console_script
├── requirements.txt             # pinned runtime deps
├── Makefile                     # install, test, lint, typecheck, run, clean
├── .env.example                 # documented env vars
├── .gitignore
├── BUILD_NOTES.md               # build/verification trace
└── PUBLISH.md                   # exact GitHub push commands

The src/ directory maps cleanly to the agent's responsibilities - agent.py handles the motion detection and VLM orchestration loop, vision.py wraps the model with a mock-compatible interface, db.py handles all SQLite operations, dashboard.py is the FastAPI application, and cli.py handles all argument parsing and environment variable loading.

Contributing

PRs welcome. Before submitting, all three of the following must pass with zero errors:

make lint && make typecheck && make test

How I Built This Using NEO

The process started with an idea - a fully offline edge vision agent that runs on CPU-only hardware with no GPU and no cloud API calls. I put together a clear project description with the requirements, tech stack, and expected output, and handed it to NEO. From there NEO handled the full build autonomously: writing the code, running tests, fixing issues, and iterating until everything was working end to end. Once NEO completed the build, I did a manual review, tested it myself, and fed any improvements back - which NEO then implemented.

How You Can Use and Extend This With NEO

Use it as an offline home security monitor: Point it at a webcam, let it run, and review what it logged through the dashboard. Every scene change is stored with a timestamp, description, confidence score, and thumbnail - all locally, with no data leaving your machine.

Use it for document digitization pipelines: Point --input at a folder of scanned receipts, whiteboards, or handwritten notes. The VLM reads text from images and logs structured observations. The /api/search endpoint lets you query what was found across the full document set.

Use it as an accessibility tool: Run it against a webcam feed to generate continuous natural language descriptions of what is visible in the environment - stored and searchable, entirely offline.

Extend it with additional VLM backends: VisionEngine in vision.py wraps SmolVLM2-2.2B with a clean interface that MockVisionEngine also implements. Swapping in a different HuggingFace multimodal model means updating vision.py - the agent, database, dashboard, and CLI stay entirely unchanged.

Final Notes

SmolVLM2 Edge Vision Agent shows that meaningful vision AI does not require a GPU or a cloud API. A 2.2B-parameter model, motion-gated inference, a local SQLite store, and a FastAPI dashboard, all running offline on commodity hardware.

The code is at https://github.com/dakshjain-1616/SmolVLM2-Edge-Vision-Agent
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

Prompt Compression Benchmarker: Cut LLM Input Costs by 35–63% With Measurable Quality Tracking

Nilofer 🚀 — Wed, 06 May 2026 07:07:48 +0000

Most LLM cost comes from input tokens, the long documents, codebases, or conversation histories you send as context. There are several prompt compression algorithms available, but nobody tells you which one actually works best for your specific workload, or how much quality you are trading for the savings.

Prompt Compression Benchmarker (PCB) answers both questions. It benchmarks every major prompt compression algorithm against your actual data, shows you exactly how much quality each one drops, projects the real dollar savings at your call volume, and then gives you a one-line wrapper to deploy the winner as a drop-in replacement around your Anthropic or OpenAI client.

What It Does

PCB answers two questions:

Which compression algorithm preserves the most quality at a given token budget?
Benchmark mode runs all compressors against your data and scores each one with task-specific quality metrics and an optional LLM-as-judge.

How much money does that save at your actual call volume?
Cost projection mode takes your daily token volume and model pricing and gives you monthly and annual savings per compressor.

Then it gives you a one-line wrapper to deploy the answer.

Installation

# From source
git clone https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
cd Prompt-Compression-Benchmarker
pip install .

# From PyPI (once published)
pip install prompt-compression-benchmarker

Requires Python 3.9+. No GPU required. Core dependencies: tiktoken, scikit-learn, rouge-score, rank-bm25, typer, rich.

# Verify
pcb --help

# Optional extras
pip install "prompt-compression-benchmarker[anthropic]"   # SDK wrapper for Anthropic
pip install "prompt-compression-benchmarker[openai]"      # SDK wrapper for OpenAI
pip install "prompt-compression-benchmarker[mcp]"         # MCP server for Claude Code
pip install "prompt-compression-benchmarker[all]"         # Everything

Quick Start

1. Run the benchmark
The simplest run uses bundled sample data - no setup needed:

# All compressors × all task types, bundled sample data — no setup needed
pcb run

# Target a specific task with cost projection
pcb run --task rag --max-samples 20 --daily-tokens 2000000 --cost-model claude-sonnet-4-6

# Add LLM-as-judge for deeper quality scoring (requires OpenRouter API key)
export OPENROUTER_API_KEY=sk-or-...
pcb run --llm-judge --judge-model claude-sonnet-4-6 --max-samples 10

Here is what a real benchmark run looks like - RAG task, 3M tokens/day, claude-sonnet-4-6 pricing:

pcb run --daily-tokens 3000000 --cost-model claude-sonnet-4-6
                          RAG
 Compressor          Token Reduc %  Proxy Score  Proxy Drop %   ms
 no_compression           0.0%        0.2983         0.0%      0.3
 tfidf ★                 40.1%        0.2519        +16.5%     12.1
 selective_context        56.9%        0.1874        +34.4%      8.3
 llmlingua                53.6%        0.2182        +28.1%      9.7
 llmlingua2               45.0%        0.2204        +27.3%     11.2

 Monthly Cost Projection  claude-sonnet-4-6 · $3/1M · 3M tokens/day
 tfidf             38.3% reduction   $103/mo saved   $1,240/yr
 selective_context 57.5% reduction   $155/mo saved   $1,863/yr
 llmlingua2        43.6% reduction   $118/mo saved   $1,413/yr

The ★ marks the Pareto-optimal compressor - best token savings given a quality drop below 20%.

2. Compress a file directly

# Compress from a file or stdin, output to stdout
pcb compress context.txt --compressor llmlingua2 --rate 0.45 --stats

# Pipe it into any script
cat rag_context.txt | pcb compress | python send_to_claude.py

# Save compressed output
pcb compress context.txt -o compressed.txt --stats

3. Deploy the winner
Once you know which compressor wins on your data, deploying it is one line:

from pcb.middleware import CompressingAnthropic

# Drop-in replacement for anthropic.Anthropic()
client = CompressingAnthropic(compressor="llmlingua2", rate=0.45)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

print(client.stats)  # CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)

Everything else in your codebase stays the same.

Understanding the Results

Benchmark table columns

Quality drop color coding

cyan   = negative drop (compression improved the metric — noise removal)
green  = < 5% drop    (effectively lossless)
yellow = 5–15% drop   (acceptable for most use cases)
red    = ≥ 15% drop   (significant information loss)

Why use the LLM judge?

The proxy score (F1, ROUGE, BM25) is fast and free but mechanical. The LLM judge calls a real model to evaluate whether the compressed context still supports the correct answer, it reveals things proxy metrics miss.

Here is a real example showing why this matters - RAG task, 5 samples, LLM judge = claude-sonnet-4-6:

Compressor           Proxy Drop %   LLM Score   LLM Drop %
no_compression           0.0%         0.94         0.0%
tfidf                  +23.7%         0.40        -57.4%    ← proxy hid the severity
llmlingua2             +29.9%         0.70        -25.5%    ← much better than proxy suggested
selective_context      +37.6%         0.14        -85.1%    ← dangerous despite high compression

Rule of thumb: use proxy scores to compare many configs quickly, then LLM-judge the top 2–3 before deploying.

Choosing a Compressor

RAG: llmlingua2 at rate 0.40 - preserves named entities and key facts better than sentence-dropping

Summarization: llmlingua at rate 0.45 - sentence-level pruning maintains structural coverage

Code contexts: llmlingua2 at rate 0.35 - keeps imports, identifiers, type names; removes boilerplate

General chat: tfidf at rate 0.40 - safe default, fast, reliable

Target compression rate

--rate is the fraction of tokens to remove. 0.45 means keep 55% of tokens.

Cost Savings - The Real Numbers

Compression saves money on input tokens only. Output tokens are unchanged.
At 3M input tokens per day:

Compression is most valuable on premium models. On DeepSeek or GPT-4.1-mini, the savings are too small to justify the complexity, use it only if you're hitting context window limits.

# Check your own workload
pcb run --max-samples 10 --daily-tokens 5000000 --cost-model claude-opus-4-7

Deploy: Python SDK Wrappers

Anthropic

from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    verbose=True,
)

response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": very_long_document}],
    max_tokens=1024,
)

# Cumulative stats
print(client.stats)
# CompressionStats(calls=47, tokens_saved=21,800, reduction=44.8%)

# Estimate monthly savings
print(client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000))
# 588.0

OpenAI (Chat Completions + Codex Responses API)

from pcb.middleware import CompressingOpenAI

client = CompressingOpenAI(compressor="tfidf", rate=0.40)

# Chat Completions API — unchanged
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": long_context}]
)

# Responses API (Codex / o-series)
response = client.responses.create(
    model="codex-mini-latest",
    input=long_codebase_context,
    reasoning={"effort": "high"}
)

What gets compressed

By default, only "user" role messages over 100 tokens are compressed. System prompts and assistant history are passed through unchanged.

client = CompressingAnthropic(
    compressor="llmlingua2",
    rate=0.45,
    compress_roles=("user", "system"),  # also compress system prompt
)

Claude Code Integration (MCP)

PCB ships an MCP server that adds four compression tools directly into Claude Code conversations.

Setup

# Add to the current project
claude mcp add pcb -s project -- python -m pcb.mcp_server

# Or add to all your projects
claude mcp add pcb -s user -- python -m pcb.mcp_server

Or drop .mcp.json into any project root:

{
  "mcpServers": {
    "pcb": {
      "type": "stdio",
      "command": "python",
      "args": ["-m", "pcb.mcp_server"]
    }
  }
}

Available tools
Once connected, you can ask Claude:

"Compress this RAG context before sending it to the model"
"Estimate how much I'd save compressing my prompts on claude-opus-4-7 at 2000 calls/day"
"What compressor should I use for my coding assistant at 90% quality floor?"

OpenAI Codex (Agents SDK)

from agents import Agent, Runner
from agents.mcp import MCPServerStdio
import asyncio

async def main():
    async with MCPServerStdio(
        name="pcb",
        params={"command": "python", "args": ["-m", "pcb.mcp_server"]},
    ) as pcb_server:
        agent = Agent(
            name="CostAwareAssistant",
            model="codex-mini-latest",
            mcp_servers=[pcb_server],
        )
        result = await Runner.run(
            agent,
            "Compress this codebase context and estimate savings: " + codebase_context
        )
        print(result.final_output)

asyncio.run(main())

Bring Your Own Data

Data is JSONL - one JSON object per line. Check the schema for each task type:

pcb show-schema rag
pcb show-schema summarization
pcb show-schema coding

RAG schema

{
  "id": "my_001",
  "context": "<passage 300–1500 tokens>",
  "question": "<specific question requiring the full context>",
  "answer": "<short, precise answer string>"
}

Summarization schema

{
  "id": "my_001",
  "article": "<article or document 300–800 tokens>",
  "summary": "<2–3 sentence reference summary>"
}

Coding schema

{
  "id": "my_001",
  "context": "<imports, helpers, type definitions — 400–800 tokens>",
  "docstring": "<description of the function to implement>",
  "solution": "<correct Python implementation>"
}

Running on your data

pcb run --data-dir ./my_data --task rag --max-samples 50

# Compare specific compressors
pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2

# Export results
pcb run --data-dir ./my_data --output results.json
pcb run --data-dir ./my_data --output results.csv
pcb run --data-dir ./my_data --output results.html

Workflow: Benchmark to Production

Here is the full path from benchmarking to deploying a compressor in production.

Step 1 : Benchmark on your actual data

pcb run --data-dir ./my_data --max-samples 50 --task rag \
        --daily-tokens 2000000 --cost-model claude-opus-4-7 \
        --output benchmark.json

Step 2 : LLM-judge the top candidates

pcb run --data-dir ./my_data --compressor tfidf --compressor llmlingua2 \
        --llm-judge --judge-model claude-sonnet-4-6 --max-samples 30

Step 3 : Deploy the winner

from pcb.middleware import CompressingAnthropic

client = CompressingAnthropic(compressor="llmlingua2", rate=0.40)
# Everything else in your codebase stays the same

Step 4 : Monitor in production

if client.stats.calls % 1000 == 0:
    logger.info(
        "pcb savings: calls=%d saved=%d tokens (%.1f%%) est_monthly=$%.0f",
        client.stats.calls,
        client.stats.tokens_saved,
        client.stats.reduction_pct,
        client.stats.monthly_savings_usd(price_per_million=15.0, daily_calls_estimate=2000),
    )

CLI Reference

PCB run - benchmark

Options:
  -c, --compressor TEXT       Compressor to include (repeat for multiple). Default: all five.
  -t, --task TEXT             Task type: rag, summarization, coding (repeat for multiple).
  -n, --max-samples INT       Max samples per task.
  -r, --rate FLOAT            Target compression rate 0.0–1.0. Default: 0.5
  -d, --data-dir PATH         Directory with *_samples.jsonl files.
  -o, --output PATH           Save report as .json, .csv, or .html.
  -j, --llm-judge             Enable LLM-as-judge scoring via OpenRouter.
  -m, --judge-model TEXT      Model for LLM judge. Default: claude-sonnet-4-6.
      --openrouter-key TEXT   OpenRouter API key (or set OPENROUTER_API_KEY).
      --daily-tokens INT      Daily token volume for cost projection.
      --cost-model TEXT       Model name for cost lookup (e.g. claude-opus-4-7).
      --token-price FLOAT     Manual price override in $/1M tokens.

PCB compress - compress text

Arguments:
  [INPUT_FILE]                File to compress. Reads stdin if omitted.

Options:
  -c, --compressor TEXT       Algorithm. Default: tfidf.
  -r, --rate FLOAT            Fraction to remove. Default: 0.45.
  -o, --output PATH           Write to file instead of stdout.
  -s, --stats                 Print token stats to stderr.

Other commands

pcb list-compressors          # Show all algorithms
pcb list-models               # Show 75+ supported LLM judge models
pcb show-schema rag           # Show JSONL schema for a task type

Output Formats

JSON - full detail per sample

pcb run --output results.json

CSV - one row per compressor × task

pcb run --output results.csv

Columns: compressor, task, avg_token_reduction_pct, avg_quality_score, avg_quality_drop_pct, avg_llm_score, avg_llm_drop_pct, avg_latency_ms, num_samples

HTML - shareable visual report

pcb run --output results.html
# Open in any browser — Chart.js scatter plots, dark theme, Pareto highlights

When NOT to Use Compression

Short prompts (< 200 tokens): PCB skips these automatically overhead exceeds savings.
Cheap models (< $0.50/1M): DeepSeek, Gemini Flash, GPT-4.1-mini - savings too small.
High-precision tasks: Legal review, medical diagnosis - verify your quality floor with --llm-judge first.
Output-bottlenecked workloads: Compression only affects input tokens.

Project Structure

src/pcb/
├── cli.py                      # Typer CLI — all commands
├── config.py                   # Pydantic config and model pricing table
├── runner.py                   # Benchmark orchestration + BenchmarkReport
├── mcp_server.py               # FastMCP server for Claude Code / Codex
├── compressors/
│   ├── tfidf.py                # TF-IDF sentence scoring
│   ├── selective_context.py    # Greedy token-budget selection
│   ├── llmlingua.py            # Sentence-level coarse pruning
│   └── no_compression.py       # Passthrough baseline
├── tasks/
│   ├── rag.py                  # F1/EM/context-recall evaluator
│   ├── summarization.py        # ROUGE-L evaluator
│   └── coding.py               # BM25 + identifier preservation
├── evaluators/
│   └── llm_judge.py            # OpenRouter LLM-as-judge (75+ models)
├── reporters/
│   ├── terminal.py             # Rich terminal tables
│   ├── json_reporter.py        # JSON output
│   ├── csv_reporter.py         # CSV output
│   └── html_reporter.py        # Chart.js HTML report
├── middleware/
│   ├── anthropic_client.py     # CompressingAnthropic drop-in wrapper
│   └── openai_client.py        # CompressingOpenAI drop-in wrapper
└── data/
    ├── rag_samples.jsonl        # 20 real-world factual passages (400–450 tokens)
    ├── summarization_samples.jsonl  # 10 real news-style articles
    └── coding_samples.jsonl     # 10 real Python code contexts (370–800 tokens)

How I Built This Using NEO

I described the problem at a high level: a tool that benchmarks multiple prompt compression algorithms against real workloads, scores quality loss empirically, projects actual dollar savings at a given token volume, and makes it trivially easy to deploy the winning algorithm into an existing Anthropic or OpenAI codebase.

NEO built the entire thing autonomously, the Typer CLI with all commands and flags, all five compressor implementations, the F1/ROUGE-L/BM25 task evaluators, the OpenRouter LLM-as-judge with support for 75+ models, the cost projection engine with the model pricing table, the CompressingAnthropic and CompressingOpenAI drop-in wrappers, the FastMCP server with four tools, the JSON/CSV/HTML reporters, and the three bundled sample datasets - 20 RAG passages, 10 summarization articles, and 10 coding contexts.

How You Can Use and Extend This With NEO

Use it before committing to a compression strategy.
Before wiring any compressor into your production stack, run pcb on a sample of your actual prompts. The benchmark tells you which algorithm preserves the most quality at your target compression rate specific to your data, not a generic recommendation.

Use it to justify the cost of compression infrastructure.
The cost projection output gives you monthly and annual savings at your actual token volume and model pricing. This is the number you need to make a case for adding compression to your pipeline, not a rough estimate but a measured projection against your workload.

Use the MCP tools inside Claude Code sessions.
With the MCP server connected, you can ask Claude to compress a context, estimate savings, or recommend a compressor without leaving your coding environment. This makes compression a natural part of the agent workflow rather than a separate offline step.

Extend it with additional compressors.
The four compressors share a common interface in src/pcb/compressors/. A new algorithm - semantic chunking, abstractive summarization, or a custom retrieval-based approach, slots in as a new file in that directory and appears automatically in pcb run, pcb compress, and the MCP recommend tool without touching any other part of the codebase.

Final Notes

Most teams discover they are over-spending on input tokens only after the bill arrives. pcb gives you the benchmark data to make an informed decision before committing - which algorithm, at what rate, for which task type and the deployment tooling to act on it immediately.

The code is at https://github.com/dakshjain-1616/Prompt-Compression-Benchmarker
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

ContextCraft: A Visual Workbench for Building and Managing LLM Context Windows

Nilofer 🚀 — Tue, 05 May 2026 16:00:35 +0000

Building a good LLM prompt is not a one-shot task. You assemble pieces, a system message, a few examples, some context, the actual instruction and then you iterate. You compress things that are too long, test whether the output still holds up, check how many tokens you are spending, and save versions so you can roll back when something breaks.

Most developers do this in a text editor, a notebook, or scattered across a handful of scripts. There is no single place where you can see the whole context window, manipulate it visually, compress a block, run a live test, and save a snapshot, all without switching tools.

ContextCraft is that place. It is a canvas-based interactive workbench for assembling, compressing, testing, and versioning LLM context windows. It runs locally, connects to Ollama for local compression and testing, supports OpenRouter for cloud LLM testing, and exports directly to OpenAI, Anthropic, LangChain, and JSON formats.

Features

Visual Canvas: Drag and drop interface for organizing prompt blocks with real-time token counting and visual progress bars.

Smart Compression: AI-powered compression using Ollama with semantic preservation. Set a target compression ratio, choose whether to preserve structure, and review a before/after comparison before applying.

Coverage Analysis: Semantic similarity scoring between original and compressed content. Key concept preservation is surfaced as a score so you know exactly what you are trading for token savings.

LLM Testing: Test prompts with streaming responses from Ollama or OpenRouter directly from the canvas. Select provider, model, and temperature and view responses in real time.

Version Control: Save and restore canvas versions with a SQLite backend. Name versions for easy reference and compare two versions to see what changed.

Multi-Format Export: Export to OpenAI, Anthropic, LangChain, and JSON formats. Copy the generated code and paste directly into your application.

Block Library: Pre-built starter blocks for common use cases, available from the sidebar. Add your own blocks to the library for reuse across canvases.

Architecture

ContextCraft is split into a FastAPI backend and a React + Vite frontend.

The backend handles token counting via tiktoken, semantic similarity analysis via sentence-transformers, compression via Ollama, streaming LLM test responses via Ollama or OpenRouter, SQLite-backed version management, and export format generation. The frontend renders the visual canvas with drag-and-drop via @hello-pangea/dnd and code editing via CodeMirror.

contextcraft/
├── server/                 # FastAPI backend
│   ├── main.py            # FastAPI app entry point
│   ├── models.py          # Pydantic data models
│   ├── tokenizer.py       # Token counting (tiktoken)
│   ├── coverage.py        # Semantic similarity analysis
│   ├── compress.py        # Ollama compression service
│   ├── tester.py          # LLM streaming test service
│   ├── export.py          # Export format generators
│   ├── versions.py        # SQLite version management
│   └── pricing.py         # OpenRouter pricing API
├── frontend/              # React + Vite frontend
│   ├── src/
│   │   ├── components/    # React components
│   │   ├── hooks/         # Custom React hooks
│   │   └── App.jsx        # Main application
│   └── package.json
├── cli/                   # CLI entry point
│   └── main.py
└── pyproject.toml         # Python package config

Getting Started

Prerequisites

Python 3.9+
Node.js 18+
Ollama (optional, for local compression and testing)
OpenRouter API key (optional, for cloud LLM testing)
Installation

Installation

# Clone the repository
git clone https://github.com/contextcraft/contextcraft.git
cd contextcraft

# Install Python dependencies
pip install -e ".[dev]"

# Install frontend dependencies
cd frontend
npm install
cd ..

# Initialize the database
contextcraft init-db

Running the Application

# Start the server and frontend
contextcraft serve

# Or start with custom options
contextcraft serve --host 0.0.0.0 --port 8000 --frontend-port 3000

Once running, the frontend is available at localhost:5173 and the API docs at localhost:8000/docs.

Configuration

Create a .env file in the project root:

# OpenRouter API key (for cloud LLM testing)
OPENROUTER_API_KEY=your_api_key_here

# Ollama URL (default: http://localhost:11434)
OLLAMA_URL=http://localhost:11434

# Default compression model
DEFAULT_COMPRESSION_MODEL=gemma2:2b

Supported Models

Token Counting: GPT-4, GPT-4o, GPT-4o Mini, GPT-3.5 Turbo, Claude 3 Opus, Sonnet, Haiku, Claude 3.5 Sonnet

Compression (via Ollama): gemma2:2b (default), any Ollama-compatible model

Testing: Ollama local models, OpenRouter cloud models (requires API key)

Usage Guide

Creating a Canvas: Start with an empty canvas or load from the library. Add blocks using the sidebar buttons or drag from the library. Arrange blocks by dragging to reorder. Edit block content inline or in the full editor.

Compressing Content: Click the compress icon on any block. Set the target compression ratio (0.1 to 0.9). Choose whether to preserve structure. Review the before/after comparison. Apply compression when satisfied.

Testing Prompts: Add your prompt blocks to the canvas. Click the Test button. Select provider (Ollama or OpenRouter). Choose model and set temperature. View streaming responses in real time.

Analyzing Coverage: Compress one or more blocks. Click the Coverage button. View semantic similarity scores. Check key concept preservation.

Managing Versions: Click Versions to save the current state. Name your version for easy reference. Restore previous versions at any time. Compare versions to see changes.

Exporting: Click Export when ready. Choose format, OpenAI, Anthropic, LangChain, or JSON. Copy the generated code. Paste into your application.

API Endpoints

Token Management

POST /api/tokenize - count tokens for text or blocks
GET /api/pricing - get model pricing information

Compression

POST /api/compress - compress text using Ollama
POST /api/coverage - analyze semantic coverage between original and compressed

Testing

POST /api/test - stream LLM responses from Ollama or OpenRouter

Versioning

GET /api/versions - list all versions
POST /api/versions - save a new version
GET /api/versions/{id} - get a specific version
POST /api/versions/{id}/restore - restore a version
POST /api/versions/compare - compare two versions

Export

POST /api/export - export canvas to OpenAI, Anthropic, LangChain, or JSON format

Library

GET /api/library - get starter block library
POST /api/library - add a block to the library

CLI Commands

# Start the application
contextcraft serve

# Initialize database
contextcraft init-db

# Add a block to library
contextcraft add-block --type system --label "My Template" --content "..."

# Get help
contextcraft --help

Docker

# Build image
docker build -t contextcraft .

# Run container
docker run -p 8000:8000 -p 5173:5173 contextcraft

Development

Backend

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black server/ cli/
isort server/ cli/

# Type checking
mypy server/ cli/

Frontend

cd frontend

# Start dev server
npm run dev

# Build for production
npm run build

# Run linter
npm run lint

Contributing

Fork the repository. Create a feature branch:

git checkout -b feature/amazing-feature

Commit your changes:

git commit -m 'Add amazing feature'

Push to the branch:

git push origin feature/amazing-feature

Then open a Pull Request.

How I Built This Using NEO

This tool was designed, built, debugged, and iterated entirely using NEO -An autonomous AI engineering agent that writes, runs, and refines real code end-to-end.

ContextCraft is a full-stack application, a FastAPI backend, a React + Vite frontend, a CLI, and a SQLite-backed versioning layer. Every part of the system was generated and connected through NEO: the backend services for token counting, semantic coverage analysis, compression via Ollama, streaming LLM testing, export pipelines, version management, and pricing integration, along with the interactive frontend canvas for assembling prompt blocks with drag-and-drop, inline editing, and real-time token tracking.

The compression and coverage pipeline, the live testing flow across Ollama and OpenRouter, the version save/restore and comparison system, and the multi-format export layer were all built end-to-end from a high-level problem description. NEO handled the full cycle - generating code, wiring components, resolving issues, and refining the system into a working product.

How You Can Use and Extend This With NEO

Prompt engineering workbench: Instead of iterating on prompts in a text editor and manually counting tokens, assemble your context window visually, compress blocks that are too long, and test the result, all in one place. The version control means you never lose a working configuration while experimenting.

Evaluate compression quality before shipping: Before deploying a compressed prompt to production, run coverage analysis to get a semantic similarity score between the original and compressed version. You know exactly how much meaning you are trading for token savings, not just a token count but an actual semantic measurement.

Manage prompt libraries across projects: The block library lets you save reusable prompt blocks and load them into any canvas. Teams building multiple LLM products can maintain a shared library of tested, versioned prompt components.

Extend it with additional export formats: The export module currently supports OpenAI, Anthropic, LangChain, and JSON. Adding a new format follows the same pattern in export.py and surfaces automatically in the Export UI without touching any other part of the stack.

Final Notes

Context window management is one of those problems that looks simple until you are doing it seriously. ContextCraft brings together the pieces that are usually scattered across different tools visual assembly, token counting, AI compression, semantic coverage analysis, live testing, version control, and export into a single local workbench.

The code is at https://github.com/dakshjain-1616/ContextCraft
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can also use NEO MCP with Claude Code: https://heyneo.com/claude-code

LLM Behavior Diff Model Update Detector

Nilofer 🚀 — Mon, 04 May 2026 11:18:41 +0000

You swap a model. The new one scores better on your benchmarks. You deploy it. Two days later, a user reports that something that used to work reliably now behaves differently.

The benchmark never caught it because benchmarks measure averages. What changed was the behavior on specific prompts, the ones your users actually send.

LLM Behavior Diff is a tool that catches this before it happens. Feed it two model versions and a prompt suite, and it runs every prompt through both, scores the responses for semantic similarity, classifies each divergence by severity, and produces an HTML report you can drop into a CI artifact or diff review.

It ships as a CLI, a Python API, and an MCP server so Claude Code or any MCP-compatible agent can run a behavioral diff before a model swap.

The Problem With Model Updates

Every model update is a tradeoff. The new version might score better on reasoning benchmarks while quietly regressing on instruction-following for your specific use case. Or it might phrase safety refusals differently in a way that breaks downstream parsing. Or two models might produce semantically identical answers that look completely different at the token level, which a naive string comparison would flag as a major change when it isn't one.

LLM Behavior Diff addresses all three scenarios. Embedding-based semantic similarity catches meaning-level changes that token-level comparison misses. The LLM-as-judge layer adds a reasoning layer for ambiguous cases. Severity classification separates noise from real regressions.

How It Works

The pipeline runs in five steps for every prompt in your suite:

Load: A YAML prompt suite is loaded into a PromptSuite Pydantic model. Each prompt has an ID, text, category, tags, and an expected behavior description.

Run: Each prompt is sent through Model A and Model B via LLMRunner. Three providers are supported: Ollama (/api/generate), OpenRouter (chat completions), and a deterministic stub provider for offline CI runs.

Score: Each response pair is scored with either EmbeddingDiffer (cosine similarity on all-MiniLM-L6-v2 embeddings) or SimpleDiffer (Jaccard over words). Optionally, an LLM-as-judge score is combined with the similarity score, default judge model is google/gemini-2.0-flash-lite-001 via OpenRouter.

Classify: Each prompt is classified against the --threshold. Changes are bucketed by severity: combined score >= 0.7 is minor, >= 0.4 is moderate, < 0.4 is major.

Report: An HTML report is rendered and a rich summary table is printed to the terminal.

Why Embeddings Over Token Matching

The difference matters. Here is the same two-model comparison run two ways:
With --use-embeddings (cosine on all-MiniLM-L6-v2):

Avg Similarity: 91.4%
Changes Detected: 0 of 5

With --no-use-embeddings (Jaccard fallback):

Avg Similarity: 25.0%
Changes Detected: 5 of 5

Same two models, same prompts, completely opposite conclusions. The Llama and Gemini answers shared few exact tokens even when semantically identical, which is exactly why the embeddings path is on by default.

Getting Started

pip install -e .

Requires Python 3.11+. Embedding similarity uses sentence-transformers/all-MiniLM-L6-v2, downloaded on first use. The LLM-judge path requires OPENROUTER_API_KEY without it, scoring falls back to embeddings-only.

Running a Diff

Offline - stub provider
A stub provider returns deterministic hashed responses, so the whole pipeline runs offline without Ollama or an API key. Good for CI and testing the setup:

llm-diff run \
  --model-a stub-a --provider-a stub \
  --model-b stub-b --provider-b stub \
  --prompts prompts/default.yaml \
  --output output/report.html \
  --no-use-embeddings

Real output from this run (stub + Jaccard, threshold 0.5):

╭───────────────────────────────────────────────────╮
│ LLM Behavior Diff                                 │
│ Detecting behavioral shifts between model updates │
╰───────────────────────────────────────────────────╯
  Processing: safety-001 ━━━━━━━━━━━━━━━━━━━━ 100%

Comparison Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric           ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Total Prompts    │ 5     │
│ Changes Detected │ 3     │
│ Change Rate      │ 60.0% │
│ Avg Similarity   │ 40.0% │
└──────────────────┴───────┘
Report saved to: output/stub_jaccard.html

Real models - OpenRouter

export OPENROUTER_API_KEY=sk-or-...
llm-diff run \
  --model-a meta-llama/llama-3.2-3b-instruct --provider-a openrouter \
  --model-b google/gemini-2.0-flash-lite-001 --provider-b openrouter \
  --prompts prompts/default.yaml \
  --output output/or_emb.html \
  --use-embeddings --threshold 0.85

Real output (embeddings only):

Comparison Summary
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Total Prompts    │ 5     │
┃ Changes Detected │ 0     │
┃ Change Rate      │ 0.0%  │
┃ Avg Similarity   │ 91.4% │
└──────────────────┴───────┘

Adding --use-judge brings the average similarity to 91.8% and surfaces reasoning like: "Both responses correctly answer 'yes' and provide essentially the same explanation... Response A is slightly more verbose, but the core meaning is identical."

Real models - Ollama

llm-diff run \
  --model-a qwen3:8b --provider-a ollama \
  --model-b gemma4:e4b --provider-b ollama \
  --prompts prompts/default.yaml \
  --output output/report.html \
  --use-embeddings --threshold 0.85

CLI Reference

llm-diff --help

Usage: llm-diff [OPTIONS] COMMAND [ARGS]...

 LLM Behavior Diff — Model Update Detector

 --version                Show version information
 --help                   Show this message and exit.

 Commands
   run  Run a comparison between two models.

llm-diff --version
LLM Behavior Diff version 0.1.0

Key options for llm-diff run:

Severity buckets applied when a change is detected: combined >= 0.7 is minor, >= 0.4 is moderate, < 0.4 is major.

Prompt Suite Format

The prompt suite is a YAML file. prompts/default.yaml ships with 5 prompts spanning reasoning, coding, factual, instruction-following, and safety. You can write your own:

name: "My suite"
version: "1.0.0"
prompts:
  - id: "code-001"
    text: "Write a Python function reverse_string(s)..."
    category: "coding"
    tags: ["python"]
    expected_behavior: "Short correct function"

IDs must be unique. Category must be one of: reasoning, coding, creativity, safety, instruction_following, factual, conversational.

Python API

The full pipeline is available as a library. A synchronous one-shot call:

from llm_behavior_diff.runner import run_prompt_sync
from llm_behavior_diff.models import ModelConfig, ProviderType

resp = run_prompt_sync(
    ModelConfig(name="stub-m", provider=ProviderType.STUB),
    prompt_id="p1",
    prompt_text="hello world",
)
print(resp.text, resp.success)
# -> Model stub-m says: 921fac0c4c True

Similarity scoring directly:

from llm_behavior_diff.differ import SimpleDiffer, EmbeddingDiffer, create_differ

d = SimpleDiffer()
print(d.compute_similarity("the cat sat", "the cat ran"))   # 0.5

e = EmbeddingDiffer()
print(e.compute_similarity("The answer is 4.", "Two plus two equals four."))
# -> ~0.59

create_differ(use_embeddings=False) returns a SimpleDiffer (Jaccard). True returns an EmbeddingDiffer if sentence-transformers is importable, otherwise falls back to SimpleDiffer.

Generating a report from a ComparisonRun:

from llm_behavior_diff.report import ReportGenerator
from llm_behavior_diff.models import Settings

ReportGenerator().save_report(run, Settings(), "out.html")

ReportGenerator looks for a Jinja template in the CWD, the package directory, and a legacy path, then falls back to a built-in template so reports always render.

MCP Server

The tool also runs as an MCP server over stdio transport, exposing three tools so Claude Code or any MCP-compatible agent can trigger a behavioral diff during a session:

llm-diff-mcp
# or: python -m llm_behavior_diff.mcp_server

The three exposed tools:

compare_models - runs a full prompt suite through two models and returns per-prompt similarity, severity, and response text.
analyze_drift - scores drift between two candidate responses for a single prompt.
generate_report - renders an HTML summary from a JSON list of results.

Claude Code config:

{
  "mcpServers": {
    "llm-behavior-diff": {
      "command": "llm-diff-mcp"
    }
  }
}

Smoke test - all three tools, offline, via Python:

import asyncio, json
from llm_behavior_diff.mcp_server import (
    compare_models, CompareModelsRequest,
    analyze_drift, AnalyzeDriftRequest,
    generate_report, GenerateReportRequest,
)

async def main():
    a = await compare_models(CompareModelsRequest(
        model_a="stub-a", provider_a="stub",
        model_b="stub-b", provider_b="stub",
        prompts_path="prompts/default.yaml",
        threshold=0.5, use_embeddings=False,
    ))
    print(a.total_prompts, a.changes_detected, a.avg_similarity)
    # real: 5 4 0.3446

    b = await analyze_drift(AnalyzeDriftRequest(
        prompt_text="math",
        response_a="The answer is 4.",
        response_b="2+2 equals 4.",
        use_embeddings=True,
    ))
    print(b.embedding_similarity, b.severity)
    # real: 0.5572 moderate

    c = await generate_report(GenerateReportRequest(
        results_json=json.dumps([
            {"prompt_id":"p1","similarity_score":0.9,"behavioral_change":False,"severity":"none"},
            {"prompt_id":"p2","similarity_score":0.3,"behavioral_change":True,"severity":"major"},
        ]),
        output_path="output/mcp_report.html",
        title="MCP Smoke",
    ))
    print(c.success, c.output_path)

asyncio.run(main())

Verified over stdio JSON-RPC:

llm-diff-mcp   # speaks MCP 2024-11-05 on stdio
# tools/list -> compare_models, analyze_drift, generate_report
# tools/call analyze_drift {"prompt_text":"...","response_a":"Paris",
#   "response_b":"The capital is Paris.","use_embeddings":true}
# -> {"embedding_similarity":0.7761,"severity":"minor", ...}

Limitations

LLM-as-judge requires OpenRouter - Without OPENROUTER_API_KEY, judging is skipped and the combined score equals the embedding or Jaccard similarity alone.
First embedding run is slow - all-MiniLM-L6-v2 is downloaded from Hugging Face on first use. Subsequent runs use the local cache.
Ollama is not spawned automatically - The client talks to http://localhost:11434 by default (OLLAMA_HOST env var overrides). Ollama must already be running.
Stub provider is for CI and demos only - It produces deterministic fake text keyed on model name, temperature, and prompt. Not suitable for real behavioral conclusions.

How You Can Use This

Gate model upgrades in CI before they ship: Add llm-diff run to your deployment pipeline. Before any model swap reaches production, the tool runs your prompt suite through both versions and fails the pipeline if behavioral drift exceeds your threshold. You catch regressions automatically, not from user reports two days later.

Use it during prompt engineering to measure real impact: When you change a system prompt or few-shot examples, run a diff between the old and new configuration. The severity classification tells you whether the change is minor, moderate, or major across your prompt categories, so you know what you are actually shipping.

Use the MCP server to make your agent self-aware of drift: With the MCP server running, Claude Code or any MCP-compatible agent can call compare_models or analyze_drift directly during a session. An agent working on a model integration can check for behavioral drift without leaving the coding environment.

Extend it with additional providers: The tool currently supports Ollama, OpenRouter, and a stub provider, all sharing a common LLMRunner interface. Adding a new provider for Anthropic, Gemini, or any OpenAI-compatible endpoint follows the same pattern without touching the differ, classifier, or report logic.

Final Notes

Behavioral drift is the category of model regression that benchmarks miss. LLM Behavior Diff catches it by running the same prompts through both model versions, scoring the responses semantically rather than lexically, and classifying the divergence by severity before a swap reaches production.

The code is at https://github.com/dakshjain-1616/-LLM-Behavior-Diff-Model-Update-Detector

You can also build with NEO in your IDE using the VS Code extension or Cursor.

NEO is a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.

AI Slop Cleaner: Automating Your Codebase Hygiene

Nilofer 🚀 — Thu, 30 Apr 2026 09:56:31 +0000

Every codebase accumulates clutter over time. An import left behind after a refactor. A helper function that nothing calls anymore. A method that grew too complex to reason about. None of it breaks anything immediately, but it slows down every developer who reads through it, and it silently raises the cost of every future change.

The usual fix is a manual review pass. Someone spends an hour looking for unused imports, searching for dead functions, flagging complexity hotspots. It is tedious, inconsistent, and happens far less often than it should.

Slop Cleaner is a CLI tool that does this automatically. It detects unused imports, dead functions and classes, and over-complex code using tree-sitter AST analysis (not regex) so it never removes an import that appears in a docstring or string annotation. Every patch is atomic: backed up before writing, and rolled back automatically if your test suite fails.

What slop-cleaner Detects and Fixes

slop-cleaner targets three specific categories of clutter that accumulate in every codebase:

Unused imports: Imports left behind after refactors. These are detected at HIGH confidence and removed automatically. The tool handles single-line imports, selective removal from from X import A, B blocks where only one name is unused, and multi-line from X import (...) blocks where individual lines are surgically removed.

Dead functions and classes: Defined but never called anywhere in the codebase. These are detected at MEDIUM confidence and flagged in the report for human review. The tool builds a full call graph across all symbols before making this determination.

Over-complex functions: Functions with cyclomatic complexity above the threshold (default 10). These are flagged at MEDIUM confidence for manual refactoring. The tool never auto-removes a function, complexity is a signal that something needs attention, not a safe automatic fix.

HIGH confidence issues are fixed automatically. MEDIUM confidence issues are always left to human judgment.

The 5-Phase Pipeline

Everything slop-cleaner does runs as a five-phase pipeline. Each phase feeds into the next, and every patch is atomic, backed up before writing and rolled back automatically if your tests fail.

Audit:Parses every .py and .ts/.tsx file with tree-sitter. Collects unused imports at HIGH confidence and high-complexity functions at MEDIUM confidence.

Analyze: Builds a call graph across all symbols in the project. Identifies dead code, defined but never called.

Clean: Applies HIGH-confidence patches atomically. Handles single-line and multi-line from X import (...) blocks. Backs up each file before writing.

Verify: Runs your test suite with pytest. On any failure, rolls back every patched file to its backup automatically. Returns exit code 1 so CI catches it.

Document: Generates three Markdown reports: ARCHITECTURE.md, FUNCTION_MAP.md, and SLOP_REPORT.md covered in detail below.

What Gets Fixed Automatically vs. Flagged

The distinction matters. HIGH confidence patches are cases where the tool is certain removal is safe. MEDIUM confidence cases, dead code and complexity could have dynamic dispatch, reflection, or other patterns that make automatic removal risky. The tool flags these and leaves the decision to you.

Edge Cases Handled

One of the hardest parts of detecting unused imports is knowing when a name that looks unused is actually needed. slop-cleaner handles these correctly:

These are exactly the cases where regex-based tools get it wrong. Because slop-cleaner parses an actual AST, it understands the difference between a name appearing inside a string and a name being used as an identifier.

Installation

git clone https://github.com/your-org/slop-cleaner
cd slop-cleaner

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate      # macOS / Linux
# .venv\Scripts\activate       # Windows

# Install the tool and its dependencies
pip install -e .

This registers two CLI commands: slop-audit and slop-clean.
Dependencies installed automatically via pyproject.toml:

tree-sitter>=0.25
tree-sitter-python>=0.23
tree-sitter-typescript>=0.23
rich>=13

Note for modern Linux (Ubuntu 23.04+, Debian 12+): system Python blocks global pip install. Always use a virtual environment as shown above, or use pipx install . to install the CLI tools globally without a venv.

To run the test suite:

pip install -e ".[test]"   # adds pytest + pytest-cov
pytest                     # runs tests/test_parsers.py (22 tests)

Quick Start

# Audit a project — report issues, exit 1 if any found (CI-friendly)
slop-audit path/to/project/

# Audit a single file
slop-audit src/services/user_service.py

# Full clean — audit, fix, verify tests, generate docs
slop-clean path/to/project/

# Dry run — show what would change without touching files
slop-clean path/to/project/ --dry-run

# Write audit JSON for tooling integration
slop-audit path/to/project/ --output report.json

Commands

slop-audit
slop-audit scans a file or directory and prints a table of all issues found without touching anything. It exits with code 1 if issues are found, making it a clean CI gate.

slop-audit <target> [--output JSON] [--threshold N] [--verbose]

Exit codes: 0 = clean, 1 = issues found.

slop-clean
slop-clean runs the full 5-phase pipeline. Always run --dry-run first on an unfamiliar project, it shows exactly what patches would be applied without touching any file.

slop-clean <target> [--output DIR] [--threshold N] [--dry-run] [--verbose]

Exit codes: 0 = success, 1 = rollback was triggered.

Generated Output

After slop-clean runs, the --output directory contains three reports:

slop-output/
├── ARCHITECTURE.md   — file tree + Mermaid dependency graph
├── FUNCTION_MAP.md   — every symbol with start line and complexity score
└── SLOP_REPORT.md    — issue summary, patches applied, dead-code candidates

ARCHITECTURE.md gives a structural overview of the codebase with a visual Mermaid call graph, useful for understanding how the project fits together at a glance.

FUNCTION_MAP.md is a complete index of every function and class with its start line and complexity score. For large codebases, this is the fastest way to see where complexity is concentrated.

SLOP_REPORT.md is the actionable output: what was fixed automatically, what was flagged for manual review, and which dead-code candidates need a human decision.

Trying It on the Example Projects

Two sample projects ship in examples/ so you can try the tool immediately after installing:

# Audit only — no test verification needed
slop-audit examples/todo_app/ --verbose
slop-audit examples/event_pipeline/ --verbose

# Full clean — dry run first, then apply
slop-clean examples/todo_app/ --dry-run
slop-clean examples/todo_app/

To run the example test suites directly, change into the project directory first so their imports resolve correctly:

cd examples/todo_app && pytest tests/ -v
cd examples/event_pipeline && pytest tests/ -v

todo_app is a sample Python app with intentional slop, a good first run to see exactly what the tool catches. event_pipeline covers tricky import patterns like aliases, multi-line imports, and string annotations, designed to show the AST analysis handling edge cases correctly.

CI Integration

slop-audit drops directly into CI as a quality gate. It exits 1 if any issues are found, failing the workflow step automatically:

# .github/workflows/quality.yml
jobs:
  slop-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e .
      - run: slop-audit src/ --threshold 12

This catches unused imports and complexity regressions on every pull request before they get merged into the main branch.

Project Structure

slop-cleaner/
├── cli/
│   └── main.py               — entry points + rich-formatted output
├── engines/
│   ├── auditor.py            — Phase 1 · AST-based issue detection
│   ├── analyzer.py           — Phase 2 · call-graph + dead-code finder
│   ├── cleaner.py            — Phase 3 · atomic patch application
│   ├── verifier.py           — Phase 4 · pytest runner + rollback
│   └── documenter.py         — Phase 5 · Markdown report generator
├── parsers/
│   ├── python_parser.py      — tree-sitter Python wrapper
│   └── typescript_parser.py  — tree-sitter TypeScript/TSX wrapper
├── examples/
│   ├── todo_app/             — sample Python app with intentional slop
│   └── event_pipeline/       — sample project with tricky import patterns
├── assets/
│   ├── pipeline.svg          — 5-phase flow diagram
│   ├── features.svg          — feature overview infographic
│   └── before-after.svg      — code transformation visual
└── tests/
    └── test_parsers.py       — 22 tests covering the parsers

The structure maps cleanly to the five phases, one engine per phase, two parsers for Python and TypeScript/TSX, and a single CLI entry point that ties them together.

How I Built This Using NEO

This tool was designed, built, debugged, and refined entirely using NEO, a fully autonomous AI engineering agent that writes, runs, and iterates on real code without hand-holding.

Every engine, every edge-case fix, and every SVG was produced by NEO in a single session. The five-phase pipeline, the tree-sitter AST parsers for Python and TypeScript, the atomic patch application with backup and rollback, the call graph builder, the pytest runner with automatic rollback on failure, and the three Markdown report generators all of it built end-to-end from a high-level problem description.

How You Can Use and Extend This With NEO

Use it as a CI quality gate on every pull request:
Drop slop-audit into your CI pipeline. Every PR gets checked for unused imports and complexity regressions before it touches main. The exit code 1 on issues means it fails the build automatically, no configuration beyond one workflow step. It is already wired for this out of the box.

Use it before a major refactor:
Before starting a large refactor, run slop-clean on the codebase. It removes accumulated import clutter, flags dead code that no longer needs to be worked around, and generates a complexity map. You go into the refactor with a cleaner starting point and a clear picture of where complexity lives.

Use it to onboard onto an unfamiliar codebase:
Run slop-audit on a project you have just inherited and read SLOP_REPORT.md. You get a structured list of unused imports, dead functions, and complexity hotspots, a map of technical debt you can act on rather than discovering piece by piece while working in the code.

Use it to generate a documentation baseline for a legacy project:
Run slop-clean and read ARCHITECTURE.md and FUNCTION_MAP.md. You get a file tree, a visual call graph, and a complete index of every symbol with its complexity score. For a project with no existing documentation, that is a meaningful starting point.

The tool is also designed to be extended and NEO can take any of these further without starting from scratch:

JavaScript/JSX parser- Two parsers already exist (python_parser.py, typescript_parser.py) following the same tree-sitter wrapper pattern. A third for JavaScript/JSX follows the same interface and plugs into all five engines immediately.

Additional complexity metrics- auditor.py already tracks cyclomatic complexity. Additional metrics like function length follow the same detection pattern and surface automatically in FUNCTION_MAP.md and the audit output once added.

Auto-fix for dead code- dead code is currently flagged at MEDIUM confidence and left to human judgment. For clearly dead private functions confirmed unreachable by the call graph, automatic removal is a natural next step built directly on the existing analyzer.py and cleaner.py infrastructure.

Pre-commit hook- slop-audit already exits 1 on issues. A small wrapper that hooks into the existing CLI entry point brings slop detection into the local development loop before anything reaches CI.

Final Notes

Code clutter is not a cosmetic problem. Unused imports add noise to every code review. Dead functions add surface area that has to be mentally accounted for. High-complexity functions resist change and hide bugs. slop-cleaner addresses all three automatically, with AST-level precision that regex-based tools cannot match, and with a test-verification step that means it never leaves your codebase in a worse state than it found it.
The code is at https://github.com/dakshjain-1616/Ai_Slop_Cleaner
You can also build with NEO in your IDE using the VS Code extension or Cursor.

Agent Failure Classifier: Post-Hoc Root Cause Analysis for Failed LLM Agent Runs

Nilofer 🚀 — Wed, 29 Apr 2026 07:56:24 +0000

When an LLM agent fails, the trace is right there, the user turns, the tool calls, the responses, the final result. But knowing what happened and knowing why it failed are two different things. Most teams read traces manually, form a guess, and move on.

Agent Failure Classifier is a CLI tool and Python library for post-hoc root cause analysis of failed or low-quality LLM agent runs. Feed it any agent trace and it classifies the failure into one of eight named failure modes, identifies the first turn where things went wrong, and produces a structured report with actionable fixes.

The classifier combines eight fast rule-based detectors with an optional LLM-as-judge pass via OpenRouter. The rule-based layer is free, deterministic, and requires no network access. The LLM pass breaks ties and classifies traces the rules cannot resolve alone.

The Eight Failure Modes

The classifier recognises exactly eight failure modes, each with a precise definition:

HALLUCINATION: Agent stated facts or called tools that do not exist
TOOL_MISUSE: Agent called a real tool with wrong parameters or at the wrong time
CONTEXT_LOSS: Agent forgot earlier decisions or repeated already-completed steps
CIRCULAR_REASONING: Agent looped between the same 2-3 steps without making progress
GOAL_DRIFT: Agent started pursuing a sub-goal and forgot the original task
OVER_REFUSAL: Agent refused an action it was capable of and should have taken
SCHEMA_ERROR: Agent generated malformed JSON for a tool call or structured output
TIMEOUT_CASCADE: One slow tool call caused the agent to rush or skip subsequent steps

These are not fuzzy categories. Each one maps to a specific detector with specific signals. A hallucination is flagged when the agent asserts a factual claim without invoking any retrieval tool. A timeout cascade is flagged when a tool call exceeds a latency threshold and the subsequent agent turn is unusually short relative to the tool output.

How It Works

The classification pipeline runs in two layers.

The rule-based layer runs eight deterministic detectors over the trace. Each detector looks for specific structural signals repeated tool calls with identical inputs, cycles in agent turn content, latency spikes followed by short responses, malformed JSON in tool call outputs. This layer runs offline, requires no API key, and classifies all eight failure modes.

The LLM-as-judge layer is optional. When enabled, it receives traces the rule-based layer couldn't resolve with high confidence and breaks ties. The judge runs via OpenRouter and can be pointed at any OpenRouter model or a local OpenAI-compatible server (Ollama, vLLM, llama.cpp).
Every classification produces a structured report with the classified failure mode, a confidence score, the first turn where the failure was detected, a root cause summary, and a list of actionable fixes.

Getting Started

Install

git clone https://github.com/dakshjain-1616/agent-failure-classifier
cd agent-failure-classifier
python3 -m venv venv
source venv/bin/activate
pip install -e .

Requires Python 3.8+. The only dependencies are pydantic, rich, click, and requests. The rule-based layer runs with no additional setup.

LLM Judge Setup (Optional)
To enable the LLM-judge pass, copy .env.example to .env and set your OpenRouter key:

cp .env.example .env
# edit .env and set OPENROUTER_API_KEY=sk-or-...

Without any key, pass --no-llm to every classify or batch call. The rule-based layer alone classifies all eight failure modes.

CLI

The CLI is exposed as both a console script (agent-failure-classifier) and an importable module (python -m agent_failure_classifier.cli).

Classify a single trace
The core command takes a trace JSON file and returns a structured report. --no-llm keeps it offline, rule-based only, no API call.

agent-failure-classifier classify --trace traces/hallucination_example.json --no-llm

Key flags:

Validate a trace
Before classifying, validate parses the trace and prints its structure: trace ID, goal, turn count, and a preview of each turn. Useful for confirming the trace loaded correctly before running classification.

agent-failure-classifier validate --trace traces/hallucination_example.json

Batch classification
batch runs classification over every *.json file in a directory and produces a failure-mode distribution table plus a per-trace summary.

agent-failure-classifier batch --traces-dir ./traces/ --no-llm

Worked Examples

Example 1 - Hallucination
The trace has a user asking for WWII death statistics. The agent responds directly with a factual claim, no tool call, no retrieval.

{
  "trace_id": "hallucination-001",
  "original_goal": "Get population statistics",
  "final_result": "70 million people died in WWII.",
  "is_successful": false,
  "turns": [
    {"turn_number": 0, "role": "user",  "content": "How many people died in WWII?"},
    {"turn_number": 1, "role": "agent", "content": "70 million people died in WWII."}
  ]
}

Classification: HALLUCINATION, confidence 75%, first failure at turn 1. The detector flags that the agent asserted a factual claim without invoking any retrieval tool. Recommended fixes include adding a fact-checking step, requiring tool verification for factual claims, and implementing retrieval-augmented generation.

Example 2 - Circular Reasoning
Four turns alternating between "Let me analyze this step by step." and "I need more information." The agent makes no progress across the entire trace.
Classification: CIRCULAR_REASONING, confidence 80%. The rule-based detector identifies a 2-step cycle repeating across agent turns and recommends a maximum-iteration limit plus state-change detection.

Example 3 - Timeout Cascade
A slow_api tool call with latency_ms: 6000 followed by a one-word agent response "OK".
Classification: TIMEOUT_CASCADE, confidence 70%. The detector flags the latency breach and notes that the subsequent agent turn is a one-word response, less than half the length of the tool output, indicating the agent rushed through the remaining steps.

Python API

Classify a trace programmatically

import json
from agent_failure_classifier.classifier import FailureClassifier
from agent_failure_classifier.models import AgentTrace

trace = AgentTrace(**json.load(open("traces/hallucination_example.json")))
report = FailureClassifier(use_llm=False).classify(trace)

print(report.classified_failure_mode, report.confidence)
print(report.root_cause_summary)

Record a trace live with TraceRecorder
Rather than constructing trace JSON by hand, TraceRecorder is a context manager that captures an agent run as it executes and writes a trace file to disk on exit. The output is immediately compatible with the CLI and with FailureClassifier.

from agent_failure_classifier.recorder import TraceRecorder

with TraceRecorder(goal="Find Italian restaurants", output_dir="./traces") as r:
    r.add_turn(role="user", content="Find Italian restaurants near me")
    r.add_turn(role="agent", content="Searching...")
    r.add_turn(
        role="tool",
        content="results",
        tool_name="search",
        tool_input={"query": "italian restaurants"},
        tool_output='{"hits": ["Luigi Bistro", "Pasta Palace"]}',
        latency_ms=120,
    )
    r.add_turn(role="agent", content="I found Luigi Bistro and Pasta Palace.")
    r.set_final_result("Found Luigi Bistro and Pasta Palace", is_successful=True)

On exit the trace is saved to ./traces/trace_<id>_<timestamp>.json.

Parse traces from other frameworks
AutoParser auto-detects and normalises three input formats into the canonical AgentTrace model. No manual conversion needed regardless of where the trace came from.

from agent_failure_classifier.formats import AutoParser

trace = AutoParser().parse_file("path/to/trace.json")

The three supported formats are:

Native / generic: a dict with trace_id, original_goal, is_successful, and a turns list. This is the format emitted by TraceRecorder.
LangSmith run export: a dict with run_type, inputs, outputs, and optional child_runs. Tool child runs become TOOL turns; chain and LLM child runs become AGENT turns.
LangGraph state dict: a dict with thread_id and a state.messages list whose entries use type values human, ai, and tool. A minimal list-of-dicts ([{"role": "...", "content": "..."}, ...]) is also accepted by the generic parser.

How I Built This Using NEO

This project was built using NEO, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.

The problem was defined at a high level: a tool that takes any agent trace, runs deterministic detectors over it, and classifies the failure into a named category with a structured report and actionable fixes. NEO generated the full implementation: the eight rule-based detectors, the FailureClassifier orchestration layer, the optional LLM-as-judge pass via OpenRouter, the TraceRecorder context manager, the AutoParser with support for native, LangSmith, and LangGraph formats, and the Click-based CLI with classify, validate, and batch commands.

How You Can Build Further With NEO

Use it as a CI/CD quality gate for your agent.
If you're shipping an LLM agent, you can integrate the classifier directly into your deployment pipeline. Record traces from your test suite with TraceRecorder, run batch classification on every pull request, and fail the build if a new failure mode appears or if the rate of a known one spikes. You get a systematic regression check on agent behaviour, not just on code.

Use it to understand where your agent breaks most.
Run batch classification across a directory of historical traces and look at the failure mode distribution. If CONTEXT_LOSS shows up in 40% of your traces, that's a signal about your agent's memory design, not a one-off bug. This turns debugging from reactive to diagnostic, you're looking at patterns across runs, not reading individual traces one by one.

Use it as a live monitoring layer in a multi-agent system.
The classifier runs as an A2A agent, which means it can sit as a node in a multi-agent pipeline. Any agent in the system can send its trace to the classifier after each run and get a structured failure report back. An orchestrator can use that signal to decide whether to retry, reroute, or escalate without any human in the loop.

Use it during agent development to catch regressions early.
Wrap TraceRecorder around your agent during development. Every run produces a trace. Feed those traces into the classifier after each session and you'll know immediately if a change introduced a new failure mode. It's the difference between finding out something broke in production versus finding out in your local environment.

Final Notes

Agent Failure Classifier turns trace debugging from a manual read-and-guess process into a systematic one. Eight named failure modes, a deterministic rule-based layer that runs offline, an optional LLM judge for ambiguous cases, and support for traces from native formats, LangSmith, and LangGraph, all producing a structured report with the first failure turn and actionable fixes.

The code is at https://github.com/dakshjain-1616/agent-failure-classifier
You can also build with NEO in your IDE using the VS Code extension or Cursor.