<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nilofer 🚀</title>
    <description>The latest articles on Forem by Nilofer 🚀 (@nilofer_tweets).</description>
    <link>https://forem.com/nilofer_tweets</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1137273%2Fac10d3a1-21d6-46e3-90d6-889213a616bd.jpg</url>
      <title>Forem: Nilofer 🚀</title>
      <link>https://forem.com/nilofer_tweets</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nilofer_tweets"/>
    <language>en</language>
    <item>
      <title>Agent Failure Classifier: Post-Hoc Root Cause Analysis for Failed LLM Agent Runs</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 29 Apr 2026 07:56:24 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/agent-failure-classifier-post-hoc-root-cause-analysis-for-failed-llm-agent-runs-1i79</link>
      <guid>https://forem.com/nilofer_tweets/agent-failure-classifier-post-hoc-root-cause-analysis-for-failed-llm-agent-runs-1i79</guid>
      <description>&lt;p&gt;When an LLM agent fails, the trace is right there, the user turns, the tool calls, the responses, the final result. But knowing what happened and knowing why it failed are two different things. Most teams read traces manually, form a guess, and move on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Failure Classifier&lt;/strong&gt; is a CLI tool and Python library for post-hoc root cause analysis of failed or low-quality LLM agent runs. Feed it any agent trace and it classifies the failure into one of eight named failure modes, identifies the first turn where things went wrong, and produces a structured report with actionable fixes.&lt;/p&gt;

&lt;p&gt;The classifier combines eight fast rule-based detectors with an optional LLM-as-judge pass via OpenRouter. The rule-based layer is free, deterministic, and requires no network access. The LLM pass breaks ties and classifies traces the rules cannot resolve alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Eight Failure Modes
&lt;/h2&gt;

&lt;p&gt;The classifier recognises exactly eight failure modes, each with a precise definition:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HALLUCINATION:&lt;/strong&gt; Agent stated facts or called tools that do not exist&lt;br&gt;
&lt;strong&gt;TOOL_MISUSE:&lt;/strong&gt; Agent called a real tool with wrong parameters or at the wrong time&lt;br&gt;
&lt;strong&gt;CONTEXT_LOSS:&lt;/strong&gt; Agent forgot earlier decisions or repeated already-completed steps&lt;br&gt;
&lt;strong&gt;CIRCULAR_REASONING:&lt;/strong&gt; Agent looped between the same 2-3 steps without making progress&lt;br&gt;
&lt;strong&gt;GOAL_DRIFT:&lt;/strong&gt; Agent started pursuing a sub-goal and forgot the original task&lt;br&gt;
&lt;strong&gt;OVER_REFUSAL:&lt;/strong&gt; Agent refused an action it was capable of and should have taken&lt;br&gt;
&lt;strong&gt;SCHEMA_ERROR:&lt;/strong&gt; Agent generated malformed JSON for a tool call or structured output&lt;br&gt;
&lt;strong&gt;TIMEOUT_CASCADE:&lt;/strong&gt; One slow tool call caused the agent to rush or skip subsequent steps&lt;/p&gt;

&lt;p&gt;These are not fuzzy categories. Each one maps to a specific detector with specific signals. A hallucination is flagged when the agent asserts a factual claim without invoking any retrieval tool. A timeout cascade is flagged when a tool call exceeds a latency threshold and the subsequent agent turn is unusually short relative to the tool output.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The classification pipeline runs in two layers.&lt;/p&gt;

&lt;p&gt;The rule-based layer runs eight deterministic detectors over the trace. Each detector looks for specific structural signals repeated tool calls with identical inputs, cycles in agent turn content, latency spikes followed by short responses, malformed JSON in tool call outputs. This layer runs offline, requires no API key, and classifies all eight failure modes.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;LLM-as-judge&lt;/strong&gt; layer is optional. When enabled, it receives traces the rule-based layer couldn't resolve with high confidence and breaks ties. The judge runs via OpenRouter and can be pointed at any OpenRouter model or a local OpenAI-compatible server (Ollama, vLLM, llama.cpp).&lt;br&gt;
Every classification produces a structured report with the classified failure mode, a confidence score, the first turn where the failure was detected, a root cause summary, and a list of actionable fixes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/agent-failure-classifier
&lt;span class="nb"&gt;cd &lt;/span&gt;agent-failure-classifier
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.8+. The only dependencies are &lt;code&gt;pydantic&lt;/code&gt;, &lt;code&gt;rich&lt;/code&gt;, &lt;code&gt;click&lt;/code&gt;, and &lt;code&gt;requests&lt;/code&gt;. The rule-based layer runs with no additional setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Judge Setup (Optional)&lt;/strong&gt;&lt;br&gt;
To enable the LLM-judge pass, copy &lt;code&gt;.env.example&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt; and set your OpenRouter key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# edit .env and set OPENROUTER_API_KEY=sk-or-...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv2sdra53suzkiojox5t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwv2sdra53suzkiojox5t.png" alt=" " width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Without any key, pass &lt;code&gt;--no-llm&lt;/code&gt; to every classify or batch call. The rule-based layer alone classifies all eight failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI
&lt;/h2&gt;

&lt;p&gt;The CLI is exposed as both a console script (&lt;code&gt;agent-failure-classifier&lt;/code&gt;) and an importable module (&lt;code&gt;python -m agent_failure_classifier.cli&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classify a single trace&lt;/strong&gt;&lt;br&gt;
The core command takes a trace JSON file and returns a structured report. &lt;code&gt;--no-llm&lt;/code&gt; keeps it offline, rule-based only, no API call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier classify &lt;span class="nt"&gt;--trace&lt;/span&gt; traces/hallucination_example.json &lt;span class="nt"&gt;--no-llm&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key flags:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdabmg2rqnqa9i9tyks.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdabmg2rqnqa9i9tyks.png" alt=" " width="746" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate a trace&lt;/strong&gt;&lt;br&gt;
Before classifying, &lt;code&gt;validate&lt;/code&gt; parses the trace and prints its structure: trace ID, goal, turn count, and a preview of each turn. Useful for confirming the trace loaded correctly before running classification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier validate &lt;span class="nt"&gt;--trace&lt;/span&gt; traces/hallucination_example.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Batch classification&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;batch&lt;/code&gt; runs classification over every &lt;code&gt;*.json&lt;/code&gt; file in a directory and produces a failure-mode distribution table plus a per-trace summary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;agent-failure-classifier batch &lt;span class="nt"&gt;--traces-dir&lt;/span&gt; ./traces/ &lt;span class="nt"&gt;--no-llm&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Worked Examples
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Example 1 - Hallucination&lt;/strong&gt;&lt;br&gt;
The trace has a user asking for WWII death statistics. The agent responds directly with a factual claim, no tool call, no retrieval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trace_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hallucination-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"original_goal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Get population statistics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"final_result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"70 million people died in WWII."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"is_successful"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"turns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"turn_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How many people died in WWII?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"turn_number"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"70 million people died in WWII."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Classification: &lt;code&gt;HALLUCINATION&lt;/code&gt;, confidence 75%, first failure at turn 1. The detector flags that the agent asserted a factual claim without invoking any retrieval tool. Recommended fixes include adding a fact-checking step, requiring tool verification for factual claims, and implementing retrieval-augmented generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 2 - Circular Reasoning&lt;/strong&gt;&lt;br&gt;
Four turns alternating between &lt;code&gt;"Let me analyze this step by step."&lt;/code&gt; and &lt;code&gt;"I need more information."&lt;/code&gt; The agent makes no progress across the entire trace.&lt;br&gt;
Classification: &lt;code&gt;CIRCULAR_REASONING&lt;/code&gt;, confidence 80%. The rule-based detector identifies a 2-step cycle repeating across agent turns and recommends a maximum-iteration limit plus state-change detection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example 3 - Timeout Cascade&lt;/strong&gt;&lt;br&gt;
A &lt;code&gt;slow_api&lt;/code&gt; tool call with &lt;code&gt;latency_ms: 6000&lt;/code&gt; followed by a one-word agent response &lt;code&gt;"OK"&lt;/code&gt;.&lt;br&gt;
Classification: &lt;code&gt;TIMEOUT_CASCADE&lt;/code&gt;, confidence 70%. The detector flags the latency breach and notes that the subsequent agent turn is a one-word response, less than half the length of the tool output, indicating the agent rushed through the remaining steps.&lt;/p&gt;
&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Classify a trace programmatically&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.classifier&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FailureClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentTrace&lt;/span&gt;

&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentTrace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traces/hallucination_example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FailureClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classified_failure_mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;root_cause_summary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Record a trace live with TraceRecorder&lt;/strong&gt;&lt;br&gt;
Rather than constructing trace JSON by hand, &lt;code&gt;TraceRecorder&lt;/code&gt; is a context manager that captures an agent run as it executes and writes a trace file to disk on exit. The output is immediately compatible with the CLI and with &lt;code&gt;FailureClassifier&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.recorder&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceRecorder&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;TraceRecorder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find Italian restaurants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./traces&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find Italian restaurants near me&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Searching...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;italian restaurants&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;tool_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luigi Bistro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pasta Palace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;latency_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I found Luigi Bistro and Pasta Palace.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_final_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found Luigi Bistro and Pasta Palace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_successful&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On exit the trace is saved to &lt;code&gt;./traces/trace_&amp;lt;id&amp;gt;_&amp;lt;timestamp&amp;gt;.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parse traces from other frameworks&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;AutoParser&lt;/code&gt; auto-detects and normalises three input formats into the canonical &lt;code&gt;AgentTrace&lt;/code&gt; model. No manual conversion needed regardless of where the trace came from.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_failure_classifier.formats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoParser&lt;/span&gt;

&lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AutoParser&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;parse_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path/to/trace.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The three supported formats are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Native / generic:&lt;/strong&gt; a dict with &lt;code&gt;trace_id&lt;/code&gt;, &lt;code&gt;original_goal&lt;/code&gt;, &lt;code&gt;is_successful&lt;/code&gt;, and a turns list. This is the format emitted by &lt;code&gt;TraceRecorder&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith run export:&lt;/strong&gt; a dict with &lt;code&gt;run_type&lt;/code&gt;, &lt;code&gt;inputs&lt;/code&gt;, &lt;code&gt;outputs&lt;/code&gt;, and optional &lt;code&gt;child_runs&lt;/code&gt;. Tool child runs become TOOL turns; chain and LLM child runs become AGENT turns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph state dict:&lt;/strong&gt; a dict with &lt;code&gt;thread_id&lt;/code&gt; and a &lt;code&gt;state.messages&lt;/code&gt; list whose entries use type values &lt;code&gt;human&lt;/code&gt;, &lt;code&gt;ai&lt;/code&gt;, and &lt;code&gt;tool&lt;/code&gt;.
A minimal list-of-dicts (&lt;code&gt;[{"role": "...", "content": "..."}, ...]&lt;/code&gt;) is also accepted by the generic parser.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

&lt;p&gt;The problem was defined at a high level: a tool that takes any agent trace, runs deterministic detectors over it, and classifies the failure into a named category with a structured report and actionable fixes. NEO generated the full implementation: the eight rule-based detectors, the &lt;code&gt;FailureClassifier&lt;/code&gt; orchestration layer, the optional LLM-as-judge pass via OpenRouter, the &lt;code&gt;TraceRecorder&lt;/code&gt; context manager, the &lt;code&gt;AutoParser&lt;/code&gt; with support for native, LangSmith, and LangGraph formats, and the Click-based CLI with classify, validate, and batch commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Build Further With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use it as a CI/CD quality gate for your agent.&lt;/strong&gt;&lt;br&gt;
If you're shipping an LLM agent, you can integrate the classifier directly into your deployment pipeline. Record traces from your test suite with &lt;code&gt;TraceRecorder&lt;/code&gt;, run &lt;code&gt;batch&lt;/code&gt; classification on every pull request, and fail the build if a new failure mode appears or if the rate of a known one spikes. You get a systematic regression check on agent behaviour, not just on code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it to understand where your agent breaks most.&lt;/strong&gt;&lt;br&gt;
Run &lt;code&gt;batch&lt;/code&gt; classification across a directory of historical traces and look at the failure mode distribution. If CONTEXT_LOSS shows up in 40% of your traces, that's a signal about your agent's memory design, not a one-off bug. This turns debugging from reactive to diagnostic, you're looking at patterns across runs, not reading individual traces one by one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it as a live monitoring layer in a multi-agent system.&lt;/strong&gt;&lt;br&gt;
The classifier runs as an A2A agent, which means it can sit as a node in a multi-agent pipeline. Any agent in the system can send its trace to the classifier after each run and get a structured failure report back. An orchestrator can use that signal to decide whether to retry, reroute, or escalate without any human in the loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use it during agent development to catch regressions early.&lt;/strong&gt;&lt;br&gt;
Wrap &lt;code&gt;TraceRecorder&lt;/code&gt; around your agent during development. Every run produces a trace. Feed those traces into the classifier after each session and you'll know immediately if a change introduced a new failure mode. It's the difference between finding out something broke in production versus finding out in your local environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent Failure Classifier turns trace debugging from a manual read-and-guess process into a systematic one. Eight named failure modes, a deterministic rule-based layer that runs offline, an optional LLM judge for ambiguous cases, and support for traces from native formats, LangSmith, and LangGraph, all producing a structured report with the first failure turn and actionable fixes.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/agent-failure-classifier" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/agent-failure-classifier&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>opensource</category>
      <category>machinelearning</category>
      <category>agents</category>
    </item>
    <item>
      <title>Synthetic Data Flywheel: A Closed-Loop Pipeline for Instruction-Tuning Data</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:44:54 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/synthetic-data-flywheel-a-closed-loop-pipeline-for-instruction-tuning-data-c85</link>
      <guid>https://forem.com/nilofer_tweets/synthetic-data-flywheel-a-closed-loop-pipeline-for-instruction-tuning-data-c85</guid>
      <description>&lt;p&gt;Fine-tuning a model requires data. Good data requires human labeling. Human labeling doesn't scale. And most synthetic generation pipelines stop at generation, they produce candidate pairs but have no mechanism to filter them, measure quality, or feed failure cases back into the next round.&lt;br&gt;
&lt;strong&gt;Synthetic Data Flywheel&lt;/strong&gt; is a closed-loop pipeline that handles the full cycle: generate candidate instruction-output pairs, validate them deterministically, score them with an LLM-as-judge, calibrate that judge against human labels, export clean training data, and feed the failure cases from one cycle as seeds into the next. It ships as a CLI, a Python library, and an A2A-protocol agent surface for multi-agent orchestration.&lt;/p&gt;

&lt;p&gt;Everything except the optional fine-tuning step runs on CPU.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Synthetic data generation without a quality gate produces noise at scale. And quality gates without calibration produce a judge whose scores you can't trust. The flywheel addresses both: every candidate pair is scored, every score can be validated against human labels, and every failure becomes signal for the next generation cycle rather than a dead end.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;A dataset moves through a series of additive stages, each producing artifacts keyed by the dataset name. Every stage is idempotent and re-runnable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation:&lt;/strong&gt; Candidate pairs are produced from seed prompts via OpenRouter, using one of four prompt templates: QA, INSTRUCTION, REASONING, or CREATIVE.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation:&lt;/strong&gt; Deterministic checks run over each pair: schema, length, dedup, PII, language, profanity. Results are written as a JSON report with severity levels (error, warning, never). A cleaned copy of the dataset can be written at this stage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judging:&lt;/strong&gt; An LLM-as-judge scores each pair against a rubric. The judge supports three backends: Ollama, OpenRouter, and Anthropic. Judgments are cached on disk keyed by &lt;code&gt;(backend, model, pair.id, rubric.name@version)&lt;/code&gt;, repeated judge passes on unchanged pairs are free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Labeling:&lt;/strong&gt; Three modes: Interactive (human reviews pairs one by one), bulk (apply a status to a filtered subset), and auto-from-judge (derive labels from judgment scores above a threshold). Labels are stored append-only so sessions can be interrupted and resumed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration:&lt;/strong&gt; Treats human labels (&lt;code&gt;status == approved&lt;/code&gt;) as ground truth and measures the judge's precision, recall, F1, and accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compare:&lt;/strong&gt; Two or more judgment runs on the same dataset are compared: pass-agreement, Cohen's kappa, and Pearson correlation on the overall score.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Export:&lt;/strong&gt; Pairs that clear the judge filter are written to a train/val split. The filter expression uses a safe evaluator, only arithmetic, comparisons, and subscript access into the context dict. Attribute access and function calls are rejected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cycle feedback:&lt;/strong&gt; Failure instructions from one cycle are extracted and fed as additional seeds into cycle N+1. The autonomous loop stops when the pass rate drops below &lt;code&gt;min_pass_rate&lt;/code&gt; (default 0.5) or &lt;code&gt;max_cycles&lt;/code&gt; is reached.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/synthetic-data-flywheel
&lt;span class="nb"&gt;cd &lt;/span&gt;synthetic-data-flywheel
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.11+. Generation requires OPENROUTER_API_KEY. The local judge path requires Ollama, verified against gemma4:latest. Fine-tuning requires Unsloth and a GPU; the repo was verified on a free Colab T4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initialize&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;init&lt;/code&gt; creates the directory structure the rest of the pipeline writes into.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Synthetic Data Flywheel Initialized
Data Directory: ./data
Checkpoint Directory: ./data/checkpoints
Report Directory: ./reports
Directories created successfully
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ingest&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;ingest&lt;/code&gt; normalises an existing dataset into the flywheel's internal JSONL format. It supports jsonl, csv, and HuggingFace datasets, and accepts a field mapping flag when the source uses different column names.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; demo.jsonl &lt;span class="nt"&gt;-n&lt;/span&gt; demo &lt;span class="nt"&gt;--tag&lt;/span&gt; demo1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Ingested 8 pairs -&amp;gt; data/user/demo.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other ingest forms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; data.csv              &lt;span class="nt"&gt;-n&lt;/span&gt; my_dataset &lt;span class="nt"&gt;-f&lt;/span&gt; csv
flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; hf://tatsu-lab/alpaca &lt;span class="nt"&gt;-n&lt;/span&gt; alpaca &lt;span class="nt"&gt;--limit&lt;/span&gt; 500 &lt;span class="nt"&gt;--hf-split&lt;/span&gt; train
flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; data.jsonl &lt;span class="nt"&gt;-n&lt;/span&gt; aliased &lt;span class="nt"&gt;--map&lt;/span&gt; &lt;span class="s2"&gt;"instruction=prompt,output=completion"&lt;/span&gt;
flywheel ingest &lt;span class="nt"&gt;-i&lt;/span&gt; data.jsonl &lt;span class="nt"&gt;-n&lt;/span&gt; x &lt;span class="nt"&gt;--dry-run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each successful ingest writes &lt;code&gt;data/user/&amp;lt;name&amp;gt;.jsonl&lt;/code&gt; and &lt;code&gt;data/user/&amp;lt;name&amp;gt;.meta.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate&lt;/strong&gt;&lt;br&gt;
Before any judging happens, the validator runs deterministic checks over the dataset. This catches structural problems, duplicate pairs, PII, malformed schema, before spending LLM calls on them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel validate &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--checks&lt;/span&gt; schema,length,dedup,pii &lt;span class="nt"&gt;--write-clean&lt;/span&gt; data/user/demo.clean.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Validation: demo
  Total pairs       8
  pii               1
  severity:warning  1
Report: data/validation/demo.report.json
Clean dataset written (8 pairs): data/user/demo.clean.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--fail-on error|warning|never&lt;/code&gt; flag lets you gate CI on validation issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judge&lt;/strong&gt;&lt;br&gt;
With a clean dataset, the judge scores each pair against a rubric. The default rubric is built-in; custom rubrics can be passed with &lt;code&gt;--rubric&lt;/code&gt;. Results are cached, so re-running after adding new pairs only scores the new ones.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel judge &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--backend&lt;/span&gt; ollama &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:latest &lt;span class="nt"&gt;--tag&lt;/span&gt; v1 &lt;span class="nt"&gt;--max-pairs&lt;/span&gt; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Judging 3 pairs with ollama:gemma4:latest
  Judged                3
  Passed                0 (0.0%)
  Avg overall (scored)  5.00
  Output                data/judgments/demo.v1.jsonl
  Cache                 hits=0 misses=3 writes=3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Judgments land at &lt;code&gt;data/judgments/&amp;lt;dataset&amp;gt;.&amp;lt;tag&amp;gt;.jsonl&lt;/code&gt;. The &lt;code&gt;--tag&lt;/code&gt; flag is how multiple judgment runs on the same dataset are tracked separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Label&lt;/strong&gt;&lt;br&gt;
Labeling bridges human judgment and automated scoring. auto-from-judge derives labels directly from the judgment scores, pairs above the threshold are approved, pairs below are rejected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel label &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--mode&lt;/span&gt; auto-from-judge &lt;span class="nt"&gt;--judgments&lt;/span&gt; data/judgments/demo.v1.jsonl &lt;span class="nt"&gt;--reject-below&lt;/span&gt; 3.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For manual review, &lt;code&gt;--mode interactive&lt;/code&gt; walks through pairs one by one. For bulk operations, &lt;code&gt;--mode bulk&lt;/code&gt; applies a status to a filtered subset. All labels are stored append-only at &lt;code&gt;data/labels/&amp;lt;dataset&amp;gt;.jsonl&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compare&lt;/strong&gt;&lt;br&gt;
When you have two judgment runs, say from two different models, &lt;code&gt;compare&lt;/code&gt; measures how much they agree. Cohen's kappa close to 1.0 means the two judges are making the same pass/fail decisions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel compare &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--tags&lt;/span&gt; judge_a,judge_b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Judge comparison: judge_a vs judge_b
  Common pairs          8
  judge_a passed / mean 6 / 7.44
  judge_b passed / mean 6 / 7.19
  Pass agreement        100.0%
  Cohen's kappa (p/f)   1.000  (near-perfect)
  Score Pearson r       0.965
  Output                reports/demo/compare.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Calibrate&lt;/strong&gt;&lt;br&gt;
Calibration answers the question you need to answer before trusting your judge: does its &lt;code&gt;passed&lt;/code&gt; decision align with human labels? Precision of 1.0 means every pair the judge passed, a human also approved. Recall of 0.75 means the judge missed 25% of the pairs humans would have kept.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel calibrate &lt;span class="nt"&gt;-d&lt;/span&gt; demo &lt;span class="nt"&gt;--tag&lt;/span&gt; judge_a &lt;span class="nt"&gt;--approved-is&lt;/span&gt; approved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Evaluated pairs  8
  Precision        1.000
  Recall           0.750
  F1               0.857
  Accuracy         0.750
  TP/FP/TN/FN      6/0/0/2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Visualize&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;visualize&lt;/code&gt; renders a suite of PNG charts and an &lt;code&gt;index.html&lt;/code&gt; for a dataset — covering label distribution, score distributions, pass/fail breakdown, pair lengths, categories, judge agreement matrix, and validation results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel visualize &lt;span class="nt"&gt;-d&lt;/span&gt; demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;categories      reports/demo/categories.png
  lengths         reports/demo/lengths.png
  validation      reports/demo/validation.png
  pass_fail       reports/demo/pass_fail.png
  scores          reports/demo/scores.png
  criteria        reports/demo/criteria.png
  labels          reports/demo/labels.png
  judge_agreement reports/demo/judge_agreement.png
  index.html      reports/demo/index.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Dataset inspection and export&lt;/strong&gt;&lt;br&gt;
Before exporting, &lt;code&gt;dataset ls&lt;/code&gt; and &lt;code&gt;dataset info&lt;/code&gt; show what artifacts exist for each dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel dataset &lt;span class="nb"&gt;ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name   pairs  source  tags
  demo   8      jsonl   demo1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel dataset info demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pairs       data/user/demo.jsonl               present
  meta        data/user/demo.meta.json           present
  validation  data/validation/demo.report.json   present
  labels      data/labels/demo.jsonl             present
  judgments   data/judgments                     5 set(s)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Export filters pairs using a safe expression, only pairs with an overall score of 7 or above are written, split 80/20 into train and val.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel dataset &lt;span class="nb"&gt;export &lt;/span&gt;demo &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--to&lt;/span&gt; data/exports/demo.jsonl &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt; jsonl &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--judgments&lt;/span&gt; data/judgments/demo.judge_a.jsonl &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"scores['overall'] &amp;gt;= 7"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--split&lt;/span&gt; &lt;span class="nv"&gt;train&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.8,val&lt;span class="o"&gt;=&lt;/span&gt;0.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wrote 4 pairs -&amp;gt; data/exports/demo.train.jsonl
Wrote 2 pairs -&amp;gt; data/exports/demo.val.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run the autonomous loop&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;flywheel run&lt;/code&gt; ties everything together into a seeds-to-checkpoint cycle. Generation goes through OpenRouter; judging goes through Ollama. If Ollama isn't running, generation still succeeds and pairs are saved in the checkpoint, every judgment falls back to &lt;code&gt;passed=false&lt;/code&gt;. The standalone &lt;code&gt;flywheel judge --backend openrouter&lt;/code&gt; works fully without Ollama.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;meta-llama/llama-3.2-3b-instruct

flywheel run &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"benefits of green tea,history of python language"&lt;/span&gt; &lt;span class="nt"&gt;--max-cycles&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╭───── Configuration ─────╮
│ Synthetic Data Flywheel │
│ Seeds: 2                │
│ Max Cycles: 1           │
╰─────────────────────────╯
Starting Flywheel with max_cycles=1
============================================================
Starting Cycle 1
============================================================
Using 2 seeds
Generating synthetic data...
Generated 2 pairs
Judging quality...
Passed: 0, Failed: 2
Cycle 1 complete. Pass rate: 0.00%
Flywheel complete. Ran 1 cycles.
       Flywheel Summary
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Metric             ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ Total Cycles       │ 1     │
│ Total Passed Pairs │ 0     │
│ Avg Pass Rate      │ 0.00% │
└────────────────────┴───────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each cycle writes a checkpoint. The generated pair is saved verbatim inside &lt;code&gt;data/checkpoints/checkpoint_001.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"instruction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"benefits of green tea"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Here is an example of an instruction-following training data in JSON format:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;{&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;instruction&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;What are some of the benefits of drinking green tea?&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;output&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Green tea has numerous benefits, including: - High antioxidant content - Anti-inflammatory properties - May help with weight loss ...&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;,&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;  &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;category&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;instruction&lt;/span&gt;&lt;span class="se"&gt;\"\n&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_seed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"benefits of green tea"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Status and report&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel status
flywheel report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;status&lt;/code&gt; summarises checkpoint state. &lt;code&gt;report&lt;/code&gt; produces an HTML report across cycles written to &lt;code&gt;reports/flywheel_report_&amp;lt;timestamp&amp;gt;.html&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;flywheel --help&lt;/code&gt; lists the command groups. Every command has &lt;code&gt;--help&lt;/code&gt; with full flag docs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;flywheel &lt;span class="nt"&gt;--help&lt;/span&gt;
Usage: flywheel &lt;span class="o"&gt;[&lt;/span&gt;OPTIONS] COMMAND &lt;span class="o"&gt;[&lt;/span&gt;ARGS]...

  Synthetic Data Flywheel - Autonomous data generation pipeline.

Commands:
  calibrate  Measure judge &lt;span class="s1"&gt;'passed'&lt;/span&gt; against human labels &lt;span class="o"&gt;(&lt;/span&gt;precision/recall/F1&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
  compare    Compare two+ judgment runs &lt;span class="o"&gt;(&lt;/span&gt;Cohen&lt;span class="s1"&gt;'s kappa, agreement, ...).
  dataset    Dataset management: ls | info | export.
  ingest     Ingest a user dataset into the flywheel'&lt;/span&gt;s JSONL format.
  init       Initialize flywheel configuration.
  judge      Judge a dataset with an LLM-as-judge backend.
  label      Label a dataset: interactive/bulk/auto-from-judge.
  pipeline   Run declarative YAML pipelines.
  report     Generate HTML report from checkpoints.
  run        Run the synthetic data flywheel.
  status     Show current flywheel status.
  validate   Validate a dataset and write a ValidationReport.
  visualize  Render a suite of PNG charts + index.html &lt;span class="k"&gt;for &lt;/span&gt;a dataset.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pipeline Runner
&lt;/h2&gt;

&lt;p&gt;Individual commands can be composed into a declarative YAML pipeline and run as a single step. This is useful for repeatable workflows, the pipeline dispatches through the same Click commands as manual runs, so behaviour is identical.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# pipeline_demo.yaml&lt;/span&gt;
&lt;span class="na"&gt;dataset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;demo&lt;/span&gt;
&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;checks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;length&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;dedup&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;export&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/user/demo_pipeline.jsonl&lt;/span&gt;
      &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jsonl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;flywheel pipeline run pipeline_demo.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1/2] flywheel validate -d demo --checks schema,length,dedup
[2/2] flywheel dataset export demo --to data/user/demo_pipeline.jsonl --format jsonl
   Pipeline: demo
  1  validate  ok  0
  2  export    ok  0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;

&lt;p&gt;The full pipeline is available as a library. The minimal end-to-end call scores a dataset with an async judge backed by Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.ingest&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset_jsonl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.rubrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;default_rubric&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.judge&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncQualityJudge&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.judge_backends&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_backend&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.judge_cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JudgmentCache&lt;/span&gt;

&lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset_jsonl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/user/demo.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_backend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsyncQualityJudge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;default_rubric&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;JudgmentCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.cache/judge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;backend_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;judgments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;judge_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;judgments&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;judgments&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The statistical functions used internally by &lt;code&gt;calibrate&lt;/code&gt; and &lt;code&gt;compare&lt;/code&gt; are also directly callable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.stats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohens_kappa&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pearson&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prf&lt;/span&gt;

&lt;span class="nf"&gt;cohens_kappa&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# 0.5
&lt;/span&gt;
&lt;span class="nf"&gt;pearson&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# 0.8315...
&lt;/span&gt;
&lt;span class="nf"&gt;prf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="c1"&gt;# {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'accuracy': 0.5,
#  'tp': 1, 'fp': 1, 'tn': 1, 'fn': 1}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  A2A Agent
&lt;/h2&gt;

&lt;p&gt;The flywheel exposes a FastAPI application implementing the A2A protocol surface, &lt;code&gt;/a2a/capabilities&lt;/code&gt;, &lt;code&gt;/a2a/tasks/send&lt;/code&gt;, &lt;code&gt;/a2a/tasks/get&lt;/code&gt;, &lt;code&gt;/a2a/tasks/cancel&lt;/code&gt; so it can be orchestrated as a node in a multi-agent ML pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; synthetic_data_flywheel.a2a_agent
&lt;span class="c"&gt;# or&lt;/span&gt;
uvicorn synthetic_data_flywheel.a2a_agent:app &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three capabilities are exposed: &lt;code&gt;generate_synthetic_data&lt;/code&gt;, &lt;code&gt;get_status&lt;/code&gt;, &lt;code&gt;generate_report&lt;/code&gt;. Querying &lt;code&gt;/a2a/capabilities&lt;/code&gt; returns the agent's identity and the full capability list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.testclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TestClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;synthetic_data_flywheel.a2a_agent&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TestClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/a2a/capabilities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# {'agent_name': 'synthetic_data_flywheel', 'version': '0.1.0',
#  'capabilities': [{'name': 'generate_synthetic_data', ...},
#                   {'name': 'get_status', ...},
#                   {'name': 'generate_report', ...}]}
&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/a2a/tasks/send&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{},&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# {'task_id': '...', 'status': {'state': 'completed'},
#  'result': {'type': 'status_result',
#             'content': {'checkpoints_found': 1,
#                         'checkpoint_dir': 'data/checkpoints'}}}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;All settings are read from environment variables or a &lt;code&gt;.env&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;sk-or-...&lt;/span&gt;
&lt;span class="py"&gt;OPENROUTER_MODEL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen3-8b:free&lt;/span&gt;
&lt;span class="py"&gt;OLLAMA_BASE_URL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434&lt;/span&gt;
&lt;span class="py"&gt;OLLAMA_MODEL&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;gemma4:latest&lt;/span&gt;
&lt;span class="py"&gt;DEFAULT_JUDGE_BACKEND&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;ollama        # ollama | openrouter | anthropic&lt;/span&gt;
&lt;span class="py"&gt;JUDGE_CONCURRENCY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;
&lt;span class="py"&gt;JUDGE_TIMEOUT&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;600&lt;/span&gt;
&lt;span class="py"&gt;QUALITY_MIN_SCORE&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;7.0&lt;/span&gt;
&lt;span class="py"&gt;MAX_CYCLES&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;
&lt;span class="py"&gt;PII_POLICY&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;warn                     # strict | warn | off&lt;/span&gt;
&lt;span class="py"&gt;A2A_HOST&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
&lt;span class="py"&gt;A2A_PORT&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;JUDGE_TIMEOUT&lt;/code&gt; defaults to 600 seconds, large local models can take over two minutes on first call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning requires a GPU:&lt;/strong&gt; &lt;code&gt;Trainer.prepare_training_artifacts&lt;/code&gt; writes a Colab-ready Unsloth notebook under &lt;code&gt;notebooks/training_cycle_NNN.ipynb&lt;/code&gt;. Running the training step locally on CPU is not supported by Unsloth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Autonomous generation requires OpenRouter:&lt;/strong&gt; &lt;code&gt;flywheel run&lt;/code&gt; requires &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;. The in-loop judge is hardcoded to Ollama (&lt;code&gt;engine.create_judge&lt;/code&gt; constructs a sync &lt;code&gt;QualityJudge&lt;/code&gt; over &lt;code&gt;OllamaClient&lt;/code&gt;); if Ollama isn't available, pairs are persisted but every judgment falls back to &lt;code&gt;passed=false&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large local judges are slow to cold-start:&lt;/strong&gt; Gemma 4 (9 GB) takes about 130 seconds the first time it loads into VRAM/RAM. The default &lt;code&gt;JUDGE_TIMEOUT&lt;/code&gt; is 600 seconds to cover this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HuggingFace ingest requires &lt;code&gt;datasets&lt;/code&gt;:&lt;/strong&gt; already a dependency, but gated datasets additionally require &lt;code&gt;HUGGINGFACE_TOKEN&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic judge backend requires &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;:&lt;/strong&gt; no offline fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

&lt;p&gt;The problem was defined at a high level: a closed-loop pipeline that generates synthetic instruction-tuning pairs, filters them with a calibrated LLM judge, and feeds failure cases back as seeds for the next cycle. NEO generated the full implementation, the &lt;code&gt;FlywheelEngine&lt;/code&gt; cycle loop with checkpointing, the &lt;code&gt;AsyncQualityJudge&lt;/code&gt; with three pluggable backends and disk-backed cache, the deterministic &lt;code&gt;Validator&lt;/code&gt; with six check types, the &lt;code&gt;LabelStore&lt;/code&gt; with append-only storage, the statistical calibration layer (&lt;code&gt;cohens_kappa&lt;/code&gt;, &lt;code&gt;pearson&lt;/code&gt;, &lt;code&gt;prf&lt;/code&gt;), the safe-eval export filter, the declarative YAML pipeline runner, the Matplotlib visualisation suite, and the A2A FastAPI agent surface. 100 tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Build Further With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Additional judge backends:&lt;/strong&gt; the three existing backends share a common interface via &lt;code&gt;get_backend&lt;/code&gt;. Any OpenAI-compatible endpoint can be wired in as a new backend, and the judge cache, calibration, and compare logic all work with it immediately without any changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional generation templates:&lt;/strong&gt; the generator ships with four templates: QA, INSTRUCTION, REASONING, CREATIVE. New domain-specific templates would let the flywheel produce specialised training data, code generation, structured extraction, tool-use, while the cycle loop, judge, and export pipeline stay entirely unchanged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Additional validation checks:&lt;/strong&gt; the &lt;code&gt;Validator&lt;/code&gt; already supports six check types plugged into the same &lt;code&gt;--checks&lt;/code&gt; flag and report format. New checks for domain-specific quality signals would run in the same validation pass and appear in the same JSON report and visualisation output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-judge ensembling:&lt;/strong&gt; &lt;code&gt;compare&lt;/code&gt; already computes agreement metrics across judgment runs. Taking the average or majority vote across two or more judge scores before the pass/fail decision would reduce the noise that small local models introduce, without touching the labeling, calibration, or export logic downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Synthetic Data Flywheel closes the loop that most synthetic data pipelines leave open. It generates, validates, judges, calibrates, and exports, and feeds what failed back into the next cycle. The result is a data pipeline that improves with each run rather than producing a static batch.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/synthetic-data-flywheel" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/synthetic-data-flywheel&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>syntheticdata</category>
      <category>opensource</category>
      <category>finetuning</category>
    </item>
    <item>
      <title>Token Budget Negotiator</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 27 Apr 2026 21:55:15 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/token-budget-negotiator-1ijg</link>
      <guid>https://forem.com/nilofer_tweets/token-budget-negotiator-1ijg</guid>
      <description>&lt;p&gt;Everyone knows long prompts cost money. Almost nobody knows which parts of their prompt actually matter.&lt;/p&gt;

&lt;p&gt;Prompts accumulate over time, a system message, a style guide, a few-shot example or two, some background context. Each addition made sense when it was added. Over hundreds of API calls, the overhead compounds. And the honest answer to "which of these sections can I remove?" is: you don't know until you test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token Budget Negotiator&lt;/strong&gt; makes that test systematic. It takes a prompt split into named, prioritised sections, runs a greedy ablation loop that drops one section at a time, scores the remaining prompt against a rubric using a local or remote LLM judge, and stops when savings hit the target without falling below the quality threshold. The result is the smallest prompt that still behaves like the original.&lt;/p&gt;

&lt;p&gt;It ships as a CLI, a Python library, and an MCP server.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Prompt sections are not equal in value, but there's no principled way to know which ones matter for a given task without testing. Manual trimming is guesswork. Token Budget Negotiator answers the question empirically per section, per task, against a rubric that defines what quality means for that use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;A prompt is defined as a YAML file with named sections. Each section carries a &lt;code&gt;type&lt;/code&gt; (system, few_shot, context, instruction), a &lt;code&gt;content&lt;/code&gt; block, and a &lt;code&gt;priority&lt;/code&gt; integer. Priority determines the order in which sections are considered for removal low-priority sections are evaluated first, high-priority sections last.&lt;/p&gt;

&lt;p&gt;Before any removal happens, the full prompt is scored by the judge LLM against the rubric. This establishes a baseline. The quality target for the run is &lt;code&gt;baseline_score × threshold&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The ablation loop then works through sections in ascending priority order. For each candidate, a test prompt is built without that section and rescored. If the score still meets the target, the section is dropped permanently and the loop continues with the updated prompt. If not, the section is kept and the next candidate is evaluated.&lt;/p&gt;

&lt;p&gt;Two conditions stop the loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token savings reach &lt;code&gt;min_token_savings&lt;/code&gt;,the target has been hit.&lt;/li&gt;
&lt;li&gt;A removal would push savings above &lt;code&gt;max_token_savings&lt;/code&gt;, the ceiling is enforced.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every accepted removal is verified to actually reduce the token count. The loop cannot produce a larger prompt than it started with.&lt;br&gt;
The output is a &lt;code&gt;NegotiationResult&lt;/code&gt; containing the original and optimised token counts, the list of sections removed, per-step scores, quality retention percentage, elapsed time, scoring call count, rubric name, and a full ablation log. This can be written to JSON or YAML.&lt;/p&gt;
&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;token-budget-negotiator
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires Python 3.11+. The local judge path requires Ollama with a model pulled, verified end-to-end against &lt;code&gt;gemma4:latest&lt;/code&gt;. The OpenRouter path requires &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analyze token distribution&lt;/strong&gt;&lt;br&gt;
Before negotiating, &lt;code&gt;analyze&lt;/code&gt; prints how many tokens each section holds and its share of the total budget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget analyze examples/prompt.yaml
&lt;span class="go"&gt;
Token Distribution Analysis:
Section              Type              Tokens        % Priority
-----------------------------------------------------------------
system               system                22    18.6%       30
style_guide          system                26    22.0%       10
few_shot_1           few_shot              26    22.0%       20
few_shot_2           few_shot              20    16.9%       25
context              context               12    10.2%       40
instruction          instruction           12    10.2%      100
-----------------------------------------------------------------
TOTAL                                     118   100.0%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Check the local judge:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget check-ollama &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:latest
&lt;span class="go"&gt;Ollama is connected
  Host: http://localhost:11434
  Model requested: gemma4:latest
  Model available: Yes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run the negotiator:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget negotiate examples/prompt.yaml &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    --scorer ollama --model gemma4:latest \
    --threshold 0.80 --min-savings 0.20 --max-savings 0.80 \
    --output result.json --format json

Negotiation Result:
  Original: 118 tokens, score=0.600
  Optimized: 92 tokens, score=0.700
  Savings: 22.0%
  Quality Retention: 116.7%
  Success: Yes
  Sections removed: style_guide

Results saved to result.json
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;result.json&lt;/code&gt; contains the full ablation log, the final optimized prompt, per-step scores, and metadata (elapsed time, scoring call count, rubric name).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run the negotiator - OpenRouter path&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-or-...
&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget check-openrouter
&lt;span class="go"&gt;OpenRouter is connected
  Base URL: https://openrouter.ai/api/v1
  Model requested: qwen/qwen3-8b
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;token-budget &lt;span class="nt"&gt;-v&lt;/span&gt; negotiate examples/prompt.yaml &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="go"&gt;    --scorer openrouter --model meta-llama/llama-3.2-3b-instruct \
    --rubric rubrics/qa.yaml \
    --threshold 0.7 --min-savings 0.1 --max-savings 0.6 --no-cache
Connected to openrouter

Negotiation Result:
  Original: 118 tokens, score=1.000
  Optimized: 92 tokens, score=0.900
  Savings: 22.0%
  Quality Retention: 90.0%
  Success: Yes
  Sections removed: style_guide
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With a looser threshold &lt;code&gt;(-t 0.7 --min-savings 0.1 --max-savings 0.5)&lt;/code&gt; and caching left on, the same model drops two sections for 44.1% savings at 100% quality retention.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLI Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F279uww5s8wbbjf93vwt0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F279uww5s8wbbjf93vwt0.png" alt=" " width="800" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key &lt;code&gt;negotiate&lt;/code&gt; flags:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-r, --rubric PATH&lt;/code&gt;:  YAML rubric. Defaults to a built-in accuracy+relevance rubric.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-s, --scorer {ollama,openrouter}&lt;/code&gt;: which judge to use. Default ollama.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-m, --model TEXT&lt;/code&gt;: model name (gemma4:latest for Ollama, qwen/qwen3-8b for OpenRouter, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-t, --threshold FLOAT&lt;/code&gt;: minimum fraction of the baseline score to keep. Default 0.95.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--min-savings FLOAT&lt;/code&gt;: stop once savings reach this fraction. Default 0.40.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--max-savings FLOAT&lt;/code&gt;: never drop sections if it would save more than this. Default 0.60.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-o, --output PATH&lt;/code&gt; / &lt;code&gt;-f, --format {json,yaml}&lt;/code&gt;: write a machine-readable report.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--no-cache&lt;/code&gt;: disable the in-memory scoring cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Python API
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;token_budget_negotiator&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Negotiator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OllamaScorer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PromptSection&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rubric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;token_budget_negotiator.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RubricCriterion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SectionType&lt;/span&gt;

&lt;span class="n"&gt;sections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;PromptSection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are helpful.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;section_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SectionType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SYSTEM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;PromptSection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2+2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;section_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SectionType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INSTRUCTION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;priority&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;rubric&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Rubric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa rubric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;RubricCriterion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;factually correct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                              &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scorer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OllamaScorer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;negotiator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Negotiator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;min_token_savings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_token_savings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;negotiator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;negotiate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rubric&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;original_token_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;optimized_token_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;removed:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sections_removed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Rubric Format
&lt;/h2&gt;

&lt;p&gt;The rubric defines what quality means for the task. The judge scores each test prompt against it. Three rubrics ship in &lt;code&gt;rubrics&lt;/code&gt;/: &lt;code&gt;qa.yaml&lt;/code&gt;, &lt;code&gt;coding.yaml&lt;/code&gt;, &lt;code&gt;summarization.yaml&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qa&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;General question-answer rubric&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.0"&lt;/span&gt;
&lt;span class="na"&gt;criteria&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;accuracy&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Is the response factually correct?&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;relevance&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Does it answer what was asked?&lt;/span&gt;
    &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.0&lt;/span&gt;
&lt;span class="na"&gt;scoring_instructions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;Score 0-1. 1 = perfect, 0 = wrong or irrelevant.&lt;/span&gt;
&lt;span class="na"&gt;output_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  MCP Server
&lt;/h2&gt;

&lt;p&gt;The library also runs as an MCP server over stdio transport, exposing two tools, analyze and negotiate, so Claude Code or any MCP-compatible agent can call it directly during a session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; token_budget_negotiator.mcp_server &lt;span class="nt"&gt;--scorer&lt;/span&gt; ollama &lt;span class="nt"&gt;--model&lt;/span&gt; gemma4:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;analyze&lt;/code&gt; takes a sections list and returns token distribution as JSON. negotiate takes sections, rubric, task, thresholds, and scorer config and returns the full negotiation result as JSON.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ablation is greedy one-at-a-time in priority order, not exhaustive subset search.&lt;/li&gt;
&lt;li&gt;The judge is asked for strict JSON; free-text replies fall back to regex score extraction with reduced confidence.&lt;/li&gt;
&lt;li&gt;Small local judges like &lt;code&gt;gemma4&lt;/code&gt; are noisy, prefer thresholds in the 0.80-0.90 range and expect multi-minute wall clock even for short prompts.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;check-openrouter&lt;/code&gt; and the OpenRouter scorer require &lt;code&gt;OPENROUTER_API_KEY&lt;/code&gt;; there is no offline stub.&lt;/li&gt;
&lt;li&gt;Only the &lt;code&gt;remove&lt;/code&gt; compression strategy is wired up. &lt;code&gt;CompressionStrategy&lt;/code&gt; and &lt;code&gt;sections_compressed&lt;/code&gt; exist on the model but are not yet produced by the negotiator.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;, a fully autonomous AI engineering agent that writes code and builds solutions for AI/ML tasks including model evals, prompt optimisation, and end-to-end pipeline development.&lt;/p&gt;

&lt;p&gt;The problem was defined at a high level: a tool that takes a structured prompt, scores it with a local or remote LLM judge, and finds the minimum set of sections needed to hit a quality threshold. NEO generated the full implementation, the greedy ablation loop in &lt;code&gt;Negotiator&lt;/code&gt;, the &lt;code&gt;OllamaScorer&lt;/code&gt; and &lt;code&gt;OpenRouterScorer&lt;/code&gt; with their shared interface, the &lt;code&gt;ScoreCache&lt;/code&gt; with TTL-based invalidation, the &lt;code&gt;SectionTokenizer&lt;/code&gt; backed by tiktoken, the YAML rubric format, the MCP server with its two exposed tools, and the CLI built on Click. All 49 tests pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Token Budget Negotiator turns prompt compression from guesswork into an empirical process. It scores every section against a rubric, drops only what demonstrably doesn't matter, and produces a report showing exactly what changed and why.&lt;br&gt;
The code is at &lt;a href="https://github.com/dakshjain-1616/token-budget-negotiator" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/token-budget-negotiator&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>promptengineering</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Agent Memory Compressor: Intelligent Memory Compression for Long-Running LLM Agents</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Mon, 27 Apr 2026 13:02:47 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/agent-memory-compressor-intelligent-memory-compression-for-long-running-llm-agents-5941</link>
      <guid>https://forem.com/nilofer_tweets/agent-memory-compressor-intelligent-memory-compression-for-long-running-llm-agents-5941</guid>
      <description>&lt;p&gt;A 10-turn agent session can easily accumulate 20,000+ tokens of raw history, leaving almost no room for the current task. Naive truncation drops older turns wholesale, including the decisions and discovered facts the agent needs to avoid repeating work. Developers need a principled way to compress history rather than discard it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Memory Compressor&lt;/strong&gt; is a Python library that implements an intelligent memory compression pipeline for long-running LLM agents. It combines importance-based scoring, LLM-driven summarization, a forgetting curve trigger, and a token-budgeted context builder so agents can run indefinitely without exhausting their context windows, while preserving the facts and decisions that matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Context Window Exhaustion
&lt;/h2&gt;

&lt;p&gt;The problem has three dimensions, and agent-memory-compressor addresses each one directly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to keep&lt;/strong&gt;: A multi-signal importance scorer ranks every memory entry.&lt;br&gt;
&lt;strong&gt;How to shrink&lt;/strong&gt;: Three pluggable compression strategies replace low-value entries with compact equivalents using any OpenAI-compatible LLM.&lt;br&gt;
&lt;strong&gt;When to act&lt;/strong&gt;: A forgetting curve fires compression automatically when either a turn interval or a token threshold is crossed.&lt;/p&gt;
&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Importance Scoring&lt;/strong&gt;&lt;br&gt;
Every memory entry is scored by the &lt;code&gt;ImportanceScorer&lt;/code&gt;, which combines three signals:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vuitmevat3gxfqchuy5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vuitmevat3gxfqchuy5.png" alt=" " width="780" height="198"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compression Strategies&lt;/strong&gt;&lt;br&gt;
Given a scored store, the CompressionEngine exposes three strategies:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;summarize(entry)&lt;/code&gt;: Asks the LLM for a short summary that preserves all decisions and facts.&lt;br&gt;
&lt;code&gt;extract_facts(entry)&lt;/code&gt;: Asks the LLM for a bullet list of facts and decisions, stored as high-importance compressed entries.&lt;br&gt;
&lt;code&gt;archive(entry)&lt;/code&gt;: Replaces the entry with a minimal reference; the original content is retained in the entry's &lt;code&gt;compression_history&lt;/code&gt; for audit.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;MemoryCompressor&lt;/strong&gt; orchestrates the pipeline: score, pick the lowest-scoring non-protected entries, apply the least-destructive strategy first, and iterate until the store is under &lt;code&gt;token_budget&lt;/code&gt;. Every successful replacement is verified to actually reduce the token count, so compression can never make the context larger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Forgetting Curve&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;ForgettingCurve&lt;/code&gt; decides when to compress. It combines two triggers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Turn-based:&lt;/strong&gt; fires once the number of turns since the last compression reaches &lt;code&gt;compression_interval_turns&lt;/code&gt; (default: 10)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Token-based:&lt;/strong&gt; fires once &lt;code&gt;MemoryStore.token_total()&lt;/code&gt; exceeds &lt;code&gt;compression_threshold_tokens&lt;/code&gt; (default: 6000), with hysteresis to prevent thrashing.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;should_compress(store)&lt;/code&gt; returns &lt;code&gt;True&lt;/code&gt; as soon as either condition is met. &lt;code&gt;get_compression_priority(store)&lt;/code&gt; returns entries sorted by importance, so the orchestrator always attacks the least-valuable history first.&lt;/p&gt;
&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;span class="c"&gt;# optional, for live LLM calls&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;openai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The package depends on &lt;code&gt;pydantic&lt;/code&gt;, &lt;code&gt;tiktoken&lt;/code&gt; (for &lt;code&gt;cl100k_base&lt;/code&gt; token counts), &lt;code&gt;click&lt;/code&gt;, and &lt;code&gt;rich&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Usage Example
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MemoryEntry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MemoryCompressor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.triggers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ForgettingCurve&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ContextBuilder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ContextConfig&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.strategies&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLMClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CompressionEngine&lt;/span&gt;

&lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MemoryEntry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;turn_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;turn&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;compressor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryCompressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;token_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;protected_recent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;CompressionEngine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;curve&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ForgettingCurve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;compression_interval_turns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="n"&gt;compression_threshold_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;curve&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;should_compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compressor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;curve&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mark_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens_saved&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compression_ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; reduction)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ContextBuilder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ContextConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Without an API key, &lt;code&gt;LLMClient&lt;/code&gt; falls back to a deterministic short stub so pipelines remain runnable in tests and offline demos. A full end-to-end demo lives at &lt;a href="https://github.com/dakshjain-1616/Agent-Memory-Compressor/blob/main/demos/long_run_demo.py" rel="noopener noreferrer"&gt;demos/long_run_demo.py&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  API Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqv6ab49n2hrdoiiaylc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faqv6ab49n2hrdoiiaylc.png" alt=" " width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;memory-cli&lt;/code&gt; entrypoint (&lt;code&gt;click&lt;/code&gt;-based) is installed for quick inspection, compression, and demo runs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Integration with the Session Manager
&lt;/h2&gt;

&lt;p&gt;The adapters module wires the compressor directly into the &lt;a href="https://github.com/dakshjain-1616/agent-session-manager" rel="noopener noreferrer"&gt;Stateful Agent Session Manager&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_memory_compressor.adapters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;compress_session&lt;/span&gt;

&lt;span class="n"&gt;compressed_messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compress_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;# anything exposing get_messages() / get_metadata()
&lt;/span&gt;    &lt;span class="n"&gt;token_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;protected_recent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;SessionAdapter.session_to_store&lt;/code&gt; projects session messages into a &lt;code&gt;MemoryStore&lt;/code&gt;, &lt;code&gt;compressor.compress(...)&lt;/code&gt; runs the pipeline, and &lt;code&gt;store_to_session&lt;/code&gt; projects the compressed entries back into the session's message format, preserving original roles and retaining the compression history on each compacted entry.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I build This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. A fully autonomous AI engineering agent that writes code end-to-end for AI/ML tasks including model evals, prompt optimization, and pipeline development.&lt;br&gt;
I described the problem at a high level: an intelligent memory pipeline for long-running agents that scores history by importance, compresses the least valuable entries, and assembles a token-bounded context. &lt;/p&gt;

&lt;p&gt;NEO generated the full implementation, the multi-signal ImportanceScorer, the three compression strategies in CompressionEngine, the turn- and token-based ForgettingCurve triggers, the token-budgeted ContextBuilder, and the SessionAdapter that wires everything into an existing agent session, all as a coherent, installable Python library.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Can Build Further With NEO
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Semantic similarity scoring&lt;/strong&gt;: straightforward, just call an embeddings API and add the score to the existing pipeline. Done all the time in RAG systems.&lt;br&gt;
&lt;strong&gt;Pluggable tokenizers&lt;/strong&gt;: purely an engineering task, just abstract the tiktoken call. No research needed.&lt;br&gt;
&lt;strong&gt;More agent framework adapters&lt;/strong&gt;: LangChain/LlamaIndex all expose message lists. The &lt;code&gt;session_to_store&lt;/code&gt; pattern already exists, just repeat it for each framework.&lt;br&gt;
&lt;strong&gt;Streaming compression&lt;/strong&gt;: the trigger logic already exists, moving it per-turn is a refactor not a research problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Notes
&lt;/h2&gt;

&lt;p&gt;Agent Memory Compressor is a principled answer to context window exhaustion for long-running LLM agents.&lt;/p&gt;

&lt;p&gt;Instead of truncating history blindly, it scores every piece of memory, applies the least-destructive compression strategy first, and assembles a token-bounded context that preserves what the agent actually needs, the decisions, discovered facts, and recent turns that matter most.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Agent-Memory-Compressor" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Agent-Memory-Compressor&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>devtools</category>
      <category>agents</category>
    </item>
    <item>
      <title>Cache-Augmented Generation (CAG): A RAG-less Approach to Document QA</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 25 Apr 2026 10:29:22 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/cache-augmented-generation-cag-a-rag-less-approach-to-document-qa-3296</link>
      <guid>https://forem.com/nilofer_tweets/cache-augmented-generation-cag-a-rag-less-approach-to-document-qa-3296</guid>
      <description>&lt;p&gt;Most document QA systems today rely on Retrieval-Augmented Generation (RAG). The standard pipeline is familiar: chunk the document, generate embeddings, store them in a vector database, and retrieve relevant chunks at query time.&lt;/p&gt;

&lt;p&gt;This works, but it comes with trade-offs. The model only sees fragments of the document, retrieval adds latency, and the system becomes more complex with multiple moving parts.&lt;/p&gt;

&lt;p&gt;Cache-Augmented Generation (CAG) explores a different approach,where the document is processed once and reused across queries instead of being retrieved repeatedly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Cache-Augmented Generation
&lt;/h2&gt;

&lt;p&gt;Cache-Augmented Generation (CAG) approaches document QA by reusing the model’s internal state instead of retrieving context for every query.&lt;/p&gt;

&lt;p&gt;During ingestion, the entire document is processed in a single pass. In this step, the model builds its KV (key-value) cache, which represents the document’s context.&lt;/p&gt;

&lt;p&gt;This KV cache is then saved to disk.&lt;/p&gt;

&lt;p&gt;When a query is made, the cache is restored and the query is appended, allowing the model to generate responses using the previously processed document.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmg3ppq2ap9haejv1cnvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmg3ppq2ap9haejv1cnvd.png" alt=" " width="757" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Ingest (done once per document)&lt;/strong&gt;&lt;br&gt;
The document is wrapped in a structured prompt and sent to llama-server. The model runs a full prefill pass, loading every token into the KV cache. This takes time, proportional to document size, but only happens once. The KV cache is then saved to a .bin file on disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Query (instant, repeatable)&lt;/strong&gt;&lt;br&gt;
Before each query, the saved .bin file is restored into llama-server's KV cache in ~1 second. The user's question is appended and the model generates an answer with full document context active. No re-reading, no re-embedding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Persistence&lt;/strong&gt;&lt;br&gt;
KV slots survive server restarts. Kill the server, restart it, and your next query restores the cache from disk just as fast. The 24-minute prefill for War and Peace only needs to happen once ever.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Validated Results&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;All 11 GPU tests were run on an NVIDIA RTX A6000 (48 GB VRAM) with Qwen3.5-35B-A3B Q3_K_M at 1,048,576 token context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcl1j86rcusllw996dbj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqcl1j86rcusllw996dbj.png" alt=" " width="800" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Output&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;“Who is Pierre Bezukhov?” → Correct, detailed answer&lt;br&gt;
“What happened at the Battle of Borodino?” → Correct, detailed answer&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Quick Start&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Prerequisites: Linux, NVIDIA GPU (8 GB+ VRAM), Python 3.8+&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Build llama.cpp + download model (one-time, ~35 min)&lt;/span&gt;
./setup.sh

&lt;span class="c"&gt;# 2. Start the LLM server&lt;/span&gt;
./start_server.sh

&lt;span class="c"&gt;# 3. Start the API server&lt;/span&gt;
python3 src/api_server.py

&lt;span class="c"&gt;# 4. Ingest a document&lt;/span&gt;
python3 src/ingest.py my_document.txt &lt;span class="nt"&gt;--corpus-id&lt;/span&gt; my_doc

&lt;span class="c"&gt;# 5. Query it&lt;/span&gt;
python3 src/query.py my_doc &lt;span class="s2"&gt;"What is this document about?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. After step 4, the KV cache is saved to kv_slots/my_doc.bin. Every future query restores it instantly, and it survives server restarts.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Model Selection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;setup.sh auto-detects GPU VRAM and picks the right model:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F172inz546xzlcb7o615u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F172inz546xzlcb7o615u.png" alt=" " width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 24 GB+ path uses unsloth/Qwen3.5-35B-A3B-GGUF on HuggingFace and requires a free HF account + access token.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;REST API&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Start the API server with python3 src/api_server.py --port 8000 (optionally set CAG_API_KEY env var to enable key auth).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhslie8eplnq0cbj0uelk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhslie8eplnq0cbj0uelk.png" alt=" " width="800" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Full API docs available at &lt;code&gt;http://localhost:8000/docs&lt;/code&gt; when the server is running.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Directory Structure&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── setup.sh              # Builds llama.cpp, downloads model
├── start_server.sh       # Launches llama-server with CAG flags
├── requirements.txt
├── src/
│   ├── api_server.py     # FastAPI REST API
│   ├── ingest.py         # CLI: ingest a document
│   ├── query.py          # CLI: query a corpus
│   └── demo.py           # End-to-end demo
├── docker/
│   ├── Dockerfile
│   └── docker-compose.yml
├── docs/
│   ├── REPORT.md         # Full GPU validation report with all 11 test results
│   └── GPU_TESTING.md    # GPU test checklist
├── models/               # GGUF weights (not committed)
├── kv_slots/             # Saved KV cache .bin files (not committed)
└── logs/                 # Runtime logs (not committed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Limitations&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Linux + NVIDIA only:&lt;/strong&gt; TurboQuant CUDA kernels require Linux and NVIDIA GPUs (no Windows, macOS, or AMD).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long initial prefill:&lt;/strong&gt; ~900K tokens can take ~24 minutes on an A6000. This is a one-time cost; subsequent queries restore in ~1 second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VRAM gating:&lt;/strong&gt; Systems with lower VRAM use smaller models with shorter context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single active corpus:&lt;/strong&gt; Uses a single llama.cpp slot (slot 0). Switching corpora requires restoring a different KV cache (~1 second).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-context limitations:&lt;/strong&gt; YaRN extrapolation biases attention toward the start and end of documents, so mid-document content can be missed at very large context sizes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build time:&lt;/strong&gt; Initial setup (./setup.sh) can take ~35 minutes to compile CUDA kernels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model access requirements:&lt;/strong&gt; Large models (e.g., Qwen3.5-35B) require a Hugging Face account and access token.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built This Using NEO
&lt;/h2&gt;

&lt;p&gt;This project was built using &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt;. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The system was defined at a high level, describing a document QA workflow that avoids RAG by loading full documents into an LLM, saving the KV cache, and restoring it for repeated queries.&lt;/p&gt;

&lt;p&gt;Based on this, NEO generated the implementation, handled debugging across CUDA, Python, and shell components, and validated the system through a series of GPU tests.&lt;/p&gt;

&lt;p&gt;This included fixing multiple issues during development and running end-to-end validation to ensure ingestion, cache restoration, and query flows worked reliably.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to Extend This Further with NEO&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The system can be extended in several ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;supporting multiple KV cache slots&lt;/li&gt;
&lt;li&gt;improving handling of long-context attention limitations&lt;/li&gt;
&lt;li&gt;optimizing cache storage and compression&lt;/li&gt;
&lt;li&gt;exploring hybrid approaches combining CAG with retrieval&lt;/li&gt;
&lt;li&gt;extending API capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These extensions would require changes to the current implementation and can be explored based on system requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Notes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Cache-Augmented Generation is an alternative way to approach document QA.&lt;/p&gt;

&lt;p&gt;Instead of retrieving context at query time, it shifts the cost to a one-time preprocessing step and reuses the model’s KV cache.&lt;/p&gt;

&lt;p&gt;This makes repeated queries faster and makes the document context available to the model through the KV cache, while introducing trade-offs in setup time and hardware requirements.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Cache-Augmented-Generation-CAG-System" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Cache-Augmented-Generation-CAG-System&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>rag</category>
    </item>
    <item>
      <title>Loop Anti-Pattern Linter: Finding Hidden Performance Issues in Python</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Sat, 25 Apr 2026 04:40:58 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/loop-anti-pattern-linter-finding-hidden-performance-issues-in-python-1oi7</link>
      <guid>https://forem.com/nilofer_tweets/loop-anti-pattern-linter-finding-hidden-performance-issues-in-python-1oi7</guid>
      <description>&lt;p&gt;When writing Python code, loop-heavy logic often looks correct but hides performance issues that only show up at scale.&lt;/p&gt;

&lt;p&gt;Patterns like repeated membership checks, string concatenation, or nested iteration over the same data can silently increase time complexity. These are easy to miss during development because they do not break functionality, but they can significantly affect runtime as input size grows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why This Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most of these patterns are not syntax errors. The code runs correctly, but the performance cost grows with input size.&lt;/p&gt;

&lt;p&gt;Because they are subtle, they are rarely caught during code review unless someone is actively looking for them. This is especially relevant in data processing, backend services, and any code that operates on large collections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is Loop Anti-Pattern Linter?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a static analyzer that scans Python code to detect common loop anti-patterns.&lt;/p&gt;

&lt;p&gt;It does not execute the code. Instead, it parses the source using Python’s ast module and identifies inefficient patterns using rule-based detectors.&lt;/p&gt;

&lt;p&gt;Each finding includes an estimated slowdown percentage based on Big-O heuristics and a suggestion. Results are ranked so that higher-impact issues appear first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At a high level, the tool processes code in the following steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accepts a Python file or directory as input&lt;/li&gt;
&lt;li&gt;Parses the code into an Abstract Syntax Tree (AST)&lt;/li&gt;
&lt;li&gt;Runs detectors implemented as NodeVisitor subclasses&lt;/li&gt;
&lt;li&gt;Each detector targets a specific anti-pattern&lt;/li&gt;
&lt;li&gt;Assigns a predefined estimated_slowdown_pct value&lt;/li&gt;
&lt;li&gt;Sorts findings by this value&lt;/li&gt;
&lt;li&gt;Outputs results as a table or JSON&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Farlbgt52zq75blee529m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Farlbgt52zq75blee529m.png" alt=" " width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because the analysis is static and AST-based, it introduces zero runtime overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How the Scoring Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The scoring is based on Big-O heuristics.&lt;/p&gt;

&lt;p&gt;Each pattern is associated with a predefined slowdown estimate based on typical complexity impact. For example, patterns that introduce nested iteration or repeated linear scans are assigned higher percentages.&lt;/p&gt;

&lt;p&gt;These values are not derived from runtime benchmarking. They are directional signals to help prioritize fixes.&lt;/p&gt;

&lt;p&gt;Findings are sorted by the estimated_slowdown_pct field so that the most impactful issues appear first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detected Loop Anti-Patterns&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The tool detects a set of common loop inefficiencies and assigns a heuristic slowdown estimate to each:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fotd18rj3f22ix4c57vl0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fotd18rj3f22ix4c57vl0.png" alt=" " width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By default, the output includes the fields shown below. When using the --explain flag, an additional explanation field is included.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"src/processor.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"line"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pattern"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ListAppendInLoop"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"estimated_slowdown_pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"suggestion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Replace with list comprehension"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CLI Usage&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Analyze a directory
&lt;span class="go"&gt;python loop_antipattern_lint.py ./src

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Analyze a single file
&lt;span class="go"&gt;python loop_antipattern_lint.py mymodule.py

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;JSON output
&lt;span class="go"&gt;python loop_antipattern_lint.py ./src --json

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Filter high-impact issues
&lt;span class="go"&gt;python loop_antipattern_lint.py ./src --min-slowdown 30
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AI-Powered Explanations&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The --explain flag adds natural language explanations to each finding using OpenRouter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;... python loop_antipattern_lint.py ./src &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This helps explain why a pattern is inefficient and how to improve it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How I Built This Using NEO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project started from a simple requirement: identify inefficient loop patterns in Python code and highlight the ones that are likely to impact performance.&lt;/p&gt;

&lt;p&gt;Instead of building everything manually, I used &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; to generate an initial version of the tool from a high-level description.&lt;/p&gt;

&lt;p&gt;FYI: Neo is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The prompt focused on the expected behavior:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Build a Python CLI tool that analyzes code using AST, detects loop anti-patterns,assigns slowdown estimates based on Big-O heuristics, and outputs ranked results with suggestions.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This produced a working baseline that aligned with the intended functionality of the tool.&lt;/p&gt;

&lt;p&gt;From there, the focus was on validating the output and making small adjustments. This included checking that the detected patterns matched expectations, refining the heuristic slowdown values, and ensuring the CLI usage behaved as intended.&lt;/p&gt;

&lt;p&gt;This approach made it possible to move from a high-level idea to a functional tool quickly, without manually implementing each part from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Extend This Further with NEO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The current system can be extended in several ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;adding new detectors for additional patterns&lt;/li&gt;
&lt;li&gt;refining slowdown estimates&lt;/li&gt;
&lt;li&gt;improving suggestions&lt;/li&gt;
&lt;li&gt;integrating with CI pipelines&lt;/li&gt;
&lt;li&gt;extending analysis to other inefficiencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These extensions can be approached by describing the required behavior and iterating on the existing implementation using NEO.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running the Project&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python loop_antipattern_lint.py ./src
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Final Notes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This tool focuses on identifying inefficiencies that are easy to miss during development but can have a measurable impact as data size grows.&lt;/p&gt;

&lt;p&gt;By combining static analysis with cost estimation based on Big-O heuristics, it helps prioritize optimizations that are worth addressing.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/Loop-Anti-Pattern-Linter" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/Loop-Anti-Pattern-Linter&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Low-Latency Model Router: Automatic LLM Selection Across OpenRouter</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Wed, 22 Apr 2026 14:41:20 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/low-latency-model-router-automatic-llm-selection-across-openrouter-2mjo</link>
      <guid>https://forem.com/nilofer_tweets/low-latency-model-router-automatic-llm-selection-across-openrouter-2mjo</guid>
      <description>&lt;p&gt;When calling an LLM API directly, the model selection is typically fixed ahead of time. In practice, this creates several limitations across different workloads.&lt;br&gt;
Latency varies depending on the model and request type. Lower-cost models may not maintain quality under certain conditions. External API failures require fallback handling. Repeated identical requests increase cost if caching is not applied.&lt;br&gt;
These constraints require a routing layer that can dynamically select models based on latency, cost, and quality, while also handling caching, fallback, and observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What This Project Does&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This project implements a low-latency LLM router that dynamically selects the best model for each request based on latency, cost, and quality.&lt;/p&gt;

&lt;p&gt;Instead of sending every request to a fixed model, the router evaluates multiple candidates at runtime and routes each request to the most suitable option.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0izxjy0yfb3lmq1cna4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs0izxjy0yfb3lmq1cna4.png" alt=" " width="800" height="599"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How the Scoring Engine Works&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The scoring engine is the component responsible for evaluating available models and selecting the most suitable one for each request.&lt;/p&gt;

&lt;p&gt;It assigns a score to each model based on latency, cost, and quality, and then selects the model with the highest score according to the defined priority.&lt;/p&gt;

&lt;p&gt;Every model in the catalogue is scored on three dimensions. A routing decision is a single pass over the catalogue to find the highest-scoring candidate given your weight preferences:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Score = w_latency * (1 - norm_latency)
      + w_cost    * (1 - norm_cost)
      + w_quality * quality_score
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You control the weights via the priority field:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcc4fylh6r0huaxprrree.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcc4fylh6r0huaxprrree.png" alt=" " width="800" height="211"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Catalogue&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps4x2pvou28k0qlclzl3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps4x2pvou28k0qlclzl3.png" alt=" " width="800" height="257"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the selected model fails, the router automatically retries with the next-best candidate. Identical requests are served from cache. Redis, or in-memory if Redis is unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project Structure&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ml_project_0652/
├── src/
│   ├── models.py              # Pydantic schemas
│   ├── router/
│   │   ├── core.py            # Weighted scoring engine + model catalogue
│   │   ├── metrics.py         # Rolling-window metrics tracker
│   │   ├── openrouter.py      # Async OpenRouter API client
│   │   └── cache.py           # Redis cache + MockCache fallback
│   ├── api/
│   │   ├── main.py            # FastAPI app
│   │   └── routes.py          # Route definitions
│   └── cli/
│       └── commands.py        # Typer CLI
├── tests/                     # 29 unit + integration tests
├── start_router.py            # Server entry point
├── config.yaml                # Server, Redis, and routing config
├── .env.example               # Environment variable template
└── requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;REST API&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/route &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "priority": "balanced"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example response:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gen-abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google/gemini-flash-1.5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Paris."&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"usage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"completion_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"routing_decision"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"selected_model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"google/gemini-flash-1.5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Best composite score based on latency, cost, and quality"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"latency_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;312.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cached"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Route with speed priority and latency cap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/route &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "messages": [{"role": "user", "content": "Translate: hello"}],
    "priority": "speed",
    "max_latency_ms": 700
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;All Endpoints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhd3t3clbvxa11l02c0f1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhd3t3clbvxa11l02c0f1.png" alt=" " width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Identical requests are cached using a hash of the request.&lt;br&gt;
Redis configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost"&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6379&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Redis is unavailable, the system automatically falls back to in-memory caching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fallback models are defined in config.yaml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;fallback_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3-haiku"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemini-flash-1.5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;average latency&lt;/li&gt;
&lt;li&gt;p95 latency&lt;/li&gt;
&lt;li&gt;p99 latency&lt;/li&gt;
&lt;li&gt;per-model usage&lt;/li&gt;
&lt;li&gt;cache hit rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Available via /metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List models&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands models

&lt;span class="c"&gt;# Preview routing decision&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"What is 2+2?"&lt;/span&gt; &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Route with quality priority&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"Summarize this article"&lt;/span&gt; &lt;span class="nt"&gt;--priority&lt;/span&gt; quality &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Route with latency cap&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"Hello"&lt;/span&gt; &lt;span class="nt"&gt;--priority&lt;/span&gt; speed &lt;span class="nt"&gt;--max-latency&lt;/span&gt; 600 &lt;span class="nt"&gt;--dry-run&lt;/span&gt;

&lt;span class="c"&gt;# Live call&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands route &lt;span class="s2"&gt;"What is 2+2?"&lt;/span&gt; &lt;span class="nt"&gt;--priority&lt;/span&gt; balanced

&lt;span class="c"&gt;# Benchmark&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; src.cli.commands benchmark &lt;span class="nt"&gt;--iterations&lt;/span&gt; 10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0"&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;

&lt;span class="na"&gt;redis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost"&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6379&lt;/span&gt;
  &lt;span class="na"&gt;ttl_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;

&lt;span class="na"&gt;routing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default_weights&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.4&lt;/span&gt;
    &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
    &lt;span class="na"&gt;quality&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.3&lt;/span&gt;
  &lt;span class="na"&gt;fallback_models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3-haiku"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;google/gemini-flash-1.5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How I Built This Using NEO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I used &lt;a href="https://heyneo.com/" rel="noopener noreferrer"&gt;NEO&lt;/a&gt; AI Engineer to build this project by starting with a high-level description of the system requirements.&lt;/p&gt;

&lt;p&gt;FYI: Neo is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.&lt;/p&gt;

&lt;p&gt;The goal was to create a routing layer that can dynamically select LLMs based on latency, cost, and quality, while also supporting caching, fallback handling, and metrics tracking.&lt;/p&gt;

&lt;p&gt;I began by giving this task prompt to Neo:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Build a FastAPI LLM router that selects models from OpenRouter based on latency, cost, and quality. Include weighted scoring, fallback handling, caching, and metrics tracking.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From this prompt, NEO generated the initial project structure, including the API layer and routing logic.&lt;/p&gt;

&lt;p&gt;It then produced the core components required for the system, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model selection based on weighted scoring&lt;/li&gt;
&lt;li&gt;Request handling through the API layer&lt;/li&gt;
&lt;li&gt;Caching support with Redis and in-memory fallback&lt;/li&gt;
&lt;li&gt;Fallback handling for failed model calls&lt;/li&gt;
&lt;li&gt;Metrics tracking for latency and usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These pieces came together as a working router that could process requests, select models based on defined priorities, and return responses with routing decisions and metrics.&lt;/p&gt;

&lt;p&gt;This made it possible to move from a high-level idea to a functioning system without manually implementing each part of the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to Extend This Further with NEO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the base system is in place, NEO can also be used to iterate on specific components.&lt;/p&gt;

&lt;p&gt;You can extend this project with more functionality such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adjusting scoring weights for different workloads&lt;/li&gt;
&lt;li&gt;Refining model selection strategies&lt;/li&gt;
&lt;li&gt;Modifying cache policies and TTL behavior&lt;/li&gt;
&lt;li&gt;Adding constraints such as latency limits or budget caps&lt;/li&gt;
&lt;li&gt;Integrating additional model providers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Running the Project&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dakshjain-1616/low-Latency-Model-Router
&lt;span class="nb"&gt;cd &lt;/span&gt;low-Latency-Model-Router
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
python start_router.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;em&gt;Ensure that you specify your OpenRouter API Key in .env while running the low latency model router.&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Notes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This router implements a routing layer that dynamically selects models based on latency, cost, and quality while handling caching, fallback, and observability.&lt;/p&gt;

&lt;p&gt;It separates routing logic from model usage, allowing systems to adapt across different workloads without changing application logic.&lt;/p&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/dakshjain-1616/low-Latency-Model-Router" rel="noopener noreferrer"&gt;https://github.com/dakshjain-1616/low-Latency-Model-Router&lt;/a&gt;&lt;br&gt;
You can also build with NEO in your IDE using the &lt;a href="https://marketplace.visualstudio.com/items?itemName=NeoResearchInc.heyneo" rel="noopener noreferrer"&gt;VS Code extension&lt;/a&gt; or &lt;a href="https://open-vsx.org/extension/NeoResearchInc/heyneo" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Nilofer 🚀</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:36:01 +0000</pubDate>
      <link>https://forem.com/nilofer_tweets/-fgp</link>
      <guid>https://forem.com/nilofer_tweets/-fgp</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" class="crayons-story__hidden-navigation-link"&gt;A CLI tool to score fine-tuning dataset quality before training starts&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/gaurav_vij137" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F390876%2F1001f18f-15c5-4cb3-b792-3c4e81a1cc61.jpg" alt="gaurav_vij137 profile" class="crayons-avatar__image" width="400" height="400"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/gaurav_vij137" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Gaurav Vij
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Gaurav Vij
                
              
              &lt;div id="story-author-preview-content-3500478" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/gaurav_vij137" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F390876%2F1001f18f-15c5-4cb3-b792-3c4e81a1cc61.jpg" class="crayons-avatar__image" alt="" width="400" height="400"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Gaurav Vij&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 14&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" id="article-link-3500478"&gt;
          A CLI tool to score fine-tuning dataset quality before training starts
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/finetuning"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;finetuning&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/llm"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;llm&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/gaurav_vij137/a-cli-tool-to-score-fine-tuning-dataset-quality-before-training-starts-23ng#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
  </channel>
</rss>
