<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aayush kumarsingh</title>
    <description>The latest articles on Forem by Aayush kumarsingh (@aayush_kumarsingh_6ee1ffe).</description>
    <link>https://forem.com/aayush_kumarsingh_6ee1ffe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869731%2F3626c00e-9846-420a-aa24-7ef35e7ed749.png</url>
      <title>Forem: Aayush kumarsingh</title>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aayush_kumarsingh_6ee1ffe"/>
    <language>en</language>
    <item>
      <title>TraceMind v2 — I added hallucination detection and A/B testing to my open-source LLM eval platform</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Tue, 14 Apr 2026 05:39:42 +0000</pubDate>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe/tracemind-v2-i-added-hallucination-detection-and-ab-testing-to-my-open-source-llm-eval-platform-1lkn</link>
      <guid>https://forem.com/aayush_kumarsingh_6ee1ffe/tracemind-v2-i-added-hallucination-detection-and-ab-testing-to-my-open-source-llm-eval-platform-1lkn</guid>
      <description>&lt;h2&gt;
  
  
  What changed since v1
&lt;/h2&gt;

&lt;p&gt;When I posted the first version of TraceMind, I got one clear piece of feedback: "this is useful but I need to know if my AI is making things up, not just scoring low."&lt;/p&gt;

&lt;p&gt;So I built hallucination detection. Then while building it I realized I needed a way to compare prompts systematically. So I built A/B testing too.&lt;/p&gt;

&lt;p&gt;Here's what's new and how I built it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The original problem (unchanged)
&lt;/h2&gt;

&lt;p&gt;I was building a multi-agent orchestration system. Three days after deploying, I changed a system prompt. Quality dropped from 84% to 52%. I found out 11 days later from a user complaint.&lt;/p&gt;

&lt;p&gt;TraceMind was built to catch this on day zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's new in v2
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hallucination detection
&lt;/h3&gt;

&lt;p&gt;The endpoint takes a question, the AI's response, and optional ground truth context. It extracts individual claims from the response, checks each one against the context, and returns a structured result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;has_hallucinations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claims&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claim&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;We offer 60-day refunds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context says 30-day refunds only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key architectural decision: claim extraction and verification are separate LLM calls. The first call extracts atomic claims. The second verifies each claim against ground truth. This is more reliable than asking one model to do both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt A/B testing
&lt;/h3&gt;

&lt;p&gt;You give it two system prompts and a dataset. It runs both prompts against every test case and compares results.&lt;/p&gt;

&lt;p&gt;The interesting part is the statistical layer. A naive implementation would just compare average scores. But with small datasets (5-20 cases),average score differences are often noise. I added Mann-Whitney U test and Cohen's d to give a confidence score on whether prompt B is actually better or just randomly different.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_a_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;6.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_b_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cohen_d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verification suite
&lt;/h3&gt;

&lt;p&gt;I built a 44-test verification script covering all 11 feature areas. Running &lt;code&gt;python verify_all.py&lt;/code&gt; hits every endpoint end-to-end against a real running server and reports pass/fail. This was more useful than unit tests for catching integration issues.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd still do differently
&lt;/h2&gt;

&lt;p&gt;The same things from v1, plus one new one: the hallucination detection is synchronous. For production use it should be a background job like span scoring. A user with 1000 traces would need to wait for each one — that doesn't scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv3ifxm8dqm1kep9qrsy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv3ifxm8dqm1kep9qrsy.gif" alt=" " width="600" height="319"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7lgr7zj60672myyhh4m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7lgr7zj60672myyhh4m.png" alt=" " width="800" height="381"&gt;&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/Aayush-engineer/tracemind" rel="noopener noreferrer"&gt;https://github.com/Aayush-engineer/tracemind&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceMind&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tracemind.onrender.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tm.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;your_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Self-hosted, free, no vendor lock-in.&lt;/p&gt;

&lt;p&gt;If you're building with LLMs — I'd genuinely love to know &lt;br&gt;
what breaks when you try it.&lt;/p&gt;

</description>
      <category>python</category>
      <category>llmops</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Thu, 09 Apr 2026 11:45:30 +0000</pubDate>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe/i-built-an-open-source-llm-eval-platform-with-a-react-agent-that-diagnoses-quality-regressions-3a26</link>
      <guid>https://forem.com/aayush_kumarsingh_6ee1ffe/i-built-an-open-source-llm-eval-platform-with-a-react-agent-that-diagnoses-quality-regressions-3a26</guid>
      <description>&lt;h2&gt;
  
  
  The problem that made me build this
&lt;/h2&gt;

&lt;p&gt;I was building a multi-agent orchestration system. It worked great &lt;br&gt;
in testing. I deployed it. Three days later I changed a system prompt. &lt;br&gt;
Quality dropped from 84% to 52%. I found out 11 days later when a &lt;br&gt;
user complained.&lt;/p&gt;

&lt;p&gt;This is the most common failure mode in LLM applications. Unlike &lt;br&gt;
traditional software where a bug throws an exception, bad LLM outputs &lt;br&gt;
look like valid responses. They just happen to be wrong, unhelpful, &lt;br&gt;
or unsafe. You need systematic measurement to catch this.&lt;/p&gt;

&lt;p&gt;I looked for existing tools. Langfuse is good but expensive at scale for self-hosted teams. &lt;br&gt;
Braintrust doesn't have a free self-hosted option. Helicone doesn't do &lt;br&gt;
evals. I built TraceMind.&lt;/p&gt;
&lt;h2&gt;
  
  
  What TraceMind does
&lt;/h2&gt;

&lt;p&gt;Three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Automatic quality scoring&lt;/strong&gt;&lt;br&gt;
Every LLM response is scored 1-10 by another LLM acting as judge &lt;br&gt;
(LLM-as-judge pattern). I use Groq's free tier — llama-3.1-8b-instant &lt;br&gt;
for fast scoring, llama-3.3-70b for deep analysis. The score runs in &lt;br&gt;
the background, never blocking your application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Golden dataset evals&lt;/strong&gt;&lt;br&gt;
You define expected behaviors once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I want a refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acknowledge and ask for order number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;your_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pass rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pass_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Pass rate: 87%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. AI agent that diagnoses regressions&lt;/strong&gt;&lt;br&gt;
This is the part I'm most proud of. You can ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Why did quality drop yesterday?"
"What are the most common failure patterns?"
"Generate test cases for billing question failures"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent implements the ReAct pattern with 6 tools and 4 memory types.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture decisions that matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Parallel eval execution with asyncio.Semaphore
&lt;/h3&gt;

&lt;p&gt;The naive approach runs LLM judge calls sequentially. &lt;br&gt;
For 100 test cases at 500ms each = 50 seconds.&lt;/p&gt;

&lt;p&gt;I use asyncio.Semaphore(3) to run 3 evaluations concurrently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;semaphore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_concurrent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;run_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semaphore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;coro&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;coro&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;100 cases now takes ~17 seconds. The semaphore limit exists because &lt;br&gt;
Groq's free tier has rate limits — I tuned it to stay under the threshold.&lt;/p&gt;
&lt;h3&gt;
  
  
  The ReAct agent with semantic memory
&lt;/h3&gt;

&lt;p&gt;The agent has 4 memory types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In-context&lt;/strong&gt;: conversation history within the session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External KV&lt;/strong&gt;: project config from database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic&lt;/strong&gt;: past failures in ChromaDB with sentence-transformers embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Episodic&lt;/strong&gt;: past agent run results in SQLite&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you ask "why did quality drop?", the agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Searches ChromaDB semantically for similar past failures&lt;/li&gt;
&lt;li&gt;Fetches recent low-scoring traces from the database&lt;/li&gt;
&lt;li&gt;Runs a targeted eval on the failure category&lt;/li&gt;
&lt;li&gt;Uses Opus-equivalent model to analyze root cause&lt;/li&gt;
&lt;li&gt;Generates new test cases to prevent future recurrence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I intentionally avoided LangChain. The ReAct loop is 80 lines of &lt;br&gt;
readable Python. When something breaks at 3am, you want to read &lt;br&gt;
your own code.&lt;/p&gt;
&lt;h3&gt;
  
  
  Background worker for async scoring
&lt;/h3&gt;

&lt;p&gt;The HTTP ingestion endpoint returns in &amp;lt;10ms regardless of batch size. &lt;br&gt;
Scoring runs in a background worker that polls every 10 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_score_unscored_spans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;spans&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_unscored&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spans&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_score_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;save_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worst thing an observability tool can do is slow down the system &lt;br&gt;
it's monitoring. Scoring is completely decoupled from ingestion.&lt;/p&gt;
&lt;h3&gt;
  
  
  Local embeddings — no OpenAI dependency
&lt;/h3&gt;

&lt;p&gt;I use sentence-transformers &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; for ChromaDB embeddings. &lt;br&gt;
It runs locally, downloads once (~90MB), works offline, zero API cost. &lt;br&gt;
This was a deliberate choice — I wanted the tool to work completely &lt;br&gt;
free with no external dependencies beyond Groq for LLM calls.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I'd do differently in production
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy&lt;/strong&gt;: Row-level security instead of project-level isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celery + Redis&lt;/strong&gt; instead of asyncio background worker for horizontal scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming eval results&lt;/strong&gt; via WebSocket — see case-by-case progress in real time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alembic migrations&lt;/strong&gt; from day one (I added these later)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Live demo: &lt;a href="https://tracemind.vercel.app" rel="noopener noreferrer"&gt;https://tracemind.vercel.app&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/Aayush-engineer/tracemind" rel="noopener noreferrer"&gt;https://github.com/Aayush-engineer/tracemind&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3-line setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceMind&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
               &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tracemind.onrender.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tm.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;your_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# your code unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxmjhby73l7wsn7laek2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxmjhby73l7wsn7laek2.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
If you're building with LLMs and want to know if they're actually &lt;br&gt;
working — I'd love feedback.&lt;/p&gt;

</description>
      <category>python</category>
      <category>rag</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
