<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aayush kumarsingh</title>
    <description>The latest articles on Forem by Aayush kumarsingh (@aayush_kumarsingh_6ee1ffe).</description>
    <link>https://forem.com/aayush_kumarsingh_6ee1ffe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3869731%2F3626c00e-9846-420a-aa24-7ef35e7ed749.png</url>
      <title>Forem: Aayush kumarsingh</title>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aayush_kumarsingh_6ee1ffe"/>
    <language>en</language>
    <item>
      <title>TraceMind v3 — I built an AI agent that diagnoses why your LLM quality dropped</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Tue, 05 May 2026 12:25:42 +0000</pubDate>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe/tracemind-v3-i-built-an-ai-agent-that-diagnoses-why-your-llm-quality-dropped-921</link>
      <guid>https://forem.com/aayush_kumarsingh_6ee1ffe/tracemind-v3-i-built-an-ai-agent-that-diagnoses-why-your-llm-quality-dropped-921</guid>
      <description>&lt;p&gt;&lt;em&gt;Previous posts: &lt;a href="https://dev.to/aayush_kumarsingh_6ee1ffe/tracemind-v2-i-added-hallucination-detection-and-ab-testing-to-my-open-source-llm-eval-platform-1lkn"&gt;v2&lt;/a&gt; — hallucination detection + A/B testing&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;The most common question I got after v2 was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The hallucination score spiked. Now what?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;TraceMind told you &lt;em&gt;that&lt;/em&gt; something broke. It didn't tell you &lt;em&gt;why&lt;/em&gt;. And it definitely didn't help you fix it.&lt;/p&gt;

&lt;p&gt;That gap is what v3 closes.&lt;/p&gt;




&lt;p&gt;If TraceMind is useful to you, a ⭐ on GitHub helps others find it.&lt;br&gt;
GitHub: &lt;a href="https://github.com/Aayush-engineer/TraceMind" rel="noopener noreferrer"&gt;https://github.com/Aayush-engineer/TraceMind&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What's new
&lt;/h2&gt;

&lt;p&gt;Three things shipped in v3:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;EvalAgent&lt;/strong&gt; — a ReAct agent that diagnoses quality regressions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response Control Hooks&lt;/strong&gt; — block or retry hallucinated responses automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt Version Registry&lt;/strong&gt; — track which prompt is deployed where&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  The EvalAgent
&lt;/h2&gt;

&lt;p&gt;This is the main feature. When quality drops, instead of staring at a dashboard, you ask the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Why is quality dropping on the support dataset?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent runs a loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;THINK → What do I need to know?
ACT   → Use a tool to get it
OBSERVE → What did the tool show?
REPEAT until I have enough to answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It has 6 tools: fetch recent traces, run targeted evals, search past failures (semantic search via ChromaDB), generate new test cases, analyze failure patterns, and send alerts.&lt;/p&gt;

&lt;p&gt;A real session looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1: search_similar_failures
→ Found 3 similar past failures (82% match). Last seen 4 days ago.

Step 2: fetch_recent_traces
→ 14 low-quality traces in last 24h. Lowest score: 3.2.

Step 3: analyze_failure_pattern
→ Pattern: multi-step refund questions with policy constraints
  Root cause: prompt doesn't specify what to do when policy is ambiguous
  Fix: add explicit fallback instruction for edge cases

Step 4: generate_test_cases
→ Generated 5 adversarial cases covering this failure mode

ANSWER: Quality dropped because the prompt has no fallback for ambiguous
policy questions. Generated 5 test cases to cover this. Recommended fix:
add "If policy is unclear, say: I'll check and follow up" to your prompt.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the complete investigation — 4 tool calls, 45 seconds, specific root cause, specific fix, new test cases already added to the dataset.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture decision: text-based ReAct, not native tool calling
&lt;/h2&gt;

&lt;p&gt;I had two options for the agent loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A — Anthropic/OpenAI native tool calling&lt;/strong&gt;: cleaner, more reliable JSON, the model calls tools directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B — Text-based ReAct&lt;/strong&gt;: model outputs &lt;code&gt;TOOL: name\nINPUT: {...}&lt;/code&gt;, I parse it.&lt;/p&gt;

&lt;p&gt;I went with Option B because I'm running on Groq's free tier (llama-3.1-8b-instant), and native tool calling on smaller open models is unreliable — the model frequently hallucinates tool names or produces malformed schemas. Text-based ReAct is more forgiving and easier to debug when something goes wrong.&lt;/p&gt;

&lt;p&gt;The tradeoff: I have to parse the output myself, and occasionally the model produces text that doesn't match the &lt;code&gt;TOOL:&lt;/code&gt; / &lt;code&gt;ANSWER:&lt;/code&gt; pattern. I handle that with a fallback that appends the raw response to context and retries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory: 4 types
&lt;/h2&gt;

&lt;p&gt;The agent isn't stateless. Between runs it maintains:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic memory&lt;/strong&gt; — ChromaDB stores embeddings of every past failure. When a new failure arrives, the agent searches for similar past failures and their resolutions. If this exact problem was solved 3 weeks ago, the agent finds it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episodic memory&lt;/strong&gt; — The last 5 agent runs for each project are stored in Postgres. New runs start with context from previous investigations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project context&lt;/strong&gt; — Loaded at agent init. The agent knows what kind of system it's investigating.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-context working memory&lt;/strong&gt; — The scratchpad of tool results that accumulates during a single run.&lt;/p&gt;

&lt;p&gt;Most agents only have the last one. The semantic + episodic layers are what make investigations get faster over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Response Control Hooks
&lt;/h2&gt;

&lt;p&gt;This closes the loop on hallucination detection.&lt;/p&gt;

&lt;p&gt;Before v3: TraceMind detected a high-risk response. You logged it. Nothing happened.&lt;/p&gt;

&lt;p&gt;Now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HallucinationPolicy&lt;/span&gt;

&lt;span class="n"&gt;tm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Built-in policies — safe defaults out of the box
&lt;/span&gt;&lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response_control&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HallucinationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response_control&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="n"&gt;HallucinationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response_control&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="n"&gt;HallucinationPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FLAG&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Or custom callback for your specific logic
&lt;/span&gt;&lt;span class="nd"&gt;@tm.response_control.on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_critical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;alert_oncall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Critical hallucination in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;span_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m not confident in this answer. Please contact support.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Your existing code, unchanged
&lt;/span&gt;&lt;span class="nd"&gt;@tm.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support_handler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_ticket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;your_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ticket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# If response is critical-risk → HallucinationBlocked raised automatically
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The design principle here came from a comment on my v2 post from &lt;a href="https://dev.to/sunychoudhary"&gt;@sunychoudhary&lt;/a&gt;: teams that get full flexibility usually implement no policy at all. So the defaults ship with something safe, and you override what you need.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt Version Registry
&lt;/h2&gt;

&lt;p&gt;Every deployed prompt is now versioned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST /api/prompts/&lt;span class="o"&gt;{&lt;/span&gt;prompt_name&lt;span class="o"&gt;}&lt;/span&gt;/versions
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"content"&lt;/span&gt;: &lt;span class="s2"&gt;"You are a professional support agent. Be empathetic and precise."&lt;/span&gt;,
  &lt;span class="s2"&gt;"tags"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;, &lt;span class="s2"&gt;"v2.3"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="c"&gt;# → { "version_id": "support:v3" }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When quality drops, you can correlate it with which prompt version was deployed at that timestamp. This answers "did the regression start when we changed the prompt?" without manually digging through git history.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I got wrong in v2 (and fixed)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;inputs["project_id"]&lt;/code&gt; bug&lt;/strong&gt; — The agent would call &lt;code&gt;fetch_recent_traces&lt;/code&gt; but the LLM sometimes omitted &lt;code&gt;project_id&lt;/code&gt; from the tool input JSON. The function did &lt;code&gt;inputs["project_id"]&lt;/code&gt; — hard key access — so it crashed with a &lt;code&gt;KeyError&lt;/code&gt; instead of falling back to the agent's own project ID.&lt;/p&gt;

&lt;p&gt;The fix: &lt;code&gt;pid = inputs.get("project_id") or project_id&lt;/code&gt; and pass &lt;code&gt;project_id&lt;/code&gt; through the call chain. Obvious in hindsight. The pattern for all tool inputs is now &lt;code&gt;.get()&lt;/code&gt; with fallbacks throughout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The float parse crash&lt;/strong&gt; — The worker that auto-scores spans sent &lt;code&gt;max_tokens=5&lt;/code&gt; to get a single number back. Sometimes the model returned &lt;code&gt;"3\n\nThe response is..."&lt;/code&gt;. The code did &lt;code&gt;float(result.strip())&lt;/code&gt; and crashed.&lt;/p&gt;

&lt;p&gt;The fix: &lt;code&gt;float(result.strip().split()[0].rstrip('.'))&lt;/code&gt; — take only the first token.&lt;/p&gt;

&lt;p&gt;Both bugs were caught by the verify suite (&lt;code&gt;verify_all.py&lt;/code&gt;) before I noticed them in logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;44/44 verification checks passing
76 unit tests
8 iterations average per agent run
~45 seconds for a complete investigation
&amp;lt;1ms SDK overhead (batched, non-blocking)
$0 — runs entirely on Groq free tier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Aayush-engineer/tracemind
&lt;span class="nb"&gt;cd &lt;/span&gt;tracemind &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Add GROQ_API_KEY (free at console.groq.com)&lt;/span&gt;
docker-compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or hit the hosted demo: &lt;strong&gt;tracemind.onrender.com/docs&lt;/strong&gt; (free tier, ~30s cold start)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceMind&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ef_live_...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tracemind.onrender.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tm.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;your_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;your_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What I'd still do differently
&lt;/h2&gt;

&lt;p&gt;The agent uses text-based ReAct which occasionally misfires on smaller models. Native tool calling with a model that supports it reliably (Llama 3.3 70B, Mixtral) would be more robust — but that's beyond Groq's free tier limits for my use case.&lt;/p&gt;

&lt;p&gt;The semantic memory searches all past failures globally across projects. It should be scoped per project first. On a shared instance with many projects, cross-project signal is mostly noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Live
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9vh6ngdrndzadv6o48r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9vh6ngdrndzadv6o48r.png" alt=" " width="800" height="382"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5cg81e6gfzrvb3jb3wj.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5cg81e6gfzrvb3jb3wj.gif" alt=" " width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ollama integration — run entirely local, no API key&lt;/li&gt;
&lt;li&gt;Hosted cloud version — 1 project, 1000 spans/month free&lt;/li&gt;
&lt;li&gt;LlamaIndex callback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're building with LLMs and something breaks in a way that doesn't show up in your error logs — that's exactly the problem TraceMind is for. Would genuinely value feedback on whether the agent investigations are useful in practice, or just interesting in theory.&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>agents</category>
      <category>llmops</category>
    </item>
    <item>
      <title>The gap between detecting hallucinations and handling them</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Wed, 15 Apr 2026 13:43:15 +0000</pubDate>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe/the-gap-between-detecting-hallucinations-and-handling-them-47hh</link>
      <guid>https://forem.com/aayush_kumarsingh_6ee1ffe/the-gap-between-detecting-hallucinations-and-handling-them-47hh</guid>
      <description>&lt;p&gt;After posting about TraceMind's hallucination detection, someone left &lt;br&gt;
a comment that stopped me.&lt;/p&gt;

&lt;p&gt;Suny Choudhary wrote: "the harder issue is what happens after &lt;br&gt;
detection. Whether the system can handle that uncertainty correctly — &lt;br&gt;
retry, validate, or block actions."&lt;/p&gt;

&lt;p&gt;He's right. And it exposed a gap I hadn't thought through.&lt;/p&gt;



&lt;p&gt;Right now TraceMind detects hallucinations. You get this back:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "has_hallucinations": True,&lt;br&gt;
  "overall_risk": "high",&lt;br&gt;
  "claims": [{&lt;br&gt;
    "claim": "We offer 60-day refunds",&lt;br&gt;
    "type": "factual_contradiction",&lt;br&gt;
    "evidence": "context says 30-day refunds only"&lt;br&gt;
  }]&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;And then... nothing. You have to decide what to do with it.&lt;/p&gt;



&lt;p&gt;The problem is "what to do" is completely application-specific.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;customer support bot&lt;/strong&gt; should probably retry with a more &lt;br&gt;
conservative prompt. The user is waiting for an answer.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;legal document analyzer&lt;/strong&gt; should block and escalate to a human. &lt;br&gt;
A wrong answer has real consequences.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;coding assistant&lt;/strong&gt; might just flag it with low confidence. The &lt;br&gt;
developer will review the code anyway.&lt;/p&gt;

&lt;p&gt;You can't hardcode the right behavior at the detection layer because &lt;br&gt;
it depends on context the detection layer doesn't have.&lt;/p&gt;



&lt;p&gt;My current thinking for v3: opinionated defaults with override hooks.&lt;/p&gt;

&lt;p&gt;Three built-in policies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;block&lt;/code&gt; — don't return the response&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry&lt;/code&gt; — re-run the LLM call with a safer prompt&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;flag&lt;/code&gt; — return the response with a warning attached&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Override any of them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tm.on_hallucination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;legal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCK&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FLAG&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Teams get safe defaults on day one. Teams with specific workflows &lt;br&gt;
customize exactly what they need.&lt;/p&gt;




&lt;p&gt;This isn't shipped yet. It's a design I'm thinking through based on &lt;br&gt;
real feedback.&lt;/p&gt;

&lt;p&gt;If you're building with LLMs and have dealt with this problem — what &lt;br&gt;
did you actually do when your AI hallucinated in production?&lt;/p&gt;

&lt;p&gt;GitHub: github.com/Aayush-engineer/tracemind&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>The gap between detecting hallucinations and handling them</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Wed, 15 Apr 2026 13:43:15 +0000</pubDate>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe/the-gap-between-detecting-hallucinations-and-handling-them-1b3m</link>
      <guid>https://forem.com/aayush_kumarsingh_6ee1ffe/the-gap-between-detecting-hallucinations-and-handling-them-1b3m</guid>
      <description>&lt;p&gt;After posting about TraceMind's hallucination detection, someone left &lt;br&gt;
a comment that stopped me.&lt;/p&gt;

&lt;p&gt;Suny Choudhary wrote: "the harder issue is what happens after &lt;br&gt;
detection. Whether the system can handle that uncertainty correctly — &lt;br&gt;
retry, validate, or block actions."&lt;/p&gt;

&lt;p&gt;He's right. And it exposed a gap I hadn't thought through.&lt;/p&gt;



&lt;p&gt;Right now TraceMind detects hallucinations. You get this back:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "has_hallucinations": True,&lt;br&gt;
  "overall_risk": "high",&lt;br&gt;
  "claims": [{&lt;br&gt;
    "claim": "We offer 60-day refunds",&lt;br&gt;
    "type": "factual_contradiction",&lt;br&gt;
    "evidence": "context says 30-day refunds only"&lt;br&gt;
  }]&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;And then... nothing. You have to decide what to do with it.&lt;/p&gt;



&lt;p&gt;The problem is "what to do" is completely application-specific.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;customer support bot&lt;/strong&gt; should probably retry with a more &lt;br&gt;
conservative prompt. The user is waiting for an answer.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;legal document analyzer&lt;/strong&gt; should block and escalate to a human. &lt;br&gt;
A wrong answer has real consequences.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;coding assistant&lt;/strong&gt; might just flag it with low confidence. The &lt;br&gt;
developer will review the code anyway.&lt;/p&gt;

&lt;p&gt;You can't hardcode the right behavior at the detection layer because &lt;br&gt;
it depends on context the detection layer doesn't have.&lt;/p&gt;



&lt;p&gt;My current thinking for v3: opinionated defaults with override hooks.&lt;/p&gt;

&lt;p&gt;Three built-in policies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;block&lt;/code&gt; — don't return the response&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry&lt;/code&gt; — re-run the LLM call with a safer prompt&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;flag&lt;/code&gt; — return the response with a warning attached&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Override any of them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tm.on_hallucination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;risk&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;legal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLOCK&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Policy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FLAG&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Teams get safe defaults on day one. Teams with specific workflows &lt;br&gt;
customize exactly what they need.&lt;/p&gt;




&lt;p&gt;This isn't shipped yet. It's a design I'm thinking through based on &lt;br&gt;
real feedback.&lt;/p&gt;

&lt;p&gt;If you're building with LLMs and have dealt with this problem — what &lt;br&gt;
did you actually do when your AI hallucinated in production?&lt;/p&gt;

&lt;p&gt;GitHub: github.com/Aayush-engineer/tracemind&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>TraceMind v2 — I added hallucination detection and A/B testing to my open-source LLM eval platform</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Tue, 14 Apr 2026 05:39:42 +0000</pubDate>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe/tracemind-v2-i-added-hallucination-detection-and-ab-testing-to-my-open-source-llm-eval-platform-1lkn</link>
      <guid>https://forem.com/aayush_kumarsingh_6ee1ffe/tracemind-v2-i-added-hallucination-detection-and-ab-testing-to-my-open-source-llm-eval-platform-1lkn</guid>
      <description>&lt;h2&gt;
  
  
  What changed since v1
&lt;/h2&gt;

&lt;p&gt;When I posted the first version of TraceMind, I got one clear piece of feedback: "this is useful but I need to know if my AI is making things up, not just scoring low."&lt;/p&gt;

&lt;p&gt;So I built hallucination detection. Then while building it I realized I needed a way to compare prompts systematically. So I built A/B testing too.&lt;/p&gt;

&lt;p&gt;Here's what's new and how I built it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The original problem (unchanged)
&lt;/h2&gt;

&lt;p&gt;I was building a multi-agent orchestration system. Three days after deploying, I changed a system prompt. Quality dropped from 84% to 52%. I found out 11 days later from a user complaint.&lt;/p&gt;

&lt;p&gt;TraceMind was built to catch this on day zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's new in v2
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Hallucination detection
&lt;/h3&gt;

&lt;p&gt;The endpoint takes a question, the AI's response, and optional ground truth context. It extracts individual claims from the response, checks each one against the context, and returns a structured result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;has_hallucinations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overall_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claims&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claim&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;We offer 60-day refunds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context says 30-day refunds only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key architectural decision: claim extraction and verification are separate LLM calls. The first call extracts atomic claims. The second verifies each claim against ground truth. This is more reliable than asking one model to do both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt A/B testing
&lt;/h3&gt;

&lt;p&gt;You give it two system prompts and a dataset. It runs both prompts against every test case and compares results.&lt;/p&gt;

&lt;p&gt;The interesting part is the statistical layer. A naive implementation would just compare average scores. But with small datasets (5-20 cases),average score differences are often noise. I added Mann-Whitney U test and Cohen's d to give a confidence score on whether prompt B is actually better or just randomly different.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_a_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;6.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_b_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;8.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;winner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cohen_d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verification suite
&lt;/h3&gt;

&lt;p&gt;I built a 44-test verification script covering all 11 feature areas. Running &lt;code&gt;python verify_all.py&lt;/code&gt; hits every endpoint end-to-end against a real running server and reports pass/fail. This was more useful than unit tests for catching integration issues.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd still do differently
&lt;/h2&gt;

&lt;p&gt;The same things from v1, plus one new one: the hallucination detection is synchronous. For production use it should be a background job like span scoring. A user with 1000 traces would need to wait for each one — that doesn't scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv3ifxm8dqm1kep9qrsy.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsv3ifxm8dqm1kep9qrsy.gif" alt=" " width="600" height="319"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7lgr7zj60672myyhh4m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7lgr7zj60672myyhh4m.png" alt=" " width="800" height="381"&gt;&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/Aayush-engineer/tracemind" rel="noopener noreferrer"&gt;https://github.com/Aayush-engineer/tracemind&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceMind&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tracemind.onrender.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tm.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;your_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Self-hosted, free, no vendor lock-in.&lt;/p&gt;

&lt;p&gt;If you're building with LLMs — I'd genuinely love to know &lt;br&gt;
what breaks when you try it.&lt;/p&gt;

</description>
      <category>python</category>
      <category>llmops</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
    <item>
      <title>I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions</title>
      <dc:creator>Aayush kumarsingh</dc:creator>
      <pubDate>Thu, 09 Apr 2026 11:45:30 +0000</pubDate>
      <link>https://forem.com/aayush_kumarsingh_6ee1ffe/i-built-an-open-source-llm-eval-platform-with-a-react-agent-that-diagnoses-quality-regressions-3a26</link>
      <guid>https://forem.com/aayush_kumarsingh_6ee1ffe/i-built-an-open-source-llm-eval-platform-with-a-react-agent-that-diagnoses-quality-regressions-3a26</guid>
      <description>&lt;h2&gt;
  
  
  The problem that made me build this
&lt;/h2&gt;

&lt;p&gt;I was building a multi-agent orchestration system. It worked great &lt;br&gt;
in testing. I deployed it. Three days later I changed a system prompt. &lt;br&gt;
Quality dropped from 84% to 52%. I found out 11 days later when a &lt;br&gt;
user complained.&lt;/p&gt;

&lt;p&gt;This is the most common failure mode in LLM applications. Unlike &lt;br&gt;
traditional software where a bug throws an exception, bad LLM outputs &lt;br&gt;
look like valid responses. They just happen to be wrong, unhelpful, &lt;br&gt;
or unsafe. You need systematic measurement to catch this.&lt;/p&gt;

&lt;p&gt;I looked for existing tools. Langfuse is good but expensive at scale for self-hosted teams. &lt;br&gt;
Braintrust doesn't have a free self-hosted option. Helicone doesn't do &lt;br&gt;
evals. I built TraceMind.&lt;/p&gt;
&lt;h2&gt;
  
  
  What TraceMind does
&lt;/h2&gt;

&lt;p&gt;Three things:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Automatic quality scoring&lt;/strong&gt;&lt;br&gt;
Every LLM response is scored 1-10 by another LLM acting as judge &lt;br&gt;
(LLM-as-judge pattern). I use Groq's free tier — llama-3.1-8b-instant &lt;br&gt;
for fast scoring, llama-3.3-70b for deep analysis. The score runs in &lt;br&gt;
the background, never blocking your application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Golden dataset evals&lt;/strong&gt;&lt;br&gt;
You define expected behaviors once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I want a refund&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acknowledge and ask for order number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;support-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;your_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pass rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pass_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Pass rate: 87%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. AI agent that diagnoses regressions&lt;/strong&gt;&lt;br&gt;
This is the part I'm most proud of. You can ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Why did quality drop yesterday?"
"What are the most common failure patterns?"
"Generate test cases for billing question failures"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent implements the ReAct pattern with 6 tools and 4 memory types.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture decisions that matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Parallel eval execution with asyncio.Semaphore
&lt;/h3&gt;

&lt;p&gt;The naive approach runs LLM judge calls sequentially. &lt;br&gt;
For 100 test cases at 500ms each = 50 seconds.&lt;/p&gt;

&lt;p&gt;I use asyncio.Semaphore(3) to run 3 evaluations concurrently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;semaphore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_concurrent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;run_case&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;semaphore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;coro&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;coro&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;100 cases now takes ~17 seconds. The semaphore limit exists because &lt;br&gt;
Groq's free tier has rate limits — I tuned it to stay under the threshold.&lt;/p&gt;
&lt;h3&gt;
  
  
  The ReAct agent with semantic memory
&lt;/h3&gt;

&lt;p&gt;The agent has 4 memory types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;In-context&lt;/strong&gt;: conversation history within the session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External KV&lt;/strong&gt;: project config from database&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic&lt;/strong&gt;: past failures in ChromaDB with sentence-transformers embeddings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Episodic&lt;/strong&gt;: past agent run results in SQLite&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you ask "why did quality drop?", the agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Searches ChromaDB semantically for similar past failures&lt;/li&gt;
&lt;li&gt;Fetches recent low-scoring traces from the database&lt;/li&gt;
&lt;li&gt;Runs a targeted eval on the failure category&lt;/li&gt;
&lt;li&gt;Uses Opus-equivalent model to analyze root cause&lt;/li&gt;
&lt;li&gt;Generates new test cases to prevent future recurrence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I intentionally avoided LangChain. The ReAct loop is 80 lines of &lt;br&gt;
readable Python. When something breaks at 3am, you want to read &lt;br&gt;
your own code.&lt;/p&gt;
&lt;h3&gt;
  
  
  Background worker for async scoring
&lt;/h3&gt;

&lt;p&gt;The HTTP ingestion endpoint returns in &amp;lt;10ms regardless of batch size. &lt;br&gt;
Scoring runs in a background worker that polls every 10 seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_score_unscored_spans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;spans&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_unscored&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;spans&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_score_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;save_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The worst thing an observability tool can do is slow down the system &lt;br&gt;
it's monitoring. Scoring is completely decoupled from ingestion.&lt;/p&gt;
&lt;h3&gt;
  
  
  Local embeddings — no OpenAI dependency
&lt;/h3&gt;

&lt;p&gt;I use sentence-transformers &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; for ChromaDB embeddings. &lt;br&gt;
It runs locally, downloads once (~90MB), works offline, zero API cost. &lt;br&gt;
This was a deliberate choice — I wanted the tool to work completely &lt;br&gt;
free with no external dependencies beyond Groq for LLM calls.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I'd do differently in production
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy&lt;/strong&gt;: Row-level security instead of project-level isolation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celery + Redis&lt;/strong&gt; instead of asyncio background worker for horizontal scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming eval results&lt;/strong&gt; via WebSocket — see case-by-case progress in real time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alembic migrations&lt;/strong&gt; from day one (I added these later)&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Live demo: &lt;a href="https://tracemind.vercel.app" rel="noopener noreferrer"&gt;https://tracemind.vercel.app&lt;/a&gt;&lt;br&gt;
GitHub: &lt;a href="https://github.com/Aayush-engineer/tracemind" rel="noopener noreferrer"&gt;https://github.com/Aayush-engineer/tracemind&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;3-line setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tracemind&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TraceMind&lt;/span&gt;
&lt;span class="n"&gt;tm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TraceMind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
               &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://tracemind.onrender.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tm.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;your_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;  &lt;span class="c1"&gt;# your code unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxmjhby73l7wsn7laek2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdxmjhby73l7wsn7laek2.png" alt=" "&gt;&lt;/a&gt;&lt;br&gt;
If you're building with LLMs and want to know if they're actually &lt;br&gt;
working — I'd love feedback.&lt;/p&gt;

</description>
      <category>python</category>
      <category>rag</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
