Forem: Aayush kumarsingh

TraceMind v3 — I built an AI agent that diagnoses why your LLM quality dropped

Aayush kumarsingh — Tue, 05 May 2026 12:25:42 +0000

Previous posts: v2 — hallucination detection + A/B testing

The most common question I got after v2 was this:

"The hallucination score spiked. Now what?"

TraceMind told you that something broke. It didn't tell you why. And it definitely didn't help you fix it.

That gap is what v3 closes.

If TraceMind is useful to you, a ⭐ on GitHub helps others find it.
GitHub: https://github.com/Aayush-engineer/TraceMind

What's new

Three things shipped in v3:

EvalAgent — a ReAct agent that diagnoses quality regressions
Response Control Hooks — block or retry hallucinated responses automatically
Prompt Version Registry — track which prompt is deployed where

The EvalAgent

This is the main feature. When quality drops, instead of staring at a dashboard, you ask the agent:

"Why is quality dropping on the support dataset?"

The agent runs a loop:

THINK → What do I need to know?
ACT   → Use a tool to get it
OBSERVE → What did the tool show?
REPEAT until I have enough to answer

It has 6 tools: fetch recent traces, run targeted evals, search past failures (semantic search via ChromaDB), generate new test cases, analyze failure patterns, and send alerts.

A real session looks like this:

Step 1: search_similar_failures
→ Found 3 similar past failures (82% match). Last seen 4 days ago.

Step 2: fetch_recent_traces
→ 14 low-quality traces in last 24h. Lowest score: 3.2.

Step 3: analyze_failure_pattern
→ Pattern: multi-step refund questions with policy constraints
  Root cause: prompt doesn't specify what to do when policy is ambiguous
  Fix: add explicit fallback instruction for edge cases

Step 4: generate_test_cases
→ Generated 5 adversarial cases covering this failure mode

ANSWER: Quality dropped because the prompt has no fallback for ambiguous
policy questions. Generated 5 test cases to cover this. Recommended fix:
add "If policy is unclear, say: I'll check and follow up" to your prompt.

That's the complete investigation — 4 tool calls, 45 seconds, specific root cause, specific fix, new test cases already added to the dataset.

The architecture decision: text-based ReAct, not native tool calling

I had two options for the agent loop.

Option A — Anthropic/OpenAI native tool calling: cleaner, more reliable JSON, the model calls tools directly.

Option B — Text-based ReAct: model outputs TOOL: name\nINPUT: {...}, I parse it.

I went with Option B because I'm running on Groq's free tier (llama-3.1-8b-instant), and native tool calling on smaller open models is unreliable — the model frequently hallucinates tool names or produces malformed schemas. Text-based ReAct is more forgiving and easier to debug when something goes wrong.

The tradeoff: I have to parse the output myself, and occasionally the model produces text that doesn't match the TOOL: / ANSWER: pattern. I handle that with a fallback that appends the raw response to context and retries.

Memory: 4 types

The agent isn't stateless. Between runs it maintains:

Semantic memory — ChromaDB stores embeddings of every past failure. When a new failure arrives, the agent searches for similar past failures and their resolutions. If this exact problem was solved 3 weeks ago, the agent finds it.

Episodic memory — The last 5 agent runs for each project are stored in Postgres. New runs start with context from previous investigations.

Project context — Loaded at agent init. The agent knows what kind of system it's investigating.

In-context working memory — The scratchpad of tool results that accumulates during a single run.

Most agents only have the last one. The semantic + episodic layers are what make investigations get faster over time.

Response Control Hooks

This closes the loop on hallucination detection.

Before v3: TraceMind detected a high-risk response. You logged it. Nothing happened.

Now:

from tracemind import TraceMind, HallucinationPolicy

tm = TraceMind(api_key="...", project="my-app")

# Built-in policies — safe defaults out of the box
tm.response_control.set_policy("critical", HallucinationPolicy.BLOCK)
tm.response_control.set_policy("high",     HallucinationPolicy.BLOCK)
tm.response_control.set_policy("medium",   HallucinationPolicy.FLAG)

# Or custom callback for your specific logic
@tm.response_control.on("critical")
def handle_critical(event):
    alert_oncall(f"Critical hallucination in {event.span_name}")
    return "I'm not confident in this answer. Please contact support."

# Your existing code, unchanged
@tm.trace("support_handler")
def handle_ticket(ticket: str) -> str:
    return your_llm.complete(ticket)
# If response is critical-risk → HallucinationBlocked raised automatically

The design principle here came from a comment on my v2 post from @sunychoudhary: teams that get full flexibility usually implement no policy at all. So the defaults ship with something safe, and you override what you need.

Prompt Version Registry

Every deployed prompt is now versioned:

POST /api/prompts/{prompt_name}/versions
{
  "content": "You are a professional support agent. Be empathetic and precise.",
  "tags": ["production", "v2.3"]
}
# → { "version_id": "support:v3" }

When quality drops, you can correlate it with which prompt version was deployed at that timestamp. This answers "did the regression start when we changed the prompt?" without manually digging through git history.

What I got wrong in v2 (and fixed)

The inputs["project_id"] bug — The agent would call fetch_recent_traces but the LLM sometimes omitted project_id from the tool input JSON. The function did inputs["project_id"] — hard key access — so it crashed with a KeyError instead of falling back to the agent's own project ID.

The fix: pid = inputs.get("project_id") or project_id and pass project_id through the call chain. Obvious in hindsight. The pattern for all tool inputs is now .get() with fallbacks throughout.

The float parse crash — The worker that auto-scores spans sent max_tokens=5 to get a single number back. Sometimes the model returned "3\n\nThe response is...". The code did float(result.strip()) and crashed.

The fix: float(result.strip().split()[0].rstrip('.')) — take only the first token.

Both bugs were caught by the verify suite (verify_all.py) before I noticed them in logs.

Numbers

44/44 verification checks passing
76 unit tests
8 iterations average per agent run
~45 seconds for a complete investigation
<1ms SDK overhead (batched, non-blocking)
$0 — runs entirely on Groq free tier

Try it

git clone https://github.com/Aayush-engineer/tracemind
cd tracemind && cp .env.example .env
# Add GROQ_API_KEY (free at console.groq.com)
docker-compose up

Or hit the hosted demo: tracemind.onrender.com/docs (free tier, ~30s cold start)

pip install tracemind-sdk

from tracemind import TraceMind
tm = TraceMind(
    api_key  = "ef_live_...",
    project  = "my-app",
    base_url = "https://tracemind.onrender.com"
)

@tm.trace("llm_call")
def your_function(msg):
    return your_llm.complete(msg)  # unchanged

What I'd still do differently

The agent uses text-based ReAct which occasionally misfires on smaller models. Native tool calling with a model that supports it reliably (Llama 3.3 70B, Mixtral) would be more robust — but that's beyond Groq's free tier limits for my use case.

The semantic memory searches all past failures globally across projects. It should be scoped per project first. On a shared instance with many projects, cross-project signal is mostly noise.

Live

What's next

Ollama integration — run entirely local, no API key
Hosted cloud version — 1 project, 1000 spans/month free
LlamaIndex callback

If you're building with LLMs and something breaks in a way that doesn't show up in your error logs — that's exactly the problem TraceMind is for. Would genuinely value feedback on whether the agent investigations are useful in practice, or just interesting in theory.

The gap between detecting hallucinations and handling them

Aayush kumarsingh — Wed, 15 Apr 2026 13:43:15 +0000

After posting about TraceMind's hallucination detection, someone left
a comment that stopped me.

Suny Choudhary wrote: "the harder issue is what happens after
detection. Whether the system can handle that uncertainty correctly —
retry, validate, or block actions."

He's right. And it exposed a gap I hadn't thought through.

Right now TraceMind detects hallucinations. You get this back:

{
"has_hallucinations": True,
"overall_risk": "high",
"claims": [{
"claim": "We offer 60-day refunds",
"type": "factual_contradiction",
"evidence": "context says 30-day refunds only"
}]
}

And then... nothing. You have to decide what to do with it.

The problem is "what to do" is completely application-specific.

A customer support bot should probably retry with a more
conservative prompt. The user is waiting for an answer.

A legal document analyzer should block and escalate to a human.
A wrong answer has real consequences.

A coding assistant might just flag it with low confidence. The
developer will review the code anyway.

You can't hardcode the right behavior at the detection layer because
it depends on context the detection layer doesn't have.

My current thinking for v3: opinionated defaults with override hooks.

Three built-in policies:

block — don't return the response
retry — re-run the LLM call with a safer prompt
flag — return the response with a warning attached

Override any of them:

@tm.on_hallucination(risk="high")
def my_policy(claim, context):
    if context.domain == "legal":
        return Policy.BLOCK
    return Policy.FLAG

Teams get safe defaults on day one. Teams with specific workflows
customize exactly what they need.

This isn't shipped yet. It's a design I'm thinking through based on
real feedback.

If you're building with LLMs and have dealt with this problem — what
did you actually do when your AI hallucinated in production?

GitHub: github.com/Aayush-engineer/tracemind

The gap between detecting hallucinations and handling them

Aayush kumarsingh — Wed, 15 Apr 2026 13:43:15 +0000

After posting about TraceMind's hallucination detection, someone left
a comment that stopped me.

Suny Choudhary wrote: "the harder issue is what happens after
detection. Whether the system can handle that uncertainty correctly —
retry, validate, or block actions."

He's right. And it exposed a gap I hadn't thought through.

Right now TraceMind detects hallucinations. You get this back:

{
"has_hallucinations": True,
"overall_risk": "high",
"claims": [{
"claim": "We offer 60-day refunds",
"type": "factual_contradiction",
"evidence": "context says 30-day refunds only"
}]
}

And then... nothing. You have to decide what to do with it.

The problem is "what to do" is completely application-specific.

A customer support bot should probably retry with a more
conservative prompt. The user is waiting for an answer.

A legal document analyzer should block and escalate to a human.
A wrong answer has real consequences.

A coding assistant might just flag it with low confidence. The
developer will review the code anyway.

You can't hardcode the right behavior at the detection layer because
it depends on context the detection layer doesn't have.

My current thinking for v3: opinionated defaults with override hooks.

Three built-in policies:

block — don't return the response
retry — re-run the LLM call with a safer prompt
flag — return the response with a warning attached

Override any of them:

@tm.on_hallucination(risk="high")
def my_policy(claim, context):
    if context.domain == "legal":
        return Policy.BLOCK
    return Policy.FLAG

Teams get safe defaults on day one. Teams with specific workflows
customize exactly what they need.

This isn't shipped yet. It's a design I'm thinking through based on
real feedback.

If you're building with LLMs and have dealt with this problem — what
did you actually do when your AI hallucinated in production?

GitHub: github.com/Aayush-engineer/tracemind

TraceMind v2 — I added hallucination detection and A/B testing to my open-source LLM eval platform

Aayush kumarsingh — Tue, 14 Apr 2026 05:39:42 +0000

What changed since v1

When I posted the first version of TraceMind, I got one clear piece of feedback: "this is useful but I need to know if my AI is making things up, not just scoring low."

So I built hallucination detection. Then while building it I realized I needed a way to compare prompts systematically. So I built A/B testing too.

Here's what's new and how I built it.

The original problem (unchanged)

I was building a multi-agent orchestration system. Three days after deploying, I changed a system prompt. Quality dropped from 84% to 52%. I found out 11 days later from a user complaint.

TraceMind was built to catch this on day zero.

What's new in v2

Hallucination detection

The endpoint takes a question, the AI's response, and optional ground truth context. It extracts individual claims from the response, checks each one against the context, and returns a structured result:

{
  "has_hallucinations": True,
  "overall_risk": "high",
  "claims": [
    {
      "claim": "We offer 60-day refunds",
      "verdict": "hallucination",
      "reason": "Context says 30-day refunds only"
    }
  ]
}

The key architectural decision: claim extraction and verification are separate LLM calls. The first call extracts atomic claims. The second verifies each claim against ground truth. This is more reliable than asking one model to do both.

Prompt A/B testing

You give it two system prompts and a dataset. It runs both prompts against every test case and compares results.

The interesting part is the statistical layer. A naive implementation would just compare average scores. But with small datasets (5-20 cases),average score differences are often noise. I added Mann-Whitney U test and Cohen's d to give a confidence score on whether prompt B is actually better or just randomly different.

{
  "prompt_a_score": 6.2,
  "prompt_b_score": 8.1,
  "winner": "B",
  "confidence": "high",
  "cohen_d": 1.4,
  "p_value": 0.03
}

Verification suite

I built a 44-test verification script covering all 11 feature areas. Running python verify_all.py hits every endpoint end-to-end against a real running server and reports pass/fail. This was more useful than unit tests for catching integration issues.

What I'd still do differently

The same things from v1, plus one new one: the hallucination detection is synchronous. For production use it should be a background job like span scoring. A user with 1000 traces would need to wait for each one — that doesn't scale.

Try it

GitHub: https://github.com/Aayush-engineer/tracemind

pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app",
               base_url="https://tracemind.onrender.com")

@tm.trace("llm_call")
def your_function(msg): ...  # unchanged

Self-hosted, free, no vendor lock-in.

If you're building with LLMs — I'd genuinely love to know
what breaks when you try it.

I built an open-source LLM eval platform with a ReAct agent that diagnoses quality regressions

Aayush kumarsingh — Thu, 09 Apr 2026 11:45:30 +0000

The problem that made me build this

I was building a multi-agent orchestration system. It worked great
in testing. I deployed it. Three days later I changed a system prompt.
Quality dropped from 84% to 52%. I found out 11 days later when a
user complained.

This is the most common failure mode in LLM applications. Unlike
traditional software where a bug throws an exception, bad LLM outputs
look like valid responses. They just happen to be wrong, unhelpful,
or unsafe. You need systematic measurement to catch this.

I looked for existing tools. Langfuse is good but expensive at scale for self-hosted teams.
Braintrust doesn't have a free self-hosted option. Helicone doesn't do
evals. I built TraceMind.

What TraceMind does

Three things:

1. Automatic quality scoring
Every LLM response is scored 1-10 by another LLM acting as judge
(LLM-as-judge pattern). I use Groq's free tier — llama-3.1-8b-instant
for fast scoring, llama-3.3-70b for deep analysis. The score runs in
the background, never blocking your application.

2. Golden dataset evals
You define expected behaviors once:

ds = tm.dataset("support-v1")
ds.add("I want a refund", expected="acknowledge and ask for order number")
ds.push()

result = tm.run_eval("support-v1", function=your_agent.run)
result.wait()
print(f"Pass rate: {result.pass_rate:.0%}")  # Pass rate: 87%

3. AI agent that diagnoses regressions
This is the part I'm most proud of. You can ask:

"Why did quality drop yesterday?"
"What are the most common failure patterns?"
"Generate test cases for billing question failures"

The agent implements the ReAct pattern with 6 tools and 4 memory types.

The architecture decisions that matter

Parallel eval execution with asyncio.Semaphore

The naive approach runs LLM judge calls sequentially.
For 100 test cases at 500ms each = 50 seconds.

I use asyncio.Semaphore(3) to run 3 evaluations concurrently:

semaphore = asyncio.Semaphore(max_concurrent)
tasks = [run_case(ex, system_fn, criteria, semaphore) for ex in examples]
for coro in asyncio.as_completed(tasks):
    result = await coro

100 cases now takes ~17 seconds. The semaphore limit exists because
Groq's free tier has rate limits — I tuned it to stay under the threshold.

The ReAct agent with semantic memory

The agent has 4 memory types:

In-context: conversation history within the session
External KV: project config from database
Semantic: past failures in ChromaDB with sentence-transformers embeddings
Episodic: past agent run results in SQLite

When you ask "why did quality drop?", the agent:

Searches ChromaDB semantically for similar past failures
Fetches recent low-scoring traces from the database
Runs a targeted eval on the failure category
Uses Opus-equivalent model to analyze root cause
Generates new test cases to prevent future recurrence

I intentionally avoided LangChain. The ReAct loop is 80 lines of
readable Python. When something breaks at 3am, you want to read
your own code.

Background worker for async scoring

The HTTP ingestion endpoint returns in <10ms regardless of batch size.
Scoring runs in a background worker that polls every 10 seconds:

async def _score_unscored_spans(self):
    spans = fetch_unscored(limit=20)
    for span in spans:
        score = await self._score_span(span.input, span.output)
        save_score(span.id, score)

The worst thing an observability tool can do is slow down the system
it's monitoring. Scoring is completely decoupled from ingestion.

Local embeddings — no OpenAI dependency

I use sentence-transformers all-MiniLM-L6-v2 for ChromaDB embeddings.
It runs locally, downloads once (~90MB), works offline, zero API cost.
This was a deliberate choice — I wanted the tool to work completely
free with no external dependencies beyond Groq for LLM calls.

What I'd do differently in production

Multi-tenancy: Row-level security instead of project-level isolation
Celery + Redis instead of asyncio background worker for horizontal scaling
Streaming eval results via WebSocket — see case-by-case progress in real time
Alembic migrations from day one (I added these later)

Try it

Live demo: https://tracemind.vercel.app
GitHub: https://github.com/Aayush-engineer/tracemind

3-line setup:

pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app", 
               base_url="https://tracemind.onrender.com")

@tm.trace("llm_call")
def your_function(msg): ...  # your code unchanged

If you're building with LLMs and want to know if they're actually
working — I'd love feedback.