<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Abhi Chatterjee</title>
    <description>The latest articles on Forem by Abhi Chatterjee (@abhi_chatterjee_979801).</description>
    <link>https://forem.com/abhi_chatterjee_979801</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890932%2F829ef7da-8a3f-402c-8839-16d64b32d92e.jpg</url>
      <title>Forem: Abhi Chatterjee</title>
      <link>https://forem.com/abhi_chatterjee_979801</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/abhi_chatterjee_979801"/>
    <language>en</language>
    <item>
      <title>Observability for AI Systems: Monitoring Drift, Hallucinations, and Reliability in Production</title>
      <dc:creator>Abhi Chatterjee</dc:creator>
      <pubDate>Mon, 25 May 2026 15:36:51 +0000</pubDate>
      <link>https://forem.com/abhi_chatterjee_979801/observability-for-ai-systems-monitoring-drift-hallucinations-and-reliability-in-production-4i4l</link>
      <guid>https://forem.com/abhi_chatterjee_979801/observability-for-ai-systems-monitoring-drift-hallucinations-and-reliability-in-production-4i4l</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 5 of a series on building reliable AI systems&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;So far in this series, we explored:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI testing fundamentals&lt;/li&gt;
&lt;li&gt;Evaluation pipelines&lt;/li&gt;
&lt;li&gt;RAG evaluation&lt;/li&gt;
&lt;li&gt;Agent tracing and reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there’s a major gap between:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The system passed evaluation”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The system is behaving reliably in production.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That gap is where observability becomes critical.&lt;/p&gt;

&lt;p&gt;Because AI systems don’t just fail once.&lt;/p&gt;

&lt;p&gt;They drift.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AI Systems Need Observability
&lt;/h2&gt;

&lt;p&gt;Traditional applications are usually monitored for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU usage&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Error rates&lt;/li&gt;
&lt;li&gt;API failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI systems introduce an entirely different layer of operational risk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinations&lt;/li&gt;
&lt;li&gt;Behavioral drift&lt;/li&gt;
&lt;li&gt;Retrieval degradation&lt;/li&gt;
&lt;li&gt;Prompt regressions&lt;/li&gt;
&lt;li&gt;Tool misuse&lt;/li&gt;
&lt;li&gt;Silent quality decay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most of these issues won’t show up in infrastructure metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  AI Failures Are Often Silent
&lt;/h2&gt;

&lt;p&gt;This is what makes production AI systems dangerous.&lt;/p&gt;

&lt;p&gt;The system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;returns 200 OK&lt;/li&gt;
&lt;li&gt;responds within latency limits&lt;/li&gt;
&lt;li&gt;appears operational&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…but produces low-quality or misleading outputs.&lt;/p&gt;

&lt;p&gt;Infrastructure monitoring says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Everything is healthy.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Users experience:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The system is getting worse.”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Should You Monitor?
&lt;/h2&gt;

&lt;p&gt;AI observability is about monitoring both:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;System performance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Behavior quality&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You need visibility into both layers.&lt;/p&gt;




&lt;h1&gt;
  
  
  Core Dimensions of AI Observability
&lt;/h1&gt;




&lt;h2&gt;
  
  
  1. Input Monitoring
&lt;/h2&gt;

&lt;p&gt;Question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What kinds of inputs is the system receiving?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query distribution&lt;/li&gt;
&lt;li&gt;Input length&lt;/li&gt;
&lt;li&gt;Language changes&lt;/li&gt;
&lt;li&gt;New user patterns&lt;/li&gt;
&lt;li&gt;Adversarial inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example issue:&lt;br&gt;
A support chatbot trained mostly on short queries suddenly starts receiving multi-step enterprise requests.&lt;/p&gt;

&lt;p&gt;Performance drops—even though the model hasn’t changed.&lt;/p&gt;

&lt;p&gt;That’s drift.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Output Quality Monitoring
&lt;/h2&gt;

&lt;p&gt;Question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are outputs still reliable?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucination frequency&lt;/li&gt;
&lt;li&gt;Response consistency&lt;/li&gt;
&lt;li&gt;Formatting failures&lt;/li&gt;
&lt;li&gt;Grounding quality&lt;/li&gt;
&lt;li&gt;Toxicity / unsafe outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where online evaluation becomes important.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. Retrieval Monitoring (for RAG)
&lt;/h2&gt;

&lt;p&gt;RAG systems need dedicated observability.&lt;/p&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval success rate&lt;/li&gt;
&lt;li&gt;Context relevance&lt;/li&gt;
&lt;li&gt;Empty retrievals&lt;/li&gt;
&lt;li&gt;Retrieval latency&lt;/li&gt;
&lt;li&gt;Top-K quality trends&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Good model
    +
Poor retrieval
    =
Bad user experience
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Many “LLM issues” are actually retrieval degradation problems.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Agent Workflow Monitoring
&lt;/h2&gt;

&lt;p&gt;Agent systems require workflow-level visibility.&lt;/p&gt;

&lt;p&gt;Monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool usage patterns&lt;/li&gt;
&lt;li&gt;Retry frequency&lt;/li&gt;
&lt;li&gt;Loop detection&lt;/li&gt;
&lt;li&gt;Failed actions&lt;/li&gt;
&lt;li&gt;Average execution steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example issue:&lt;br&gt;
An agent starts making 4x more tool calls after a prompt update.&lt;/p&gt;

&lt;p&gt;Outputs still look correct.&lt;/p&gt;

&lt;p&gt;Operational cost quietly explodes.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. Drift Detection
&lt;/h2&gt;

&lt;p&gt;One of the hardest production problems.&lt;/p&gt;

&lt;p&gt;Drift happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user behavior changes&lt;/li&gt;
&lt;li&gt;prompts evolve&lt;/li&gt;
&lt;li&gt;retrieval data changes&lt;/li&gt;
&lt;li&gt;model behavior shifts over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even small changes compound.&lt;/p&gt;
&lt;h3&gt;
  
  
  Common drift signals:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Lower task success rate&lt;/li&gt;
&lt;li&gt;Increased hallucinations&lt;/li&gt;
&lt;li&gt;More retries&lt;/li&gt;
&lt;li&gt;Reduced grounding quality&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Difference Between Monitoring and Evaluation
&lt;/h2&gt;

&lt;p&gt;This distinction is important.&lt;/p&gt;
&lt;h3&gt;
  
  
  Evaluation:
&lt;/h3&gt;

&lt;p&gt;Usually offline and controlled.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Run dataset → Measure metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observability:
&lt;/h3&gt;

&lt;p&gt;Continuous monitoring in production.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Live traffic → Detect anomalies → Trigger alerts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need both.&lt;/p&gt;




&lt;h1&gt;
  
  
  A Practical AI Observability Flow
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Production Traffic
        ↓
Capture Inputs &amp;amp; Outputs
        ↓
Run Online Checks
        ↓
Detect Drift / Failures
        ↓
Trigger Alerts
        ↓
Feed Back Into Evaluation Pipeline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a continuous reliability loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Online Evaluation in Production
&lt;/h2&gt;

&lt;p&gt;Many teams now run lightweight evaluations on live traffic.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucination checks&lt;/li&gt;
&lt;li&gt;Grounding verification&lt;/li&gt;
&lt;li&gt;Response quality scoring&lt;/li&gt;
&lt;li&gt;Toxicity detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This helps identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;silent regressions&lt;/li&gt;
&lt;li&gt;degraded prompts&lt;/li&gt;
&lt;li&gt;retrieval failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;before users escalate issues.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Example
&lt;/h2&gt;

&lt;p&gt;Consider a production RAG assistant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Initial state:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Strong retrieval quality&lt;/li&gt;
&lt;li&gt;Stable outputs&lt;/li&gt;
&lt;li&gt;Good user satisfaction&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What changed:
&lt;/h3&gt;

&lt;p&gt;A large set of new documents was added to the vector database.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened next:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval relevance dropped&lt;/li&gt;
&lt;li&gt;Context became noisy&lt;/li&gt;
&lt;li&gt;Hallucinations increased&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Infrastructure metrics remained healthy.&lt;/p&gt;

&lt;p&gt;Only observability metrics exposed the degradation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes Teams Make
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Monitoring only infrastructure
&lt;/h3&gt;

&lt;p&gt;AI quality problems are behavioral—not just operational.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. No production sampling
&lt;/h3&gt;

&lt;p&gt;If you never inspect real outputs, you’ll miss drift entirely.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. No feedback loop
&lt;/h3&gt;

&lt;p&gt;Observability should improve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;datasets&lt;/li&gt;
&lt;li&gt;evaluations&lt;/li&gt;
&lt;li&gt;prompts&lt;/li&gt;
&lt;li&gt;retrieval quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Otherwise monitoring becomes passive reporting.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Ignoring cost observability
&lt;/h3&gt;

&lt;p&gt;AI systems also drift operationally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;token usage&lt;/li&gt;
&lt;li&gt;tool calls&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliability includes efficiency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Signals Worth Tracking
&lt;/h2&gt;

&lt;p&gt;Here are some high-value production metrics:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Signals&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Output Quality&lt;/td&gt;
&lt;td&gt;Hallucination rate, grounding score&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Retrieval relevance, empty retrievals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agents&lt;/td&gt;
&lt;td&gt;Tool failures, retries, loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Usage&lt;/td&gt;
&lt;td&gt;Query distribution, prompt drift&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operations&lt;/td&gt;
&lt;td&gt;Latency, token usage, cost&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Start small. Expand over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Building Feedback Loops
&lt;/h2&gt;

&lt;p&gt;The best AI teams continuously feed production insights back into evaluation.&lt;/p&gt;

&lt;p&gt;Example loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Production Failure
        ↓
Add to Dataset
        ↓
Run Evaluations
        ↓
Improve System
        ↓
Deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is how reliable systems mature.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;In the next part of this series, I’ll go deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Red teaming AI systems&lt;/li&gt;
&lt;li&gt;Prompt injection attacks&lt;/li&gt;
&lt;li&gt;Jailbreak testing&lt;/li&gt;
&lt;li&gt;Adversarial evaluation strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because reliability without security is incomplete.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI systems are not static applications.&lt;/p&gt;

&lt;p&gt;They evolve continuously through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;changing inputs&lt;/li&gt;
&lt;li&gt;retrieval updates&lt;/li&gt;
&lt;li&gt;prompt modifications&lt;/li&gt;
&lt;li&gt;model behavior shifts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that means reliability cannot depend on testing alone.&lt;/p&gt;

&lt;p&gt;It requires continuous observability.&lt;/p&gt;

&lt;p&gt;The teams building resilient AI systems are the ones that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;monitor behavior, not just infrastructure&lt;/li&gt;
&lt;li&gt;detect drift early&lt;/li&gt;
&lt;li&gt;build strong feedback loops&lt;/li&gt;
&lt;li&gt;continuously evaluate production quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in AI systems, failures rarely announce themselves.&lt;/p&gt;

&lt;p&gt;They emerge gradually—until users notice first.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Evaluating AI Agents: Tracing, Tool Calls, and Multi-Step Reliability</title>
      <dc:creator>Abhi Chatterjee</dc:creator>
      <pubDate>Tue, 19 May 2026 19:34:42 +0000</pubDate>
      <link>https://forem.com/abhi_chatterjee_979801/evaluating-ai-agents-tracing-tool-calls-and-multi-step-reliability-40eb</link>
      <guid>https://forem.com/abhi_chatterjee_979801/evaluating-ai-agents-tracing-tool-calls-and-multi-step-reliability-40eb</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 4 of a series on building reliable AI systems&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In previous parts of this series, we explored:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why testing AI systems is different&lt;/li&gt;
&lt;li&gt;How to build evaluation pipelines&lt;/li&gt;
&lt;li&gt;How to evaluate RAG systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we move into one of the hardest areas in modern AI systems:&lt;/p&gt;

&lt;h1&gt;
  
  
  AI Agents
&lt;/h1&gt;

&lt;p&gt;Unlike traditional LLM applications, agents don’t just generate responses.&lt;/p&gt;

&lt;p&gt;They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Plan&lt;/li&gt;
&lt;li&gt;Make decisions&lt;/li&gt;
&lt;li&gt;Call tools&lt;/li&gt;
&lt;li&gt;Maintain state&lt;/li&gt;
&lt;li&gt;Iterate toward goals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that makes evaluation significantly harder.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Agent Evaluation Is Different
&lt;/h2&gt;

&lt;p&gt;A standard LLM interaction is usually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → Model → Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An agent system looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Goal
  ↓
Plan
  ↓
Tool Call
  ↓
Observe Result
  ↓
Reason Again
  ↓
Repeat
  ↓
Final Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Failures can happen at any step.&lt;/p&gt;

&lt;p&gt;Sometimes the final answer is wrong.&lt;br&gt;
Sometimes the answer is correct—but achieved inefficiently or unsafely.&lt;/p&gt;

&lt;p&gt;Traditional output-based testing misses most of these issues.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Actually Fails in Agent Systems?
&lt;/h2&gt;

&lt;p&gt;Here are the most common production failure patterns:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Wrong Tool Selection
&lt;/h3&gt;

&lt;p&gt;The agent selects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the wrong API&lt;/li&gt;
&lt;li&gt;the wrong retrieval source&lt;/li&gt;
&lt;li&gt;or an unnecessary tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even when the correct tool exists.&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Infinite or Inefficient Loops
&lt;/h3&gt;

&lt;p&gt;The agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;repeats actions&lt;/li&gt;
&lt;li&gt;retries unnecessarily&lt;/li&gt;
&lt;li&gt;or keeps reasoning without progressing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This increases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;cost&lt;/li&gt;
&lt;li&gt;failure probability&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  3. Partial Task Completion
&lt;/h3&gt;

&lt;p&gt;The agent completes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;step 1 and step 2&lt;/li&gt;
&lt;li&gt;but silently skips step 3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Users often don’t notice immediately.&lt;/p&gt;


&lt;h3&gt;
  
  
  4. Hallucinated Tool Results
&lt;/h3&gt;

&lt;p&gt;The model behaves as if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a tool succeeded&lt;/li&gt;
&lt;li&gt;data was retrieved&lt;/li&gt;
&lt;li&gt;or an action was completed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;—even when it failed.&lt;/p&gt;

&lt;p&gt;This is extremely dangerous in automation workflows.&lt;/p&gt;


&lt;h2&gt;
  
  
  Evaluating Agents Requires More Than Final Outputs
&lt;/h2&gt;

&lt;p&gt;This is the key mindset shift:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are not evaluating answers.&lt;br&gt;
You are evaluating decision-making behavior.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That means inspecting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reasoning flow&lt;/li&gt;
&lt;li&gt;tool usage&lt;/li&gt;
&lt;li&gt;execution paths&lt;/li&gt;
&lt;li&gt;recovery behavior&lt;/li&gt;
&lt;li&gt;efficiency&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Core Dimensions of Agent Evaluation
&lt;/h2&gt;


&lt;h3&gt;
  
  
  1. Task Success
&lt;/h3&gt;

&lt;p&gt;The most obvious metric.&lt;/p&gt;

&lt;p&gt;Question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did the agent complete the goal correctly?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Was the email actually sent?&lt;/li&gt;
&lt;li&gt;Was the meeting booked?&lt;/li&gt;
&lt;li&gt;Was the report generated correctly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But task success alone is not enough.&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Tool Usage Accuracy
&lt;/h3&gt;

&lt;p&gt;Question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did the agent use the correct tools correctly?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Things to measure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool selection quality&lt;/li&gt;
&lt;li&gt;Correct parameters&lt;/li&gt;
&lt;li&gt;API success/failure handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example failure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Correct tool available
        ↓
Agent chooses wrong tool
        ↓
Task fails downstream
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Step Efficiency
&lt;/h3&gt;

&lt;p&gt;Question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How efficiently did the agent complete the task?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Number of reasoning steps&lt;/li&gt;
&lt;li&gt;Number of tool calls&lt;/li&gt;
&lt;li&gt;Retry frequency&lt;/li&gt;
&lt;li&gt;Time to completion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two agents may produce the same output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one in 3 steps&lt;/li&gt;
&lt;li&gt;another in 25 unnecessary steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Efficiency matters in production systems.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Recovery Behavior
&lt;/h3&gt;

&lt;p&gt;Question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens when something fails?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Strong agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;retry intelligently&lt;/li&gt;
&lt;li&gt;switch strategies&lt;/li&gt;
&lt;li&gt;recover from missing data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weak agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loop&lt;/li&gt;
&lt;li&gt;hallucinate&lt;/li&gt;
&lt;li&gt;terminate incorrectly&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  5. Grounding and Reliability
&lt;/h3&gt;

&lt;p&gt;Even agents using RAG can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ignore retrieved context&lt;/li&gt;
&lt;li&gt;invent tool results&lt;/li&gt;
&lt;li&gt;produce unsupported conclusions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Grounding still matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Tracing Is Critical
&lt;/h2&gt;

&lt;p&gt;Without tracing, debugging agents becomes almost impossible.&lt;/p&gt;

&lt;p&gt;You need visibility into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reasoning steps&lt;/li&gt;
&lt;li&gt;tool calls&lt;/li&gt;
&lt;li&gt;observations&lt;/li&gt;
&lt;li&gt;intermediate outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A trace typically looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request
   ↓
Reasoning Step
   ↓
Tool Call
   ↓
Tool Response
   ↓
Updated Reasoning
   ↓
Final Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows you to identify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;where failures happened&lt;/li&gt;
&lt;li&gt;why decisions were made&lt;/li&gt;
&lt;li&gt;which step introduced errors&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Practical Agent Evaluation Workflow
&lt;/h2&gt;

&lt;p&gt;A simple workflow might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task Dataset
    ↓
Run Agent
    ↓
Capture Trace
    ↓
Evaluate:
  - Task Success
  - Tool Usage
  - Efficiency
  - Recovery
    ↓
Store Metrics
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example Evaluation Loop
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;efficiency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_efficiency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tool_usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;efficiency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_usage&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Evaluate the process, not just the output.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Real-World Failure Example
&lt;/h2&gt;

&lt;p&gt;Consider a support automation agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Goal:
&lt;/h3&gt;

&lt;p&gt;Refund a customer order and send confirmation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Agent retrieved order correctly&lt;/li&gt;
&lt;li&gt;Attempted refund API call failed&lt;/li&gt;
&lt;li&gt;Agent still generated:&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;“Refund completed successfully”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From the user’s perspective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;everything looked correct&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Operationally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nothing happened&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why agent tracing and verification matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes Teams Make
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Evaluating only final responses
&lt;/h3&gt;

&lt;p&gt;Misses reasoning and execution failures.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. No trace logging
&lt;/h3&gt;

&lt;p&gt;Makes debugging extremely difficult.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Ignoring efficiency
&lt;/h3&gt;

&lt;p&gt;High-quality outputs can still be operationally expensive.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. No failure simulation
&lt;/h3&gt;

&lt;p&gt;Agents behave differently under real-world failures.&lt;/p&gt;

&lt;p&gt;Test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API timeouts&lt;/li&gt;
&lt;li&gt;missing context&lt;/li&gt;
&lt;li&gt;invalid tool responses&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Practical Tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Start with scenario-based evaluation&lt;/li&gt;
&lt;li&gt;Log every tool interaction&lt;/li&gt;
&lt;li&gt;Track retries and loops&lt;/li&gt;
&lt;li&gt;Simulate failures intentionally&lt;/li&gt;
&lt;li&gt;Evaluate both correctness and efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Don’t trust successful outputs blindly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;In the next part of this series, I’ll go deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI system observability&lt;/li&gt;
&lt;li&gt;Monitoring production drift&lt;/li&gt;
&lt;li&gt;Detecting hallucinations in live systems&lt;/li&gt;
&lt;li&gt;Building feedback loops for continuous improvement&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI agents are not just text generators.&lt;/p&gt;

&lt;p&gt;They are decision-making systems operating across tools, workflows, and state.&lt;/p&gt;

&lt;p&gt;And that means reliability depends on far more than output quality.&lt;/p&gt;

&lt;p&gt;The teams building reliable agents are the ones that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;trace behavior&lt;/li&gt;
&lt;li&gt;evaluate decisions&lt;/li&gt;
&lt;li&gt;simulate failures&lt;/li&gt;
&lt;li&gt;continuously monitor execution patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in agent systems, failures rarely happen in one step.&lt;/p&gt;

&lt;p&gt;They compound across the workflow.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>softwareengineering</category>
      <category>agents</category>
      <category>testing</category>
    </item>
    <item>
      <title>Evaluating RAG Systems: Measuring Retrieval Quality, Grounding, and Hallucinations</title>
      <dc:creator>Abhi Chatterjee</dc:creator>
      <pubDate>Fri, 08 May 2026 15:07:32 +0000</pubDate>
      <link>https://forem.com/abhi_chatterjee_979801/evaluating-rag-systems-measuring-retrieval-quality-grounding-and-hallucinations-16cn</link>
      <guid>https://forem.com/abhi_chatterjee_979801/evaluating-rag-systems-measuring-retrieval-quality-grounding-and-hallucinations-16cn</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 3 of a series on building reliable AI systems&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In Part 1, we explored why testing AI systems is different.&lt;br&gt;
In Part 2, we built evaluation pipelines.&lt;/p&gt;

&lt;p&gt;Now let’s focus on one of the most widely used (and misunderstood) patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval-Augmented Generation (RAG).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RAG is often seen as a solution to hallucinations.&lt;/p&gt;

&lt;p&gt;In reality, it just shifts the problem.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Core Problem with RAG
&lt;/h2&gt;

&lt;p&gt;A typical RAG pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    ↓
Retriever → Context
    ↓
LLM → Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When something goes wrong, it’s not always obvious where the failure is.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did retrieval fail?&lt;/li&gt;
&lt;li&gt;Was the context irrelevant?&lt;/li&gt;
&lt;li&gt;Did the model ignore the context?&lt;/li&gt;
&lt;li&gt;Or did it hallucinate anyway?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without proper evaluation, everything looks like a “model problem.”&lt;/p&gt;




&lt;h2&gt;
  
  
  RAG Has Two Systems, Not One
&lt;/h2&gt;

&lt;p&gt;This is the key insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are not evaluating a single system—you are evaluating two tightly coupled systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Retriever (search problem)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generator (language problem)&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you don’t evaluate them separately, debugging becomes guesswork.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Should You Measure?
&lt;/h2&gt;

&lt;p&gt;To evaluate RAG properly, you need to break it into components.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. Retrieval Quality
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Did we fetch the right information?&lt;/p&gt;

&lt;p&gt;Metrics to consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Top-K relevance&lt;/li&gt;
&lt;li&gt;Context recall (was the correct doc retrieved?)&lt;/li&gt;
&lt;li&gt;Ranking quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example failure:&lt;/strong&gt;&lt;br&gt;
The correct document exists—but wasn’t retrieved.&lt;/p&gt;

&lt;p&gt;No model can fix missing context.&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Context Relevance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Is the retrieved content actually useful?&lt;/p&gt;

&lt;p&gt;Even if retrieval “works,” the context may be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Noisy&lt;/li&gt;
&lt;li&gt;Partially relevant&lt;/li&gt;
&lt;li&gt;Outdated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leads to weak or incorrect answers.&lt;/p&gt;


&lt;h3&gt;
  
  
  3. Grounding / Faithfulness
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Did the model use the retrieved context?&lt;/p&gt;

&lt;p&gt;This is one of the most critical checks.&lt;/p&gt;

&lt;p&gt;Failure patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model ignores context&lt;/li&gt;
&lt;li&gt;Adds unsupported information&lt;/li&gt;
&lt;li&gt;Mixes correct and hallucinated facts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Evaluation idea:&lt;/strong&gt;&lt;br&gt;
Compare response against context—not just expected answer.&lt;/p&gt;


&lt;h3&gt;
  
  
  4. Answer Correctness
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; Is the final answer actually correct?&lt;/p&gt;

&lt;p&gt;This is what users see—but it’s the &lt;em&gt;last&lt;/em&gt; layer.&lt;/p&gt;

&lt;p&gt;Important:&lt;br&gt;
Correct answers can still be &lt;strong&gt;poorly grounded&lt;/strong&gt;, which is risky.&lt;/p&gt;


&lt;h3&gt;
  
  
  5. Hallucination Rate
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question:&lt;/strong&gt; How often does the model generate unsupported information?&lt;/p&gt;

&lt;p&gt;This is especially important in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer support&lt;/li&gt;
&lt;li&gt;Healthcare&lt;/li&gt;
&lt;li&gt;Finance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Track it explicitly—it won’t surface automatically.&lt;/p&gt;


&lt;h2&gt;
  
  
  A Practical Evaluation Flow
&lt;/h2&gt;

&lt;p&gt;Here’s how you can structure RAG evaluation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input (Query)
   ↓
Retrieve Documents
   ↓
Evaluate Retrieval
   ↓
Generate Answer
   ↓
Evaluate Grounding + Correctness
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example Evaluation Loop
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;retrieval_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_retrieval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;grounding_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_grounding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;correctness_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retrieval_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grounding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;grounding_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correctness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;correctness_score&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-World Failure Patterns
&lt;/h2&gt;

&lt;p&gt;These show up again and again:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. “Looks correct, but isn’t grounded”
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Answer sounds right&lt;/li&gt;
&lt;li&gt;Not supported by retrieved context&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. “Right data, wrong answer”
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Correct document retrieved&lt;/li&gt;
&lt;li&gt;Model misinterprets it&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. “No retrieval, full hallucination”
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Retriever fails&lt;/li&gt;
&lt;li&gt;Model still generates confident answer&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. “Too much context”
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Irrelevant documents dilute signal&lt;/li&gt;
&lt;li&gt;Model produces vague responses&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Evaluating only final answer&lt;/li&gt;
&lt;li&gt;Ignoring retrieval metrics&lt;/li&gt;
&lt;li&gt;Assuming RAG eliminates hallucinations&lt;/li&gt;
&lt;li&gt;Not separating retrieval vs generation failures&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Practical Tips
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Start with a small, high-quality dataset&lt;/li&gt;
&lt;li&gt;Log retrieved documents for every query&lt;/li&gt;
&lt;li&gt;Evaluate components separately&lt;/li&gt;
&lt;li&gt;Track metrics over time (not just one run)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;In the next part, I’ll go deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluating AI agents (multi-step workflows)&lt;/li&gt;
&lt;li&gt;Tracing and debugging agent behavior&lt;/li&gt;
&lt;li&gt;Measuring task success and failure modes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;RAG doesn’t remove hallucinations—it changes where they come from.&lt;/p&gt;

&lt;p&gt;If you only evaluate outputs, you’ll miss the real problem.&lt;/p&gt;

&lt;p&gt;Reliable RAG systems come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong retrieval&lt;/li&gt;
&lt;li&gt;Grounded generation&lt;/li&gt;
&lt;li&gt;Continuous evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in RAG, the answer is only as good as the context behind it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>softwareengineering</category>
      <category>llm</category>
    </item>
    <item>
      <title>Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD</title>
      <dc:creator>Abhi Chatterjee</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:23:00 +0000</pubDate>
      <link>https://forem.com/abhi_chatterjee_979801/building-ai-evaluation-pipelines-automating-llm-testing-from-dataset-to-cicd-2io7</link>
      <guid>https://forem.com/abhi_chatterjee_979801/building-ai-evaluation-pipelines-automating-llm-testing-from-dataset-to-cicd-2io7</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2 of a series on testing AI systems in production&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;In Part 1, we explored why testing AI systems is fundamentally different from traditional software.&lt;/p&gt;

&lt;p&gt;We talked about non-determinism, prompt sensitivity, and why unit tests aren’t enough.&lt;/p&gt;

&lt;p&gt;Now let’s move from theory to practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you actually build a system to test AI reliably?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This post walks through a practical approach to building an &lt;strong&gt;AI evaluation pipeline&lt;/strong&gt;—from dataset creation to CI/CD integration.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is an AI Evaluation Pipeline?
&lt;/h2&gt;

&lt;p&gt;At a high level, an evaluation pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dataset → System → Evaluation → Metrics → Analysis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You define a dataset of test cases&lt;/li&gt;
&lt;li&gt;Run them through your AI system&lt;/li&gt;
&lt;li&gt;Evaluate outputs using defined metrics&lt;/li&gt;
&lt;li&gt;Store and analyze results over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This becomes your &lt;strong&gt;source of truth for system quality&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Build a High-Quality Evaluation Dataset
&lt;/h2&gt;

&lt;p&gt;Your evaluation pipeline is only as good as your dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where data comes from:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production logs&lt;/strong&gt; (most valuable)&lt;/li&gt;
&lt;li&gt;Synthetic examples (for coverage)&lt;/li&gt;
&lt;li&gt;Edge cases and failure scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example structure:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What is the refund policy?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"expected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Answer should mention 30-day refund window"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"context"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Optional (for RAG systems)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"faq"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"difficulty"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"easy"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What makes a good dataset:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Represents real user behavior&lt;/li&gt;
&lt;li&gt;Includes edge cases&lt;/li&gt;
&lt;li&gt;Covers known failure modes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Insight:&lt;/strong&gt; Most teams underestimate this step. Dataset quality matters more than model choice in many cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Define Evaluation Metrics
&lt;/h2&gt;

&lt;p&gt;Unlike traditional systems, correctness isn’t always binary.&lt;/p&gt;

&lt;p&gt;You’ll need a mix of evaluation strategies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common approaches:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Exact match (for structured tasks)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Useful for classification or JSON outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Semantic similarity&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Measures meaning, not exact wording&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. LLM-as-a-judge&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses a model to evaluate output quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Task success (for agents)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the system complete the objective?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tradeoffs:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Exact match → precise but brittle&lt;/li&gt;
&lt;li&gt;Semantic → flexible but fuzzy&lt;/li&gt;
&lt;li&gt;LLM judge → scalable but imperfect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key is combining multiple signals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Run Evaluations
&lt;/h2&gt;

&lt;p&gt;At this stage, you execute your system against the dataset.&lt;/p&gt;

&lt;p&gt;A simple evaluation loop might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep it simple at first. Complexity can come later.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Store Results and Enable Debugging
&lt;/h2&gt;

&lt;p&gt;Raw scores are not enough. You need visibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Store:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Inputs&lt;/li&gt;
&lt;li&gt;Outputs&lt;/li&gt;
&lt;li&gt;Scores&lt;/li&gt;
&lt;li&gt;Metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Add:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Failure tagging&lt;/li&gt;
&lt;li&gt;Error categories (hallucination, formatting, etc.)&lt;/li&gt;
&lt;li&gt;Trace logs (especially for agents)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what allows you to answer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Why did the system fail?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Without this layer, debugging becomes guesswork.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Track Changes Over Time
&lt;/h2&gt;

&lt;p&gt;An evaluation pipeline is not a one-time exercise.&lt;/p&gt;

&lt;p&gt;You should be able to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did the latest change improve performance?&lt;/li&gt;
&lt;li&gt;Did hallucination rates increase?&lt;/li&gt;
&lt;li&gt;Did a prompt tweak break edge cases?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Track metrics like:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Accuracy&lt;/li&gt;
&lt;li&gt;Hallucination rate&lt;/li&gt;
&lt;li&gt;Task success rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Version your datasets and compare results across runs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Integrate with CI/CD
&lt;/h2&gt;

&lt;p&gt;This is where evaluation becomes part of engineering discipline.&lt;/p&gt;

&lt;p&gt;Run evaluations when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompts change&lt;/li&gt;
&lt;li&gt;Models are updated&lt;/li&gt;
&lt;li&gt;Retrieval logic is modified&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example workflow:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code Change → Run Evals → Compare Metrics → Pass/Fail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can define thresholds like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fail if accuracy drops below X%&lt;/li&gt;
&lt;li&gt;Fail if hallucination rate increases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents silent regressions.&lt;/p&gt;




&lt;h2&gt;
  
  
  End-to-End Flow
&lt;/h2&gt;

&lt;p&gt;Putting it all together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dataset
   ↓
Run System
   ↓
Evaluate Outputs
   ↓
Store Results
   ↓
Compare with Previous Runs
   ↓
Trigger Alerts / Decisions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is your &lt;strong&gt;AI quality control loop&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Example
&lt;/h2&gt;

&lt;p&gt;Let’s say you’re testing a support chatbot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before pipeline:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Manual testing&lt;/li&gt;
&lt;li&gt;Inconsistent results&lt;/li&gt;
&lt;li&gt;Hard to track improvements&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  After pipeline:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;~200 real queries as dataset&lt;/li&gt;
&lt;li&gt;Automated evaluation on every update&lt;/li&gt;
&lt;li&gt;Clear metrics (correctness, grounding)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Outcome:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Faster iteration&lt;/li&gt;
&lt;li&gt;Reduced hallucinations&lt;/li&gt;
&lt;li&gt;Better confidence in releases&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;Even with a pipeline, teams run into issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overfitting to the evaluation dataset&lt;/li&gt;
&lt;li&gt;Blind trust in LLM-as-a-judge&lt;/li&gt;
&lt;li&gt;Not updating datasets with real usage&lt;/li&gt;
&lt;li&gt;Lack of dataset versioning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Avoid treating evals as static—they should evolve with your system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;In the next part of this series, I’ll go deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluating RAG systems (retrieval + generation)&lt;/li&gt;
&lt;li&gt;Measuring context relevance and faithfulness&lt;/li&gt;
&lt;li&gt;Common failure patterns in retrieval pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI systems don’t fail loudly—they drift.&lt;/p&gt;

&lt;p&gt;An evaluation pipeline gives you a way to &lt;strong&gt;detect, measure, and control that drift&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It’s not just about testing once.&lt;br&gt;
It’s about building a system that continuously tells you:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Is my AI still working as expected?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Testing AI Systems in Production: From LLM Evals to Agent Reliability</title>
      <dc:creator>Abhi Chatterjee</dc:creator>
      <pubDate>Tue, 21 Apr 2026 16:34:27 +0000</pubDate>
      <link>https://forem.com/abhi_chatterjee_979801/testing-ai-systems-in-production-from-llm-evals-to-agent-reliability-4do5</link>
      <guid>https://forem.com/abhi_chatterjee_979801/testing-ai-systems-in-production-from-llm-evals-to-agent-reliability-4do5</guid>
      <description>&lt;h1&gt;
  
  
  Testing AI Systems in Production: From LLM Evals to Agent Reliability
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Practical strategies to evaluate LLMs, RAG pipelines, and AI agents in real-world systems&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Most AI systems don’t fail in development — they fail quietly in production.&lt;/p&gt;

&lt;p&gt;Not with crashes, but with subtle errors: hallucinations, incorrect tool usage, or inconsistent outputs that slip past traditional tests.&lt;/p&gt;

&lt;p&gt;The root problem is simple: we are still trying to test probabilistic systems using deterministic testing strategies.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is Part 1 of a series on testing AI systems in production.&lt;/strong&gt;&lt;br&gt;
In this post, we’ll build a practical mental model and testing strategy.&lt;br&gt;
In upcoming parts, I’ll go deeper into evaluation pipelines, RAG testing, and agent-level reliability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why Traditional Testing Breaks for AI
&lt;/h2&gt;

&lt;p&gt;In traditional software, a given input maps to a predictable output.&lt;/p&gt;

&lt;p&gt;That assumption breaks with AI systems.&lt;/p&gt;

&lt;p&gt;Key differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outputs are &lt;strong&gt;non-deterministic&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Correctness is often &lt;strong&gt;subjective&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Ground truth is &lt;strong&gt;hard to define&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Behavior can shift with &lt;strong&gt;small prompt changes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means unit tests alone are not enough. You need layered evaluation strategies.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AI Testing Stack (A Practical Mental Model)
&lt;/h2&gt;

&lt;p&gt;Think of AI testing as a stack rather than a single technique:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------------------------------------+
| Agent / Workflow Testing (multi-step reasoning)   |
+--------------------------------------------------+
| System Testing (RAG, tools, memory)              |
+--------------------------------------------------+
| Prompt Testing (instructions, few-shot behavior) |
+--------------------------------------------------+
| Model Evaluation (benchmarks, accuracy)          |
+--------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer introduces different failure modes — and requires different testing approaches.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Model-Level Evaluation
&lt;/h2&gt;

&lt;p&gt;This is the foundation: evaluating raw model capability.&lt;/p&gt;

&lt;p&gt;Typical techniques:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark datasets (task-specific)&lt;/li&gt;
&lt;li&gt;Accuracy, precision/recall (structured outputs)&lt;/li&gt;
&lt;li&gt;BLEU / ROUGE (for text similarity)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But strong benchmark performance does &lt;strong&gt;not&lt;/strong&gt; guarantee real-world reliability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
A model performing well on QA benchmarks may still hallucinate on domain-specific queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; Model evals are necessary, but insufficient.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Prompt-Level Testing
&lt;/h2&gt;

&lt;p&gt;Prompts are effectively your “programming layer” — and they are fragile.&lt;/p&gt;

&lt;p&gt;What to test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consistency across paraphrased inputs&lt;/li&gt;
&lt;li&gt;Sensitivity to prompt changes&lt;/li&gt;
&lt;li&gt;Instruction adherence&lt;/li&gt;
&lt;li&gt;Edge cases and adversarial phrasing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example test case:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: "Summarize this document in 3 bullet points"
Variation: "Give me a short summary in bullets"
Expected: Similar structure and quality
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Small wording changes shouldn’t break behavior — but often do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maintain a &lt;strong&gt;golden dataset&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Run regression tests when prompts change&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. System-Level Testing (RAG, Tools, Pipelines)
&lt;/h2&gt;

&lt;p&gt;Once you introduce retrieval or external tools, complexity increases.&lt;/p&gt;

&lt;p&gt;Typical components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval (vector DB / search)&lt;/li&gt;
&lt;li&gt;Context construction&lt;/li&gt;
&lt;li&gt;Tool/API calls&lt;/li&gt;
&lt;li&gt;Output formatting&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common failure modes:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Irrelevant retrieval results&lt;/li&gt;
&lt;li&gt;Missing critical context&lt;/li&gt;
&lt;li&gt;Incorrect tool selection&lt;/li&gt;
&lt;li&gt;Hallucinated answers despite available data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example RAG flow:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
    ↓
Retriever → Context
    ↓
LLM → Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What to evaluate:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context relevance&lt;/strong&gt; — Did we fetch the right data?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faithfulness&lt;/strong&gt; — Did the model use the context?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Answer correctness&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Agent-Level Testing (Where Things Get Hard)
&lt;/h2&gt;

&lt;p&gt;Agents introduce multi-step reasoning, planning, and state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example loop:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Goal
   ↓
Plan → Tool Call → Observe → Repeat
   ↓
Final Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Common failures:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Infinite loops&lt;/li&gt;
&lt;li&gt;Wrong tool usage&lt;/li&gt;
&lt;li&gt;Partial task completion&lt;/li&gt;
&lt;li&gt;Confident but incorrect outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How to test agents:
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Scenario-based testing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define end-to-end tasks&lt;/li&gt;
&lt;li&gt;Measure success rate and correctness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Simulation environments&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mock tools and external dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Trace inspection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log actions, inputs, outputs&lt;/li&gt;
&lt;li&gt;Analyze decision paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is essential for debugging complex failures.&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Testing Techniques That Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Golden Datasets
&lt;/h3&gt;

&lt;p&gt;Curate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real user queries&lt;/li&gt;
&lt;li&gt;Edge cases&lt;/li&gt;
&lt;li&gt;Known failure scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This becomes your most valuable testing asset.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. LLM-as-a-Judge
&lt;/h3&gt;

&lt;p&gt;Use a model to evaluate outputs.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Is this answer correct and grounded in the context?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalable&lt;/li&gt;
&lt;li&gt;Flexible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be biased&lt;/li&gt;
&lt;li&gt;Requires validation&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Regression Testing
&lt;/h3&gt;

&lt;p&gt;Every change should trigger evaluation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt updates&lt;/li&gt;
&lt;li&gt;Model changes&lt;/li&gt;
&lt;li&gt;Retrieval modifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accuracy&lt;/li&gt;
&lt;li&gt;Hallucination rate&lt;/li&gt;
&lt;li&gt;Task success&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. Red Teaming
&lt;/h3&gt;

&lt;p&gt;Actively try to break the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt injection&lt;/li&gt;
&lt;li&gt;Jailbreak attempts&lt;/li&gt;
&lt;li&gt;Malicious inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Critical for production readiness.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Practical Testing Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Define Metrics
     ↓
Build Eval Dataset
     ↓
Run Automated Evals
     ↓
Analyze Failures
     ↓
Fix (Prompt / System / Model)
     ↓
Repeat (CI/CD Integration)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  In practice:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Version control your eval datasets&lt;/li&gt;
&lt;li&gt;Automate evaluations in CI/CD&lt;/li&gt;
&lt;li&gt;Track performance over time&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-World Example: Support Chatbot
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario:
&lt;/h3&gt;

&lt;p&gt;A chatbot answering queries from a knowledge base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issues:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinated responses&lt;/li&gt;
&lt;li&gt;Ignoring retrieved context&lt;/li&gt;
&lt;li&gt;Inconsistent tone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built dataset (~200 real queries)&lt;/li&gt;
&lt;li&gt;Added evaluation metrics (correctness, grounding)&lt;/li&gt;
&lt;li&gt;Introduced regression testing&lt;/li&gt;
&lt;li&gt;Added adversarial test cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduced hallucinations&lt;/li&gt;
&lt;li&gt;Improved consistency&lt;/li&gt;
&lt;li&gt;Faster iteration&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Key Challenges (That Don’t Go Away)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-determinism&lt;/li&gt;
&lt;li&gt;Expensive evaluations&lt;/li&gt;
&lt;li&gt;Limited ground truth&lt;/li&gt;
&lt;li&gt;Continuous model drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal isn’t perfection — it’s &lt;strong&gt;controlled reliability&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What’s Next&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the next parts of this series, I’ll go deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Building automated evaluation pipelines&lt;/li&gt;
&lt;li&gt;Testing RAG systems (metrics + pitfalls)&lt;/li&gt;
&lt;li&gt;Agent evaluation and tracing strategies&lt;/li&gt;
&lt;li&gt;Tooling and implementation patterns&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AI testing is not a single technique — it’s a discipline.&lt;/p&gt;

&lt;p&gt;The teams that succeed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test at multiple layers&lt;/li&gt;
&lt;li&gt;Build strong evaluation datasets&lt;/li&gt;
&lt;li&gt;Automate aggressively&lt;/li&gt;
&lt;li&gt;Continuously learn from failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in AI systems, what you don’t test is exactly where things break.&lt;/p&gt;




</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>softwaretesting</category>
    </item>
  </channel>
</rss>
