<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Md Ayan Arshad</title>
    <description>The latest articles on Forem by Md Ayan Arshad (@ayanarshad02).</description>
    <link>https://forem.com/ayanarshad02</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817591%2F938ae2bc-4d9e-45f1-98fb-e35605f9196e.png</url>
      <title>Forem: Md Ayan Arshad</title>
      <link>https://forem.com/ayanarshad02</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ayanarshad02"/>
    <language>en</language>
    <item>
      <title>5 Critical Failures We Hit Shipping a Multi-Tenant RAG Chatbot to 500+ Enterprises</title>
      <dc:creator>Md Ayan Arshad</dc:creator>
      <pubDate>Sat, 04 Apr 2026 06:26:56 +0000</pubDate>
      <link>https://forem.com/ayanarshad02/we-shipped-a-rag-chatbot-to-500-enterprise-tenants-heres-what-actually-broke-first-1jia</link>
      <guid>https://forem.com/ayanarshad02/we-shipped-a-rag-chatbot-to-500-enterprise-tenants-heres-what-actually-broke-first-1jia</guid>
      <description>&lt;p&gt;Our first enterprise tenant onboarded on a Monday.&lt;/p&gt;

&lt;p&gt;By Wednesday, 30% of their documents had been silently indexed as empty strings. No error. No exception. The chatbot just said "I don't have enough information", confidently, every time.&lt;/p&gt;

&lt;p&gt;That was Failure #1. There were four more.&lt;/p&gt;

&lt;p&gt;Here's the honest account of shipping a multi-tenant RAG chatbot to 500+ enterprise clients — what broke, in what order, and what we should have caught earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsulg56rhevv4oloo38vj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsulg56rhevv4oloo38vj.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The System We Built
&lt;/h2&gt;

&lt;p&gt;Before the failures, the context.&lt;/p&gt;

&lt;p&gt;We built a RAG chatbot for enterprise warehouse management. Each tenant had their own isolated knowledge base — SOPs, compliance documents, operational guides. Users queried only their tenant's data. Scale target: ~25,000 queries per day at full rollout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Indexing pipeline:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;Document Upload → Type Detection → Preprocessing → Chunking → Embedding → Pinecone&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query pipeline:&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;User Query → Cache Check → Query Rewrite → Hybrid Search (BM25 + Vector) → RRF Fusion → Reranker → LLM → Response&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Two pipelines in the design. One EC2 fleet in reality, which became Failure #4.&lt;/p&gt;

&lt;p&gt;Indexing consumed from SQS. Query API sat behind an ALB. One Pinecone namespace per tenant, every query scoped to the authenticated tenant's namespace before touching the vector DB.&lt;/p&gt;

&lt;p&gt;The architecture decisions were mostly right.&lt;/p&gt;

&lt;p&gt;What broke was the assumptions underneath them.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure #1: The PDF Preprocessing Assumption (Week 1)
&lt;/h2&gt;

&lt;p&gt;We assumed all enterprise documents were text-based PDFs.&lt;/p&gt;

&lt;p&gt;They weren't.&lt;/p&gt;

&lt;p&gt;About 30% of what tenants uploaded were scanned PDFs, images of physical pages, no text layer. When PyMuPDF opened these files, it returned empty strings. We embedded empty strings. We indexed empty chunks. No error. No exception. Just silent failure.&lt;/p&gt;

&lt;p&gt;Users asked questions. Retrieval returned nothing relevant. The LLM said "I don't have enough information." Users assumed the chatbot was broken. They were right, just not for the reason they thought.&lt;/p&gt;

&lt;p&gt;The fix: A preprocessing gate that checks average characters per page. If avg_chars_per_page &amp;lt; 100, no text layer exists, trigger OCR via AWS Textract before chunking. We also added an admin-facing flag marking documents as "pending OCR" so tenants know their document is processing, not lost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Never assume your input format. Garbage input produces zero output in RAG. Preprocessing is the most boring part of the pipeline and the most catastrophic to skip.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure #2: Headers, Footers, and the Chunk Contamination Problem
&lt;/h2&gt;

&lt;p&gt;Even for text-based PDFs, every chunk was contaminated.&lt;/p&gt;

&lt;p&gt;Enterprise documents have headers and footers on every page. "Softeon WMS User Guide — Confidential — Page 14 of 203." When you chunk a 200-page document into 512-token pieces, that text bleeds into hundreds of chunks.&lt;/p&gt;

&lt;p&gt;The retrieval impact was subtle but real. Queries about "confidential" topics surfaced chunks with "Confidential" in the footer, not because the content was relevant, but because BM25 was matching on that exact term. Relevance scores were quietly polluted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; A stripping step before chunking. Text appearing in the top 5% and bottom 5% of every page gets flagged and removed. We also converted tables to markdown before chunking, a raw table extracted as "Product Price Refund Laptop 999 30 days" is useless for retrieval. The same table as structured markdown is self-contained and semantically meaningful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Most RAG tutorials skip directly to chunking size debates — 256 vs 512 tokens. They assume clean input. Real enterprise documents are not clean.&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure #3: The Parallel Pipeline Was Actually Sequential
&lt;/h2&gt;

&lt;p&gt;We ran BM25 and vector search in what we thought was parallel.&lt;/p&gt;

&lt;p&gt;It wasn't.&lt;/p&gt;

&lt;p&gt;The original implementation called BM25, waited for the result, then called Pinecone. "Parallel" on the architecture diagram. Sequential in the code. At p50 this cost us ~200ms we couldn't afford.&lt;/p&gt;

&lt;p&gt;The fix is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wrong — sequential
&lt;/span&gt;&lt;span class="n"&gt;bm25_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;bm25_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vector_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;pinecone_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Right — parallel
&lt;/span&gt;&lt;span class="n"&gt;bm25_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;bm25_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;pinecone_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Latency becomes the max of the two, not the sum. Dropped our p95 by ~180ms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; "Parallel" on a diagram and "parallel" in code are different things. Profile your pipeline stage by stage. The bottleneck is always somewhere surprising.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure #4: One Tenant's Upload Degraded Everyone's Query Latency
&lt;/h2&gt;

&lt;p&gt;This one took a week to diagnose.&lt;/p&gt;

&lt;p&gt;We noticed periodic p99 spikes, not consistent, not tied to query volume. Random, unpredictable.&lt;/p&gt;

&lt;p&gt;The cause: our indexing pipeline and query pipeline were on the same EC2 instances.&lt;/p&gt;

&lt;p&gt;When a large tenant uploaded 500 documents, the embedding loop hammered the instance CPU. Live users querying on the same instance saw response time jump from 800ms to 6+ seconds. The indexer and query service were invisible to each other in code — but very visible to each other on the metal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Complete infrastructure separation. Indexing workers on a dedicated EC2 fleet, completely outside the ALB. The query fleet has no knowledge that indexing is happening. A document upload spike now has zero effect on query latency for any tenant. The SQS queue buffers upload bursts and feeds indexing workers at a controlled pace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Load isolation is not just an architectural principle. It's a user experience decision. Enterprise tenants don't care about your architecture, they care that the chatbot was slow when they needed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure #5: The Namespace Isolation Gap We Almost Missed
&lt;/h2&gt;

&lt;p&gt;Multi-tenant isolation in Pinecone is handled by namespaces. One namespace per tenant. Every write tags it. Every read is scoped to it.&lt;/p&gt;

&lt;p&gt;What we almost shipped: namespace scoped at the request body level.&lt;/p&gt;

&lt;p&gt;A bad actor passing a forged tenant_id in the request body could scope the query to a different tenant's namespace. Subtle. Critical.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wrong — trusting request body
&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;

&lt;span class="c1"&gt;# Right — trusting validated token only
&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;token_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tenant_id&lt;/span&gt;  &lt;span class="c1"&gt;# resolved from JWT at API layer
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Namespace resolved exclusively from the validated JWT token at the API layer. The request body's tenant_id is ignored entirely. By the time a request reaches the vector DB call, the namespace has already been locked to the authenticated tenant — it cannot be overridden.&lt;/p&gt;

&lt;p&gt;If we had shipped the original version, any authenticated user who knew another tenant's ID could have queried their private documents. In a WMS context serving enterprise clients, that's not a security incident, that's a contract termination and a legal conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; Namespace isolation is not the same as security. Enforce tenant identity at authentication, not the application layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Still Haven't Built
&lt;/h2&gt;

&lt;p&gt;We don't have automated RAG evaluation in production.&lt;/p&gt;

&lt;p&gt;No RAGAS running continuously. No Precision@5 after every deployment. Human review by an internal QA team, representative queries, manual quality ratings. It works at current scale. It won't at full rollout.&lt;/p&gt;

&lt;p&gt;What I'd build next with two weeks:&lt;br&gt;
→ A golden evaluation set with 200 curated question-to-chunk pairs from real tenant queries. Your retrieval quality baseline.&lt;br&gt;
→ RAGAS Faithfulness in CI/CD runs on every deployment, blocks release if faithfulness drops more than 5% from baseline.&lt;br&gt;
→ Context Precision tracking, tells you if your reranker is actually earning its latency cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One Thing That Mattered Most
&lt;/h2&gt;

&lt;p&gt;RAG systems fail at the edges of the pipeline, not the center.&lt;/p&gt;

&lt;p&gt;Most engineering effort goes into the center, embedding models, reranking algorithms, chunk sizes. The real production failures happen at the edges: what goes into the indexer, what happens when two workloads compete for the same compute, and where tenant identity gets resolved.&lt;/p&gt;

&lt;p&gt;What broke first in your RAG pipeline? Drop it in the comments. The failures nobody writes about are always the most useful. I'll compile the best ones into a follow-up post.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>genai</category>
      <category>agents</category>
    </item>
    <item>
      <title>Why Our RAG System Was Silently Returning Wrong Answers — And How We Fixed It</title>
      <dc:creator>Md Ayan Arshad</dc:creator>
      <pubDate>Wed, 11 Mar 2026 00:19:49 +0000</pubDate>
      <link>https://forem.com/ayanarshad02/why-our-rag-system-was-silently-returning-wrong-answers-and-how-we-fixed-it-386g</link>
      <guid>https://forem.com/ayanarshad02/why-our-rag-system-was-silently-returning-wrong-answers-and-how-we-fixed-it-386g</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9cgbkpcvxtjvlb2fylie.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9cgbkpcvxtjvlb2fylie.webp" alt="Our RAG System Was Confidently Wrong"&gt;&lt;/a&gt;&lt;br&gt;
For 3 days, our RAG system was confident.&lt;/p&gt;

&lt;p&gt;Every query returned an answer. Response times were stable. No errors in the logs. By every operational metric, the system was working.&lt;/p&gt;

&lt;p&gt;Our RAGAS faithfulness score told a different story.&lt;/p&gt;

&lt;p&gt;It had dropped from 0.91 to 0.67 without a single code change.&lt;/p&gt;

&lt;p&gt;That meant roughly 1 in 3 responses was making claims our own retrieved context didn’t support. The system wasn’t crashing. It was hallucinating — silently, at scale, with complete confidence.&lt;/p&gt;

&lt;p&gt;Here is what happened.&lt;/p&gt;
&lt;h2&gt;
  
  
  The System When It Started Failing
&lt;/h2&gt;

&lt;p&gt;We were running a production RAG system serving enterprise clients across a large document corpus. Each client had their own isolated set of documents — product configuration files, setup guides, operational workflows — queried daily to answer operational questions.&lt;/p&gt;

&lt;p&gt;_The state of the system when the drift began:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;25K+ documents across corpus&lt;/li&gt;
&lt;li&gt;500+ enterprise tenants&lt;/li&gt;
&lt;li&gt;1 Pinecone namespace per tenant&lt;/li&gt;
&lt;li&gt;5 chunks retrieved per query (top-K)
_&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt; GPT-4 for generation, text-embedding-ada-002 for embeddings, Pinecone with one namespace per tenant, FastAPI on ECS. Isolation was strict — no cross-tenant reads, ever.&lt;/p&gt;

&lt;p&gt;_A NOTE on the namespace decision: Pinecone namespaces share the same index and billing unit, 500 namespaces cost the same as 1. We chose namespaces over metadata filtering (tenant_id filter on a single index) for one specific reason: Metadata filtering requires every query to carry the correct filter, and one bug means Tenant A can read Tenant B's data. For enterprise clients, that risk surface isn't acceptable. Namespaces make cross-tenant leakage structurally impossible at query time.&lt;/p&gt;

&lt;p&gt;Namespaces give us defense-in-depth isolation at the infrastructure layer rather than relying on application-level filtering._&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring:&lt;/strong&gt; API latency, error rates, cost dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer quality monitoring:&lt;/strong&gt; None.&lt;/p&gt;

&lt;p&gt;That was the bug. Not in the code. In the architecture.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Changed, And Why We Didn’t See It
&lt;/h2&gt;

&lt;p&gt;Three days before we caught the drift, a large batch of new documents was ingested. Different document type than the existing corpus — denser, longer sentences, more domain-specific terminology. Same domain, different structure.&lt;/p&gt;

&lt;p&gt;The new documents changed the distribution of our Pinecone namespaces. Queries that had previously retrieved highly relevant chunks now retrieved chunks that were topically related but not directly answering the query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cosine similarity scores:&lt;/strong&gt; 0.76, 0.79, 0.81. High enough to clear any threshold we’d set. &lt;code&gt;text-embedding-ada-002&lt;/code&gt; couldn't distinguish between "this chunk discusses this topic" and "this chunk contains the specific answer this query is asking for." Retrieval looked confident. The chunks were wrong.&lt;/p&gt;

&lt;p&gt;GPT-4 did what LLMs do when context is adjacent-but-imprecise: it filled the gaps with plausible-sounding claims not present in the retrieved text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAGAS faithfulness:&lt;/strong&gt; 0.91 → 0.67.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Precision:&lt;/strong&gt; 0.84 → 0.61.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We had no alert for either. We found out from a user.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The core failure:&lt;/strong&gt; we instrumented everything easy to measure — latency, throughput, cost, error rates — and nothing that measured correctness. We were flying blind on the one metric that determined whether the system was actually useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One clarification on how we were running RAGAS.&lt;/strong&gt; We were not evaluating live traffic — that would be prohibitively expensive and slow. We maintained a golden evaluation set of ~300 representative queries with known-good answers and source chunks, curated when the system first launched. RAGAS ran against that set nightly, and on every ingestion event. The drop from 0.91 to 0.67 showed up the morning after the batch ingestion. We just had no alert configured to catch it.&lt;/p&gt;
&lt;h2&gt;
  
  
  What We Tried First — And Why It Wasn’t Enough
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First instinct:&lt;/strong&gt; improve retrieval.&lt;/p&gt;

&lt;p&gt;We raised the similarity threshold from 0.70 to 0.78, rejecting chunks below that score. Retrieval precision improved. We also started returning no results for legitimate queries with unusual phrasing. Users got empty responses. That was worse.&lt;/p&gt;

&lt;p&gt;We increased top-K from 5 to 10. Slightly helped recall. Sent 2× the tokens into every LLM call, which compounded a cost problem already building at 500+ active tenants.&lt;/p&gt;

&lt;p&gt;Context Precision recovered to 0.78. Faithfulness only reached 0.81, still below our 0.85 target.&lt;/p&gt;

&lt;p&gt;The retrieval fixes were necessary. They were not sufficient. We needed a layer that caught the gap between what retrieval returned and what the LLM claimed. Retrieval improvement was treating the symptom. We needed to treat the cause.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Real Fix: Grounding Validation as a First-Class Architecture Layer
&lt;/h2&gt;

&lt;p&gt;We added a grounding validation step that runs after every LLM response, before it’s returned to the user.&lt;/p&gt;

&lt;p&gt;The mechanism:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract the factual claims from the generated response using a structured extraction prompt. This step is imperfect, LLM-based claim extraction can miss or misinterpret implicit claims, so we treat it as a signal, not a definitive verdict. A claim is flagged as unsupported if no retrieved chunk scores above a similarity threshold for that claim.&lt;/li&gt;
&lt;li&gt;Score each claim against the retrieved chunks, classified as supported, unsupported, or contradicted.&lt;/li&gt;
&lt;li&gt;Flag any response where more than 15% of claims are unsupported or contradicted.&lt;/li&gt;
&lt;li&gt;Regenerate flagged responses with an explicit grounding instruction injected into the prompt.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The regeneration prompt:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Your response must only make claims directly supported by the provided context. If the context does not contain the answer, say so explicitly. Do not infer or extrapolate.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical implementation detail:&lt;/strong&gt; The claim extraction and verification call runs against &lt;code&gt;gpt-4o-mini&lt;/code&gt;, not GPT-4. Running a full GPT-4 call for every response validation would double our inference cost and add 600–800ms of latency. With &lt;code&gt;gpt-4o-mini&lt;/code&gt;, the validation step adds approximately 180–220ms on average for a 3–5 sentence response. That number is model-dependent, it will be higher on a slower model and lower with a fine-tuned classifier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After deploying:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before:  Faithfulness 0.67  |  Context Precision 0.61  |  31% unsupported claim rate
After:   Faithfulness 0.91  |  Context Precision 0.87  |  &amp;lt;4% unsupported claim rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lkl8ekyylllf9dylbt5.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6lkl8ekyylllf9dylbt5.webp" alt="Production RAG Architecture : 500+ Enterprise Clients"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Decision — And When You’d Make It Differently
&lt;/h2&gt;

&lt;p&gt;We made an explicit trade-off: ~200ms of additional latency in exchange for verifiable answer quality.&lt;/p&gt;

&lt;p&gt;For our use case — enterprise clients making operational decisions based on the chatbot’s answers — that trade-off was not a discussion. The 200ms is noise. The trust cost of a wrong answer at enterprise scale is not.&lt;/p&gt;

&lt;p&gt;But this decision is not universal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Constraint                            Decision          Rationale
──────────────────────────────────────────────────────────────────────────────
Consumer product, SLA &amp;lt; 500ms         Run async          Log failures, don't block.
                                                         Stakes are low, UX matters more.

Low-stakes (drafting, summarisation)  Skip it            User edits the output.
                                                         Grounding matters less.

High-volume, cost-sensitive           Sample 10%         Statistical signal at
                                                         1/10th the overhead.

Enterprise / regulated / high-stakes  Mandatory sync     A wrong answer has real
                                                         downstream consequences.

Multi-tenant, strict isolation        Mandatory + audit  Every response must be
                                                         traceable to a source chunk.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Principle:&lt;/strong&gt; grounding validation is always worth measuring. Whether to block on it synchronously depends on your SLA and the cost of a wrong answer in your domain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern: Checking Faithfulness After Generation Is the Wrong Architecture
&lt;/h2&gt;

&lt;p&gt;Here is the deeper architectural mistake this exposed.&lt;/p&gt;

&lt;p&gt;We were checking faithfulness after generation — as a post-hoc audit — rather than as a gating condition on the response pipeline. The audit told us something was wrong. It didn’t stop a wrong answer from being returned.&lt;/p&gt;

&lt;p&gt;The correct architecture treats grounding validation as a blocking step in the response pipeline, not an observability metric reviewed after the fact.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wrong: Generate → Return → [async] Validate → Log failure → Review weekly
Right: Generate → Validate → [if flagged] Regenerate → Return → Log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The async pattern gives you observability. It does not give you correctness. For any system where answer quality has downstream consequences, post-hoc monitoring is not a substitute for inline validation.&lt;/p&gt;

&lt;p&gt;We caught our failure because a user noticed. That should never be the detection mechanism for a production system serving enterprise clients.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Changed Permanently
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Quality monitoring is now first-class. RAGAS faithfulness and context precision scored against our golden evaluation set on every ingestion event and every deployment. Grafana alerts fire if either drops more than 10% from the established baseline. We do not run RAGAS on live traffic, it’s too slow and expensive at scale. The golden set gives us the signal we need.&lt;/li&gt;
&lt;li&gt;Document ingestion now triggers a quality gate. When new documents are ingested, we run the benchmark query set against the updated index before traffic is shifted. Faithfulness drops &amp;gt;5% → ingestion rolled back.&lt;/li&gt;
&lt;li&gt;Grounding validation is synchronous and non-configurable. The ~200ms cost is included in our SLA. Not optional for any enterprise-tier query.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The One-Line Takeaway
&lt;/h2&gt;

&lt;p&gt;Your RAG system will hallucinate. The question is whether you find out before your users do.&lt;/p&gt;

</description>
      <category>genai</category>
      <category>rag</category>
      <category>ai</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
