<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ali Ismail</title>
    <description>The latest articles on Forem by Ali Ismail (@ali_ismail).</description>
    <link>https://forem.com/ali_ismail</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3583810%2Fdff6fcf5-add6-427f-b656-49e5573a204f.jpg</url>
      <title>Forem: Ali Ismail</title>
      <link>https://forem.com/ali_ismail</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ali_ismail"/>
    <language>en</language>
    <item>
      <title>Why Your RAG System Hallucinations Start at Ingestion, Not the LLM</title>
      <dc:creator>Ali Ismail</dc:creator>
      <pubDate>Sun, 26 Oct 2025 17:39:31 +0000</pubDate>
      <link>https://forem.com/ali_ismail/why-your-rag-system-hallucinations-start-at-ingestion-not-the-llm-303k</link>
      <guid>https://forem.com/ali_ismail/why-your-rag-system-hallucinations-start-at-ingestion-not-the-llm-303k</guid>
      <description>&lt;p&gt;Most teams are busy optimizing prompts, but the silent bottleneck is poor ingestion&lt;/p&gt;




&lt;p&gt;LLMs are only as good as what you feed them.&lt;br&gt;
Make sure that your team isn't feeding it junk.&lt;br&gt;
The ingestion pipeline is unglamorous. It's the data processing layer between documents and your vector database. However, it is the biggest factor on whether your AI answers are accurate, relevant, and cost efficient. Not your prompts, retrieval strategy, or your chosen model.&lt;br&gt;
Fix ingestion first, or watch everything downstream fall apart.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hidden Problems with Poor Ingestion
&lt;/h2&gt;

&lt;p&gt;Improper ingestion of documents in VectorDBs leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bad Retrieval: irrelevant chunks or missing context.&lt;/li&gt;
&lt;li&gt;High costs: thousands of unnecessary embeddings ("chunk explosion")&lt;/li&gt;
&lt;li&gt;Slow queries: poor indexing or overlapping data&lt;/li&gt;
&lt;li&gt;Stale knowledge: no versioning or re-embedding strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When embedding is not optimized, you're facing major flaws in your AI solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every interaction costs more money than it should&lt;/li&gt;
&lt;li&gt;Hallucination rates go up&lt;/li&gt;
&lt;li&gt;AI quickly turns into an echo chamber&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Classic Ingestion Demo Loop
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docs = []
for root, _, files in os.walk(doc_dir):
    for file in files:
        path = os.path.join(root, file)
        if file.endswith((".txt", ".md")):
            docs.extend(TextLoader(path, encoding="utf-8").load())
        elif file.endswith(".pdf"):
            docs.extend(PyPDFLoader(path).load())

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=50)
splits = splitter.split_documents(docs) if docs else []

embeddings = HuggingFaceEmbeddings(model_name=settings.embeddings_model)
db = Chroma(persist_directory=chroma_dir, embedding_function=embeddings)
db.add_documents(splits)
db.persist()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code snippet will work in a demo but fall apart in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Code Doesn't Handle
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Deduplication:
&lt;/h3&gt;

&lt;p&gt;If you run this code twice, you'll have duplicate embeddings. Uploading updated documents will compete will older versions during retrieval.&lt;br&gt;
VectorDBs (like Chroma, Pinecone, Qdrant) don't automatically detect duplicates. If you re-run ingestion on the same folder, every file gets re-embedded and appended. This creates multiple versions of the same content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Re-embedding Strategy:
&lt;/h3&gt;

&lt;p&gt;Without a tracking system, if your embedding model gets updated or your document content changes, there's no way to know what needs to be re-embedded or why.&lt;br&gt;
Embeddings are model-dependent. If you switch from text-embedding-ada-002 to text-embedding-3-small, or even upgrade your HuggingFace model, old vectors are no longer compatible because cosine similarities shift.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata Versioning
&lt;/h3&gt;

&lt;p&gt;Knowing which version of the document a chunk is from, when it was ingested, who authored it is paramount to knowing if the data is still valid.&lt;br&gt;
Without metadata, the VectorDB turns into a black box. No one can explain why a chunk was retrieved, whether it's outdated, or who authored it. All of which is critical for compliance and debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dynamic Chunk Sizing
&lt;/h3&gt;

&lt;p&gt;Strict character limits fall apart when ingesting tables, code blocks and lists. Semantic boundaries are more important than arbitrary character counts.&lt;br&gt;
Fixed length chunking ignores structure. If the data retrieved was in them middle of a code block, table, or conversation, a Model will not be able to reconstruct the meaning. The retrieval will lead to incomplete or surface level thoughts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Validation of Retrievals
&lt;/h3&gt;

&lt;p&gt;It's vital to know if the chunking preserved meaning and that relevant information can be retrieved. In the absence of testing, metrics, and a feedback loop, the project flies blind.&lt;br&gt;
Consistent testing is required to confirm that the ingestion pipeline has been successful upon embedding. The team needs to validate that adding chunks is hurting or helping.&lt;br&gt;
Each of these needs to be considered in order to ensure quality and reliability in your AI assistant.&lt;/p&gt;




&lt;p&gt;What Healthy Ingestion Unlocks&lt;br&gt;
When done right you get more&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval accuracy: Relevant and confident answers&lt;/li&gt;
&lt;li&gt;Cost efficiency: Fewer embeddings and fewer tokens&lt;/li&gt;
&lt;li&gt;Performance: Faster queries and smaller indexes&lt;/li&gt;
&lt;li&gt;Knowledge freshness: re-embedding keeps responses up to date&lt;/li&gt;
&lt;li&gt;Compliance ready: traceable, auditable data lineage.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The Healthy Ingestion Lifecycle&lt;br&gt;
A well designed ingestion pipeline typically follows this sequence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extract - Load docs, normalize text, attach metadata&lt;/li&gt;
&lt;li&gt;Chunk - split semantically, not blindly by size&lt;/li&gt;
&lt;li&gt;Embed - use model fit for your domain&lt;/li&gt;
&lt;li&gt;Validate - test retrieval precision and recall with sample questions&lt;/li&gt;
&lt;li&gt;Version &amp;amp; Monitor - detect drift, re-embed, and track growth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each stage can be measured with metrics like cost and accuracy&lt;/p&gt;




&lt;p&gt;If AI is behaving dumb, it may be because has an ingestion problem.&lt;br&gt;
AI is only as sharp as what it eats.&lt;br&gt;
Healthy ingestion is like a healthy diet.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Documents are ingredients&lt;/li&gt;
&lt;li&gt;Chunking is portion control&lt;/li&gt;
&lt;li&gt;Embeddings are digestion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When AI eats junk food, it gets bloated, sluggish, and confused.&lt;br&gt;
Feed it clean and structured data, and it gets lean, responsive, and smart.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>vectordatabase</category>
      <category>llm</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
