<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Gunjan Tailor</title>
    <description>The latest articles on Forem by Gunjan Tailor (@gunjantailor).</description>
    <link>https://forem.com/gunjantailor</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3938215%2F3cbbb4e8-fd61-4eac-aeda-e6d393ac966c.png</url>
      <title>Forem: Gunjan Tailor</title>
      <link>https://forem.com/gunjantailor</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gunjantailor"/>
    <language>en</language>
    <item>
      <title>I was embarrassed by my RAG demo. Turns out the bug was never in my code.</title>
      <dc:creator>Gunjan Tailor</dc:creator>
      <pubDate>Thu, 21 May 2026 17:08:33 +0000</pubDate>
      <link>https://forem.com/gunjantailor/i-was-embarrassed-by-my-rag-demo-turns-out-the-bug-was-never-in-my-code-4hmb</link>
      <guid>https://forem.com/gunjantailor/i-was-embarrassed-by-my-rag-demo-turns-out-the-bug-was-never-in-my-code-4hmb</guid>
      <description>&lt;p&gt;I showed my RAG app to a friend.&lt;/p&gt;

&lt;p&gt;He asked: "which region grew the  most last quarter?"&lt;/p&gt;

&lt;p&gt;It said Europe. The answer was Asia. By a lot.&lt;/p&gt;

&lt;p&gt;I spent two days debugging embeddings, chunk sizes, temperature settings.&lt;br&gt;
The bug was none of those things.&lt;/p&gt;

&lt;p&gt;The table had been turned into this:&lt;/p&gt;

&lt;p&gt;"45.2% Q3 Europe 38.1% Q2 Asia 41.7%..."&lt;/p&gt;

&lt;p&gt;Numbers with no headers. No caption. No context.&lt;br&gt;
The LLM wasn't hallucinating. It was working with garbage.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpc6snla18uijvpwblqh5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpc6snla18uijvpwblqh5.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🛠️ So I built the thing I wished existed&lt;br&gt;
Meet DocNest — not another chunker.&lt;br&gt;
A document normalization engine that reads structure before touching content.&lt;/p&gt;

&lt;p&gt;Every heading → a navigable §section with its own ID&lt;br&gt;
Every table → preserved as { caption, headers, rows[] } JSON&lt;br&gt;
Every section → one-sentence LLM summary + BM25 keyword index&lt;br&gt;
All of it → packed into a portable .udf file&lt;/p&gt;

&lt;p&gt;python&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;docnest.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DocNestPipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;docnest.reader&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;UDFIndex&lt;/span&gt;

&lt;span class="c1"&gt;# Convert — runs once, costs a few LLM calls
&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocNestPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# free tier works perfectly
&lt;/span&gt;    &lt;span class="n"&gt;llm_api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gsk_...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;emb_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;huggingface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# local, no API key needed
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# → report.udf ✓
&lt;/span&gt;
&lt;span class="c1"&gt;# Query
&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;UDFIndex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.udf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which region had the highest Q3 growth?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# "Asia grew the most, up +12.4pp"
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layer_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 1
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 0  ← yes, really. zero.
&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;✅ Zero tokens. Correct answer. 18ms.&lt;br&gt;
That's not a cherry-picked example. Here's why it's possible.&lt;/p&gt;

&lt;p&gt;⚡ The 5-layer query engine&lt;br&gt;
Instead of dumping the full document into an LLM, queries escalate through layers — stopping the moment one can answer confidently.&lt;br&gt;
LayerWhat it doesTokensSpeed0Pre-computed summary + key numbers0&amp;lt; 1ms1BM25 + cosine → lands on exact §section0&amp;lt; 20ms2Section-scoped LLM call~3001–3s3Multi-section synthesis~9002–5s4Full document fallback~4000+5–15s&lt;br&gt;
I expected layers 2–4 to do most of the work.&lt;/p&gt;

&lt;p&gt;🤯 Layers 0 and 1 handle roughly 70% of real-world questions — at zero token cost.&lt;br&gt;
Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.&lt;/p&gt;

&lt;p&gt;📊 Real numbers. Not vibes.&lt;br&gt;
25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.&lt;br&gt;
Question typeScoreBasic facts (calories, macros)✅ 5/5Detailed nutrition (fiber, glycemic index)✅ 5/5Micronutrients (vitamins, minerals)✅ 4/5Hard synthesis (BMR, omega-3, antioxidants)✅ 5/5Edge cases + hallucination traps✅ 5/5Total24/25 — 96%&lt;br&gt;
The one failure: a table-only page where the text parser extracted nothing.&lt;br&gt;
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.&lt;/p&gt;

&lt;p&gt;🧠 Handles 600-page PDFs without exploding your RAM&lt;br&gt;
Standard Docling loads the full document into memory. 600 pages on a normal laptop = 💀 out of memory.&lt;br&gt;
DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.&lt;br&gt;
python&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;docnest.parsers.pdf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DoclingPDFParser&lt;/span&gt;

&lt;span class="c1"&gt;# Just works — auto-detects large PDFs
&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DoclingPDFParser&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600-page-annual-report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Or tune for your hardware
&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DoclingPDFParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 💻 low RAM
&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DoclingPDFParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 🚀 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;speed mode&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
🚀 Try it&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bashpip &lt;span class="nb"&gt;install &lt;/span&gt;docnest-ai

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Formats: PDF (ML + fast) · DOCX · XLSX · HTML · Markdown&lt;br&gt;
LLM providers: Groq (free) · OpenAI · Ollama (local) · Anthropic · Mistral · Google · Cohere&lt;br&gt;
Vector backends: numpy (zero deps) · FAISS · ChromaDB&lt;br&gt;
bash# CLI — because boilerplate is boring&lt;br&gt;
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile&lt;br&gt;
docnest query report.udf "What are the key financial risks?"&lt;br&gt;
docnest view report.udf     # structured HTML viewer in browser&lt;br&gt;
GitHub repo — star it if this solved a problem you've had:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/tailorgunjan93" rel="noopener noreferrer"&gt;
        tailorgunjan93
      &lt;/a&gt; / &lt;a href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;
        docnest
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/tailorgunjan93/docnest/docs/logo.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Ftailorgunjan93%2Fdocnest%2FHEAD%2Fdocs%2Flogo.svg" alt="DOCNEST Logo" width="120"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;DOCNEST&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The document normalization engine RAG has always needed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/tailorgunjan93/docnest/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/tailorgunjan93/docnest/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/tailorgunjan93/docnest/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/08cef40a9105b6526ca22088bc514fbfdbc9aac1ddbf8d4e6c750e3a88a44dca/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d626c75652e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://python.org" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e7d16618cfb930dc9ed2cbd0283c05e8164571fa019ce4ec0981047de28192f8/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31312532422d626c75653f6c6f676f3d707974686f6e" alt="Python"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/docnest-ai" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/09660d947fcded30a96c897b5d5a3962cbac7a9a3e12a5bd36a940f4444744ad/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f646f636e6573742d61693f636f6c6f723d677265656e" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/docnest-ai" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/0d9ce0b1159aca10d98cdfe739542633d588b4f26a24e5e7ff7995790cfc3d5f/68747470733a2f2f696d672e736869656c64732e696f2f707970692f646d2f646f636e6573742d61693f636f6c6f723d626c7565" alt="PyPI Downloads"&gt;&lt;/a&gt;
&lt;a href=""&gt;&lt;img src="https://camo.githubusercontent.com/f3adeea933a64c2014c89092040b8c02f4931f3f5a5d46a189133d4ac21d0ebf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7374617475732d737461626c652d627269676874677265656e" alt="Status"&gt;&lt;/a&gt;
&lt;a href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e2ab1ab10c5e4d7caa102b689469c5c6317ad19c273e05e28f02e048da214e79/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f7461696c6f7267756e6a616e39332f646f636e6573743f7374796c653d736f6369616c" alt="Stars"&gt;&lt;/a&gt;
&lt;a href="https://github.com/tailorgunjan93/docnest/graphs/contributors" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/375d3c06879d4f352880c2fb43546cca4287ddfbf90017f60da8a1b69ab93104/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f636f6e7472696275746f72732f7461696c6f7267756e6a616e39332f646f636e657374" alt="Contributors"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Parse any document. Understand its structure. Build RAG that actually works.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/tailorgunjan93/docnest#-why-docnest" rel="noopener noreferrer"&gt;Why DOCNEST&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-installation" rel="noopener noreferrer"&gt;Installation&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-python-api" rel="noopener noreferrer"&gt;Python API&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-pdf-parsing--memory-guide" rel="noopener noreferrer"&gt;PDF Parsing&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-how-it-works" rel="noopener noreferrer"&gt;How It Works&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-cli-reference" rel="noopener noreferrer"&gt;CLI Reference&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-provider-interfaces" rel="noopener noreferrer"&gt;Providers&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-roadmap" rel="noopener noreferrer"&gt;Roadmap&lt;/a&gt;&lt;/p&gt;


&lt;/div&gt;
&lt;br&gt;


&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;The Problem with RAG Today&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;Every RAG pipeline ingests documents the same broken way:&lt;/p&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;PDF → extract text → split every 512 chars → embed → store → hope
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;What gets silently destroyed:&lt;/p&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;What blind chunking loses&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Financial report&lt;/td&gt;
&lt;td&gt;Table row &lt;code&gt;45.2% | Q3 | Europe&lt;/code&gt; has no column headers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal contract&lt;/td&gt;
&lt;td&gt;Clause split mid-sentence across two chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;Code example separated from its description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research paper&lt;/td&gt;
&lt;td&gt;Figure caption disconnected from its analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;The LLM receives noise and returns approximate answers.&lt;/strong&gt; This is not a retrieval problem — it is an ingestion problem.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;See the difference&lt;/h3&gt;
&lt;/div&gt;

&lt;p&gt;Take a financial report with a revenue table…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;
&lt;br&gt;&lt;br&gt;
PyPI: &lt;a href="https://pypi.org/project/docnest-ai" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://pypi.org/project/docnest-ai" rel="noopener noreferrer"&gt;https://pypi.org/project/docnest-ai&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Format spec: &lt;a href="https://github.com/tailorgunjan93/udf-spec" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://github.com/tailorgunjan93/udf-spec" rel="noopener noreferrer"&gt;https://github.com/tailorgunjan93/udf-spec&lt;/a&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>programming</category>
      <category>rag</category>
    </item>
    <item>
      <title>My RAG app confidently told my client the wrong answer. I spent 3 days debugging the wrong thing.</title>
      <dc:creator>Gunjan Tailor</dc:creator>
      <pubDate>Mon, 18 May 2026 13:35:15 +0000</pubDate>
      <link>https://forem.com/gunjantailor/i-built-a-pdf-parser-that-actually-preserves-table-structure-for-rag-heres-why-it-matters-19fo</link>
      <guid>https://forem.com/gunjantailor/i-built-a-pdf-parser-that-actually-preserves-table-structure-for-rag-heres-why-it-matters-19fo</guid>
      <description>&lt;p&gt;Picture this.&lt;/p&gt;

&lt;p&gt;It's a client demo. They're watching. I type:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Which region had the highest revenue growth last quarter?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My RAG app — &lt;strong&gt;three weeks of work&lt;/strong&gt;, carefully tuned embeddings, clever prompts — responds instantly.&lt;/p&gt;

&lt;p&gt;The client nods. Writes it down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer was wrong. By almost double.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I spent three days debugging the wrong things.&lt;/p&gt;

&lt;p&gt;Chunk size? Tried 256, 512, 1024. Nothing.&lt;br&gt;
Temperature? 0.0, 0.3, 0.7. Still wrong.&lt;br&gt;
Embeddings model? Swapped three of them. Nope.&lt;br&gt;
Prompt engineering? Added &lt;em&gt;"think step by step"&lt;/em&gt;, &lt;em&gt;"be precise"&lt;/em&gt;, &lt;em&gt;"do not hallucinate"&lt;/em&gt;. 😭&lt;/p&gt;

&lt;p&gt;The LLM wasn't hallucinating. It was doing its best with this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"45.2%  Q3  Europe  38.1%  Q2  Europe  41.7%  Q3  Asia   29.3%"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Orphaned numbers. No column headers. No caption. No context.&lt;/p&gt;

&lt;p&gt;The original table had all of that. My chunker ate it silently.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;The bug was never in retrieval. It was in ingestion.&lt;/strong&gt; And I never thought to look there.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🔥 The dirty secret of RAG tutorials
&lt;/h2&gt;

&lt;p&gt;Every tutorial shows you this pipeline:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF → extract text → chunk at 512 tokens → embed → store → retrieve → answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Clean. Simple. &lt;strong&gt;Completely wrong for structured documents.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what blind chunking silently destroys:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Document&lt;/th&gt;
&lt;th&gt;What you had&lt;/th&gt;
&lt;th&gt;What the LLM gets&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Financial report&lt;/td&gt;
&lt;td&gt;Revenue table with headers&lt;/td&gt;
&lt;td&gt;Orphaned numbers, zero context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal contract&lt;/td&gt;
&lt;td&gt;3-page clause&lt;/td&gt;
&lt;td&gt;Split mid-sentence, both halves useless&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API docs&lt;/td&gt;
&lt;td&gt;Function + code example&lt;/td&gt;
&lt;td&gt;Code separated from its description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research paper&lt;/td&gt;
&lt;td&gt;Figure with caption&lt;/td&gt;
&lt;td&gt;Caption on chunk 7, analysis on chunk 12&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;🗑️ &lt;strong&gt;You're feeding the LLM garbage and expecting gold.&lt;/strong&gt; The model isn't dumb — it's working with broken input.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🛠️ So I built the thing I wished existed
&lt;/h2&gt;

&lt;p&gt;Meet &lt;strong&gt;DocNest&lt;/strong&gt; — not another chunker.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;document normalization engine&lt;/strong&gt; that reads structure &lt;em&gt;before&lt;/em&gt; touching content.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every heading → a navigable &lt;code&gt;§section&lt;/code&gt; with its own ID&lt;/li&gt;
&lt;li&gt;Every table → preserved as &lt;code&gt;{ caption, headers, rows[] }&lt;/code&gt; JSON&lt;/li&gt;
&lt;li&gt;Every section → one-sentence LLM summary + BM25 keyword index&lt;/li&gt;
&lt;li&gt;All of it → packed into a portable &lt;code&gt;.udf&lt;/code&gt; file
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;docnest.pipeline&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DocNestPipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;docnest.reader&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;UDFIndex&lt;/span&gt;

&lt;span class="c1"&gt;# Convert — runs once, costs a few LLM calls
&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocNestPipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# free tier works perfectly
&lt;/span&gt;    &lt;span class="n"&gt;llm_api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gsk_...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;emb_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;huggingface&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# local, no API key needed
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# → report.udf ✓
&lt;/span&gt;
&lt;span class="c1"&gt;# Query
&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;UDFIndex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.udf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which region had the highest Q3 growth?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# "Asia grew the most, up +12.4pp"
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layer_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 1
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokens_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 0  ← yes, really. zero.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;✅ &lt;strong&gt;Zero tokens. Correct answer. 18ms.&lt;/strong&gt;&lt;br&gt;
That's not a cherry-picked example. Here's why it's possible.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  ⚡ The 5-layer query engine
&lt;/h2&gt;

&lt;p&gt;Instead of dumping the full document into an LLM, queries escalate through layers — stopping the moment one can answer confidently.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-computed summary + key numbers&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 1ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;BM25 + cosine → lands on exact §section&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;lt; 20ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Section-scoped LLM call&lt;/td&gt;
&lt;td&gt;~300&lt;/td&gt;
&lt;td&gt;1–3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-section synthesis&lt;/td&gt;
&lt;td&gt;~900&lt;/td&gt;
&lt;td&gt;2–5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full document fallback&lt;/td&gt;
&lt;td&gt;~4000+&lt;/td&gt;
&lt;td&gt;5–15s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I expected layers 2–4 to do most of the work.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🤯 &lt;strong&gt;Layers 0 and 1 handle roughly 70% of real-world questions — at zero token cost.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  📊 Real numbers. Not vibes.
&lt;/h2&gt;

&lt;p&gt;25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question type&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Basic facts (calories, macros)&lt;/td&gt;
&lt;td&gt;✅ 5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detailed nutrition (fiber, glycemic index)&lt;/td&gt;
&lt;td&gt;✅ 5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Micronutrients (vitamins, minerals)&lt;/td&gt;
&lt;td&gt;✅ 4/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hard synthesis (BMR, omega-3, antioxidants)&lt;/td&gt;
&lt;td&gt;✅ 5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge cases + hallucination traps&lt;/td&gt;
&lt;td&gt;✅ 5/5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24/25 — 96%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The one failure: a table-only page where the text parser extracted nothing.&lt;br&gt;
Fix: use &lt;code&gt;DoclingPDFParser&lt;/code&gt; for image-heavy or scanned PDFs.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧠 Handles 600-page PDFs without exploding your RAM
&lt;/h2&gt;

&lt;p&gt;Standard Docling loads the full document into memory. 600 pages on a normal laptop = 💀 out of memory.&lt;/p&gt;

&lt;p&gt;DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;docnest.parsers.pdf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DoclingPDFParser&lt;/span&gt;

&lt;span class="c1"&gt;# Just works — auto-detects large PDFs
&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DoclingPDFParser&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600-page-annual-report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Or tune for your hardware
&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DoclingPDFParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 💻 low RAM
&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DoclingPDFParser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_pages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;report.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 🚀 speed mode
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🚀 Try it
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;docnest-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Formats:&lt;/strong&gt; PDF (ML + fast) · DOCX · XLSX · HTML · Markdown&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM providers:&lt;/strong&gt; Groq (free) · OpenAI · Ollama (local) · Anthropic · Mistral · Google · Cohere&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector backends:&lt;/strong&gt; numpy (zero deps) · FAISS · ChromaDB&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# CLI — because boilerplate is boring&lt;/span&gt;
docnest convert report.pdf &lt;span class="nt"&gt;--llm-provider&lt;/span&gt; groq &lt;span class="nt"&gt;--llm-model&lt;/span&gt; llama-3.3-70b-versatile
docnest query report.udf &lt;span class="s2"&gt;"What are the key financial risks?"&lt;/span&gt;
docnest view report.udf     &lt;span class="c"&gt;# structured HTML viewer in browser&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;GitHub repo — star it if this solved a problem you've had:&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/tailorgunjan93" rel="noopener noreferrer"&gt;
        tailorgunjan93
      &lt;/a&gt; / &lt;a href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;
        docnest
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/tailorgunjan93/docnest/docs/logo.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Ftailorgunjan93%2Fdocnest%2FHEAD%2Fdocs%2Flogo.svg" alt="DOCNEST Logo" width="120"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;DOCNEST&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The document normalization engine RAG has always needed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/tailorgunjan93/docnest/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/tailorgunjan93/docnest/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/tailorgunjan93/docnest/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/08cef40a9105b6526ca22088bc514fbfdbc9aac1ddbf8d4e6c750e3a88a44dca/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d626c75652e737667" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://python.org" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e7d16618cfb930dc9ed2cbd0283c05e8164571fa019ce4ec0981047de28192f8/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31312532422d626c75653f6c6f676f3d707974686f6e" alt="Python"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/docnest-ai" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/09660d947fcded30a96c897b5d5a3962cbac7a9a3e12a5bd36a940f4444744ad/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f646f636e6573742d61693f636f6c6f723d677265656e" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://pypi.org/project/docnest-ai" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/0d9ce0b1159aca10d98cdfe739542633d588b4f26a24e5e7ff7995790cfc3d5f/68747470733a2f2f696d672e736869656c64732e696f2f707970692f646d2f646f636e6573742d61693f636f6c6f723d626c7565" alt="PyPI Downloads"&gt;&lt;/a&gt;
&lt;a href=""&gt;&lt;img src="https://camo.githubusercontent.com/a7dd954dfc85fc675b686e7d47fef2274be7095f2b5c77a5a031a945889b018c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7374617475732d616c7068612d79656c6c6f77" alt="Status"&gt;&lt;/a&gt;
&lt;a href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/e2ab1ab10c5e4d7caa102b689469c5c6317ad19c273e05e28f02e048da214e79/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f7461696c6f7267756e6a616e39332f646f636e6573743f7374796c653d736f6369616c" alt="Stars"&gt;&lt;/a&gt;
&lt;a href="https://github.com/tailorgunjan93/docnest/graphs/contributors" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/375d3c06879d4f352880c2fb43546cca4287ddfbf90017f60da8a1b69ab93104/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f636f6e7472696275746f72732f7461696c6f7267756e6a616e39332f646f636e657374" alt="Contributors"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Parse any document. Understand its structure. Build RAG that actually works.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/tailorgunjan93/docnest#-why-docnest" rel="noopener noreferrer"&gt;Why DOCNEST&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-installation" rel="noopener noreferrer"&gt;Installation&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-quick-start" rel="noopener noreferrer"&gt;Quick Start&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-python-api" rel="noopener noreferrer"&gt;Python API&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-pdf-parsing--memory-guide" rel="noopener noreferrer"&gt;PDF Parsing&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-how-it-works" rel="noopener noreferrer"&gt;How It Works&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-cli-reference" rel="noopener noreferrer"&gt;CLI Reference&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-provider-interfaces" rel="noopener noreferrer"&gt;Providers&lt;/a&gt; •
&lt;a href="https://github.com/tailorgunjan93/docnest#-roadmap" rel="noopener noreferrer"&gt;Roadmap&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;The Problem with RAG Today&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Every RAG pipeline ingests documents the same broken way:&lt;/p&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;PDF → extract text → split every 512 chars → embed → store → hope
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;What gets silently destroyed:&lt;/p&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;What blind chunking loses&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Financial report&lt;/td&gt;
&lt;td&gt;Table row &lt;code&gt;45.2% | Q3 | Europe&lt;/code&gt; has no column headers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legal contract&lt;/td&gt;
&lt;td&gt;Clause split mid-sentence across two chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API documentation&lt;/td&gt;
&lt;td&gt;Code example separated from its description&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research paper&lt;/td&gt;
&lt;td&gt;Figure caption disconnected from its analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;The LLM receives noise and returns approximate answers.&lt;/strong&gt; This is not a retrieval problem — it is an ingestion problem.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;See the difference&lt;/h3&gt;
&lt;/div&gt;
&lt;p&gt;Take a financial report with a revenue table…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;PyPI: &lt;a href="https://pypi.org/project/docnest-ai" rel="noopener noreferrer"&gt;https://pypi.org/project/docnest-ai&lt;/a&gt;&lt;br&gt;
Format spec: &lt;a href="https://github.com/tailorgunjan93/udf-spec" rel="noopener noreferrer"&gt;https://github.com/tailorgunjan93/udf-spec&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔨 Honesty tax
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;🚧 This is &lt;code&gt;0.4.0a2&lt;/code&gt; — alpha. It works on real documents, but PPTX parser isn't built yet, Qdrant/Weaviate backends are on the roadmap, and SharePoint/Confluence connectors are planned.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If any of those sound like something you want to build — &lt;a href="https://github.com/tailorgunjan93/docnest/issues" rel="noopener noreferrer"&gt;good first issues are labeled and waiting&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 One question for you
&lt;/h2&gt;

&lt;p&gt;Most RAG infrastructure assumes text extraction is a solved problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It isn't.&lt;/strong&gt; Not for tables. Not for anything where position and relationship carry meaning.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💬 &lt;strong&gt;What document type has caused you the most RAG pain?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For me it was financial tables. Drop it in the comments — if it's a format DocNest doesn't handle yet, that's probably the next parser I build.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Building in the open at &lt;a href="https://github.com/tailorgunjan93/docnest" rel="noopener noreferrer"&gt;github.com/tailorgunjan93/docnest&lt;/a&gt;. Stars, issues, and brutal feedback all welcome.&lt;/em&gt; 🙏&lt;/p&gt;

</description>
      <category>rag</category>
      <category>python</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
