Forem: Gunjan Tailor

I was embarrassed by my RAG demo. Turns out the bug was never in my code.

Gunjan Tailor — Thu, 21 May 2026 17:08:33 +0000

I showed my RAG app to a friend.

He asked: "which region grew the most last quarter?"

It said Europe. The answer was Asia. By a lot.

I spent two days debugging embeddings, chunk sizes, temperature settings.
The bug was none of those things.

The table had been turned into this:

"45.2% Q3 Europe 38.1% Q2 Asia 41.7%..."

Numbers with no headers. No caption. No context.
The LLM wasn't hallucinating. It was working with garbage.

🛠️ So I built the thing I wished existed
Meet DocNest — not another chunker.
A document normalization engine that reads structure before touching content.

Every heading → a navigable §section with its own ID
Every table → preserved as { caption, headers, rows[] } JSON
Every section → one-sentence LLM summary + BM25 keyword index
All of it → packed into a portable .udf file

python

from docnest.pipeline import DocNestPipeline
from docnest.reader import UDFIndex

# Convert — runs once, costs a few LLM calls
pipeline = DocNestPipeline(
    llm_provider="groq",           # free tier works perfectly
    llm_api_key="gsk_...",
    emb_provider="huggingface",    # local, no API key needed
)
pipeline.convert("report.pdf")    # → report.udf ✓

# Query
idx = UDFIndex.load("report.udf")
result = idx.query("Which region had the highest Q3 growth?")

print(result.answer)       # "Asia grew the most, up +12.4pp"
print(result.layer_used)   # 1
print(result.tokens_used)  # 0  ← yes, really. zero.

✅ Zero tokens. Correct answer. 18ms.
That's not a cherry-picked example. Here's why it's possible.

⚡ The 5-layer query engine
Instead of dumping the full document into an LLM, queries escalate through layers — stopping the moment one can answer confidently.
LayerWhat it doesTokensSpeed0Pre-computed summary + key numbers0< 1ms1BM25 + cosine → lands on exact §section0< 20ms2Section-scoped LLM call~3001–3s3Multi-section synthesis~9002–5s4Full document fallback~4000+5–15s
I expected layers 2–4 to do most of the work.

🤯 Layers 0 and 1 handle roughly 70% of real-world questions — at zero token cost.
Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.

📊 Real numbers. Not vibes.
25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.
Question typeScoreBasic facts (calories, macros)✅ 5/5Detailed nutrition (fiber, glycemic index)✅ 5/5Micronutrients (vitamins, minerals)✅ 4/5Hard synthesis (BMR, omega-3, antioxidants)✅ 5/5Edge cases + hallucination traps✅ 5/5Total24/25 — 96%
The one failure: a table-only page where the text parser extracted nothing.
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.

🧠 Handles 600-page PDFs without exploding your RAM
Standard Docling loads the full document into memory. 600 pages on a normal laptop = 💀 out of memory.
DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.
python

from docnest.parsers.pdf import DoclingPDFParser

# Just works — auto-detects large PDFs
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")

# Or tune for your hardware
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf")  # 💻 low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf")  # 🚀

speed mode

🚀 Try it

bashpip install docnest-ai

Formats: PDF (ML + fast) · DOCX · XLSX · HTML · Markdown
LLM providers: Groq (free) · OpenAI · Ollama (local) · Anthropic · Mistral · Google · Cohere
Vector backends: numpy (zero deps) · FAISS · ChromaDB
bash# CLI — because boilerplate is boring
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key financial risks?"
docnest view report.udf # structured HTML viewer in browser
GitHub repo — star it if this solved a problem you've had:

tailorgunjan93 / docnest

The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.

DOCNEST

The document normalization engine RAG has always needed.

Parse any document. Understand its structure. Build RAG that actually works.

Why DOCNEST • Installation • Quick Start • Python API • PDF Parsing • How It Works • CLI Reference • Providers • Roadmap

The Problem with RAG Today

Every RAG pipeline ingests documents the same broken way:

PDF → extract text → split every 512 chars → embed → store → hope

What gets silently destroyed:

Source	What blind chunking loses
Financial report	Table row `45.2% \| Q3 \| Europe` has no column headers
Legal contract	Clause split mid-sentence across two chunks
API documentation	Code example separated from its description
Research paper	Figure caption disconnected from its analysis

The LLM receives noise and returns approximate answers. This is not a retrieval problem — it is an ingestion problem.

See the difference

Take a financial report with a revenue table…

View on GitHub

PyPI: https://pypi.org/project/docnest-ai

Format spec: https://github.com/tailorgunjan93/udf-spec

My RAG app confidently told my client the wrong answer. I spent 3 days debugging the wrong thing.

Gunjan Tailor — Mon, 18 May 2026 13:35:15 +0000

Picture this.

It's a client demo. They're watching. I type:

"Which region had the highest revenue growth last quarter?"

My RAG app — three weeks of work, carefully tuned embeddings, clever prompts — responds instantly.

The client nods. Writes it down.

The answer was wrong. By almost double.

I spent three days debugging the wrong things.

Chunk size? Tried 256, 512, 1024. Nothing.
Temperature? 0.0, 0.3, 0.7. Still wrong.
Embeddings model? Swapped three of them. Nope.
Prompt engineering? Added "think step by step", "be precise", "do not hallucinate". 😭

The LLM wasn't hallucinating. It was doing its best with this:

"45.2%  Q3  Europe  38.1%  Q2  Europe  41.7%  Q3  Asia   29.3%"

Orphaned numbers. No column headers. No caption. No context.

The original table had all of that. My chunker ate it silently.

⚠️ The bug was never in retrieval. It was in ingestion. And I never thought to look there.

🔥 The dirty secret of RAG tutorials

Every tutorial shows you this pipeline:

PDF → extract text → chunk at 512 tokens → embed → store → retrieve → answer

Clean. Simple. Completely wrong for structured documents.

Here's what blind chunking silently destroys:

Document	What you had	What the LLM gets
Financial report	Revenue table with headers	Orphaned numbers, zero context
Legal contract	3-page clause	Split mid-sentence, both halves useless
API docs	Function + code example	Code separated from its description
Research paper	Figure with caption	Caption on chunk 7, analysis on chunk 12

🗑️ You're feeding the LLM garbage and expecting gold. The model isn't dumb — it's working with broken input.

🛠️ So I built the thing I wished existed

Meet DocNest — not another chunker.

A document normalization engine that reads structure before touching content.

Every heading → a navigable §section with its own ID
Every table → preserved as { caption, headers, rows[] } JSON
Every section → one-sentence LLM summary + BM25 keyword index
All of it → packed into a portable .udf file

from docnest.pipeline import DocNestPipeline
from docnest.reader import UDFIndex

# Convert — runs once, costs a few LLM calls
pipeline = DocNestPipeline(
    llm_provider="groq",           # free tier works perfectly
    llm_api_key="gsk_...",
    emb_provider="huggingface",    # local, no API key needed
)
pipeline.convert("report.pdf")    # → report.udf ✓

# Query
idx = UDFIndex.load("report.udf")
result = idx.query("Which region had the highest Q3 growth?")

print(result.answer)       # "Asia grew the most, up +12.4pp"
print(result.layer_used)   # 1
print(result.tokens_used)  # 0  ← yes, really. zero.

✅ Zero tokens. Correct answer. 18ms.
That's not a cherry-picked example. Here's why it's possible.

⚡ The 5-layer query engine

Instead of dumping the full document into an LLM, queries escalate through layers — stopping the moment one can answer confidently.

Layer	What it does	Tokens	Speed
0	Pre-computed summary + key numbers	0	< 1ms
1	BM25 + cosine → lands on exact §section	0	< 20ms
2	Section-scoped LLM call	~300	1–3s
3	Multi-section synthesis	~900	2–5s
4	Full document fallback	~4000+	5–15s

I expected layers 2–4 to do most of the work.

🤯 Layers 0 and 1 handle roughly 70% of real-world questions — at zero token cost.

Seven out of ten queries answered from a structured index. You pay for LLM compute only when genuine reasoning is needed.

📊 Real numbers. Not vibes.

25 questions. 500-page open-source nutrition textbook. PyMuPDF + Groq free tier.

Question type	Score
Basic facts (calories, macros)	✅ 5/5
Detailed nutrition (fiber, glycemic index)	✅ 5/5
Micronutrients (vitamins, minerals)	✅ 4/5
Hard synthesis (BMR, omega-3, antioxidants)	✅ 5/5
Edge cases + hallucination traps	✅ 5/5
Total	24/25 — 96%

The one failure: a table-only page where the text parser extracted nothing.
Fix: use DoclingPDFParser for image-heavy or scanned PDFs.

🧠 Handles 600-page PDFs without exploding your RAM

Standard Docling loads the full document into memory. 600 pages on a normal laptop = 💀 out of memory.

DocNest chunks automatically, processes each at full ML quality, merges the output. Peak RAM stays constant regardless of document size.

from docnest.parsers.pdf import DoclingPDFParser

# Just works — auto-detects large PDFs
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")

# Or tune for your hardware
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf")  # 💻 low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf")  # 🚀 speed mode

🚀 Try it

pip install docnest-ai

Formats: PDF (ML + fast) · DOCX · XLSX · HTML · Markdown

LLM providers: Groq (free) · OpenAI · Ollama (local) · Anthropic · Mistral · Google · Cohere

Vector backends: numpy (zero deps) · FAISS · ChromaDB

# CLI — because boilerplate is boring
docnest convert report.pdf --llm-provider groq --llm-model llama-3.3-70b-versatile
docnest query report.udf "What are the key financial risks?"
docnest view report.udf     # structured HTML viewer in browser

GitHub repo — star it if this solved a problem you've had:

tailorgunjan93 / docnest

The document normalization engine RAG has always needed. Parse any document, understand its structure, build RAG that actually works.

DOCNEST

The document normalization engine RAG has always needed.

Parse any document. Understand its structure. Build RAG that actually works.

Why DOCNEST • Installation • Quick Start • Python API • PDF Parsing • How It Works • CLI Reference • Providers • Roadmap

The Problem with RAG Today

Every RAG pipeline ingests documents the same broken way:

PDF → extract text → split every 512 chars → embed → store → hope

What gets silently destroyed:

Source	What blind chunking loses
Financial report	Table row `45.2% \| Q3 \| Europe` has no column headers
Legal contract	Clause split mid-sentence across two chunks
API documentation	Code example separated from its description
Research paper	Figure caption disconnected from its analysis

The LLM receives noise and returns approximate answers. This is not a retrieval problem — it is an ingestion problem.

See the difference

Take a financial report with a revenue table…

View on GitHub

PyPI: https://pypi.org/project/docnest-ai
Format spec: https://github.com/tailorgunjan93/udf-spec

🔨 Honesty tax

🚧 This is 0.4.0a2 — alpha. It works on real documents, but PPTX parser isn't built yet, Qdrant/Weaviate backends are on the roadmap, and SharePoint/Confluence connectors are planned.

If any of those sound like something you want to build — good first issues are labeled and waiting.

💬 One question for you

Most RAG infrastructure assumes text extraction is a solved problem.

It isn't. Not for tables. Not for anything where position and relationship carry meaning.

💬 What document type has caused you the most RAG pain?

For me it was financial tables. Drop it in the comments — if it's a format DocNest doesn't handle yet, that's probably the next parser I build.

Building in the open at github.com/tailorgunjan93/docnest. Stars, issues, and brutal feedback all welcome. 🙏