Forem: saurabh naik

GraphRAG vs vector RAG: when the knowledge graph pays for itself

saurabh naik — Mon, 18 May 2026 11:09:40 +0000

Ask your vector RAG pipeline "what are the main themes in this corpus?" and watch it return three random chunks that share a keyword. Flat vector retrieval is built for "find me the chunk that matches this query." It is not built for holistic, sense-making questions over a whole corpus.

GraphRAG, from Microsoft Research, was the headline fix for that gap. It builds an LLM-extracted knowledge graph plus hierarchical community summaries, then answers global queries by map-reducing over those summaries. The catch — which Microsoft itself published in their LazyGraphRAG benchmark — is that the indexing pipeline costs roughly 1000x more than a naive vector index. This post walks through what GraphRAG actually does, when it earns that cost, and what to reach for when it doesn't.

The failure mode flat vector RAG hides

Say you have 500 internal incident reports. A new hire asks: "What categories of incidents have we hit most often this year?"

Vector RAG embeds the question, retrieves the top-k chunks by cosine similarity, and stuffs them into the prompt. You get an answer based on whichever 5 chunks happened to score highest — usually the ones with the highest keyword overlap, not a representative sample of the corpus. The model can only summarize what it sees, and it never sees the whole picture.

This is the failure mode GraphRAG was built for: queries where the right answer requires reasoning over the whole corpus, not just retrieving the closest passage.

How GraphRAG fixes it

The indexing pipeline does four things:

Chunk and extract. An LLM reads each chunk and extracts entities, relationships, and claims — with weighted edges and source provenance back to the original text.
Build a typed graph. Entities become nodes, relationships become edges. Storage is usually Neo4j or LanceDB.
Run Leiden community detection. This hierarchical clustering algorithm partitions the graph into nested communities — small tight clusters inside larger thematic ones.
Generate community reports. For every community at every level, the LLM writes a natural-language summary. These summaries are what global queries actually answer against.

That last step is where the token bill explodes. You are paying an LLM to summarize every community at every hierarchy level, and you do it once at index time.

Local Search vs Global Search

GraphRAG ships two query modes, and the difference matters:

Local Search is for specific entity-centric questions ("what did we ship in the Q3 release?"). It matches the query to entities, expands to their neighborhoods (linked entities, relationships, source text), and feeds that subgraph as context.

Global Search is for thematic, aggregative questions ("what are the recurring failure modes across these incidents?"). It map-reduces over the precomputed community reports — each report contributes a partial answer, then a reducer combines them.

If you only need Local Search, you arguably do not need GraphRAG — entity-anchored hybrid retrieval gets you most of the way there. Global Search is the unique capability, and it is also the one that justifies the indexing cost.

A minimal run

pip install graphrag

Initialize a workspace:

python -m graphrag.index --init --root ./ragtest

That scaffolds a settings.yaml. The fields you will edit first:

llm:
  type: openai_chat
  model: gpt-4o-mini
  api_key: ${GRAPHRAG_API_KEY}

embeddings:
  llm:
    type: openai_embedding
    model: text-embedding-3-small

chunks:
  size: 1200
  overlap: 100

community_reports:
  max_length: 2000

Drop your text files in ./ragtest/input/, then run the index:

python -m graphrag.index --root ./ragtest

Issue a global query:

python -m graphrag.query \
  --root ./ragtest \
  --method global \
  "What are the main themes across these documents?"

The first run on a small corpus is illuminating — you can watch the token meter while the LLM summarizes communities.

Warning: Run this on a 5MB corpus before you point it at a 5GB one. The indexing cost scales with the LLM work, not with disk size, and that work is not cheap.

The cost wall

This is the part most blog posts skip. Microsoft Research's own LazyGraphRAG benchmark on AP News measured the original GraphRAG indexing cost at ~$1,544 per million tokens versus ~$1.45 per million tokens for vector RAG. That is roughly 1000x.

The same paper introduced LazyGraphRAG, which defers graph construction to query time and uses cheaper NLP for entity extraction plus on-the-fly LLM ranking. On the same benchmark, LazyGraphRAG matched or beat GraphRAG's answer quality at ~0.1% of the indexing cost — and at its highest query budget, it outperformed GraphRAG Global Search by 16.96% on comprehensiveness and 25.7% on diversity win rates for local queries.

The authors of that LazyGraphRAG paper, Darren Edge and Ha Trinh, are also the authors of the original GraphRAG paper. Microsoft is telling you the upfront graph is overkill for most workloads.

The cheaper default: hybrid + rerank

When the corpus is not heavily reused, the practical pattern is:

Hybrid retrieval. BM25 for lexical recall + dense embeddings for semantic recall, union the candidates.
LLM reranker. Pass the top ~50 candidates to a small cheap LLM with a relevance prompt, keep the top 5–10.
Generate. Feed those into the answer LLM.

This recovers most of the gain GraphRAG offers for entity-anchored queries, with no indexing-time graph build. The tradeoff is that you do pay more per query — every question runs the rerank step. For corpora that are queried rarely, that economics is correct. For corpora that are queried thousands of times a day on the same content, GraphRAG amortizes better.

When the graph still wins

Three signals say "build the graph upfront":

Reuse. The same corpus is queried heavily — knowledge bases, support docs, contract repositories — so the indexing cost amortizes over thousands of queries.
Provenance. Regulated domains where every answer needs a citation trail back to source documents. The graph's edge-level source tracking is the cleanest way to deliver that.
Repeatable thematic queries. Same kinds of "what are the patterns across X" questions, over and over. Community reports are precisely the precomputation that makes those cheap at query time.

If your workload misses all three, LazyGraphRAG or hybrid+rerank is almost certainly the right default.

A three-question decision

Before you pip install graphrag in production:

Will you reissue similar queries against the same corpus more than ~1000 times? If no, defer the graph.
Do answers need an auditable citation trail? If no, defer the graph.
Are your hardest queries thematic ("what are the main X across the whole corpus")? If no, hybrid retrieval is likely enough.

Three nos means hybrid retrieval plus an LLM reranker. Three yeses means GraphRAG earns its index cost. Mixed answers mean LazyGraphRAG is probably the right middle.

Wrapping up

GraphRAG is real engineering, not hype. It solves a problem flat vector RAG genuinely cannot solve. But the cost profile is severe enough that Microsoft itself shipped a 1000x-cheaper variant a year later. Treat the choice as an economics question, not a capability question: does your query-to-index ratio amortize a $1,500 indexing job, and do your answers need the provenance the graph gives you?

If you want to go deeper, the LazyGraphRAG announcement on the Microsoft Research blog has the full benchmark numbers, and the microsoft/graphrag repo has reference settings for several backends. Both are worth reading before you commit to a path.

What query-to-index ratio made GraphRAG worth it in your stack? Or did you end up landing on hybrid retrieval instead?

Why production RAG fails — and the boring metrics that fix it

saurabh naik — Mon, 18 May 2026 10:47:06 +0000

Most production RAG pipelines underperform for the same reason: the team treats retrieval as a solved vector-search problem, ships top-k embedding search, and then blames the generator when the answers are wrong. The "RAG is dead, long context replaces it" framing is the wrong fight. Long context doesn't fix retrieval — it hides retrieval failures behind a larger haystack while adding cost and latency.

This walkthrough is for engineers who already have a RAG prototype and want to know what to measure, what to fix, and in what order. By the end you'll have a minimal LangChain + FAISS + cross-encoder reranker pipeline and a clear separation between retrieval metrics and generation metrics.

The three components, in one paragraph

RAG is a hybrid: a non-parametric retriever (usually a dual-encoder over a chunked corpus, often paired with BM25) selects top-k passages from a document store, then a parametric LLM generates an answer conditioned on those passages. Three knobs, three failure surfaces. The original paper (Lewis et al., 2020 — arxiv.org/abs/2005.11401) introduced two variants — RAG-Sequence and RAG-Token — but in practice almost no production system jointly fine-tunes any of it. Teams freeze components and tune chunking, embeddings, and reranking.

That's the whole architecture. Everything below is about why each component fails and how to tell which one is failing.

The metrics most teams skip

If you only measure end-to-end answer quality, you cannot tell whether the retriever missed the right chunk or the generator ignored a chunk it was given. These are different bugs with different fixes. You need at least three numbers, scored on a synthetic eval set built on day one:

Retrieval recall@k — did the right chunk appear in the top-k? Computed against ground-truth passage IDs.
Faithfulness — does the generated answer actually follow from the retrieved chunks, or is it hallucinated?
Answer relevance — does the answer address the question, regardless of source?

RAGAS (Es et al., 2023 — arxiv.org/abs/2309.15217) gives you reference-free versions of the last two, validated on WikiEval at 0.95 agreement with human annotators for faithfulness (vs. 0.61 for naive GPT-3.5 prompting). The authors showed automated metrics can replace ~80% of human eval effort in iterative tuning.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Each row: a question, the retrieved chunks, the model's answer, ground truth
data = Dataset.from_dict({
    "question": questions,
    "contexts": retrieved_chunks,   # list[list[str]]
    "answer": generated_answers,
    "ground_truth": ground_truths,
})

result = evaluate(data, metrics=[faithfulness, answer_relevancy, context_precision])
print(result)

The point isn't the score — it's that you now have three separable signals. When faithfulness is high but answer relevance is low, your retriever missed. When faithfulness is low, your generator is ignoring context. You stop guessing.

The four real failure modes

After enough postmortems they collapse to four. Each has a different fix.

1. Chunking splits the answer

The right information exists in your corpus but it's spread across the boundary between two chunks. Neither chunk alone contains the answer, so neither retrieves well. Fix: overlap your chunks (10–20% is a reasonable start), respect semantic boundaries (sections, paragraphs) before character counts, and for long technical docs consider hierarchical chunking with parent-doc retrieval.

2. Top-k drowns the generator

You retrieved the right chunk, but you also retrieved nine weakly-related ones, and the model attends to the wrong neighbor. Bigger k is not the answer. Add a reranker (next section). Precision matters more than recall once recall@20 is acceptable.

3. Stale or duplicated index

Documents drift. The same chunk appears under three different IDs because someone re-ingested without dedup. The retriever returns three near-identical neighbors and crowds out the actually relevant one. Fix: deduplicate by content hash at ingestion, version your index, and put a TTL on anything that changes.

4. Context-faithfulness gap

The right chunk is in the prompt and the model still hallucinates. This is a generator problem. Tighten the system prompt ("answer only from the provided context; say 'I don't know' if absent"), measure faithfulness explicitly, and consider a stronger or instruction-tuned model. This is the one that looks like "RAG doesn't work" and is actually "your generator doesn't follow instructions."

Lost in the Middle: why position matters

Even when the relevant chunk is retrieved, where it sits in the context window changes whether the model uses it. Liu et al., 2023 (arxiv.org/abs/2307.03172) showed retrieval-augmented QA accuracy drops from ~75% when the relevant doc is at position 1 to ~50% when it's placed in the middle of a 20-doc context window. A 25-percentage-point swing from position alone.

This is the actual argument against "just shove everything into long context." A bigger window doesn't help if the model under-attends to the middle. A reranker that promotes the best chunk to position 1 — or that lets you safely use a smaller k — is doing real work.

A minimal LangChain + FAISS + cross-encoder reranker

Hybrid retrieval (BM25 + dense) is the cheapest precision win. A cross-encoder reranker on top is the second. Here's a stripped-down pipeline that puts both in place:

from langchain_community.vectorstores import FAISS
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 1. Chunk with overlap, respecting structure
splitter = RecursiveCharacterTextSplitter(
    chunk_size=600, chunk_overlap=80,
    separators=["\n## ", "\n### ", "\n\n", "\n", ". "]
)
chunks = splitter.split_documents(raw_docs)

# 2. Dense index
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
dense = FAISS.from_documents(chunks, embeddings).as_retriever(search_kwargs={"k": 20})

# 3. Lexical index (catches exact terms dense misses — error codes, IDs)
bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 20

# 4. Hybrid
hybrid = EnsembleRetriever(retrievers=[bm25, dense], weights=[0.4, 0.6])

# 5. Cross-encoder reranker — reorders the candidates by true query-doc relevance
ce = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
reranker = CrossEncoderReranker(model=ce, top_n=5)

retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=hybrid,
)

docs = retriever.invoke("How does the reranker change top-k recall?")

A few things to notice. The hybrid retriever pulls 20 candidates from each backend; the cross-encoder scores them properly (a true bi-input model, not a similarity proxy) and returns the top 5. Final k to the generator is small — which both fixes the Lost-in-the-Middle problem and cuts your token cost. Switching from a bi-encoder-only setup to this on a real corpus usually moves retrieval recall@5 by double-digit points without touching the generator.

Note: Cross-encoders are slow per pair — they recompute attention over the concatenated query+doc. That's why you only run them on the top-20 from the cheap retriever, not the whole corpus.

The fix order that has actually worked

If you can do one thing this week, in this order:

Build a 50–100 question synthetic eval set with ground-truth chunk IDs. Without it you're flying blind.
Add BM25 alongside your dense retriever and ensemble them. Cheapest precision gain.
Add a cross-encoder reranker. Measure recall@5 before and after.
Wire up RAGAS faithfulness + answer-relevance so you can separate retriever bugs from generator bugs.
Only then think about query rewriting, HyDE, fine-tuning embeddings, or a bigger generator.

The interesting work is at the top of that list, not the bottom. Most teams reverse it.

Wrapping up

Retrieval recall is its own metric. Measure it separately from answer quality, or you'll keep blaming the generator for the retriever's miss. Long context doesn't replace retrieval — it just hides which one of your four failure modes is the one biting you.

Two follow-ups worth a read if you want to go deeper:

Jason Liu's "systematically improving RAG" writing (jxnl.co/writing/) — the most practical eval-driven approach I've seen.
GraphRAG for cases where your corpus has real entity-relationship structure and dense retrieval keeps missing the connection.

What's the one retrieval failure that took you longest to diagnose? I'm curious how often it turned out to be chunking vs. reranking vs. the generator just ignoring the context.

Chunking for RAG: stop tuning the wrong knob

saurabh naik — Mon, 18 May 2026 07:00:35 +0000

Every other week a new "smart" chunking strategy lands on AI Twitter — semantic, agentic, propositional, late chunking. Meanwhile the two boring knobs that actually move retrieval quality (chunk size and overlap) sit at whatever default a tutorial picked in 2023.

This post is for engineers shipping RAG who want a defensible chunking choice instead of a vibes-based one. By the end you'll have: a clear picture of what the recent research says, a working Python eval harness that compares chunking strategies on your own data, and a concrete production default to start from.

The chunking strategies, very briefly

There are basically four families in the wild:

Fixed-size: split every N tokens. Fastest, dumbest, cuts mid-sentence.
Recursive character splitting (LangChain's RecursiveCharacterTextSplitter): tries paragraph → sentence → word until chunks fit. The pragmatic default for prose.
Document-structure-aware: split on Markdown headers, HTML tags, or code AST nodes. Keeps logical sections intact.
Semantic chunking (LlamaIndex's SemanticSplitterNodeParser and friends): embed each sentence, cut where adjacent-embedding distance spikes past a percentile. Topically coherent, much more expensive.

The intuition for the last one is seductive — "let embeddings decide where ideas end." That's also the one that doesn't reliably pay off.

What recent research actually shows

Two independent results are worth knowing before you pick a strategy.

Chroma's chunking eval (Brandon Smith and Anton Troynikov) tested embedding-similarity splitters and LLM cluster chunkers against naive recursive and fixed-size chunking, scored with Intersection-over-Union and Recall on multiple corpora. The headline: semantic methods showed inconsistent, often negligible gains. Sometimes they lost. The dominant variables were chunk size and overlap, not the splitter. Default RecursiveCharacterTextSplitter at ~200–400 tokens was a strong baseline.

Databricks Mosaic AI's FinanceBench sweep went the other direction — fix the splitter (recursive), vary chunk size, measure answer correctness end-to-end:

512-token chunks → ~36% correctness
1024 → ~42%
2048 → ~45%
4096 → ~47%

Bumping overlap from 20% to 50% added less than a point and roughly doubled the index. In other words, larger chunks helped more than fancier splitting — and overlap mostly bought you a bigger index.

Anthropic's Contextual Retrieval is the one place "smart" preprocessing clearly paid off. Their move wasn't splitting cleverly; it was augmenting each chunk with ~50–100 tokens of LLM-generated context before embedding:

Contextual Embeddings alone: 35% fewer failed retrievals (5.7% → 3.7%)
Add Contextual BM25: 49% reduction
Add a reranker: 67% reduction
Indexing cost: ~$1.02 per million document tokens with Claude Haiku + prompt caching
Their sweet spot: 800 tokens, 100-token overlap

The pattern across all three: optimize the cheap knobs first (size, overlap), then augment chunks if you need more, and treat semantic splitting as a last resort.

A small eval harness you can actually run

You don't need a benchmark suite to make this call on your own corpus. Forty labeled (question, expected_snippet) pairs and an afternoon will do it. Here's the minimal harness.

# pip install langchain langchain-community sentence-transformers faiss-cpu
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

def build_index(docs, chunk_size, chunk_overlap):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.create_documents(docs)
    return FAISS.from_documents(chunks, embeddings)

def recall_at_k(index, eval_set, k=5):
    hits = 0
    for q, expected in eval_set:
        results = index.similarity_search(q, k=k)
        if any(expected.lower() in r.page_content.lower() for r in results):
            hits += 1
    return hits / len(eval_set)

Now sweep:

docs = [open("corpus.txt").read()]  # your real corpus
eval_set = [
    ("What's the refund policy?", "refunds are issued within 14 days"),
    # ... 40 of these
]

for size in [256, 512, 1024, 2048]:
    for overlap_pct in [0.0, 0.1, 0.2]:
        idx = build_index(docs, size, int(size * overlap_pct))
        score = recall_at_k(idx, eval_set, k=5)
        print(f"size={size:>4} overlap={int(overlap_pct*100):>2}% recall@5={score:.3f}")

The first time you run this on real data it's deflating in a useful way. Most teams discover the difference between their current setup and the best cell in this grid is bigger than the difference between any two splitter algorithms.

Note: "Expected snippet appears in any top-k chunk" is a coarse metric. It's fine for picking between configs; for production-grade evals you want a proper retrieval IoU or a downstream answer-correctness score, ideally with an LLM-as-judge over (question, retrieved_chunks, ground_truth_answer).

When semantic chunking is worth the bill

The Chroma study isn't a blanket "never use it." Semantic splitting helps when:

Your corpus is very heterogeneous in topic density — long technical docs that mix narrative explanation with dense reference tables, for example.
Your chunks need to be smaller than recursive splitting can keep coherent (e.g., 200-token chunks where every cut on a paragraph boundary truncates an idea).
You're already running cheap embeddings on every sentence for another reason.

If none of those apply, you're paying 10–100× the preprocessing cost to lose to a tuned recursive splitter.

Add context to chunks, not cleverness to splits

Once your size sweep stops moving the needle, the next lever isn't a fancier splitter — it's giving each chunk more context. Anthropic's Contextual Retrieval is the cleanest version of this idea:

CONTEXT_PROMPT = """<document>
{whole_document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Give a short (50-100 token) context that situates this chunk in the
overall document. Answer only with the context, nothing else."""

def contextualize(client, whole_doc, chunk):
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(
                whole_document=whole_doc, chunk=chunk
            ),
        }],
    )
    return msg.content[0].text

# Then embed: f"{context}\n\n{chunk}" instead of just chunk

In production you almost certainly want prompt caching on whole_document — that's what gets the per-token indexing cost down to roughly $1 per million document tokens. Without caching, this approach is too expensive to be a default; with it, it's a reasonable line item.

You combine that with BM25 on the same contextualized text and a reranker on top of the union of dense + sparse hits, and you've reproduced most of the 67% retrieval-failure reduction Anthropic reported — without ever leaving recursive chunking.

Honest tradeoffs

A few things this post is not claiming:

That recursive chunking is optimal. It's a strong default. Your corpus might beat it with structure-aware splitting (Markdown headers, code AST) — that's worth trying before semantic chunking and is usually cheaper at index time too.
That bigger chunks are always better. The Mosaic AI sweep showed monotonic gains to 4096, but they were also running a long-context model. With an 8k-context generator, dumping 4k-token chunks limits how many you can stuff into the prompt. The right answer depends on your generator and your top-k.
That contextual retrieval is free. It costs an LLM call per chunk at index time. Worth it for high-value, slow-churn corpora (product docs, legal). Probably not for a corpus you re-index hourly.

Wrapping up

If you've never tuned chunking, the play is:

Start with RecursiveCharacterTextSplitter, 1024 tokens, 10–20% overlap.
Build a small (40–100) labeled eval set on your real corpus.
Sweep chunk size and overlap. Pick the best cell.
If retrieval is still the bottleneck, add contextual retrieval + BM25 + a reranker before you reach for semantic splitting.

The boring knob beats the smart algorithm in most real systems. Tune it.

What chunk size are you running in production — and is it a tuned number, or the default from a LangChain tutorial? Curious how many teams have actually swept this.

Chunking in RAG: why your splitter matters more than your embedding model

saurabh naik — Mon, 18 May 2026 06:23:45 +0000

Most RAG retrieval problems I've debugged came down to the same thing: someone swapped the embedding model three times, added a reranker, then gave up — and never once changed the chunker.

This is backwards. The chunker decides what your embedding model is allowed to see. A great embedding on a bad chunk is still a bad retrieval. And the published research from the last 18 months keeps pointing at the same conclusion: the "smart" chunking strategies don't beat a tuned dumb one. What does beat them is augmenting each chunk with context.

This post walks through the four chunking strategies you'll actually run into, why semantic chunking disappoints on benchmarks, and a working contextual retrieval implementation with the numbers from Anthropic's report. By the end you should have a default chunking recipe you can defend with data, not vibes.

The four chunking strategies

Almost every chunker in the wild is a variation of one of these.

1. Fixed-size

Split every N tokens (or characters) with some overlap.

def fixed_chunks(text: str, size: int = 512, overlap: int = 50):
    tokens = text.split()
    step = size - overlap
    return [" ".join(tokens[i:i + size]) for i in range(0, len(tokens), step)]

Fast and reproducible. Cuts mid-sentence. Useful as a baseline so you have something to beat.

2. Recursive character splitting

The LangChain default. Tries paragraph breaks first, then sentences, then words — recursing until each chunk fits.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)

This is the pragmatic default for prose. It respects natural breaks when it can, falls back to character splits when it can't.

3. Document-structure-aware

Uses the document's own structure as the split signal — Markdown headers, HTML tags, code AST nodes. The chunks carry the section path as metadata, which is gold for filtering at retrieval time.

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
)
chunks = splitter.split_text(markdown_doc)
# each chunk's metadata: {"h1": "...", "h2": "...", "h3": "..."}

Use this whenever your source has structure. Throwing it away to run recursive character splitting is a self-inflicted wound.

4. Semantic chunking

Embed each sentence, walk through the document, and start a new chunk every time the cosine distance between adjacent sentences exceeds a percentile threshold.

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding(),
)
nodes = splitter.get_nodes_from_documents(documents)

Intuitively appealing. Topically coherent chunks should retrieve better. And it costs you an embedding call per sentence at index time.

The intuition is wrong often enough to matter.

Why semantic chunking often disappoints

Chroma Research ran a careful evaluation last year (Brandon Smith and Anton Troynikov, the latter a Chroma co-founder). They tested embedding-similarity splitters and LLM cluster chunkers against plain recursive and fixed-size chunking across multiple corpora, scoring with Intersection-over-Union and Recall.

The headline result: semantic methods produced inconsistent, often negligible gains. Sometimes they lost. Meanwhile they cost orders of magnitude more in embedding and LLM calls at index time.

The dominant variables across every experiment were chunk size and overlap, not the splitting strategy. A RecursiveCharacterTextSplitter at the right size was a hard-to-beat baseline.

If you're going to spend engineering hours, spend them on a chunk-size sweep, not on a smarter splitter.

import numpy as np

def recall_at_k(retrieved_ids, relevant_ids, k=5):
    return len(set(retrieved_ids[:k]) & set(relevant_ids)) / len(relevant_ids)

# Sweep chunk_size with everything else held constant
for size in [256, 400, 600, 800, 1000, 1200]:
    chunks = chunk_corpus(documents, size=size, overlap=size // 8)
    index = embed_and_index(chunks)
    scores = [recall_at_k(index.search(q), gold) for q, gold in eval_set]
    print(f"size={size}  recall@5={np.mean(scores):.3f}")

You will see a clear curve, not a flat line. Pick the peak. Don't ship a default you never measured.

What actually moves the needle: contextual retrieval

The interesting move isn't a smarter splitter. It's keeping the splitter dumb and giving each chunk back the context it lost when you split it.

This is Anthropic's contextual retrieval recipe. For every chunk, prompt a cheap model with the full document and the chunk, and ask for 50-100 tokens of situating context. Prepend that context to the chunk before embedding.

import anthropic

client = anthropic.Anthropic()

CTX_PROMPT = """<document>
{doc}
</document>

Here is a chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Give a short (50-100 token) context that situates this chunk
within the overall document for retrieval. Answer only with the
succinct context."""

def contextualize(doc: str, chunk: str) -> str:
    msg = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": CTX_PROMPT.format(doc=doc, chunk=chunk),
                 "cache_control": {"type": "ephemeral"}},
            ],
        }],
    )
    return msg.content[0].text

augmented_chunks = [
    f"{contextualize(doc, c)}\n\n{c}" for c in chunks
]

The cache_control block matters. Without prompt caching you pay the full document token cost per chunk. With it, the document is cached once and reused across every chunk call — Anthropic reports roughly a 90% cost reduction on the context-generation step.

The reported numbers on their evaluation corpus (codebases, papers, fiction; top-20 retrieval failure rate):

Contextual Embeddings alone: 35% fewer failed retrievals (5.7% → 3.7%)
+ Contextual BM25 (the same context augmentation applied to a BM25 index): 49% fewer (5.7% → 2.9%)
+ a reranker on top of both: 67% fewer (5.7% → 1.9%)
One-time indexing cost: ~$1.02 per million document tokens with Haiku + prompt caching
Optimal chunk size in their tests: 800 tokens with 100-token overlap, beating both 400 and 1600

The 800/100 number is worth pausing on. It's not "256 because that's what the tutorial said." It's not "1024 because the context window is big." It's a measured optimum on a real corpus. Yours will land somewhere similar but not identical — run the sweep.

When contextual retrieval pays for itself

Indexing cost goes up. Query-time cost is unchanged. So the math is:

How often do you re-index? If you re-index weekly on a 100M-token corpus, that's ~$100/week. Trivial for most production systems.
What's a retrieval miss worth? In support automation a single wrong answer can be measured in minutes of human time. The math is usually obvious.

Where it doesn't pay: tiny corpora (< 1M tokens) where you can fit everything in context anyway, or extremely high-churn corpora where you re-embed many times a day. Everything else, run it.

Note: Contextual retrieval is additive with everything else. Recursive splitter, document-structure-aware metadata, BM25 hybrid, reranker — they all stack. The 67% number assumes the full stack. Don't read that line as "the reranker is doing nothing."

A default recipe to start from

If you're staring at a blank file, this is a reasonable first pass:

Recursive character splitter at 800 tokens, 100 overlap.
Preserve any structural metadata (Markdown headers, file paths) as chunk metadata.
Add 50-100 tokens of LLM-generated context per chunk with Haiku + prompt caching.
Hybrid: vector index + BM25 over the same augmented chunks.
Rerank top-20 down to top-5 with a cross-encoder.
Build a 100-query eval set from real user logs and run a chunk-size sweep against your corpus before treating any of this as settled.

Step 6 is the one most teams skip. Don't.

Wrapping up

Chunking is one of the highest-leverage things in a RAG pipeline and one of the least-measured. The cheap experiments — sweeping chunk size, adding contextual augmentation — usually beat the expensive ones (a fancier embedding model, a third reranker).

Two links worth your time next:

Chroma's chunking evaluation: https://research.trychroma.com/evaluating-chunking
Anthropic's contextual retrieval writeup: https://www.anthropic.com/news/contextual-retrieval

What chunk size do you run in production — and have you actually benchmarked it against alternatives, or is it still the framework default? I'm curious how often teams have a measured answer here.

RLHF in 2026: when to pick PPO, DPO, or verifier-based RL

saurabh naik — Sat, 16 May 2026 09:37:14 +0000

The famous InstructGPT result is still the cleanest argument for post-training: a 1.3B aligned model was preferred over the 175B GPT-3 base ~85% of the time on instruction-following. Alignment beat a 100x scale gap.

That number got a lot of people to implement RLHF. Most of them later ripped it out and switched to DPO. A smaller group skipped both and went to verifier-based RL.

This post is the decision tree I wish I'd had when I started: what each pipeline actually looks like in TRL, where it breaks, and which one you should reach for first in 2026. The code blocks are runnable end-to-end against open weights — pick one and you have a working stack by tomorrow.

The three-way choice

Before any code, the picture:

PPO RLHF — sample, score with a reward model, update with PPO under a KL leash. The original InstructGPT recipe. Powerful, fiddly, expensive.
DPO — collapse the reward model and the RL loop into a single supervised loss on preference pairs. Trains like SFT, no sampling loop.
RLVR — verifier-based RL. The reward is ground truth (unit tests pass, math answer is correct, JSON parses). No human preferences at all.

A rough rule that holds in most post-training shops I've talked to:

Style, tone, instruction-following → DPO by default, PPO only if you can afford on-policy sampling.
Math, code, structured output, tool-use → RLVR. Don't waste a reward model on something a checker can score.
Mixed product behavior → SFT first, then DPO, then a verifier-RL pass on the verifiable slices.

The rest of this post is the why behind that table, and the actual training code.

SFT first, always

Every pipeline below assumes you've done SFT. The SFT model is both the starting policy for the RL/DPO step and the frozen reference the KL term anchors against.

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
from transformers import AutoTokenizer

MODEL = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft").select(range(5000))

trainer = SFTTrainer(
    model=MODEL,
    train_dataset=ds,
    args=SFTConfig(
        output_dir="qwen-sft",
        per_device_train_batch_size=4,
        learning_rate=2e-5,
        num_train_epochs=1,
        bf16=True,
    ),
    tokenizer=tokenizer,
)
trainer.train()

SFT teaches the model to imitate a fixed target. It runs out of road the moment "good" isn't a single sentence away — helpfulness, tone, "did you actually answer the question" are comparative judgments, not next-token predictions. That's the whole reason the other stages exist.

Path A: classical PPO RLHF

Step 1 — train the reward model

The reward model (RM) is a scalar head on top of a transformer: (prompt, response) → r. You train it on pairwise comparisons with the Bradley-Terry loss:

L = -log σ(r(x, y_chosen) - r(x, y_rejected))

Translation: push the score of the chosen response above the rejected one, by enough margin that softmax probabilities match human preferences.

from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

RM_BASE = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(RM_BASE)
model = AutoModelForSequenceClassification.from_pretrained(RM_BASE, num_labels=1)

ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train").select(range(10000))

trainer = RewardTrainer(
    model=model,
    args=RewardConfig(
        output_dir="qwen-rm",
        per_device_train_batch_size=8,
        learning_rate=1e-5,
        num_train_epochs=1,
        bf16=True,
        max_length=1024,
    ),
    train_dataset=ds,
    tokenizer=tokenizer,
)
trainer.train()

Warning: the RM overfits fast. Track validation pairwise accuracy, not training loss. If train accuracy keeps climbing while eval plateaus around 0.65–0.70, stop training. A slightly underfit RM is far better than a sharp one — sharp RMs are the easiest to exploit.

OpenAI used a 6B RM against a 175B policy. The RM doesn't need to be as big as the policy; it just needs to be a stable judge.

Step 2 — PPO with a KL penalty

PPO samples completions from the current policy, scores them with the RM, and updates the policy with clipped policy-gradient. The KL penalty is what keeps the run from imploding:

r_total = r_RM(x, y) - β · KL(π_θ(·|x) || π_ref(·|x))

Drop the KL term and the policy walks off the manifold the RM was trained on, finds a strange region of token space that scores high, and produces nonsense. With KL, every step is leashed to the SFT reference.

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoModelForSequenceClassification, AutoTokenizer

policy = AutoModelForCausalLMWithValueHead.from_pretrained("qwen-sft")
ref = AutoModelForCausalLMWithValueHead.from_pretrained("qwen-sft")  # frozen
rm = AutoModelForSequenceClassification.from_pretrained("qwen-rm")
tokenizer = AutoTokenizer.from_pretrained("qwen-sft")

config = PPOConfig(
    output_dir="qwen-ppo",
    learning_rate=1e-6,
    per_device_train_batch_size=4,
    mini_batch_size=2,
    num_ppo_epochs=4,
    kl_coef=0.05,        # β — start here
    cliprange=0.2,
    cliprange_value=0.2,
    bf16=True,
)

trainer = PPOTrainer(
    args=config,
    model=policy,
    ref_model=ref,
    reward_model=rm,
    train_dataset=prompt_dataset,
    tokenizer=tokenizer,
)
trainer.train()

Three dashboards to keep open:

Mean reward — should rise, then plateau. If it keeps climbing past your RM's eval accuracy ceiling, the policy is hacking the RM.
KL to reference — should stay bounded. A spike means the policy is sprinting away from SFT. Raise kl_coef.
A separate judge on held-out prompts — never trust the RM as ground truth. Read samples, or score with a different model entirely.

kl_coef between 0.02 and 0.2 covers most cases. I start at 0.05 and only move it when the KL graph misbehaves.

Why PPO breaks

After a few runs the failure modes get predictable:

Reward hacking — the policy finds outputs the RM loves and humans don't. Karpathy's line that RLHF is "just barely RL" is exactly this — the RM is a vibe check trained on a few thousand comparisons, and the policy is a much stronger optimizer than the RM is a judge.
Sycophancy — if labelers preferred responses that agreed with them, the RM learns "agreement = good," and the policy agrees with factual errors. Fix the data, not the optimizer.
Mode collapse — the policy narrows onto a few high-reward templates. Entropy drops, and you'll see the same opener over and over at temperature 1.0.
Alignment tax — RLHF'd models often regress on raw capability benchmarks like MMLU. You're trading capability for instruction-following, which is the right call for chat products and the wrong one for a model used as a backbone.

Path B: DPO — skip the RL loop

Direct Preference Optimization (Rafailov et al., 2023) folds the RM and PPO into a single supervised loss directly on preference pairs:

L_DPO = -log σ(β · [log π_θ(y_w|x)/π_ref(y_w|x) - log π_θ(y_l|x)/π_ref(y_l|x)])

No reward model. No sampling loop. No value head. Same (chosen, rejected) data as the RM stage above, plus your frozen reference policy.

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

policy = AutoModelForCausalLM.from_pretrained("qwen-sft")
ref = AutoModelForCausalLM.from_pretrained("qwen-sft")
tokenizer = AutoTokenizer.from_pretrained("qwen-sft")

ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train").select(range(10000))

trainer = DPOTrainer(
    model=policy,
    ref_model=ref,
    args=DPOConfig(
        output_dir="qwen-dpo",
        per_device_train_batch_size=4,
        learning_rate=5e-7,
        beta=0.1,                # KL strength, same role as PPO's kl_coef
        num_train_epochs=1,
        bf16=True,
    ),
    train_dataset=ds,
    tokenizer=tokenizer,
)
trainer.train()

DPO wins when:

You have static preference data and don't want to maintain an RM service.
You want a training run that looks like SFT operationally — same trainer pattern, same monitoring, same failure profile.
You don't need on-policy exploration. DPO learns from a fixed dataset; PPO can sample fresh comparisons.

PPO still wins when:

You can generate fresh comparisons mid-training (online RLHF).
The preference signal is non-stationary and DPO's frozen dataset goes stale.
You're running a frontier-scale RM whose inference cost is justified.

For most teams shipping post-training in 2026, DPO (or its variants — IPO, KTO, SimPO) is the default. PPO RLHF still earns its place at the top of the budget curve.

Path C: RLVR — when you have a real checker

For domains with a real verifier — math, code, structured output, tool-use success — RLVR sidesteps the reward-model problem entirely. The reward is ground truth, not a learned vibe check. DeepSeek-R1 and o1-style training are the canonical examples.

The pipeline shape:

Sample completions from the policy.
Run them through a checker — execute the code, check the math answer, validate the JSON.
Reward = 1 if pass, 0 if fail (or a richer shaped reward if you have partial credit).
PPO-update against that reward, KL-anchored to the SFT reference exactly like classical RLHF.

The big advantage is that reward hacking gets much harder. A unit test either passes or it doesn't — there's no spurious phrase the policy can latch onto. The reward signal scales with capability instead of fighting it, which is why the recent capability jumps on reasoning benchmarks came from this direction rather than from bigger RMs.

The catch: it only works where you can build a cheap, reliable checker. For "is this helpful and polite," you're still in RLHF/DPO territory.

Putting it together

The minimal mental model:

SFT gives you a model that follows instructions in form.
Reward modeling lets you express comparative preferences when no ground truth exists.
PPO + KL tunes the policy against those preferences without letting it wander.
DPO collapses 2 and 3 into one supervised step — usually the right call for offline preference data.
RLVR replaces all of the above wherever you have a real checker.

If I were standing up alignment from scratch this quarter: SFT, then DPO on offline preference data for style and helpfulness, then a verifier-RL pass on the math/code/tool-use slices where checkers are cheap. PPO RLHF would only show up if I had budget for online sampling and a serious RM team to back it.

What does your alignment stack look like in 2026 — PPO, DPO, or have you moved on to verifier-based RL where you can? I'm curious which step everyone is keeping versus dropping.

If you want to go deeper:

InstructGPT paper — arxiv.org/abs/2203.02155. Canonical reference for the full pipeline.
DPO paper — arxiv.org/abs/2305.18290. Short, worth reading in full.