Forem: Nikita Groshin

How I stopped Claude Code from hallucinating function names on a 4,000-file repo (with a local MCP server)

Nikita Groshin — Sun, 03 May 2026 17:34:20 +0000

TL;DR: My Claude Code agent kept inventing function names that looked plausible but didn't exist (getUserByEmail, parseConfigFile, validateInput — all fake in my codebase). Adding a local MCP server that gives the agent a real symbol graph and ranked code search cut hallucinations to roughly zero on the same repo. Below: the bug, the cause, the fix, the bench numbers, and the cases where it still doesn't help.

The bug

I was refactoring a logging middleware in a 4,000-file TypeScript monorepo. The agent's task: rename logRequest to logHttpRequest everywhere it's called, including transitive callers.

What Claude Code generated, paraphrased:

// src/middleware/auth.ts
import { logRequest } from "../logging/logger"

export function withAuth(handler: Handler) {
  return async (req, res) => {
    logRequest(req)  // ← real
    if (!req.session) {
      return res.status(401).json({ error: "unauthorized" })
    }
    // …
    return logResponseTime(req, res, handler)  // ← INVENTED
  }
}

logResponseTime does not exist. It has never existed in this codebase. The agent generated a call to it because (a) the surrounding code talks about logging, (b) function names like logResponseTime exist in millions of public repos, (c) the model's training data has a strong prior that "logging middleware should also log response time."

The actual function in our codebase is called recordLatency, which the agent never used in any of the four files it edited.

I rolled back, ran the same task again, and got trackRequestDuration and emitTimingEvent — also fake. Three runs, three different invented names. The agent was confident each time.

This is the load-bearing failure of every AI coding agent on a repo larger than its context window: the model treats your codebase as if it were a representative sample of the training corpus.

Why grep alone doesn't fix it

Cursor's @codebase, Claude Code's grep tool, plain ripgrep — all of them work in principle. The agent can search before it writes. In practice, it almost never does, for two reasons.

Cost. A single grep call against this repo returned 14,200 input tokens (we measured). Across the four files in this task the agent would have needed roughly 8 grep calls to be confident. That's ~$1.40 per task at Claude Sonnet's input rate, just for exploration. Multiply by 50 tasks a day. Engineers feel this — agents respond by doing fewer searches and guessing more.

Bias. Grep returns lexically matching strings. It doesn't tell the agent which results are load-bearing — which functions are central to the call graph and which are utility code touched once. The agent reads the first three results and stops, which on a 4,000-file repo is almost always wrong.

Recall. Grep matches identifiers. It doesn't match concepts. If you ask "what handles request timing in this repo?" grep can't answer. Embedding-only search can, but embedding-only search returns ranked-by-cosine results that are often wrong on code (logResponseTime and recordLatency are nearly cosine-identical; the second is correct only because of where it sits in the call graph).

The honest answer is that grep, embeddings, and graphs each fail in a different way. You need all three signals plus a way to combine them, exposed to the agent as MCP tools so it actually uses them.

What I did instead

I installed Sverklo, a local-first MCP server for code intelligence. (Disclosure: I wrote it.) The 60-second pitch:

npm install -g sverklo
cd your-project && sverklo init

sverklo init auto-detects your AI coding agent (Claude Code, Cursor, Windsurf, Zed) and writes the right MCP config. It indexes your repo with tree-sitter, builds a call graph, computes a PageRank-ranked symbol importance score, and generates ONNX embeddings — all locally. Your code never leaves the machine.

The agent now has 37 extra tools alongside grep:

sverklo_search — hybrid BM25 + embedding + PageRank ranked search
sverklo_lookup — exact symbol definition by name
sverklo_refs — every reference to a symbol (call graph, not just textual)
sverklo_impact — recursive blast-radius (transitive callers)
sverklo_audit — god classes, dead code, security patterns
sverklo_remember / sverklo_recall — bi-temporal memory pinned to git SHAs
…and 30 more

For the rename task above, the agent's first move now is sverklo_lookup logRequest. That returns the canonical definition with file path and line number, ranked by PageRank importance. Then sverklo_refs logRequest returns every reference in the call graph, including the indirect callers grep would miss. The agent edits exactly the right files. No invented function names.

Re-ran the same task three times after install. Zero hallucinations.

The bench numbers

I ran a 60-task benchmark across 5 retrieval baselines (naive grep, smart grep, sverklo, jcodemunch-mcp, GitNexus). Methodology and raw data: sverklo.com/bench. Headline numbers:

Baseline	Avg input tokens per task	Tool calls per task
Naive grep	17,169	7–12
Smart grep	5,082	4–6
jcodemunch-mcp	5,351	1
GitNexus	543	1
Sverklo	386	1

That's 13.8× fewer input tokens than naive grep, 2.9× fewer than tuned grep, and a single tool call vs grep's 7–12.

For a typical Claude Sonnet session at $3/M input tokens, 50 tasks a day, the math comes out to roughly $0.41 per session today — projected to ~$123/month at 10 sessions/day across a small team. Sverklo's local indexing turns the same workload into roughly $9/month.

I'm not telling you those numbers to sell you anything. The repo is MIT-licensed and the bench is reproducible — clone it, run npm run bench, get the same numbers (or different ones for your repo, which is the point).

Where this still doesn't help

This is the honest part most blog posts skip.

Repos under ~5,000 LOC. The agent's context window can hold the whole thing. Grep is faster, and the indexing overhead isn't worth it.
Single-file edits with no cross-references. If you're editing one file and the change doesn't propagate, the symbol graph adds no signal.
First call to a fresh repo. Sverklo has to index before its first useful query. On a 50K-LOC repo this is ~30 seconds; on 500K-LOC, ~5 minutes. After that it's incremental and fast, but the first run isn't free.
Reference finding (P2 in the bench). This is the embarrassing one. A well-tuned ripgrep ties sverklo on the "find every caller of X" task. The semantic graph doesn't help when the question is purely textual. If P2 is your dominant workload, smart grep is genuinely competitive.
Definition lookup (P1). jcodemunch-mcp beats sverklo here at 0.65 F1 vs sverklo's 0.45. Their tree-sitter symbol indexing is sharper. I have something to learn from them.

If the audit/blast-radius/memory tools don't sound load-bearing to your workflow, just use ripgrep + Cursor's @codebase. They're fine.

What I'd actually try first

If you're a Claude Code or Cursor user on a repo bigger than ~50K LOC, the cheapest experiment I can suggest is:

npm install -g sverklo
cd your-project
sverklo init

Then ask your agent its three least-favorite codebase questions. Mine were:

"What handles request timing in this repo?" (was: 14,200 grep tokens, no useful answer; now: one sverklo_search call, 312 tokens, correct)
"If I rename logRequest, what breaks?" (was: agent guesses confidently; now: sverklo_impact returns the 23 transitive callers)
"Where is the rate limiter implemented?" (was: agent edits the wrong file 50% of the time; now: sverklo_lookup returns the canonical definition)

If those three questions don't get meaningfully better answers, uninstall and use grep. npm uninstall -g sverklo is one command.

The deeper point

Hallucination on AI coding agents isn't a model problem. It's a retrieval problem. The model has to write code that matches your codebase; if it can't see your codebase fast and ranked, it falls back on the training-data prior. Function names like logResponseTime win over recordLatency because the prior is overwhelming.

The fix isn't a smarter model. It's giving the model a real view of your code — a symbol graph, a ranked search, a call graph, a memory of what changed yesterday — exposed as tools the agent can call cheaply enough to actually use.

That's it. That's the whole post.

Repo: github.com/sverklo/sverklo (MIT, ⭐ if it saved you a hallucination — it's the only way other engineers find it)
Bench: sverklo.com/bench — 60 tasks, 5 baselines, reproducible
Paper: doi.org/10.5281/zenodo.19802051 — peer-reviewable methodology, CC-BY 4.0
Demo (90 sec): youtube.com/watch?v=OX7aEgdlqhQ

If you've been hit by the same problem, I'd love to see your worst hallucination — DM me on X @marazmo or open an issue on the repo. I'm collecting them for a follow-up post.

I benchmarked code retrieval for AI coding agents on 60 tasks

Nikita Groshin — Wed, 29 Apr 2026 01:39:33 +0000

A tuned grep beat my MCP code-intelligence server on F1 by 9 points.

I'm publishing the result anyway. Here's why.

Why this benchmark exists

I've spent the last six months building sverklo, a local-first MCP server that gives AI coding agents (Claude Code, Cursor, Windsurf) a real symbol graph instead of grep-based pattern matching. The product positioning has always been "stops the agent from hallucinating function names that don't exist in your codebase."

That positioning is hand-wavy without numbers. Six months in, I had no public benchmark. Whatever speed-of-iteration story I told myself was, I was telling myself.

So I built one: 60 hand-verified retrieval tasks across two real OSS codebases (expressjs/express and the sverklo repo itself), three baselines (naive grep, smart grep, sverklo), and metrics that measure both retrieval quality (F1, recall, precision) and the thing AI agents actually pay for (input tokens, tool calls, wall time).

Results live at sverklo.com/bench. Raw JSONL outputs are in the repo at benchmark/results/<timestamp>/. The harness runs in one npm command. Disagreements with my numbers are useful — file an issue with your machine spec.

The headline

baseline	F1	tokens	tool calls
naive-grep	0.35	15,814	7.6
smart-grep (tuned)	0.67	731	11.8
sverklo	0.58	255	1.0

A tuned grep beats sverklo on F1 by 9 points. That's not what I expected when I started building this. If you can write a clean ripgrep invocation with language filters and definition-shaped patterns, you get higher F1 than my hybrid retrieval stack returns.

What sverklo wins on:

62× fewer tokens than naive grep (255 vs 15,814)
2.9× fewer tokens than smart grep (255 vs 731)
1 tool call vs grep's 7-12 per task
~1ms wall time after a 3.7-second cold start (the index build)

Why "tokens per correct answer" is the load-bearing metric

If you're standing at a terminal with rg, F1 is what matters. You read the matches. The agent isn't paying for them.

If you're an AI agent with a 200K token context window, every token has an opportunity cost. Burning 15,000 tokens on grep noise to find one function leaves you 14,750 fewer tokens for the actual change. The agent that gets the answer in 255 tokens has 14,750 more tokens to spend on doing the work.

The metric that actually matters is tokens per correct answer: input tokens divided by recall. The bench reports this for both gated (F1 ≥ 0.8) and ungated runs. For sverklo on the gated subset, it's 203 tokens per correct answer. For naive grep, 3,557. For smart grep, 165 — smart grep is genuinely competitive on per-correct-answer cost when its F1 lands.

The mistake I almost made: optimising for F1. The thing AI coding agents actually need is the cheapest correct retrieval, not the highest-precision retrieval that takes 12 tool calls to assemble.

Per-category: where each baseline shines

Category	Best F1	Best token economy
P1 — Definition lookup (n=20)	sverklo (0.75)	smart-grep (196 tok)
P2 — Reference finding (n=20)	smart-grep (0.81)	sverklo (157 tok)
P4 — File dependencies (n=10)	sverklo (0.86)	sverklo (74 tok)
P5 — Dead code (n=10)	smart-grep (0.55)	sverklo (579 tok, F1 = 0.02)

The pattern: sverklo wins on the slices where structural retrieval (the symbol graph, the import graph) directly answers the question. Definition lookup (P1) and file dependencies (P4) are exactly that. Reference finding (P2) turns out to be a regex problem grep handles well, because the reference patterns in JS/TS are syntactically uniform enough that \bsymbol\b works most of the time.

Where sverklo fails: the P5 dead-code slice

P5 is the embarrassing one. F1 = 0.02. sverklo_refs looks at the static call graph. It doesn't see dynamic invocations (this[methodName]()), it doesn't see deserialization-driven calls (JSON.parse + eval patterns), and it doesn't see calls through ORM proxies that spell themselves with template-string method names.

Smart-grep gets 0.55 on the same slice by aggressively reading whole files and matching loose patterns. The "loose" matters: it picks up a lot of false positives, but on dead-code detection a false positive is "this function is alive" — which is the safer error.

P5 is the next thing I'm fixing. The plan is to extend the reference graph with a runtime-trace mode (instrument the test suite, log actual call sites, merge into the static graph). I'll publish that as a new bench slice when it lands.

Architecture: channelized RRF

The novel piece in sverklo's retrieval is channelized Reciprocal Rank Fusion. Most hybrid retrievers run RRF once over fts ∪ vector. Sverklo runs RRF per channel — FTS, vector, doc-section, path, symbol-name — and fuses the per-channel ranks with channel-specific weights. The path channel is weighted 1.5× because filename matches are precision-skewed: when a query's keywords match a filename, it's signal worth boosting.

The full architecture rationale is in RRF is doing 80% of the work if you want the deep dive on why per-channel weighting matters more than the embedding model choice.

Reproducing this

git clone https://github.com/sverklo/sverklo
npm install
npm run build
npm run bench:primitives

Raw outputs (raw.jsonl, summary.json, report.md) land in benchmark/results/<timestamp>/. The report.md mirrors the bench page tables. If your numbers differ, please file an issue with your machine spec and the run timestamp — I want the disagreements.

What's the takeaway

If you're choosing between grep and an MCP code-intelligence server for your AI coding agent today:

If your codebase is small (~30 files), use rg. The MCP server overhead doesn't pay back.
If you're standing at the terminal yourself doing exploration, learn smart-grep flags. F1 lands you in the right place.
If you're running an AI coding agent on a larger codebase and the agent invents function names that don't exist in your repo, the retrieval-token-economy gap is real and material. Sverklo's 1-tool-call retrieval is what unlocks that.

Try it

npm install -g sverklo
cd your-project
sverklo init

Sverklo is MIT-licensed, runs entirely on your laptop with embedded SQLite + a local ONNX model. No API keys. No cloud. No telemetry by default.

Or read the full bench report first — including the slice where sverklo loses.

Discuss

What metrics do you use when evaluating retrieval for AI coding agents? Drop a comment if "tokens per correct answer" feels right or wrong as the load-bearing axis.