<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nikita Groshin</title>
    <description>The latest articles on Forem by Nikita Groshin (@nike-17).</description>
    <link>https://forem.com/nike-17</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3903349%2Fbe1fac32-c30f-4575-b225-239f86e6fea8.jpeg</url>
      <title>Forem: Nikita Groshin</title>
      <link>https://forem.com/nike-17</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nike-17"/>
    <language>en</language>
    <item>
      <title>How I stopped Claude Code from hallucinating function names on a 4,000-file repo (with a local MCP server)</title>
      <dc:creator>Nikita Groshin</dc:creator>
      <pubDate>Sun, 03 May 2026 17:34:20 +0000</pubDate>
      <link>https://forem.com/nike-17/how-i-stopped-claude-code-from-hallucinating-function-names-on-a-4000-file-repo-with-a-local-mcp-jl5</link>
      <guid>https://forem.com/nike-17/how-i-stopped-claude-code-from-hallucinating-function-names-on-a-4000-file-repo-with-a-local-mcp-jl5</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; My Claude Code agent kept inventing function names that looked plausible but didn't exist (&lt;code&gt;getUserByEmail&lt;/code&gt;, &lt;code&gt;parseConfigFile&lt;/code&gt;, &lt;code&gt;validateInput&lt;/code&gt; — all fake in my codebase). Adding a local MCP server that gives the agent a real symbol graph and ranked code search cut hallucinations to roughly zero on the same repo. Below: the bug, the cause, the fix, the bench numbers, and the cases where it still doesn't help.&lt;/p&gt;




&lt;h2&gt;
  
  
  The bug
&lt;/h2&gt;

&lt;p&gt;I was refactoring a logging middleware in a 4,000-file TypeScript monorepo. The agent's task: rename &lt;code&gt;logRequest&lt;/code&gt; to &lt;code&gt;logHttpRequest&lt;/code&gt; everywhere it's called, including transitive callers.&lt;/p&gt;

&lt;p&gt;What Claude Code generated, paraphrased:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/middleware/auth.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;logRequest&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;../logging/logger&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;withAuth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;logRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// ← real&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;unauthorized&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// …&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;logResponseTime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// ← INVENTED&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;logResponseTime&lt;/code&gt; does not exist. It has never existed in this codebase. The agent generated a call to it because (a) the surrounding code talks about logging, (b) function names like &lt;code&gt;logResponseTime&lt;/code&gt; exist in millions of public repos, (c) the model's training data has a strong prior that "logging middleware should also log response time."&lt;/p&gt;

&lt;p&gt;The actual function in our codebase is called &lt;code&gt;recordLatency&lt;/code&gt;, which the agent never used in any of the four files it edited.&lt;/p&gt;

&lt;p&gt;I rolled back, ran the same task again, and got &lt;code&gt;trackRequestDuration&lt;/code&gt; and &lt;code&gt;emitTimingEvent&lt;/code&gt; — also fake. Three runs, three different invented names. The agent was confident each time.&lt;/p&gt;

&lt;p&gt;This is the load-bearing failure of every AI coding agent on a repo larger than its context window: the model treats your codebase as if it were a representative sample of the training corpus.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why grep alone doesn't fix it
&lt;/h2&gt;

&lt;p&gt;Cursor's @codebase, Claude Code's grep tool, plain ripgrep — all of them work in principle. The agent can search before it writes. In practice, it almost never does, for two reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; A single grep call against this repo returned 14,200 input tokens (we measured). Across the four files in this task the agent would have needed roughly 8 grep calls to be confident. That's ~$1.40 per task at Claude Sonnet's input rate, just for exploration. Multiply by 50 tasks a day. Engineers feel this — agents respond by doing fewer searches and guessing more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bias.&lt;/strong&gt; Grep returns lexically matching strings. It doesn't tell the agent which results are &lt;em&gt;load-bearing&lt;/em&gt; — which functions are central to the call graph and which are utility code touched once. The agent reads the first three results and stops, which on a 4,000-file repo is almost always wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recall.&lt;/strong&gt; Grep matches identifiers. It doesn't match concepts. If you ask "what handles request timing in this repo?" grep can't answer. Embedding-only search can, but embedding-only search returns ranked-by-cosine results that are often wrong on code (&lt;code&gt;logResponseTime&lt;/code&gt; and &lt;code&gt;recordLatency&lt;/code&gt; are nearly cosine-identical; the second is correct only because of where it sits in the call graph).&lt;/p&gt;

&lt;p&gt;The honest answer is that grep, embeddings, and graphs each fail in a different way. You need all three signals plus a way to combine them, exposed to the agent as MCP tools so it actually uses them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I did instead
&lt;/h2&gt;

&lt;p&gt;I installed &lt;a href="https://github.com/sverklo/sverklo" rel="noopener noreferrer"&gt;Sverklo&lt;/a&gt;, a local-first MCP server for code intelligence. (Disclosure: I wrote it.) The 60-second pitch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; sverklo
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; sverklo init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;sverklo init&lt;/code&gt; auto-detects your AI coding agent (Claude Code, Cursor, Windsurf, Zed) and writes the right MCP config. It indexes your repo with tree-sitter, builds a call graph, computes a PageRank-ranked symbol importance score, and generates ONNX embeddings — all locally. Your code never leaves the machine.&lt;/p&gt;

&lt;p&gt;The agent now has 37 extra tools alongside grep:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;sverklo_search&lt;/code&gt; — hybrid BM25 + embedding + PageRank ranked search&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sverklo_lookup&lt;/code&gt; — exact symbol definition by name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sverklo_refs&lt;/code&gt; — every reference to a symbol (call graph, not just textual)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sverklo_impact&lt;/code&gt; — recursive blast-radius (transitive callers)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sverklo_audit&lt;/code&gt; — god classes, dead code, security patterns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sverklo_remember&lt;/code&gt; / &lt;code&gt;sverklo_recall&lt;/code&gt; — bi-temporal memory pinned to git SHAs&lt;/li&gt;
&lt;li&gt;…and 30 more&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the rename task above, the agent's first move now is &lt;code&gt;sverklo_lookup logRequest&lt;/code&gt;. That returns the canonical definition with file path and line number, ranked by PageRank importance. Then &lt;code&gt;sverklo_refs logRequest&lt;/code&gt; returns every reference in the call graph, including the indirect callers grep would miss. The agent edits exactly the right files. No invented function names.&lt;/p&gt;

&lt;p&gt;Re-ran the same task three times after install. Zero hallucinations.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bench numbers
&lt;/h2&gt;

&lt;p&gt;I ran a 60-task benchmark across 5 retrieval baselines (naive grep, smart grep, sverklo, &lt;a href="https://github.com/jgravelle/jcodemunch-mcp" rel="noopener noreferrer"&gt;jcodemunch-mcp&lt;/a&gt;, &lt;a href="https://github.com/abhigyanpatwari/GitNexus" rel="noopener noreferrer"&gt;GitNexus&lt;/a&gt;). Methodology and raw data: &lt;a href="https://sverklo.com/bench/" rel="noopener noreferrer"&gt;sverklo.com/bench&lt;/a&gt;. Headline numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Baseline&lt;/th&gt;
&lt;th&gt;Avg input tokens per task&lt;/th&gt;
&lt;th&gt;Tool calls per task&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Naive grep&lt;/td&gt;
&lt;td&gt;17,169&lt;/td&gt;
&lt;td&gt;7–12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smart grep&lt;/td&gt;
&lt;td&gt;5,082&lt;/td&gt;
&lt;td&gt;4–6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;jcodemunch-mcp&lt;/td&gt;
&lt;td&gt;5,351&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitNexus&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sverklo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;386&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's 13.8× fewer input tokens than naive grep, 2.9× fewer than tuned grep, and a single tool call vs grep's 7–12.&lt;/p&gt;

&lt;p&gt;For a typical Claude Sonnet session at $3/M input tokens, 50 tasks a day, the math comes out to roughly $0.41 per session today — projected to ~$123/month at 10 sessions/day across a small team. Sverklo's local indexing turns the same workload into roughly $9/month.&lt;/p&gt;

&lt;p&gt;I'm not telling you those numbers to sell you anything. The repo is MIT-licensed and the bench is reproducible — clone it, run &lt;code&gt;npm run bench&lt;/code&gt;, get the same numbers (or different ones for your repo, which is the point).&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this still doesn't help
&lt;/h2&gt;

&lt;p&gt;This is the honest part most blog posts skip.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Repos under ~5,000 LOC.&lt;/strong&gt; The agent's context window can hold the whole thing. Grep is faster, and the indexing overhead isn't worth it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Single-file edits with no cross-references.&lt;/strong&gt; If you're editing one file and the change doesn't propagate, the symbol graph adds no signal.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;First call to a fresh repo.&lt;/strong&gt; Sverklo has to index before its first useful query. On a 50K-LOC repo this is ~30 seconds; on 500K-LOC, ~5 minutes. After that it's incremental and fast, but the first run isn't free.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reference finding (P2 in the bench).&lt;/strong&gt; This is the embarrassing one. A well-tuned ripgrep ties sverklo on the "find every caller of X" task. The semantic graph doesn't help when the question is purely textual. If P2 is your dominant workload, smart grep is genuinely competitive.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Definition lookup (P1).&lt;/strong&gt; &lt;a href="https://github.com/jgravelle/jcodemunch-mcp" rel="noopener noreferrer"&gt;jcodemunch-mcp&lt;/a&gt; beats sverklo here at 0.65 F1 vs sverklo's 0.45. Their tree-sitter symbol indexing is sharper. I have something to learn from them.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the audit/blast-radius/memory tools don't sound load-bearing to your workflow, just use ripgrep + Cursor's @codebase. They're fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd actually try first
&lt;/h2&gt;

&lt;p&gt;If you're a Claude Code or Cursor user on a repo bigger than ~50K LOC, the cheapest experiment I can suggest is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; sverklo
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
sverklo init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then ask your agent its three least-favorite codebase questions. Mine were:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"What handles request timing in this repo?" (was: 14,200 grep tokens, no useful answer; now: one &lt;code&gt;sverklo_search&lt;/code&gt; call, 312 tokens, correct)&lt;/li&gt;
&lt;li&gt;"If I rename &lt;code&gt;logRequest&lt;/code&gt;, what breaks?" (was: agent guesses confidently; now: &lt;code&gt;sverklo_impact&lt;/code&gt; returns the 23 transitive callers)&lt;/li&gt;
&lt;li&gt;"Where is the rate limiter implemented?" (was: agent edits the wrong file 50% of the time; now: &lt;code&gt;sverklo_lookup&lt;/code&gt; returns the canonical definition)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If those three questions don't get meaningfully better answers, uninstall and use grep. &lt;code&gt;npm uninstall -g sverklo&lt;/code&gt; is one command.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deeper point
&lt;/h2&gt;

&lt;p&gt;Hallucination on AI coding agents isn't a model problem. It's a retrieval problem. The model has to write code that matches your codebase; if it can't see your codebase fast and ranked, it falls back on the training-data prior. Function names like &lt;code&gt;logResponseTime&lt;/code&gt; win over &lt;code&gt;recordLatency&lt;/code&gt; because the prior is overwhelming.&lt;/p&gt;

&lt;p&gt;The fix isn't a smarter model. It's giving the model a real view of your code — a symbol graph, a ranked search, a call graph, a memory of what changed yesterday — exposed as tools the agent can call cheaply enough to actually use.&lt;/p&gt;

&lt;p&gt;That's it. That's the whole post.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/sverklo/sverklo" rel="noopener noreferrer"&gt;github.com/sverklo/sverklo&lt;/a&gt; (MIT, ⭐ if it saved you a hallucination — it's the only way other engineers find it)&lt;br&gt;
&lt;strong&gt;Bench:&lt;/strong&gt; &lt;a href="https://sverklo.com/bench/" rel="noopener noreferrer"&gt;sverklo.com/bench&lt;/a&gt; — 60 tasks, 5 baselines, reproducible&lt;br&gt;
&lt;strong&gt;Paper:&lt;/strong&gt; &lt;a href="https://doi.org/10.5281/zenodo.19802051" rel="noopener noreferrer"&gt;doi.org/10.5281/zenodo.19802051&lt;/a&gt; — peer-reviewable methodology, CC-BY 4.0&lt;br&gt;
&lt;strong&gt;Demo (90 sec):&lt;/strong&gt; &lt;a href="https://www.youtube.com/watch?v=OX7aEgdlqhQ" rel="noopener noreferrer"&gt;youtube.com/watch?v=OX7aEgdlqhQ&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you've been hit by the same problem, I'd love to see your worst hallucination — DM me on X &lt;a href="https://x.com/marazmo" rel="noopener noreferrer"&gt;@marazmo&lt;/a&gt; or open an issue on the repo. I'm collecting them for a follow-up post.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>ai</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I benchmarked code retrieval for AI coding agents on 60 tasks</title>
      <dc:creator>Nikita Groshin</dc:creator>
      <pubDate>Wed, 29 Apr 2026 01:39:33 +0000</pubDate>
      <link>https://forem.com/nike-17/i-benchmarked-code-retrieval-for-ai-coding-agents-on-60-tasks-f9h</link>
      <guid>https://forem.com/nike-17/i-benchmarked-code-retrieval-for-ai-coding-agents-on-60-tasks-f9h</guid>
      <description>&lt;p&gt;A tuned grep beat my MCP code-intelligence server on F1 by 9 points.&lt;/p&gt;

&lt;p&gt;I'm publishing the result anyway. Here's why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this benchmark exists
&lt;/h2&gt;

&lt;p&gt;I've spent the last six months building &lt;a href="https://sverklo.com" rel="noopener noreferrer"&gt;sverklo&lt;/a&gt;, a local-first MCP server that gives AI coding agents (Claude Code, Cursor, Windsurf) a real symbol graph instead of grep-based pattern matching. The product positioning has always been "stops the agent from hallucinating function names that don't exist in your codebase."&lt;/p&gt;

&lt;p&gt;That positioning is hand-wavy without numbers. Six months in, I had no public benchmark. Whatever speed-of-iteration story I told myself was, I was telling myself.&lt;/p&gt;

&lt;p&gt;So I built one: 60 hand-verified retrieval tasks across two real OSS codebases (&lt;a href="https://github.com/expressjs/express" rel="noopener noreferrer"&gt;expressjs/express&lt;/a&gt; and the &lt;a href="https://github.com/sverklo/sverklo" rel="noopener noreferrer"&gt;sverklo repo&lt;/a&gt; itself), three baselines (naive grep, smart grep, sverklo), and metrics that measure both retrieval quality (F1, recall, precision) and the thing AI agents actually pay for (input tokens, tool calls, wall time).&lt;/p&gt;

&lt;p&gt;Results live at &lt;a href="https://sverklo.com/bench/" rel="noopener noreferrer"&gt;sverklo.com/bench&lt;/a&gt;. Raw JSONL outputs are in the repo at &lt;code&gt;benchmark/results/&amp;lt;timestamp&amp;gt;/&lt;/code&gt;. The harness runs in one npm command. Disagreements with my numbers are useful — file an issue with your machine spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  The headline
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;baseline&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;tokens&lt;/th&gt;
&lt;th&gt;tool calls&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;naive-grep&lt;/td&gt;
&lt;td&gt;0.35&lt;/td&gt;
&lt;td&gt;15,814&lt;/td&gt;
&lt;td&gt;7.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;smart-grep (tuned)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.67&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;731&lt;/td&gt;
&lt;td&gt;11.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sverklo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.58&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;255&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;A tuned grep beats sverklo on F1 by 9 points.&lt;/strong&gt; That's not what I expected when I started building this. If you can write a clean ripgrep invocation with language filters and definition-shaped patterns, you get higher F1 than my hybrid retrieval stack returns.&lt;/p&gt;

&lt;p&gt;What sverklo wins on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;62× fewer tokens than naive grep&lt;/strong&gt; (255 vs 15,814)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2.9× fewer tokens than smart grep&lt;/strong&gt; (255 vs 731)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1 tool call vs grep's 7-12&lt;/strong&gt; per task&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~1ms wall time&lt;/strong&gt; after a 3.7-second cold start (the index build)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why "tokens per correct answer" is the load-bearing metric
&lt;/h2&gt;

&lt;p&gt;If you're standing at a terminal with &lt;code&gt;rg&lt;/code&gt;, F1 is what matters. You read the matches. The agent isn't paying for them.&lt;/p&gt;

&lt;p&gt;If you're an AI agent with a 200K token context window, every token has an opportunity cost. Burning 15,000 tokens on grep noise to find one function leaves you 14,750 fewer tokens for the actual change. The agent that gets the answer in 255 tokens has 14,750 more tokens to spend on doing the work.&lt;/p&gt;

&lt;p&gt;The metric that actually matters is &lt;em&gt;tokens per correct answer&lt;/em&gt;: input tokens divided by recall. The bench reports this for both gated (F1 ≥ 0.8) and ungated runs. For sverklo on the gated subset, it's 203 tokens per correct answer. For naive grep, 3,557. For smart grep, 165 — smart grep is genuinely competitive on per-correct-answer cost when its F1 lands.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The mistake I almost made: optimising for F1. The thing AI coding agents actually need is the &lt;em&gt;cheapest correct retrieval&lt;/em&gt;, not the highest-precision retrieval that takes 12 tool calls to assemble.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Per-category: where each baseline shines
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Best F1&lt;/th&gt;
&lt;th&gt;Best token economy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P1 — Definition lookup (n=20)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;sverklo&lt;/strong&gt; (0.75)&lt;/td&gt;
&lt;td&gt;smart-grep (196 tok)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P2 — Reference finding (n=20)&lt;/td&gt;
&lt;td&gt;smart-grep (0.81)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;sverklo&lt;/strong&gt; (157 tok)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P4 — File dependencies (n=10)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;sverklo&lt;/strong&gt; (0.86)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;sverklo&lt;/strong&gt; (74 tok)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P5 — Dead code (n=10)&lt;/td&gt;
&lt;td&gt;smart-grep (0.55)&lt;/td&gt;
&lt;td&gt;sverklo (579 tok, F1 = 0.02)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern: sverklo wins on the slices where structural retrieval (the symbol graph, the import graph) directly answers the question. Definition lookup (P1) and file dependencies (P4) are exactly that. Reference finding (P2) turns out to be a regex problem grep handles well, because the reference patterns in JS/TS are syntactically uniform enough that &lt;code&gt;\bsymbol\b&lt;/code&gt; works most of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where sverklo fails: the P5 dead-code slice
&lt;/h2&gt;

&lt;p&gt;P5 is the embarrassing one. F1 = 0.02. &lt;code&gt;sverklo_refs&lt;/code&gt; looks at the static call graph. It doesn't see dynamic invocations (&lt;code&gt;this[methodName]()&lt;/code&gt;), it doesn't see deserialization-driven calls (&lt;code&gt;JSON.parse&lt;/code&gt; + &lt;code&gt;eval&lt;/code&gt; patterns), and it doesn't see calls through ORM proxies that spell themselves with template-string method names.&lt;/p&gt;

&lt;p&gt;Smart-grep gets 0.55 on the same slice by aggressively reading whole files and matching loose patterns. The "loose" matters: it picks up a lot of false positives, but on dead-code detection a false positive is "this function is alive" — which is the safer error.&lt;/p&gt;

&lt;p&gt;P5 is the next thing I'm fixing. The plan is to extend the reference graph with a runtime-trace mode (instrument the test suite, log actual call sites, merge into the static graph). I'll publish that as a new bench slice when it lands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture: channelized RRF
&lt;/h2&gt;

&lt;p&gt;The novel piece in sverklo's retrieval is &lt;em&gt;channelized Reciprocal Rank Fusion&lt;/em&gt;. Most hybrid retrievers run RRF once over &lt;code&gt;fts ∪ vector&lt;/code&gt;. Sverklo runs RRF &lt;strong&gt;per channel&lt;/strong&gt; — FTS, vector, doc-section, path, symbol-name — and fuses the per-channel ranks with channel-specific weights. The path channel is weighted 1.5× because filename matches are precision-skewed: when a query's keywords match a filename, it's signal worth boosting.&lt;/p&gt;

&lt;p&gt;The full architecture rationale is in &lt;a href="https://sverklo.com/blog/rrf-is-doing-80-percent-of-the-work/" rel="noopener noreferrer"&gt;RRF is doing 80% of the work&lt;/a&gt; if you want the deep dive on why per-channel weighting matters more than the embedding model choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproducing this
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/sverklo/sverklo
npm &lt;span class="nb"&gt;install
&lt;/span&gt;npm run build
npm run bench:primitives
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Raw outputs (&lt;code&gt;raw.jsonl&lt;/code&gt;, &lt;code&gt;summary.json&lt;/code&gt;, &lt;code&gt;report.md&lt;/code&gt;) land in &lt;code&gt;benchmark/results/&amp;lt;timestamp&amp;gt;/&lt;/code&gt;. The report.md mirrors the bench page tables. If your numbers differ, please file an issue with your machine spec and the run timestamp — I want the disagreements.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the takeaway
&lt;/h2&gt;

&lt;p&gt;If you're choosing between grep and an MCP code-intelligence server for your AI coding agent today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your codebase is small (~30 files), use &lt;code&gt;rg&lt;/code&gt;. The MCP server overhead doesn't pay back.&lt;/li&gt;
&lt;li&gt;If you're standing at the terminal yourself doing exploration, learn smart-grep flags. F1 lands you in the right place.&lt;/li&gt;
&lt;li&gt;If you're running an AI coding agent on a larger codebase and the agent invents function names that don't exist in your repo, the retrieval-token-economy gap is real and material. Sverklo's 1-tool-call retrieval is what unlocks that.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; sverklo
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
sverklo init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sverklo is MIT-licensed, runs entirely on your laptop with embedded SQLite + a local ONNX model. No API keys. No cloud. No telemetry by default.&lt;/p&gt;

&lt;p&gt;Or read the &lt;a href="https://sverklo.com/bench/" rel="noopener noreferrer"&gt;full bench report&lt;/a&gt; first — including the slice where sverklo loses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discuss
&lt;/h2&gt;

&lt;p&gt;What metrics do you use when evaluating retrieval for AI coding agents? Drop a comment if "tokens per correct answer" feels right or wrong as the load-bearing axis.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
