<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Luc B. Perussault-Diallo</title>
    <description>The latest articles on Forem by Luc B. Perussault-Diallo (@luuuc).</description>
    <link>https://forem.com/luuuc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3784850%2Fd6117002-ae1c-457a-8b55-34f37642a1e5.jpeg</url>
      <title>Forem: Luc B. Perussault-Diallo</title>
      <link>https://forem.com/luuuc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/luuuc"/>
    <language>en</language>
    <item>
      <title>How do you benchmark an MCP server you built?</title>
      <dc:creator>Luc B. Perussault-Diallo</dc:creator>
      <pubDate>Fri, 15 May 2026 13:34:59 +0000</pubDate>
      <link>https://forem.com/luuuc/how-do-you-benchmark-an-mcp-server-you-built-2e8j</link>
      <guid>https://forem.com/luuuc/how-do-you-benchmark-an-mcp-server-you-built-2e8j</guid>
      <description>&lt;p&gt;I built a code-intelligence MCP server. Then I built a benchmark for code-intelligence MCP servers. Then my tool placed first on every scenario.&lt;/p&gt;

&lt;p&gt;I didn't believe it.&lt;/p&gt;

&lt;p&gt;So I threw the harness out and rewrote it from scratch. Same result. I built three held-out scenarios with hand-graded reference scores, ran each iteration of the bench against them, and only trusted a number when it correlated above 0.85 with my own judgment.&lt;/p&gt;

&lt;p&gt;That's the part I want to write about. Not which tool won. The methodology decisions you have to make when you are &lt;em&gt;also&lt;/em&gt; one of the contestants.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Benchmarking AI tools is harder than benchmarking models because the variable is the tool, not the model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Disclosure first.&lt;/strong&gt; I am the author of &lt;a href="https://github.com/luuuc/sense" rel="noopener noreferrer"&gt;Sense&lt;/a&gt;, one of four MCP servers in the bench, scored alongside a no-MCP baseline (Claude Code with grep, find, and Read). Everything below is in the open: code, scenarios, rubrics, transcripts, judge prompts, and the analysis of where my own tool loses. &lt;a href="https://github.com/luuuc/sense/tree/main/bench" rel="noopener noreferrer"&gt;Repo here.&lt;/a&gt; Don't take my word for any of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Fairness&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;#1&lt;/td&gt;
&lt;td&gt;sense&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;81.3%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.4%&lt;/td&gt;
&lt;td&gt;10,896&lt;/td&gt;
&lt;td&gt;141s&lt;/td&gt;
&lt;td&gt;$6.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#2&lt;/td&gt;
&lt;td&gt;probe&lt;/td&gt;
&lt;td&gt;77.7%&lt;/td&gt;
&lt;td&gt;84.8%&lt;/td&gt;
&lt;td&gt;12,119&lt;/td&gt;
&lt;td&gt;162s&lt;/td&gt;
&lt;td&gt;$6.23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#3&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;77.2%&lt;/td&gt;
&lt;td&gt;84.2%&lt;/td&gt;
&lt;td&gt;12,716&lt;/td&gt;
&lt;td&gt;185s&lt;/td&gt;
&lt;td&gt;$7.57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#4&lt;/td&gt;
&lt;td&gt;serena&lt;/td&gt;
&lt;td&gt;75.2%&lt;/td&gt;
&lt;td&gt;83.4%&lt;/td&gt;
&lt;td&gt;14,800&lt;/td&gt;
&lt;td&gt;191s&lt;/td&gt;
&lt;td&gt;$7.57&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#5&lt;/td&gt;
&lt;td&gt;gitnexus&lt;/td&gt;
&lt;td&gt;74.9%&lt;/td&gt;
&lt;td&gt;84.5%&lt;/td&gt;
&lt;td&gt;12,964&lt;/td&gt;
&lt;td&gt;173s&lt;/td&gt;
&lt;td&gt;$6.87&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two of the four MCP servers scored below the no-MCP baseline. Serena and Gitnexus, by default, made the agent worse than just letting Claude use grep, find, and Read.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why an eval framework wouldn't cut it
&lt;/h2&gt;

&lt;p&gt;That result is the question worth defending. Two MCP servers below baseline is a claim I need to back up, which is why most of my time went into the bench's methodology, not the bench's tools.&lt;/p&gt;

&lt;p&gt;I looked at the obvious frameworks first (Inspect AI from UK AISI is the closest fit for agent evals; Promptfoo, DeepEval, OpenAI Evals all have their place). They handle maybe a third of the work: the run loop, the LLM-as-judge calls, the report rendering. The other two-thirds had to be custom.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What a framework would handle&lt;/th&gt;
&lt;th&gt;What I had to build&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Run loop, prompt construction&lt;/td&gt;
&lt;td&gt;MCP server orchestration per cell&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-as-judge calls&lt;/td&gt;
&lt;td&gt;Citation grounding vs pinned repo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Report rendering&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;answer_text&lt;/code&gt; / &lt;code&gt;tool_input&lt;/code&gt; split&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Fairness vs adoption layering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Held-out + Spearman anchor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Judge variance characterization&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The reason is shape. An eval framework is built around "model produces output, output gets graded against expected output." That's a single-turn paradigm. What I needed was: &lt;em&gt;given an MCP server attached to Claude Code, does an agent reach a useful answer faster, cheaper, and with fewer hallucinated file paths than the same agent with no MCP?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The variable isn't the model. The model is fixed (Opus 4.7, 1M context). The variable is the tool the model has access to. None of those custom pieces exist as off-the-shelf primitives.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Citation grounding against a pinned commit
&lt;/h2&gt;

&lt;p&gt;Most evals grade text against text. I needed to grade text against a filesystem.&lt;/p&gt;

&lt;p&gt;When an AI agent says "the dispatch logic lives at &lt;code&gt;app.py:1625&lt;/code&gt;," that claim is either true or it isn't. The number is either inside the file or beyond EOF. The file is either at that path or it isn't.&lt;/p&gt;

&lt;p&gt;So every &lt;code&gt;file:line&lt;/code&gt; and &lt;code&gt;file:Symbol&lt;/code&gt; reference in the answer gets extracted by regex, then verified against the repo at the pinned commit. Three buckets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Grounded&lt;/strong&gt;: the file exists and the line is in range, or the symbol resolves within ±5 lines of the cited line&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unresolved&lt;/strong&gt;: the file is not at the cited path (usually a basename-only reference where the agent dropped the directory)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated&lt;/strong&gt;: the file exists, but the line is past EOF. Outright fabrication.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hallucinated is the hard signal. When you flag a line that doesn't exist in a file that does, the agent isn't paraphrasing. It's inventing.&lt;/p&gt;

&lt;p&gt;The numbers across all five were stark:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Citation grounding&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sense&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;89.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;td&gt;80.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gitnexus&lt;/td&gt;
&lt;td&gt;76.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;probe&lt;/td&gt;
&lt;td&gt;72.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;serena&lt;/td&gt;
&lt;td&gt;61.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That gap between 89% and 62% is the difference between &lt;em&gt;trusting the answer&lt;/em&gt; and &lt;em&gt;manually verifying every line reference&lt;/em&gt;. The LLM-as-judge alone would never have surfaced this. The judge reads the answer; it doesn't crawl the repo.&lt;/p&gt;

&lt;p&gt;The naive extractor handles ~95% of cases at a fraction of the complexity. It misses some edge cases (basename-only paths it correctly cannot resolve, symbols at unusual offsets) but it doesn't lie.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The &lt;code&gt;answer_text&lt;/code&gt; vs &lt;code&gt;tool_input&lt;/code&gt; split
&lt;/h2&gt;

&lt;p&gt;This one almost killed the bench before it was real.&lt;/p&gt;

&lt;p&gt;A keyword check asks: did the answer mention &lt;code&gt;TopicCreator&lt;/code&gt;? Pre-fix, the scorer searched the &lt;em&gt;entire transcript&lt;/em&gt;, including tool calls. So when the agent ran &lt;code&gt;Grep("TopicCreator")&lt;/code&gt;, the keyword &lt;code&gt;TopicCreator&lt;/code&gt; got a "hit" inside the grep invocation. Even if grep returned nothing.&lt;/p&gt;

&lt;p&gt;That's not a measurement. That's a tax on tools that don't use grep. Sense, which uses semantic search instead of grep, would lose keyword points purely because its tool calls didn't contain English-language symbol names. Probe, baseline, anything grep-flavored would win keyword points just for grepping for the right string.&lt;/p&gt;

&lt;p&gt;The fix sounds simple: keyword checks search the assistant's &lt;em&gt;prose only&lt;/em&gt;. Tool inputs and results live in a separate &lt;code&gt;audit_text&lt;/code&gt; field, available for diagnostics but never scored against.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before: scorer searched the whole transcript
&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_keyword_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcript_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# After: scorer searches assistant prose only
&lt;/span&gt;&lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_keyword_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Tool calls live in audit_text, never scored
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;A bench cannot be honest if its first instinct rewards using the tools the bench's author dislikes.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Fairness vs adoption: two layers, never folded
&lt;/h2&gt;

&lt;p&gt;Two questions that look similar are not the same question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Did the developer get a better answer?&lt;/em&gt; That's fairness.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Did the agent fluently use the MCP tools available?&lt;/em&gt; That's adoption.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you measure how often the agent uses an MCP server, Sense looks great. But baseline (no MCP attached) gets a structural zero on that axis. Through no fault of its own. The agent literally cannot call a server that isn't there.&lt;/p&gt;

&lt;p&gt;So if "adoption" feeds into the headline score, the headline answers the second question, not the first. That's not a benchmark. That's an MCP-adoption survey.&lt;/p&gt;

&lt;p&gt;The two-layer model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fairness  = 0.10 * keyword_coverage
          + 0.55 * llm_quality
          + 0.15 * citation_grounding
          + 0.20 * efficiency

adoption  = 0.60 * tool_fluency
          + 0.40 * discoverability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adoption is computed, reported, and never folded into fairness. It's there for code-intel-vs-code-intel comparisons only. The headline number is fairness, and baseline can beat any MCP server on it.&lt;/p&gt;

&lt;p&gt;Two of the four MCP servers I benchmarked did score below baseline on fairness. They added friction without offsetting it with answer quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If adoption had been in the formula, that finding would have been invisible.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Held-out scenarios + Spearman correlation
&lt;/h2&gt;

&lt;p&gt;The anti-Goodhart move.&lt;/p&gt;

&lt;p&gt;A self-improving benchmark can drift away from human judgment. You tune the rubric, the rubric tunes the scores, the scores look better, and at some point you're optimizing the metric instead of the underlying capability. Goodhart's Law applies to your own measurement.&lt;/p&gt;

&lt;p&gt;The defense is an anchor the loop cannot touch.&lt;/p&gt;

&lt;p&gt;Three held-out scenarios, frozen. Hand-graded reference scores in &lt;code&gt;gold.json&lt;/code&gt;. SHA256-pinned in a lockfile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# bench/locked/held-out.lock&lt;/span&gt;
&lt;span class="na"&gt;flask-blueprints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;transcript_sha256&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7f3a...&lt;/span&gt;
  &lt;span class="na"&gt;rubric_sha256&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;91e2...&lt;/span&gt;
  &lt;span class="na"&gt;gold_sha256&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;8d4c...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The improvement loop refuses to start if any hash has drifted. Each iteration re-judges the frozen transcripts against the current rubric, then computes Spearman correlation between the current &lt;code&gt;llm_quality&lt;/code&gt; and the gold scores.&lt;/p&gt;

&lt;p&gt;Drop below 0.85, convergence fails, the loop must stop or be re-anchored.&lt;/p&gt;

&lt;p&gt;It's the line between &lt;em&gt;we tuned the bench&lt;/em&gt; and &lt;em&gt;we tuned the bench until it agrees with our own grading&lt;/em&gt;. The smallest part of the codebase. Possibly the most important.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Judge variance: what's reliable, what isn't
&lt;/h2&gt;

&lt;p&gt;You don't get to claim the LLM-as-judge is consistent. You measure it.&lt;/p&gt;

&lt;p&gt;I ran the judge twice over the same 12 transcripts. Same prompt. Same model. Default sampling.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Max stdev&lt;/th&gt;
&lt;th&gt;Target&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-criterion (raw 0–1 scores)&lt;/td&gt;
&lt;td&gt;0.071&lt;/td&gt;
&lt;td&gt;&amp;lt;0.05&lt;/td&gt;
&lt;td&gt;Fails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-step (4-criterion weighted sum)&lt;/td&gt;
&lt;td&gt;0.048&lt;/td&gt;
&lt;td&gt;&amp;lt;0.05&lt;/td&gt;
&lt;td&gt;Passes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-scenario (mean of 4 steps)&lt;/td&gt;
&lt;td&gt;max |Δ| 0.014&lt;/td&gt;
&lt;td&gt;&amp;lt;0.05&lt;/td&gt;
&lt;td&gt;Rock-solid&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The judge is jittery at the criterion level, especially on a fuzzy criterion like "uncertainty." It averages down at the step level. At the scenario level, the number that actually matters for ranking tools, it's stable enough.&lt;/p&gt;

&lt;p&gt;So I use &lt;code&gt;scenario_quality&lt;/code&gt; to rank. I don't gate decisions on a single criterion-level delta of 0.05 between two tools. That's inside the noise floor. Per-criterion rationales are commentary, not data.&lt;/p&gt;

&lt;p&gt;The win here isn't proving the judge is perfect. It's knowing &lt;em&gt;where it's reliable and where it isn't&lt;/em&gt;, so I know which numbers I can defend and which ones I shouldn't quote at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reproduce it yourself
&lt;/h2&gt;

&lt;p&gt;The bench is open. Adding a new MCP server to test is one shell script in &lt;code&gt;bench/tools/&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/luuuc/sense
&lt;span class="nb"&gt;cd &lt;/span&gt;sense/bench

&lt;span class="c"&gt;# Run the full bench (12 sessions = 5 tools × 6 repos, sense + baseline at minimum)&lt;/span&gt;
bash bench/bench.sh

&lt;span class="c"&gt;# Or run a single (tool, repo) cell&lt;/span&gt;
bash bench/run.sh &lt;span class="nt"&gt;--tool&lt;/span&gt; sense &lt;span class="nt"&gt;--repo&lt;/span&gt; flask
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cost: ~$40 in Opus 4.7 tokens for the full 5×6 matrix. Time: ~20 minutes wall-clock. Every transcript ends up under &lt;code&gt;bench/results/&amp;lt;tool&amp;gt;/&amp;lt;repo&amp;gt;/&lt;/code&gt; with &lt;code&gt;transcript.json&lt;/code&gt;, &lt;code&gt;scored.json&lt;/code&gt;, &lt;code&gt;judged.json&lt;/code&gt;, and &lt;code&gt;run_meta.json&lt;/code&gt; pinned to the commit the agent worked against.&lt;/p&gt;

&lt;p&gt;To add a new MCP server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Drop a config script&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; bench/tools/yourtool.sh &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
#!/bin/bash
# MCP server invocation for yourtool
exec your-mcp-binary --your-flags
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;span class="nb"&gt;chmod&lt;/span&gt; +x bench/tools/yourtool.sh

&lt;span class="c"&gt;# 2. Run it&lt;/span&gt;
bash bench/run.sh &lt;span class="nt"&gt;--tool&lt;/span&gt; yourtool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you find a methodology hole, the bench is open and replayable. I'll patch what's real.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the bench does &lt;em&gt;not&lt;/em&gt; claim
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Not a benchmark of AI model quality. The model is fixed.&lt;/li&gt;
&lt;li&gt;Not a real-world end-to-end task benchmark. Each scenario is bounded, scripted, rubric-anchored.&lt;/li&gt;
&lt;li&gt;Not a cost benchmark in production terms. Cost is computed from public API pricing for comparability, not from anyone's actual invoice.&lt;/li&gt;
&lt;li&gt;Not a measure of MCP-tool fluency in isolation. Adoption is a secondary signal, not the headline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is, narrowly, this: &lt;em&gt;given a six-scenario, four-step exploration script across six real repos, how does each tool affect the agent's answer quality, citation grounding, and efficiency.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A smaller claim than "this is the best code-intel MCP." Also a more useful one.&lt;/p&gt;




&lt;p&gt;PS: The hardest part wasn't building the tool. It was building a benchmark I'd still trust after it favored me. Most benchmarks die at that step.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bench (code, scenarios, rubrics, replay): &lt;a href="https://github.com/luuuc/sense/tree/main/bench" rel="noopener noreferrer"&gt;github.com/luuuc/sense/tree/main/bench&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full results report: &lt;a href="https://github.com/luuuc/sense/blob/main/bench/results/report-for-humans.md" rel="noopener noreferrer"&gt;report-for-humans.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Sense: &lt;a href="https://github.com/luuuc/sense" rel="noopener noreferrer"&gt;github.com/luuuc/sense&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>claude</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>I built AI tools that roast your startup ideas so your investors don't</title>
      <dc:creator>Luc B. Perussault-Diallo</dc:creator>
      <pubDate>Sun, 22 Feb 2026 09:50:46 +0000</pubDate>
      <link>https://forem.com/luuuc/i-built-ai-tools-that-roast-your-startup-ideas-so-your-investors-dont-54di</link>
      <guid>https://forem.com/luuuc/i-built-ai-tools-that-roast-your-startup-ideas-so-your-investors-dont-54di</guid>
      <description>&lt;p&gt;I got tired of pitching bad ideas to polite friends.&lt;/p&gt;

&lt;p&gt;You know the drill, you share your latest "revolutionary" concept, and everyone nods. "Sounds cool." "Interesting." "You should totally build that." Then six months later you realize nobody ever told you it was a terrible idea.&lt;/p&gt;

&lt;p&gt;So I built &lt;a href="https://star.tupcheck.me?utm_source=devto&amp;amp;utm_medium=community&amp;amp;utm_campaign=launch" rel="noopener noreferrer"&gt;StartupCheck&lt;/a&gt; - a set of free AI tools that deliver the painfully honest feedback founders rarely get.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it does
&lt;/h2&gt;

&lt;p&gt;Five tools, each targeting a different founder blind spot:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idea Funeral&lt;/strong&gt;&lt;br&gt;
Submit your startup idea. Get an honest eulogy instead of polite encouragement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Corporate Translator&lt;/strong&gt; &lt;br&gt;
Paste your corporate speak. Find out what it actually means. ("Right-sizing the organization" = layoffs start next week.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Startup Therapist&lt;/strong&gt; &lt;br&gt;
Share your founder optimism. Get the reality check your co-founder is too nice to give you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pitch Without Slides&lt;/strong&gt; &lt;br&gt;
Deliver your pitch in plain text. See if it holds up without pretty graphs and stock photos.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build or Die&lt;/strong&gt; &lt;br&gt;
Describe your project scope. Find out if your "2-week MVP" is actually a 6-month odyssey.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I built it
&lt;/h2&gt;

&lt;p&gt;Every founder has blind spots. The best feedback I've ever received was the most uncomfortable. But most people won't give you that feedback unprompted, it's socially awkward, it risks the relationship, and honestly, it's easier to just nod.&lt;/p&gt;

&lt;p&gt;AI has no social incentive to spare your feelings. That's a feature, not a bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  Details
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Free to start, no signup required. Builder plan at 7 EUR/month for higher limits&lt;/li&gt;
&lt;li&gt;Built with AI that has zero reason to be polite&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check it out: &lt;a href="https://star.tupcheck.me?utm_source=devto&amp;amp;utm_medium=community&amp;amp;utm_campaign=launch" rel="noopener noreferrer"&gt;star.tupcheck.me&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Would love to hear your thoughts, and yes, you're welcome to roast StartupCheck using its own tools. Fair game.&lt;/p&gt;

</description>
      <category>startup</category>
      <category>ai</category>
      <category>sideprojects</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
