<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mike</title>
    <description>The latest articles on Forem by Mike (@bambushu).</description>
    <link>https://forem.com/bambushu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862788%2Fd6e6c9d4-b9e7-4bba-9dbb-da78c3992d2b.png</url>
      <title>Forem: Mike</title>
      <link>https://forem.com/bambushu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/bambushu"/>
    <language>en</language>
    <item>
      <title>A 4-model adversarial review pipeline that costs $0.30 per audit</title>
      <dc:creator>Mike</dc:creator>
      <pubDate>Tue, 05 May 2026 15:52:03 +0000</pubDate>
      <link>https://forem.com/bambushu/a-4-model-adversarial-review-pipeline-that-costs-030-per-audit-3bbf</link>
      <guid>https://forem.com/bambushu/a-4-model-adversarial-review-pipeline-that-costs-030-per-audit-3bbf</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Single-model code review has correlated blind spots. A panel of frontier models from different vendor families catches&lt;br&gt;
   more, but only if a final verification step throws out the hallucinations the panel converged on.&lt;br&gt;&lt;br&gt;
  &lt;a href="https://github.com/Bambushu/crucible" rel="noopener noreferrer"&gt;Crucible&lt;/a&gt; is a Claude Code skill that does both for about $0.30 a run.                  &lt;/p&gt;




&lt;p&gt;If GPT misses a bug, the runner-up GPT-class model usually misses it too. Same training data, same blind spots. The fix isn't a&lt;br&gt;
  bigger model. It's a panel of structurally different ones.                                                                       &lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/Bambushu/crucible" rel="noopener noreferrer"&gt;Crucible&lt;/a&gt; to test this. It's a Claude Code skill that walks a codebase file by&lt;br&gt;
  file, runs each file through four frontier models from different vendor families, then has Claude verify the findings against the&lt;br&gt;
   actual source before showing them to me.                                                                                      &lt;/p&gt;

&lt;p&gt;This post is about the orchestration tricks that made it actually work.                                                          &lt;/p&gt;

&lt;p&gt;## The panel                                                                                                                     &lt;/p&gt;

&lt;p&gt;Four models, four families. As of writing, the default panel is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DeepSeek V4-Pro
&lt;/li&gt;
&lt;li&gt;Google Gemini 3.1 Pro
&lt;/li&gt;
&lt;li&gt;Moonshot Kimi K2.6
&lt;/li&gt;
&lt;li&gt;MiniMax M2.7
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Family diversity matters more than raw benchmark scores. Two top-tier OpenAI-family models will share weaknesses. A DeepSeek +&lt;br&gt;&lt;br&gt;
  Gemini + Kimi + MiniMax panel does not.                                                                                        &lt;/p&gt;

&lt;p&gt;## Pass 1: chained review                                                                                                      &lt;/p&gt;

&lt;p&gt;Each file goes through the panel in order. Model A finds. Model B sees A's findings, validates and adds. Model C consolidates.&lt;br&gt;
  Model D ranks by severity and emits the structured output.                                                                       &lt;/p&gt;

&lt;p&gt;This works because adversarial pressure compounds. Model B is more rigorous when it's grading Model A's homework than when it's&lt;br&gt;&lt;br&gt;
  writing its own.                                                                                                                 &lt;/p&gt;

&lt;p&gt;## Pass 2: blind parallel mode               &lt;/p&gt;

&lt;p&gt;&lt;code&gt;--blind&lt;/code&gt; runs the same panel concurrently. No model sees another's output. Findings that overlap on file plus line plus topic &lt;br&gt;
  become "consensus" findings and get ranked higher. Findings only one model raised are tagged but not promoted.                   &lt;/p&gt;

&lt;p&gt;Blind mode is slower per file but catches different things than chained mode. Chained converges. Blind diverges.&lt;/p&gt;

&lt;p&gt;## Pass 3: Claude verification                                                                                                 &lt;/p&gt;

&lt;p&gt;This is the step everyone skips, and it's the one that matters. After the panel finishes, Claude reads every CRITICAL and HIGH&lt;br&gt;&lt;br&gt;
  finding back against the actual source code and marks each one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Confirmed&lt;/strong&gt;: matches the code
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refined&lt;/strong&gt;: real issue, severity adjusted &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disputed&lt;/strong&gt;: panel hallucinated, source disagrees
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Needs human judgment&lt;/strong&gt;: real tradeoff, not a bug
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Claude can also add up to three findings the panel missed. In practice the panel rarely misses things at HIGH severity. It's the&lt;br&gt;
  LOW and MEDIUM that get pruned hardest in this pass.                                                                             &lt;/p&gt;

&lt;p&gt;## What broke and how I fixed it             &lt;/p&gt;

&lt;p&gt;A few orchestration bugs I hit. Sharing them in case you're building something similar.                                          &lt;/p&gt;

&lt;p&gt;### Reasoning models eat your output                                                                                             &lt;/p&gt;

&lt;p&gt;OpenRouter's &lt;code&gt;chat.completions&lt;/code&gt; returns &lt;code&gt;.content&lt;/code&gt; and &lt;code&gt;.reasoning_content&lt;/code&gt; separately on DeepSeek-R1 and Nemotron. The actual&lt;br&gt;&lt;br&gt;
  review lives in &lt;code&gt;reasoning_content&lt;/code&gt; after a &lt;code&gt;&amp;lt;think&amp;gt;...&amp;lt;/think&amp;gt;&lt;/code&gt; block. If your wrapper only reads &lt;code&gt;.content&lt;/code&gt;, you'll get empty&lt;br&gt;&lt;br&gt;
  strings half the time and not know why.      &lt;/p&gt;

&lt;p&gt;Crucible's extractor reads both fields and strips &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; blocks before parsing.                                             &lt;/p&gt;

&lt;p&gt;### Free-tier model caches drift                                                                                               &lt;/p&gt;

&lt;p&gt;I had an existing skill that called OpenRouter's free tier. Free models swap behavior weekly. Crucible has its own paid-only&lt;br&gt;&lt;br&gt;
  model cache at &lt;code&gt;~/.crucible/models.json&lt;/code&gt; with a 72-hour TTL, refreshed by &lt;code&gt;discover-premium.sh&lt;/code&gt;.                                 &lt;/p&gt;

&lt;p&gt;### Mid-run network blips lose work                                                                                              &lt;/p&gt;

&lt;p&gt;Findings persist as they land, so a rate-limit hiccup at file 40 of 60 doesn't cost you the prior 39. Resume a partial run with&lt;br&gt;&lt;br&gt;
  &lt;code&gt;--resume &amp;lt;run-id&amp;gt;&lt;/code&gt;.                                                                                                           &lt;/p&gt;

&lt;p&gt;### Panel drift mid-run                                                                                                        &lt;/p&gt;

&lt;p&gt;Freeze your model panel at the start of the run. No &lt;code&gt;--auto&lt;/code&gt; cache mutation. The same four model IDs that started the run finish &lt;br&gt;
  it. Otherwise file 1 might be reviewed by Kimi K2.6 and file 30 by Kimi K2.7, with subtly different output formats that break the&lt;br&gt;
   aggregator.                                                                                                                     &lt;/p&gt;

&lt;p&gt;## Cost                                                                                                                          &lt;/p&gt;

&lt;p&gt;A 5-file by 3-model SOTA panel on a small project: $0.05 to $0.15. A 50-file run: about $0.50.&lt;/p&gt;

&lt;p&gt;The first real audit on a 1,162-line project (5 files, 4 models) cost $0.30 and surfaced 30 findings. Nine were HIGH severity.&lt;br&gt;
  All nine were real after Claude verification. Three of them were things I'd reviewed myself and missed.                          &lt;/p&gt;

&lt;p&gt;## Try it                                                                                                                        &lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash                                                                                                                          
  git clone https://github.com/Bambushu/crucible.git ~/.claude/skills/crucible                                                   
  export OPENROUTER_API_KEY=sk-or-...                                                                                              
  # restart Claude Code                                                                                                          

  Then from any project:                                                                                                         

  /crucible                              # current branch's diff
  /crucible --all                        # whole repo
  /crucible --paths "src/**/*.ts"        # glob                                                                                    
  /crucible --diff main...HEAD           # range                                                                                   

  MIT licensed. Issues and PRs welcome.                                                                                            

  &lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Bambushu" rel="noopener noreferrer"&gt;
        Bambushu
      &lt;/a&gt; / &lt;a href="https://github.com/Bambushu/crucible" rel="noopener noreferrer"&gt;
        crucible
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Codebase-level adversarial review by a panel of frontier models. A Claude Code skill that runs every file through DeepSeek + Gemini + Kimi + MiniMax in sequence, then has Claude verify the findings against the actual source.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div&gt;
&lt;a rel="noopener noreferrer" href="https://github.com/Bambushu/crucible/assets/hero.png"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FBambushu%2Fcrucible%2FHEAD%2Fassets%2Fhero.png" alt="Crucible" width="520"&gt;&lt;/a&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Crucible&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Codebase-level adversarial review by a panel of frontier models.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Claude Code skill that walks your code piece-by-piece and puts every file under simultaneous pressure from a panel of structurally different models, then aggregates the findings into a single severity-ranked report that Claude itself verifies before you see it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Bambushu/crucible#install" rel="noopener noreferrer"&gt;Install&lt;/a&gt; · &lt;a href="https://github.com/Bambushu/crucible#how-it-works" rel="noopener noreferrer"&gt;How it works&lt;/a&gt; · &lt;a href="https://github.com/Bambushu/crucible#cost" rel="noopener noreferrer"&gt;Cost&lt;/a&gt; · &lt;a href="https://github.com/Bambushu/crucible#modes" rel="noopener noreferrer"&gt;Modes&lt;/a&gt; · &lt;a href="https://github.com/Bambushu/crucible#sample-report" rel="noopener noreferrer"&gt;Sample report&lt;/a&gt;&lt;/p&gt;


&lt;/div&gt;
&lt;br&gt;


&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What it is&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code&gt;/crucible&lt;/code&gt; is a &lt;a href="https://claude.com/claude-code" rel="nofollow noopener noreferrer"&gt;Claude Code&lt;/a&gt; slash-command skill. You drop the folder into &lt;code&gt;~/.claude/skills/&lt;/code&gt;, set one env var, and from inside any project you can run:&lt;/p&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;/crucible                              # review the current branch's diff
/crucible --all                        # review the whole repo
/crucible --paths "src/api/**/*.ts"    # review a glob
/crucible --diff main...HEAD           # review a specific range
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Behind the scenes, Claude:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Resolves the file list and prints a pre-flight (files, models, est. cost).&lt;/li&gt;
&lt;li&gt;Loads a panel of four current SOTA paid models from OpenRouter, each from a…&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Bambushu/crucible" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
   

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>opensource</category>
      <category>ai</category>
      <category>claude</category>
      <category>codereview</category>
    </item>
  </channel>
</rss>
