Forem: Brian Mello

AI Code Review in 2026: How the Tools Actually Differ (A Builder's Field Guide)

Brian Mello — Fri, 22 May 2026 17:21:07 +0000

If you searched "AI code review" six months ago, the landscape looked roughly like CodeRabbit, a handful of GitHub-bot startups, and your IDE's built-in assistant. Today it's a much wider field — Qodo, Greptile, Bito, Coderabbit, Codium, Sourcegraph's Cody, plus every IDE shipping its own "review this change" button — and the answer to "which one should I use?" depends on questions nobody seems to be asking out loud.

I run 2ndOpinion, a multi-model AI code review CLI. So yes, I'm biased. I'm going to try to be honest about it anyway, because what I actually want is for you to pick the right category of tool for how you work — and then, within that category, pick the one that matches your tradeoffs. If that's not us, that's fine.

Here's how I think about the landscape after building in it for the better part of a year.

The three categories that actually exist

The category labels matter more than the brand names. Almost every tool falls into one of three buckets:

Async PR reviewers. Bot-on-GitHub, bot-on-GitLab. Reviews show up as comments after you push. CodeRabbit, Qodo Merge, Bito, Greptile are the loudest names here.
In-editor copilots. "Review this change" inside Cursor, VS Code Copilot, Cody, JetBrains AI. Synchronous, in-flow, ephemeral.
CLI / CI reviewers. Run locally on a diff or in a pipeline step. Output is structured, scriptable. This is where 2ndOpinion lives, alongside tools like Aider's review modes and a growing pile of homegrown CI wrappers.

These aren't competing products as much as competing times in the day when AI reviews your code. Some teams use all three. Most should use at least two.

What each category is actually good at

Async PR reviewers are best when the reviewer is supposed to be a teammate-shaped entity — leaving inline comments, approving or requesting changes, surfacing in the same UI where humans review. The strength is integration with the social workflow of a PR. The weakness is timing: feedback arrives after you've context-switched. By the time the bot comments, you're already in your next branch.

In-editor copilots are best for shipping velocity. The review happens while the code is still warm. The weakness is the same model bias I keep writing about — the model that helped you write the code is the worst possible reviewer of that code. If your editor's copilot and your editor's reviewer are the same model, you're getting a confidence boost, not a review.

CLI / CI reviewers are best for policy — making review a gate, not a suggestion. They run on every diff, with consistent thresholds, in an environment you control. The weakness is that they're harder to set up than installing a GitHub app, and the output is less pretty than inline comments.

If you only pick one, pick based on whether your bottleneck is catching bugs (CI), velocity (editor), or team review hygiene (PR bot).

The single-model vs multi-model split

Cutting across all three categories is a more interesting axis: how many models is the tool actually consulting?

Most of the well-known tools today are single-model. CodeRabbit publishes its model choices, Qodo lets you swap, Cursor uses whichever model you've selected in the sidebar. The review you get is one model's opinion.

A smaller group runs more than one model. 2ndOpinion runs Claude, Codex, and Gemini and surfaces both the individual reviews and a synthesized consensus verdict. A handful of newer tools are starting to do similar things.

I've written about why this matters in detail before, but the short version: each model has systematic blind spots that don't show up until you compare its review to another model's. Single-model review feels comprehensive because the model is confident. Multi-model review feels noisier because it actually surfaces the disagreement that was there all along.

If your tolerance for false negatives is low — security-sensitive code, infra, anything touching money — multi-model is worth the extra cost. If your tolerance is high — internal tools, prototypes, anything you'll rewrite in a month — single-model is probably fine.

What I'd actually recommend, by team shape

Solo developer, fast iteration. In-editor review only. Cursor or Copilot's review feature, plus whatever you're already using to write the code. Don't add a CI gate that blocks your own merges — you'll bypass it within a week.

Small team (2–5 engineers), shipping to production. PR bot for the team-review surface, plus a CLI/CI step for the actual gate. The PR bot gives you the social workflow. The CLI gives you the consistent policy.

Mid-size team, security-sensitive code. All three layers, with multi-model at the CI gate. The CI step is where you can afford the latency and cost of running multiple models — every PR runs through it once, and the cost is bounded.

Large org, monorepo. This is the case where I'd most strongly recommend a CLI/CI tool over a PR bot. PR bots tend to scale badly on monorepos — they choke on large diffs, or they review files the change didn't actually touch, or they cost a fortune because every PR pulls in the whole context. CLI tools let you scope the review precisely.

Where 2ndOpinion fits (and where it doesn't)

The honest pitch: if you want multi-model consensus, in a CLI or MCP server form factor, with first-class CI integration, that's what we do. We don't have a GitHub PR bot. We're not in your editor as a sidebar. We're a CLI and an MCP server.

If you want a pretty PR comment with inline annotations, you probably want CodeRabbit or Qodo Merge. If you want a sidebar reviewer inside Cursor, Cursor's own review is the right answer.

What we're good at: running every diff through Claude, Codex, and Gemini in parallel, getting back three independent reviews plus a synthesized verdict, and either running it locally as a CLI or wiring it in as an MCP tool inside Claude Code, Cursor, or any MCP-compatible editor. Setup is one npm install -g 2ndopinion-cli and three API keys.

How to actually decide

A working heuristic:

If your last production bug was the kind of thing a careful reviewer would have caught and AI didn't, you need either a different model or more models. Try multi-model.
If your last production bug was the kind of thing nobody would have caught, you don't need more models — you need better tests, observability, or rollback infrastructure. AI review won't save you.
If your bottleneck is "PRs sitting unreviewed for two days," any of the async PR bots will help. The specific brand matters less than picking one and getting your team to actually trust it.
If your bottleneck is "we ship a lot but we ship buggy code," that's a CI gate problem. Single-model is a start; multi-model is the upgrade.

The thing nobody in the AI-tooling space wants to say out loud is that the tool isn't the constraint. The constraint is whether your team treats the review output as signal or noise. Pick the tool that produces a kind of output your team will actually act on — and then enforce that they act on it.

If you want to try multi-model consensus review on your next diff, the CLI is one command: npm install -g 2ndopinion-cli. Setup walkthrough and the MCP server config at get2ndopinion.dev.

I Let Three AIs Argue About My Vibe-Coded App — Here's What They Caught

Brian Mello — Fri, 15 May 2026 17:08:33 +0000

I built a small side project in Cursor over a weekend. Login, dashboard, a couple of forms, a Stripe-style checkout flow. The kind of thing that feels done. Clicking around, everything works. The vibes are immaculate.

So I did the responsible adult thing: I shipped it.

It broke in three places within 48 hours. None of the breaks were in code I had written by hand. They were in code an AI had generated that I had skimmed, nodded at, and moved on from.

That's the trap of vibe coding. The AI is fluent. You're fluent at reading what the AI made. Neither of you is the kind of pedantic loser who notices that the "Cancel" button on the checkout modal actually submits the form on mobile Safari because someone forgot to add type="button" somewhere three components deep.

This is the story of the second app I built, where I tried something different. I let three AI testing agents argue about my app before I shipped it. They caught seven things I would have missed. They also disagreed with each other in ways that, weirdly, made me trust the result more.

The setup

The app was a simple expense-splitting tool — Splitwise but uglier and free. Built in Cursor, deployed on Vercel, total dev time around eight hours spread across two evenings. By the end I had:

Email/password signup
A "create group" flow
An "add expense" form with split logic
A settle-up view

Standard vibe-coded SaaS skeleton. Worked on my machine. Looked fine on my phone.

Instead of clicking around for an hour and calling it good, I pointed 2ndOpinion Testing at the URL. The pitch on the box is "AI agents test your app like real users, then cross-examine each other's findings." I'd seen the demo. This was the first time I'd used it on something I actually cared about.

What "three AIs arguing" actually looks like

The product runs three different model-backed agents at your app concurrently. Each one explores independently — clicking, typing, navigating like a confused new user who has never seen the thing before. They each file a report on what's broken or weird.

Then comes the part that earns the courtroom metaphor in the marketing: the agents cross-examine each other's findings. Agent A claims the signup form is broken. Agent B says they signed up just fine. The system makes them reproduce, defend, or retract.

You don't end up with three separate reports to read. You end up with one verdict: here's what's actually wrong, here's what one agent thought was wrong but couldn't reproduce, here's what all three independently flagged.

Reading the final verdict felt like reading the minutes of a deposition. In a good way.

The seven things they caught

I'll walk through them in increasing order of "ouch, I should have caught that."

1. The signup email field accepted "test" as a valid email. All three agents flagged this. Front-end validation was just required, no type="email". Cursor had generated a form with the bare minimum and I hadn't tightened it. Five-second fix. Would have looked terrible the first time a real user mistyped.

2. The "Add Expense" form let you submit $0. Two of three agents tried it, both succeeded, both filed it. The third agent said "this is probably intentional, some groups track zero-dollar IOUs." The system made them argue about it. They settled on "probably a bug, ask the developer." It was a bug.

3. The settle-up calculation rounded wrong on three-way splits. $10 split three ways became $3.33 + $3.33 + $3.33, which is $9.99. Someone was always going to be a penny off. One agent caught it by splitting a coffee three ways and noticing the totals didn't reconcile. The other two had only tested two-way splits.

4. Pressing Enter in the "group name" field submitted the form before I'd added any members. Only one agent caught this — the others were filling forms by clicking the submit button like polite humans. The one that pressed Enter found a half-broken state where the group existed but had no members and couldn't be edited.

5. The mobile nav menu didn't close after navigating. Two agents flagged it. Classic AI-generated React component thing. The menu had open/close state, but route changes didn't reset it.

6. The password reset email link 404'd. I had not, in fact, set up the password reset route. The "Forgot password?" link went to /reset-password which did not exist. I had written the link before writing the page and never come back to it. One agent found this by clicking every link on the login screen. Embarrassing.

7. The Stripe-style checkout for the (currently mocked) "Pro" tier accepted submissions but didn't go anywhere. I had stubbed out the Pro upgrade flow and forgotten about it. The button looked real. The page it led to was a 404.

Seven real things. None of them catastrophic, all of them the kind of thing that, on a launch day with twenty people poking at your app, accumulate into "this product feels janky."

The part I didn't expect: the disagreements

The disagreements are what convinced me this approach actually works. Here are two:

Was the signup flow too slow? One agent flagged the signup as "slow, took 4 seconds." The other two said it felt normal. The system made the first agent show its work. Turned out it had been testing on a throttled connection it had picked up from somewhere in its state, and the other two hadn't. The finding got retracted. If I had just had one agent, I'd have gone hunting for a phantom performance problem.

Was the "delete group" confirmation modal confusing? Two agents thought the wording was unclear. The third said it was fine. The argument ended with "this is subjective, flagging for human review." That's the right answer. The tool wasn't pretending to be sure when it wasn't.

I have used single-AI testing tools before. They sound confident about everything, including the wrong things. Watching three agents disagree and then resolve felt much closer to the experience of having three different humans review a PR. Some things were unanimous. Some things were noise. The noise got filtered before it got to me.

What I'd tell another vibe coder

A few things, in order of how often I've now had to repeat them to friends:

You don't need to learn Playwright. You don't need to write Cypress specs. You don't need to even know what "end-to-end testing" is in the traditional sense. If you built your app in Bolt, Lovable, v0, or Replit, the testing tool you want is the same kind of thing — point it at a URL, let it figure out what to do.

You do need to test before you ship, not after. The temptation when you've spent a weekend vibing with an AI is to deploy on Sunday night, post on X, and hope. Resist. A 20-minute pre-flight on a Sunday afternoon catches the seven things that would have been a soft launch disaster.

You should care about the disagreements more than the agreements. If your testing tool always sounds 100% confident, it's lying to you. Real bugs aren't unanimous. The interesting findings are the ones where one agent saw something and the others didn't — and you get told whether the holdout was right.

Try it

If you've vibe-coded anything in the last month and it's sitting in a Vercel deployment waiting for you to feel brave enough to share the link, I'd run it through this before you do.

Try 2ndOpinion Testing →

You paste a URL. Three AIs argue about it. You ship with fewer surprises. That's the whole product.

The Splitwise-but-uglier app is still up, by the way. Seven fewer embarrassments than it would have had. I'll take it.

How to Add Multi-Model AI Code Review to Your CI/CD Pipeline

Brian Mello — Sat, 09 May 2026 17:31:55 +0000

Running AI code review locally is fine for solo work. The moment you have a team, the question becomes: how do I make the AI an actual gate in the pipeline, not a thing one person remembers to run before they push?

This is a walkthrough for wiring 2ndOpinion — the multi-model AI code review CLI — into a CI/CD pipeline. I'll show GitHub Actions in full, then sketch the same pattern for GitLab CI and CircleCI. The interesting decisions aren't where the YAML goes; they're around consensus thresholds, blocking vs informational mode, and what happens when Claude, Codex, and Gemini disagree on the same diff (which, from our review logs, is roughly 15% of the time).

What "AI code review in CI" actually means

There are two shapes this takes, and the YAML is almost identical for either. The difference is the policy:

Informational mode. Every PR runs the review. Findings are posted as a comment or check annotation. Nothing blocks merge. Humans decide what to do.
Blocking mode. Review runs on every PR. If the consensus surface flags a HIGH severity finding, the check fails and merge is blocked until the author either fixes it or someone with override permission ships anyway.

I recommend starting in informational mode for the first week or two. AI reviewers — even three of them cross-examining each other — surface false positives. You want the team to learn the noise floor before the bot can block their merges, otherwise the first false-positive blocker generates a Slack thread that ends with "let's just turn this off."

The minimum GitHub Actions config

Here's the workflow file I use as a starting point. Drop it in .github/workflows/ai-review.yml:

name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # need full history for diffs

      - name: Set up Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install 2ndOpinion CLI
        run: npm install -g 2ndopinion-cli

      - name: Run multi-model review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          2ndopinion review \
            --base origin/${{ github.base_ref }} \
            --head HEAD \
            --format github-comment \
            --severity-threshold medium \
            --comment-pr ${{ github.event.pull_request.number }}

A few things worth calling out:

fetch-depth: 0 is necessary because actions/checkout defaults to a shallow clone, and the CLI needs full history to compute the actual PR diff against the base branch. Skip this and your review runs against an empty diff, which produces a confidently empty review.

Three API keys. Multi-model review means three providers. If you only set one, the CLI degrades to single-model mode and prints a warning. That's fine for a smoke test, but the whole reason you're doing this is the multi-model surface — the disagreement signal.

--severity-threshold medium suppresses LOW findings in the PR comment. LOW is mostly nits and style preferences, and posting them on every PR trains your team to ignore the bot. Keep MEDIUM and HIGH visible; suppress LOW.

Going from informational to blocking

To turn this into a merge gate, change one flag and one branch protection setting.

In the workflow:

2ndopinion review \
  --base origin/${{ github.base_ref }} \
  --head HEAD \
  --format github-comment \
  --severity-threshold medium \
  --fail-on high \
  --comment-pr ${{ github.event.pull_request.number }}

The --fail-on high flag tells the CLI to exit with a non-zero status if any HIGH severity finding has consensus from at least 2 of 3 models. The 2-of-3 threshold matters — it's why you don't want to block on single-model verdicts. Any single model can confidently invent a critical bug. Two models independently flagging the same critical bug is meaningfully harder to fake.

Then in Settings → Branches → Branch protection for your default branch, add the AI Code Review / review check to the required checks list. Now the merge button is gated.

I'd hold this back for at least a week of informational-mode runs. Look at the false positive rate. If you're getting more than one false HIGH per ten PRs, tune the consensus threshold up to 3-of-3 instead of 2-of-3 before flipping the gate on:

--fail-on high --consensus-required 3

That's stricter — only blocks when all three models agree the finding is HIGH. False positive rate drops, false negative rate goes up. Tradeoff worth making early; you can loosen later once the team trusts the bot.

GitLab CI

Same pattern, different YAML. .gitlab-ci.yml:

ai-code-review:
  stage: test
  image: node:20
  rules:
    - if: $CI_PIPELINE_SOURCE == 'merge_request_event'
  variables:
    GIT_DEPTH: 0
  script:
    - npm install -g 2ndopinion-cli
    - 2ndopinion review
        --base origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME
        --head HEAD
        --format gitlab-note
        --severity-threshold medium
        --comment-mr $CI_MERGE_REQUEST_IID

The CLI knows about GitLab's note format and uses CI_JOB_TOKEN automatically if it's available in the environment, so you don't need to set up a separate token unless you want bot-attributed comments.

CircleCI

CircleCI's config doesn't have the same first-class PR concept, but the CLI handles it. .circleci/config.yml:

version: 2.1
jobs:
  ai-review:
    docker:
      - image: cimg/node:20.11
    steps:
      - checkout
      - run: npm install -g 2ndopinion-cli
      - run:
          name: Run review
          command: |
            2ndopinion review \
              --base origin/main \
              --head HEAD \
              --format json \
              --output review.json
      - store_artifacts:
          path: review.json

CircleCI doesn't have a native PR-comment surface, so I store the review as a build artifact and add a separate small script to POST the JSON to the GitHub PR via a personal access token. Less elegant than the GitHub Actions path, but it works.

What to do when the models disagree

The reason multi-model review is in CI in the first place is that disagreements are signal, not noise. The CLI's default behavior on a finding where models split:

3-of-3 agree (HIGH): posted as a HIGH finding, blocks merge if --fail-on high is set.
2-of-3 agree (HIGH): posted as a HIGH finding with the dissenting model's argument attached, blocks if --consensus-required 2.
1-of-3 (HIGH): posted as a NOTE-level finding with the model's argument and the other two models' counter-arguments. Never blocks. Visible to humans.

That last category is the most underrated output. About 8% of our diffs produce a 1-of-3 HIGH where exactly one model is convinced something is broken and the other two say it's fine. Most of those are false positives by the lone model. But about a quarter of them — by far the most interesting quarter — are real bugs that two models missed. You don't want those silently dropped, but you also don't want them blocking merges. NOTE-level surfacing is the right answer.

Cost and time, in case you're worried about either

Median review on a typical 200-line diff: about 40 seconds wall-clock and roughly $0.06 in combined API spend across the three providers. That's wall-clock time the developer doesn't spend; it runs in parallel with the rest of the CI matrix. The cost works out to less than a tenth of what most teams pay for any single human reviewer's hour, which is the right comparison — multi-model review doesn't replace human review, it replaces the human reviewer asking "did you check for race conditions" by hand.

We've seen teams skip the AI review step for files larger than 1000 lines or generated files (lockfiles, schema dumps) — --exclude '**/*.lock' and --max-diff-lines 1000 handle both.

If you want to try this on a real repo, the CLI is npm install -g 2ndopinion-cli and the docs for every flag mentioned above are at get2ndopinion.dev. The MCP server flavor (for plugging the same review engine into Claude Code or Cursor as an agent tool) is also there.

We publish a weekly build-in-public update; this post is part of it. If you wire 2ndOpinion into your CI and one of your three models flags something the other two missed on a real diff, send the case over — those are the ones we use to tune the consensus thresholds.

How to Test Your AI-Built App Without Writing a Single Test

Brian Mello — Fri, 01 May 2026 17:07:41 +0000

You opened Cursor. You typed "build me a booking app." Forty-five minutes later, you have something that runs. The login works. The calendar mostly works. You ship it.

Then a friend tries it and the date picker goes blank on iOS. Another user finds that hitting back after a failed payment leaves the form locked. Someone else can't sign up because their email has a plus sign in it.

Welcome to the gap nobody talks about in the vibe coding era: AI-built apps are easier to ship than ever, and exactly as buggy as you'd expect from code you didn't fully read. Traditional testing — the unit tests, the integration tests, the Selenium suites — assumes you have time, expertise, and patience to write them. Vibe coders have none of those. So the apps go out untested.

This post is about the shift that's quietly happening: AI testing tools that act like real users, find real bugs in your AI-built app, and never ask you to write a line of test code.

Why traditional testing fails vibe coders

Selenium is from 2004. Cypress and Playwright are better, but the workflow hasn't really changed: you write a script that says click this, type that, assert this. Then your AI rebuilds the navbar and your selectors break. You spend an afternoon fixing tests instead of shipping features.

The friction is bad enough for full-time engineers. For someone who built their app in Bolt or Lovable over a weekend, it's a hard no. You're not going to become a QA engineer. You're going to ship and hope.

There is a third option, and it turns out it's pretty obvious in hindsight: have the AI do the testing too.

The shift: AI agents that test like users

The new generation of testing tools doesn't ask you to describe what to test. It asks for a URL.

You point the tool at your live app. An agent opens it, looks at the page, and behaves like a curious human. It clicks things. It fills in forms. It tries weird inputs. It tries to break the flow. It does the things you'd do if you sat down to test your own app, except it doesn't get bored after the third happy path.

This is fundamentally different from script-based testing. Scripts only test what you told them to test. An agent explores. It can find the dead end you didn't know existed because you never thought to script the click that gets you there.

The closest analogy is hiring a junior QA contractor who's actually thorough — except this one shows up in thirty seconds and costs less than a sandwich.

What "no scripts" actually means

When I say no scripts, I mean it literally. No selectors. No fixtures. No mocks. No test framework to install. The mental model is more like:

Here's my app: testing.example.com
Here's a sentence about what it does: a booking flow for a yoga studio
Find the bugs

That's the entire input. You spend more time writing the sentence than configuring anything else.

The output is a verdict. Not a failed test name and a stack trace — a plain-English description of what's broken, why it matters, and how to reproduce it. Screenshots included.

For a vibe coder, this is the whole point. You don't want to learn the testing tool. You want to know if your app is shippable.

Three AIs walk into a courtroom

Here's where it gets interesting. A single AI agent tests your app and tells you it found three bugs. How do you know it's right? AI agents hallucinate. They confidently report "the login button doesn't work" when actually they just couldn't find it because a cookie banner was in the way.

The fix is the same fix that's working everywhere else AI gets used in production: you ask more than one model, and you make them justify themselves.

In 2ndOpinion's testing product, three different agents test your app independently. Then they cross-examine each other's findings, courtroom style. Did Agent A really see this bug, or did it misread the page? Can Agent B reproduce it? Does Agent C agree that the failure mode is what A says it is?

What you get back is a verdict with confidence levels. The bugs that all three agents independently found are almost certainly real. The ones only one agent flagged usually aren't. This cuts the false-positive rate dramatically and saves you from chasing ghosts.

If you've ever been burned by an AI tool that confidently lied to you, this is the cure. Make them argue. The truth tends to survive the argument.

A typical workflow

Here's what testing an AI-built app looks like when you remove the scripting:

You finish a feature in Cursor, v0, Lovable, Replit — wherever you build. You deploy it. You paste the URL into your testing tool. You add a one-line description of what the app does and which flow you care about. You hit go.

A few minutes later, you have a list of issues. Not "test_login_button_failed at line 47." Something like: "If a user enters an email with a plus sign, the signup form silently fails. No error message appears, the button just stops responding. Reproduced in Chrome and Safari."

You take that to your AI coding tool. You paste the bug. You ask for a fix. You redeploy. You re-run the test. You ship.

The total cycle is maybe twenty minutes. Compare that to writing a single Cypress test from scratch, which is also twenty minutes — except you're still at zero coverage when you finish.

What this catches that you'd miss

The category of bugs that AI testing finds reliably is the one vibe coders most often ship by accident.

Edge cases in input handling. The plus sign in the email. The apostrophe in the last name. The phone number with a country code.

Broken back buttons and refresh behavior. The state that doesn't persist. The form that reposts on refresh. The "session expired" page that has no way out.

Mobile-specific weirdness. The viewport that doesn't scroll. The keyboard that covers the submit button. The autofocus that fights with the iOS keyboard.

Auth flows that work for the happy path and explode otherwise. Wrong password. Expired link. Already-registered email. OAuth cancellation halfway through.

These are the bugs your friends find on Twitter the day after you launch. They're also the ones a single AI rarely catches reliably, which is why the multi-agent cross-examination matters.

Where this is heading

Two things are going to happen over the next year. First, this kind of testing becomes default. Pasting a URL and getting a verdict will feel as obvious as pasting an error into ChatGPT. Second, the tools that survive will be the ones that handle disagreement honestly — that show you when their agents argued, who won, and why.

The vibe coder workflow has been bottlenecked on testing for two years. The unblocking is happening right now, and it doesn't involve learning Selenium.

If you've built something in Cursor, Bolt, or Lovable and you're nervous about shipping it: that's a reasonable feeling, and the tools to act on it finally exist.

If you want to try this on something you've built, 2ndOpinion Testing is the macOS desktop app I'm building for exactly this. Paste a URL, get a verdict. No scripts, no selectors, no test framework to learn.

When Claude, Codex, and Gemini Disagree on the Same Code

Brian Mello — Fri, 24 Apr 2026 17:06:36 +0000

When we tell people 2ndOpinion runs every pull request past Claude, Codex, and Gemini and then cross-examines the findings, the most common follow-up is: "Do they actually disagree? Or is this just three models rubber-stamping each other?"

The answer, from about six months of production review logs, is that they disagree often enough to matter. Not on everything — maybe 15% of diffs — but the disagreements cluster on exactly the kinds of bugs that hurt in production: concurrency, null handling, subtle security issues, and "this works but it's going to page you at 3am" architectural smells.

Here are four real cases, lightly anonymized, where the three models read the same code and came back with meaningfully different verdicts. If you're trying to decide whether multi-model review is worth the extra tokens, these are the kinds of arguments it's buying you.

Case 1: The async/await race that only one model saw

The diff was a webhook handler in a Node.js payments service. Roughly:

app.post("/webhook/stripe", async (req, res) => {
  const event = verifySignature(req);
  const existing = await db.events.findOne({ id: event.id });
  if (existing) return res.status(200).send("duplicate");
  await processEvent(event);
  await db.events.insert({ id: event.id, status: "processed" });
  res.status(200).send("ok");
});

Codex flagged it as a textbook race condition: two copies of the same webhook arriving within milliseconds both pass the findOne check before either has written to events, both run processEvent, and you charge the customer twice. Recommended fix: a unique index on id plus wrap the processing in an idempotency key pattern.

Claude said the code was fine and suggested minor cleanup — extract processEvent into a service, add structured logging.

Gemini agreed with Codex about the race but suggested a different fix — optimistic insert first, catch the unique constraint violation, return early if duplicate. Cleaner on the happy path.

The consensus step flagged the race because two of three models saw it. Without cross-checking, whichever single model you happened to be using would have told you the diff was either shippable or a bug — a coin flip on a payments handler.

The lesson isn't that Claude is worse at concurrency. Rerun this prompt on a different day and the models trade places. The lesson is that any single model has blind spots that are invisible until a different model looks at the same code.

Case 2: The "working" SQL that was quietly injectable

A new internal admin endpoint, Python, roughly:

def search_users(query: str, sort: str = "created_at"):
    sql = f"SELECT * FROM users WHERE email ILIKE %s ORDER BY {sort} DESC"
    return db.execute(sql, [f"%{query}%"])

Gemini immediately flagged the SQL injection in the sort parameter — the %s parameterization protects query, but sort is interpolated directly into the string. An attacker who controls sort can turn this into ORDER BY (SELECT ...) DESC and exfiltrate data.

Codex flagged it too, with a suggested allowlist: if sort not in {"created_at", "email", "last_login"}: raise ValueError(...).

Claude said the query was safe because the user parameter was parameterized — and it was technically right about query, but it missed that sort is a user-controllable input from the same request.

This is the most dangerous kind of AI review error: confidently correct about one thing, silent on a worse thing right next to it. A single-model review that happened to land on Claude that day would have said "LGTM." The second opinion is exactly what you want for security-adjacent diffs — one model being wrong is common, two models being wrong in the same direction is rare.

Case 3: The memory leak that wasn't

Sometimes consensus is wrong and the outlier is right. React component, roughly:

useEffect(() => {
  const ws = new WebSocket(url);
  ws.onmessage = (e) => setMessages((m) => [...m, e.data]);
  return () => ws.close();
}, [url]);

Claude and Gemini both flagged a missing cleanup of the onmessage handler and warned about a memory leak if the component re-mounted rapidly.

Codex pushed back — because ws is created inside the effect and ws.close() is called in the cleanup, the socket is garbage-collected along with its handlers. The handler doesn't need explicit removal. The two-of-three majority was wrong; the outlier was right.

This is where our cross-examination step earns its keep. Instead of defaulting to "majority wins," the consensus layer asks the dissenting model to defend its position, then asks the other two to respond. In this case Codex explained the GC behavior, Claude acknowledged the correction, and the final verdict downgraded the finding from "bug" to "stylistic nit."

If you only run majority voting, you get the wrong answer on cases like this. If you run proper cross-examination, you get the right answer and the reasoning, which is how engineers actually build trust in AI review.

Case 4: The Rust borrow checker dispute

A small but contentious one. The diff refactored a hot path:

fn process(items: Vec<Item>) -> Vec<Processed> {
    items.iter()
        .map(|i| transform(i.clone()))
        .collect()
}

Codex flagged the .clone() as wasteful and suggested taking items by value and using into_iter() to move instead of clone.

Gemini agreed with the performance critique but added a nuance — if Item contains anything expensive to clone (like an Arc<Mutex<T>>), the clone is specifically what you don't want in a hot path.

Claude defended the clone. Its argument: if transform is defined for &Item in the existing codebase and changing it breaks fifteen other callers, the clone is the minimal-risk change. "Optimal" and "mergeable" are different targets.

None of the three models was wrong. They were optimizing for different objectives, which is a pattern we see constantly — performance versus maintainability, correctness versus velocity, local improvement versus blast-radius. Multi-model review surfaces that there is a tradeoff rather than presenting one model's preferred answer as The Answer. That's usually more useful than a confident single verdict.

What we do with the disagreements

The short version of the product: every review goes to all three models in parallel. Findings that all three agree on are high-confidence and reported first. Findings where models disagree trigger a cross-examination round where each model sees the others' output and gets a chance to revise. Anything still contested is surfaced to the human reviewer with the full argument attached, rather than hidden behind a single "LGTM."

That last part is the one most people underestimate. You don't want the AI to resolve every disagreement — some disagreements are the signal. A human reviewer who sees "Claude says ship, Gemini says block, here's why" makes a better decision than one who sees a single-model verdict in either direction.

If your team is running code review with one model and wondering what the second opinion would say, that's the whole pitch. Install the CLI with npm i -g 2ndopinion-cli, run 2ndopinion review, and see where your models actually disagree. Or wire it into Claude Code / Cursor as an MCP server — docs at get2ndopinion.dev.

We publish a weekly build-in-public update, and this post is part of it. If you have a case where two AI reviewers disagreed on your code and you're curious what a third would say, send it over — the weird diffs are the fun ones.

How Smart Model Routing Picks the Right AI for Your Programming Language

Brian Mello — Fri, 17 Apr 2026 17:06:47 +0000

The dirty secret of AI code review is that there is no single "best" model. There are only models that happen to be good at the specific thing you're asking them to do right now.

I learned this the hard way while building 2ndOpinion, an AI code review tool where Claude, Codex, and Gemini cross-check each other's work over MCP. The first version hard-coded one model for every review. The reviews were fine for JavaScript. They were embarrassing for Rust. They were weirdly confident and wrong for SQL.

So we stopped picking a single model and started routing. This post is about how that routing layer actually works — what signals we collect, how the scoring math plays out, and what surprised us when we shipped it.

The problem: model strength is language-specific

When we started tracking per-language accuracy, a clear pattern emerged. If you take the same corpus of reviewed pull requests and score each model on catching real bugs versus hallucinating issues that don't exist, you don't get a uniform leaderboard. You get something that looks more like rock-paper-scissors.

One model might be excellent at spotting async/await footguns in TypeScript but completely miss lifetime issues in Rust. Another might be phenomenal at Python decorator patterns and hopeless at Go's error handling conventions. A third might crush Terraform drift detection but flag perfectly valid Kubernetes manifests as "probably wrong."

This isn't a flaw in any particular model. It's a consequence of training data distribution, RLHF feedback, and the fact that "code" is actually hundreds of very different specialties wearing the same trench coat. Treating "AI code review" as one problem and picking one winner leaves performance on the table for every language that isn't the winner's strong suit.

What `--llm auto` actually does

When you run a review with no model flag, the CLI calls the auto router:

npm i -g 2ndopinion-cli
2ndopinion review src/auth.ts

Under the hood, --llm auto is the default. It takes three inputs — language, change type, and file size — and picks a model. Here's the Python SDK equivalent, which exposes the same router via a keyword argument:

from secondopinion import Client

client = Client()

result = client.opinion(
    code=open("src/auth.ts").read(),
    language="typescript",
    llm="auto",          # route based on accuracy data
    change_type="bugfix" # optional hint
)

print(result.findings)
print(result.review_metadata.model_used)  # which model actually ran

The review_metadata object is the important part for debugging. Every response tells you which model was picked and why, along with token counts and duration. If you want reproducibility, pin the model explicitly; if you want the best review for this specific request, let the router decide.

The signals that feed the router

There are four signals we weight, in roughly this order:

Language. Detected from file extension, shebang, or an explicit language= argument in the SDK. This is the dominant signal because accuracy variance between models on a given language is much larger than variance on other dimensions.

Change type. A new-feature diff has different review priorities than a bugfix or a refactor. Security-sensitive file paths (auth/, crypto/, anything matching a configurable allowlist) bump a security-audit weight into the decision.

File size and diff size. Very large files hit models with bigger effective context windows. Small targeted diffs can go to faster models without losing accuracy — no point paying for a heavyweight review of a three-line typo fix.

Pattern memory. If we've seen a similar bug pattern in this repo before, we bias toward the model that caught the original. This is a small effect per review, but over a project's lifetime it adds up, because teams tend to re-introduce the same class of bug in different forms.

The scoring itself is embarrassingly simple. For each candidate model we compute a weighted sum from the accuracy table and pick the highest. It's not a neural net. It's not an LLM picking another LLM. It's a lookup table and a weighted argmax. We tried fancier approaches and they kept losing to the lookup table, which turns out to be the honest answer in most ML-system stories.

Where the accuracy data comes from

A router is only as good as the data behind it. Ours comes from three places.

First, offline evals. We maintain a set of benchmark repos per language with known bugs — either ones we inject, or historical CVEs replayed on the vulnerable commit. Every model gets scored on "did you catch this specific bug" and "did you flag something that wasn't actually a problem."

Second, production telemetry. When a user accepts or rejects a finding via 2ndopinion fix or the GitHub PR agent, that's a signal. Rejected findings that were later confirmed as real bugs (via a follow-up commit or a revert) are gold. We only aggregate feedback, never store code — that's a hard constraint baked into the pipeline.

Third, consensus disagreements. When you run a consensus review, three models vote. Disagreements are interesting because they surface cases where one model sees a bug the others miss. Over time, the model that's consistently right on disagreements gets weighted higher for that language.

# Three-model consensus review — the source of a lot of our training signal
2ndopinion review src/auth.ts --consensus

Three credits, one command. The confidence-weighted aggregator takes the three reviews, collapses duplicate findings, and ranks by agreement. High-agreement findings surface first; disagreements get flagged explicitly so a human can adjudicate.

A concrete example: routing a TypeScript auth change

Say you run this:

2ndopinion review src/auth/session.ts

The router sees:

Language: TypeScript (file extension and tsconfig detected)
Change type: bugfix (detected from git diff — a returned value was modified, no new exports)
File size: 240 lines
Path signal: auth/ → security-sensitive bump
Pattern memory: this repo had a session-fixation issue three months ago

The router weights the security-sensitive bump and biases toward whichever model has the strongest track record on auth/session TypeScript bugs in our accuracy table. It runs that single model at three-credits-equivalent depth, returns a review, and the review_metadata field on the response tells you exactly which model was chosen so you can audit the decision.

If any of those signals flip — different language, a new-feature diff, no security-sensitive path — you'd get a different model. That's the whole point.

What surprised us

Two things.

First, the router made the marginal model matter. We used to think of models as tiered — a "best" one, a "good enough" one, a "cheap one for trivial stuff." Once we started routing on language-specific accuracy, the hierarchy collapsed. Models we'd written off as second-tier turned out to dominate specific slices. There is no tier list. There are just specialties.

Second, the router made consensus more valuable, not less. You'd think smart routing would make consensus redundant — why run three models if one is already the best? In practice, consensus is where the router learns. Every disagreement is a labeled data point about where the router's current guess is wrong. We run consensus on a sampled slice of reviews partly to keep the accuracy table fresh.

The takeaway

If you're building anything on top of LLMs, the lesson generalizes past code review: "which model is best" is the wrong question. The right question is "which model is best for this specific request, given what I know about it." Build a router, not a leaderboard.

If you want to see smart model routing in action, the fastest way is the free playground — no signup, just paste code and see which model the router picks:

# Install the CLI and run a review
npm i -g 2ndopinion-cli
2ndopinion review src/your-file.ts

Or try the playground at get2ndopinion.dev and watch the model_used field on the response. You can force a specific model with --llm claude, --llm codex, or --llm gemini to see how the same code gets reviewed differently — which is the fastest way to internalize why routing matters in the first place.

If you've built a routing layer for a different ML-backed product, I'd love to hear what signals ended up mattering most. Drop a comment — I'm especially curious about people who tried fancier approaches before collapsing back to a lookup table.

What 10 Versions of an AI Code Review CLI Taught Me About Developer UX

Brian Mello — Fri, 10 Apr 2026 17:12:18 +0000

You don't learn how developers think by reading docs. You learn by shipping something, watching it fail, and shipping it again.

I've been building 2ndOpinion, an AI code review tool where multiple models — Claude, Codex, Gemini — cross-check each other's reviews. Over the past few months and ten CLI versions, I've rewritten the developer experience more times than I'd like to admit. Here's what actually stuck.

Version 1: The "Just Ship It" Phase

The first version of the CLI did exactly one thing: send your code to three AI models and print their reviews. It worked. Technically.

npx 2ndopinion-cli --file src/auth.ts --models claude,codex,gemini --format json --output review.json

Five flags to get a single review. Every run required you to specify models, format, and output. Nobody wants to think that hard before getting feedback on their code.

Lesson: If your CLI needs a manual, you've already lost.

The "Smart Default" Breakthrough

The single biggest improvement wasn't a feature — it was removing decisions. Version 0.5.0 introduced one command that just works:

2ndopinion review src/auth.ts

That's it. The tool auto-detects your language, picks the best models for that language based on real accuracy data, and prints a formatted review. No flags required.

Downloads jumped immediately. Not because the tool got more powerful — it got simpler.

Behind the scenes, --llm auto routes your code to whichever models perform best for your specific language. TypeScript reviews go to different models than Python reviews, because we track which models actually catch bugs in each language. But the developer doesn't need to know any of that.

The Feedback That Changed Everything

A developer tried 2ndOpinion and told me: "I got my review. Now what?"

That question haunted me. Getting a list of issues is step one. But developers don't want a report — they want their code to be better. So I built fix:

2ndopinion fix src/auth.ts

One command. It reviews your code, identifies the issues, generates fixes, and applies them. You can review the diff before accepting. The entire loop from "something's wrong" to "it's fixed" happens in your terminal.

Then came watch:

2ndopinion watch src/

Continuous monitoring. Save a file, get a review. Like having a pair programmer who never takes a break and never gets passive-aggressive about your variable names.

Lesson: The best developer tool is the one that closes the loop. Don't hand developers a problem — hand them a solution.

The Multi-Model Insight Nobody Asked For

Here's something I didn't expect: individual AI models are unreliable in predictable ways. Claude is excellent at architectural reasoning but sometimes misses edge cases in error handling. Codex catches implementation bugs that Claude misses. Gemini often spots performance issues the others overlook.

No single model is "the best." But three models reviewing the same code? They catch what each other misses. That's the core thesis of 2ndOpinion — consensus-based review.

When all three models agree something is a problem, the confidence is high. When they disagree, that's where the interesting conversations happen. We built a confidence-weighted system that surfaces high-agreement issues first and flags disagreements for human review.

The consensus command makes this explicit:

2ndopinion review --consensus src/auth.ts

Three models review in parallel. You get a unified report with confidence scores. Three credits, one command, and a review that's more thorough than any single model could produce.

What I Got Wrong About Developer UX

I over-indexed on power users. Early versions had flags for everything: model selection, temperature, output format, verbosity levels, custom prompts. Power users loved it. Everyone else bounced.

The fix was layered complexity. The default command (2ndopinion) requires zero configuration. Power users can add flags to customize. But the first experience is frictionless.

I underestimated CI/CD. Developers don't just run tools locally — they run them in pipelines. Version 0.10.0 added --ci, --json, and --plain flags specifically for non-interactive environments. It sounds obvious in retrospect, but I spent months building interactive terminal UI before realizing half my users needed the opposite.

# In your GitHub Actions workflow
2ndopinion review --pr $PR_NUMBER --ci --json

I ignored the "try before you buy" instinct. Developers don't sign up for things. They install them, try them, and decide in under 60 seconds. The free playground on get2ndopinion.dev — no signup required — exists because I watched too many developers hit a registration wall and leave.

What's Next: The Skills Marketplace

The most surprising thing I've learned is that every team has domain-specific review needs. A fintech team cares about different patterns than a game studio. A team migrating from Python 2 to 3 needs a completely different lens.

So we're building a skills marketplace where developers can create custom audit skills — specialized review logic for specific domains — and sell them. Creators earn 70% of revenue. It turns tribal knowledge into something shareable and monetizable.

Think of it as npm for code review intelligence. Someone who's spent five years dealing with Django security footguns can package that knowledge into a skill that catches those issues for every Django developer.

The Takeaway

Ten versions in, the biggest lesson is this: developer tools win on defaults, not features. Every flag you add is a decision you're asking the developer to make. Every decision is friction. Every bit of friction is a reason to close the terminal and move on.

If you're building developer tools, here's my checklist: Does the zero-config experience work? Does the tool close the loop (find problem → fix problem)? Can it run in CI without modification? Can someone try it in under 60 seconds?

If you want to try multi-model AI code review, the CLI is one install away:

npm i -g 2ndopinion-cli
2ndopinion review your-file.ts

Or try the playground at get2ndopinion.dev — no signup, no credit card, just paste code and see what three AI models think.

I'd love to hear what you've learned building developer tools. What UX lessons took you the longest to figure out? Drop a comment below.

Single-Model vs Multi-Model AI Code Review: What I Learned Running Both

Brian Mello — Fri, 03 Apr 2026 17:07:37 +0000

I've been obsessing over AI code review for the last year. Not because I think AI will replace code review — I don't — but because I think most developers are leaving a lot of quality signal on the table by using AI review the wrong way.

Here's the thing nobody talks about: a single AI model is confidently wrong surprisingly often.

Not maliciously wrong. Not obviously wrong. Just... plausible-sounding wrong. It'll flag a false positive, miss a real bug, or give you a high-confidence "looks good" on code that has a subtle race condition. And because the model sounds so sure of itself, you accept it and move on.

I learned this the hard way. Then I started running multi-model consensus review instead, and it changed my whole mental model of what AI code review should look like.

Here's what I found.

The Problem With Single-Model Review

When you pipe code through one model — say, Claude or GPT-4 — you get a single "opinion." That opinion is shaped by:

The model's training data distribution
Whatever biases crept in during RLHF
The specific prompt you used
The model's current context window state

None of those factors are visible to you as the reviewer. You just get a confident-sounding output and have to decide how much to trust it.

I started noticing patterns:

Claude tends to be excellent at spotting architectural smell and async/await patterns. It's more conservative — it'll point out potential issues even when they're not certain bugs.
GPT-4 / Codex is better at catching common idiom violations and tends to give more opinionated style feedback. It's more decisive.
Gemini has surprisingly strong instincts around security patterns and type safety, particularly in typed languages.

These aren't a knock on any model. They're just different lenses. And here's the thing: a bug that one model misses, another often catches.

Running the Same Code Through Both Approaches

I took a production Node.js service — about 2,000 lines across 12 files — and ran it two ways:

Approach 1: Single-model review (just Claude)

# Install the CLI
npm i -g 2ndopinion-cli

# Review with a single model
2ndopinion review --llm claude

Approach 2: Multi-model consensus (Claude + Codex + Gemini in parallel)

# Use consensus mode — 3 models, confidence-weighted
2ndopinion review --consensus

The single-model pass found 14 issues: 9 flagged as medium severity, 3 high, 2 low. Took about 8 seconds.

The consensus pass found 19 issues: same 14, plus 5 more. Three of those 5 were real bugs I later confirmed in prod logs.

But here's the part that matters more than the raw numbers:

The consensus pass also filtered out 4 false positives that Claude had flagged with high confidence. Those were caught because Codex and Gemini both disagreed — and when 2 out of 3 models say "this is fine," the confidence weight pulls the verdict away from "issue."

How Confidence-Weighted Consensus Works

The naive approach to multi-model review would be simple majority voting: if 2 of 3 models say something is a bug, call it a bug. That's better than nothing, but it treats all models as equally reliable on all tasks.

Confidence-weighted consensus is smarter. Each model reports not just what it found, but how confident it is. The final verdict weights those signals proportionally.

So if Claude says "potential null dereference, high confidence" and Codex says "looks fine, medium confidence," the system doesn't just flip a coin. It weights Claude's high-confidence flag more heavily than Codex's medium-confidence dismissal.

In practice, this means:

Unanimous findings → almost certainly real, shown at the top
2/3 agreement, high confidence → likely real, worth investigating
1/3 agreement, low confidence on the finding model → deprioritized, often noise
Divergent high-confidence opinions → flagged as a "debate" item worth human judgment

Here's what that looks like with the Python SDK:

from secondopinion import client

# Run consensus review
result = client.consensus(
    code=open("server.py").read(),
    language="python"
)

for finding in result.findings:
    print(f"[{finding.confidence:.0%}] {finding.severity}: {finding.summary}")
    print(f"  Models agreeing: {', '.join(finding.models)}")
    print()

Output might look like:

[94%] HIGH: Unhandled promise rejection in processWebhook()
  Models agreeing: claude, codex, gemini

[71%] MEDIUM: Missing input validation on userId parameter
  Models agreeing: claude, gemini

[38%] LOW: Variable name 'data' is ambiguous
  Models agreeing: codex

That 38% finding? Probably noise. The 94% finding? Drop everything.

When Single-Model Review Is Still Fine

I want to be fair here. Single-model review isn't bad — it's just different.

For fast iteration during development, single-model is great. You're not trying to catch every bug; you're trying to get quick feedback while the code is fresh. Running 2ndopinion fix in watch mode gives you that:

# Continuous monitoring — single model, fast feedback loop
2ndopinion watch

For code that's about to merge to main — especially anything touching auth, payments, or data pipelines — the consensus pass is worth the extra 10-15 seconds and the 2 additional credits.

The mental model I've landed on: single-model for development velocity, consensus for pre-merge quality gates.

The Deeper Lesson: Models Have Blind Spots

The thing I didn't fully appreciate before building multi-model review into my workflow: AI models have systematic blind spots, not random ones.

If Claude misses a certain class of bug, it tends to consistently miss that class. It's not a random error — it's a bias in how the model was trained. That means if you only ever use Claude, you'll ship the same categories of bugs repeatedly without ever knowing they're being systematically missed.

Multi-model consensus surfaces those blind spots by triangulating from different vantage points. It's the same reason we have human code reviewers with different backgrounds look at the same PR.

One model trained heavily on Python might under-weight JavaScript async patterns. Another trained on a lot of library code might be overly conservative about application-layer error handling. When you combine them, the idiosyncrasies average out.

Try It

If you want to see this difference yourself, there's a free playground at get2ndopinion.dev — no signup required. Paste your code, run both modes, and compare the outputs side by side.

Or install the CLI and try it on your own codebase:

npm i -g 2ndopinion-cli

# Single model
2ndopinion review

# Consensus (3 models, confidence-weighted)
2ndopinion review --consensus

The first time you see a consensus pass catch something a single-model review confidently missed, you'll get it. That's the moment the model clicked for me.

2ndOpinion is a multi-model AI code review tool. Claude, Codex, and Gemini cross-check each other's findings via MCP, CLI, Python SDK, REST API, and GitHub PR Agent. Free playground at get2ndopinion.dev.

How to Add Multi-Model AI Code Review to Claude Code in 30 Seconds

Brian Mello — Thu, 02 Apr 2026 23:16:45 +0000

You know that moment when Claude reviews your code, gives it the green light, and then two days later you're debugging a production issue that three humans would have caught immediately?

Single-model AI code review has a blind spot problem. Each model was trained on different data, has different failure modes, and holds different opinions about what "correct" looks like. When you only ask one AI, you're getting one perspective — and that perspective has systematic gaps.

Multi-model consensus code review flips the script. Instead of trusting one AI, you get Claude, GPT-4o, and Gemini to cross-check each other. Where all three agree, you can be confident. Where they diverge, that's where you need to look closer.

Here's how to set it up in Claude Code in about 30 seconds.

The Problem with Single-Model Review

Let me be direct: single-model AI code review is better than nothing. But it has a fundamental flaw — the model doesn't know what it doesn't know.

I ran an experiment last month. I fed the same set of 50 bugs across Claude, GPT-4o, and Gemini separately. Each model caught some bugs the others missed. GPT-4o was better at certain Python anti-patterns. Gemini caught more async/concurrency issues. Claude excelled at security-related edge cases.

No model caught everything. But when I used all three in consensus mode? Coverage went up significantly.

This is the case for multi-model AI code review — it's not about any single model being bad, it's about combining strengths.

Setting Up 2ndOpinion via MCP in 60 Seconds

2ndOpinion is an AI-to-AI communication platform that routes your code to multiple models simultaneously and returns a confidence-weighted consensus. It plugs into Claude Code via MCP.

Here's the config:

{
  "mcpServers": {
    "2ndopinion": {
      "command": "npx",
      "args": ["-y", "2ndopinion-mcp"],
      "env": {
        "SECONDOPINION_API_KEY": "your-api-key-here"
      }
    }
  }
}

Drop that into your Claude Code MCP config file (usually ~/.claude/mcp_config.json), restart Claude Code, and you're done. No extra dependencies. No separate process to run.

Once it's wired up, you have access to these tools directly inside Claude Code:

review — standard multi-model code review (uses 2 credits)
consensus — parallel review from 3 models with confidence weighting (3 credits)
debate — multi-round AI debate for architecture decisions (5–7 credits)
bug_hunt — targeted bug detection sweep
security_audit — security-focused review

You don't have to remember which tool to use. The --llm auto flag routes to the best model for your language based on real accuracy data.

Running Your First Consensus Review

Once the MCP is connected, you can trigger a review in plain English inside Claude Code:

"Run a consensus code review on this file."

Or you can use the CLI directly if you prefer the terminal:

# Install globally
npm i -g 2ndopinion-cli

# Review a specific file
2ndopinion review src/auth/token-validator.ts

# Full consensus (3 models in parallel)
2ndopinion review --consensus src/auth/token-validator.ts

# Watch mode — auto-review on every save
2ndopinion watch

The consensus output tells you:

Where all three models agree — high confidence issues, fix these immediately
Where two out of three agree — worth a look, especially for complex logic
Where models disagree — the most interesting category; often means an ambiguous design tradeoff

That last category is my favorite. When GPT-4o says "this is fine" and Claude says "this will blow up under load" — that's a signal to dig in, not dismiss.

What the Output Actually Looks Like

Here's a real example. I had this Python function I was shipping:

def get_user_data(user_id: str) -> dict:
    conn = db.connect()
    result = conn.execute(f"SELECT * FROM users WHERE id = '{user_id}'")
    return dict(result.fetchone())

Running 2ndopinion review --consensus on this file returned:

🔴 CONSENSUS (3/3 models agree): SQL injection vulnerability
   Line 3: f-string interpolation in SQL query
   Fix: Use parameterized queries

🟡 MAJORITY (2/3 models): Connection not closed on exception
   Line 2: db.connect() has no context manager / finally block
   Claude, GPT-4o: Flag | Gemini: Acceptable (with connection pooling)

🟢 LOW CONFIDENCE (1/3 models): Return type may be None
   Line 4: fetchone() returns None if no row found
   Only Claude flagged this

The SQL injection is obvious in hindsight — all three models agree, high confidence. The connection handling disagreement is interesting — it tells me something about the environment assumptions baked into each model. And the None return type is a low-confidence flag worth noting for future-proofing.

This is what multi-model AI code review buys you: not just more issues, but a quality signal on each issue.

Pattern Memory and Regression Tracking

One thing that makes 2ndOpinion useful beyond a one-off review is that it builds project context over time. It tracks which patterns it's flagged before, so it can alert you when the same class of bug reappears in a different file.

If you fixed an authentication bypass three weeks ago and a new PR introduces a structurally similar issue, 2ndOpinion flags it as a regression. No additional config required — it builds this context automatically per project.

Combined with the GitHub PR Agent:

# Review PR #42 from the CLI
2ndopinion review --pr 42

...and you get automated multi-model review on every pull request, with regression awareness. The PR gets an inline comment breakdown — agreements, disagreements, and confidence levels — before a human reviewer ever opens it.

The Marketplace: Build Audits, Earn Revenue

This is the part that surprised me most. 2ndOpinion has a skills marketplace where you can publish custom audit types. If you've got deep expertise in, say, Rust memory safety or Django security patterns, you can package that into an audit skill, publish it, and earn 70% of every credit spent running it.

It's an interesting model: the platform benefits from domain expertise that no general-purpose LLM has, and the experts get a revenue stream from codifying what they know.

Try It Without Signing Up

If you want to kick the tires before committing, there's a free playground at get2ndopinion.dev — no signup required. Paste a code snippet, pick your review type, and see what three models think.

For the full MCP + Claude Code integration, you'll need an API key, but the setup overhead is genuinely minimal. One JSON config, one restart, and you're running confidence-weighted multi-model code review on every file you touch.

Single-model AI code review is table stakes at this point. If you're serious about code quality, the next step is getting your AIs to argue with each other — and paying attention to where they agree.

Check out get2ndopinion.dev or the GitHub repo to dig into the details.