<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: that-github-user</title>
    <description>The latest articles on Forem by that-github-user (@thatgithubuser).</description>
    <link>https://forem.com/thatgithubuser</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3850068%2Fb3c582e1-6ee1-400a-8d8a-5465369d4f36.png</url>
      <title>Forem: that-github-user</title>
      <link>https://forem.com/thatgithubuser</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/thatgithubuser"/>
    <language>en</language>
    <item>
      <title>pass@1 is a gamble — how ensemble coding enhances AI reliability</title>
      <dc:creator>that-github-user</dc:creator>
      <pubDate>Thu, 02 Apr 2026 14:00:00 +0000</pubDate>
      <link>https://forem.com/thatgithubuser/pass1-is-a-gamble-how-ensemble-coding-enhances-ai-reliability-57jg</link>
      <guid>https://forem.com/thatgithubuser/pass1-is-a-gamble-how-ensemble-coding-enhances-ai-reliability-57jg</guid>
      <description>&lt;p&gt;You ask Claude to fix a bug. It nails it. You ask it again with slightly different phrasing. It refactors half your module and breaks three unrelated tests. Same model, same task, different result.&lt;/p&gt;

&lt;p&gt;This is the fundamental problem with AI coding today: &lt;strong&gt;pass@1 — the chance a single attempt succeeds — is a gamble&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Running the same task multiple times and picking the best result &lt;a href="https://arxiv.org/abs/2107.03374" rel="noopener noreferrer"&gt;dramatically improves reliability&lt;/a&gt;. It's the same principle behind &lt;a href="https://en.wikipedia.org/wiki/Ensemble_learning" rel="noopener noreferrer"&gt;ensemble methods&lt;/a&gt; in ML — and &lt;a href="https://arxiv.org/abs/2503.15838" rel="noopener noreferrer"&gt;recent&lt;/a&gt; &lt;a href="https://arxiv.org/abs/2510.21513" rel="noopener noreferrer"&gt;research&lt;/a&gt; confirms it works for code generation too, though it warns that naive consensus can amplify shared mistakes. Selection method matters as much as ensemble size. We built &lt;a href="https://github.com/that-github-user/thinktank" rel="noopener noreferrer"&gt;thinktank&lt;/a&gt; to make this practical — thinktank currently uses a single model (Claude), so &lt;strong&gt;test execution is the primary quality signal, not consensus&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  One command
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;thinktank run &lt;span class="s2"&gt;"fix the authentication bypass"&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 5 &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="s2"&gt;"npm test"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj37i1d09lndu50bd5ltc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj37i1d09lndu50bd5ltc.png" alt="Example run of thinktank" width="762" height="938"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Under the hood:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;N isolated git clones&lt;/strong&gt; — each agent gets a fully independent copy of your repo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;N parallel Claude Code agents&lt;/strong&gt;, each solving the task with zero knowledge of the others&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test verification&lt;/strong&gt; — your test suite runs in each clone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convergence analysis&lt;/strong&gt; — clusters agents by diff similarity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Copeland pairwise scoring&lt;/strong&gt; — ranks agents via head-to-head comparison&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;thinktank apply&lt;/code&gt; — applies the winner to your working tree&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The cost is real: 5 agents = 5× the API tokens. A typical 5-agent run on a medium codebase costs roughly $1-5 depending on task complexity and model (Sonnet is cheaper, Opus is pricier). But they run in parallel, so wall-clock time is the slowest agent, not the sum — typically a few minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Why not just spawn N agents yourself?"&lt;/strong&gt; You could. But thinktank handles the parts that are tedious to DIY: git isolation and cleanup, parallel orchestration with timeout/retry, test execution per clone, diff similarity across all pairs, and Copeland scoring. The isolation and the analysis are the product — not the parallelism.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it gets interesting: A* pathfinding
&lt;/h2&gt;

&lt;p&gt;We gave 5 agents a grid-based pathfinding challenge: implement A* with their choice of heuristic, data structures, and optimizations. We ran it in both TypeScript and Python.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python (pytest):&lt;/strong&gt; 5 agents, &lt;strong&gt;5/5 pass all 7 tests&lt;/strong&gt;, 71% convergence&lt;br&gt;
&lt;strong&gt;TypeScript (node:test):&lt;/strong&gt; 5 agents, 3/5 pass, 68% convergence&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqx5fhxe9yhuzd1s8l6bi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqx5fhxe9yhuzd1s8l6bi.png" alt="grid visualization showing 5 agents' exploration patterns" width="800" height="884"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All agents independently chose Manhattan distance and a min-heap priority queue — the textbook approach. But the implementations diverged:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agents 1-3:&lt;/strong&gt; Standard A* in ~38 lines, no heap tiebreaking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agents 4-5:&lt;/strong&gt; A* with a heap tiebreak counter (~46 lines) — prevents pathological node exploration when cells have equal f-scores. &lt;strong&gt;up to 37% fewer nodes explored&lt;/strong&gt;, same optimal path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the twist: &lt;strong&gt;Copeland scoring recommends Agent #1&lt;/strong&gt; — smallest diff, passes all tests, high convergence. By thinktank's reliability criteria, it's the safest pick. But Agent #4's approach is algorithmically superior, and you'd never know it existed with one agent.&lt;/p&gt;

&lt;p&gt;This is the deeper value: &lt;strong&gt;not just picking a winner, but seeing the design space&lt;/strong&gt;. Copeland gives you the safe choice. The full ensemble reveals approaches worth stealing.&lt;/p&gt;

&lt;p&gt;Want to see why Agent #4's approach was different?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ thinktank compare 1 4

  Comparing Agent #1 vs Agent #4
  ──────────────────────────────────────────────────────────

  Agent #1 (recommended): success | tests: pass | +399/-0 | 1 files
  Agent #4: success | tests: pass | +259/-0 | 1 files

  Similarity: ████░░░░░░░░░░░░░░░░ 19%

  Files changed:
    both  .../astar-python/test_pathfinding_generated.py

  Added lines:
    Shared:        58
    Only #1:      158
    Only #4:      91
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only 19% similarity — both agents wrote valid tests for the same module, but took very different approaches. Apply the winner, or pick a specific agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;thinktank apply
  Applying changes from Agent &lt;span class="c"&gt;#1...&lt;/span&gt;
  Changes applied successfully.
  Cleaning up clones...
  Done.

  Review the changes with: git diff
  Commit when ready: git add &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; git commit

&lt;span class="nv"&gt;$ &lt;/span&gt;thinktank undo   &lt;span class="c"&gt;# changed your mind? roll it back&lt;/span&gt;
  Undo &lt;span class="nb"&gt;complete&lt;/span&gt; — last applied patch has been reversed.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Same pattern, different domain: gradient descent
&lt;/h2&gt;

&lt;p&gt;We ran the same experiment on ML: 5 agents implementing linear regression via batch gradient descent. All five wrote structurally identical code — normalize features, compute gradients, update weights. 76% convergence. But:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Epochs&lt;/th&gt;
&lt;th&gt;Tests&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;#1, #3, #4&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;5/7&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;fail&lt;/strong&gt; &lt;code&gt;test_perfect_fit&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;#2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6/6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;pass&lt;/strong&gt; — Copeland pick&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#5&lt;/td&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;td&gt;6/6&lt;/td&gt;
&lt;td&gt;pass — over-engineered&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three agents under-train. One over-trains. &lt;strong&gt;The Copeland pick (#2) is the Goldilocks solution&lt;/strong&gt; — just enough iterations to converge, no wasted compute. The algorithm is identical across all five; the only difference is a hyperparameter choice that only surfaces on edge-case test inputs. Exactly the kind of subtle difference that ensemble catches and single-agent misses.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Copeland scoring works
&lt;/h2&gt;

&lt;p&gt;Most "pick the best" systems use weighted scoring — assign points for tests, convergence, diff size, sum them up. We tried this. The weights felt arbitrary, and across 57 runs, &lt;a href="https://github.com/that-github-user/thinktank/blob/main/docs/scoring-evaluation.md" rel="noopener noreferrer"&gt;weighted scoring disagreed with pairwise methods about a third of the time&lt;/a&gt; (Cochran's Q: p&amp;lt;0.0001).&lt;/p&gt;

&lt;p&gt;Instead, thinktank uses &lt;a href="https://en.wikipedia.org/wiki/Copeland%27s_method" rel="noopener noreferrer"&gt;Copeland's method&lt;/a&gt; from social choice theory. Every pair of agents is compared head-to-head on four criteria:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tests passed&lt;/strong&gt; — did the test suite pass?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convergence&lt;/strong&gt; — how many other agents took a similar approach?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope&lt;/strong&gt; — fewer production files changed = less risk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test coverage&lt;/strong&gt; — did the agent add or update tests?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent winning more criteria gets +1, the loser gets −1. No arbitrary point weights. Two independent ranking methods — Copeland and Borda count — agree on the recommendation 84-96% of the time (84% across 74 evaluated runs, 96% on the 53-run subset with stored per-agent scores), while weighted scoring disagrees about a third of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we learned building this
&lt;/h2&gt;

&lt;p&gt;thinktank was built using thinktank — 80 PRs, ~250 tests, 103 ensemble runs across TypeScript, Python, pathfinding, and ML tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;About half of individual agent attempts fail their tests.&lt;/strong&gt; This sounds bad, but it's the point — in a 5-agent ensemble, you only need one to succeed. If your task is simple enough for pass@1, a single agent is fine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlated failures are the limit.&lt;/strong&gt; All agents use the same Claude model. When it has a systematic blind spot, all 5 may fail the same way. Multi-model ensembles (Claude + GPT + Gemini) would help — thinktank has a runner abstraction for this but only Claude Code is implemented today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The sweet spot is medium-complexity tasks.&lt;/strong&gt; "Fix this bug and add a test," "add rate limiting to this API," "refactor this module" — tasks where the model &lt;em&gt;can&lt;/em&gt; succeed but the approach isn't obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Convergence is a confidence signal, not a correctness signal.&lt;/strong&gt; Tests are the oracle. When all agents converge on the same wrong answer, it's the tests that catch it — not the consensus. (More on this below.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use it for&lt;/strong&gt; simple tasks (wasting tokens), without a test suite (the primary oracle), or for tasks that need iterative refinement (each agent starts fresh).&lt;/p&gt;

&lt;h2&gt;
  
  
  The false oracle problem: ensemble test generation
&lt;/h2&gt;

&lt;p&gt;This was our most expensive lesson. Early in development, a single agent wrote a test asserting a maze's shortest path was 13 steps. The correct answer was 9. That bad test became the oracle for 13+ ensemble runs — every A* implementation looked "broken" when they were all correct.&lt;/p&gt;

&lt;p&gt;The fix: &lt;strong&gt;use the ensemble to write the tests first.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Phase 1: generate tests — no test command, just convergence analysis
$ thinktank run "write unit tests for A* pathfinding on a 5x5 grid" -n 3

  Convergence
  ────────────────────────────────────────────────────────────
  Agents [1, 3]: ████████████████░░░░ 67%

  Agent #1: assert shortestPath(grid) === 9
  Agent #2: assert shortestPath(grid) === 13    ← disagrees
  Agent #3: assert shortestPath(grid) === 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two out of three agents independently computed 9. The disagreement flags the bad oracle &lt;em&gt;before&lt;/em&gt; it poisons anything. Apply the majority's tests, then use them as ground truth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Phase 2: implement against validated tests&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;thinktank run &lt;span class="s2"&gt;"implement A* pathfinding"&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 5 &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="s2"&gt;"npm test"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This two-phase workflow — &lt;strong&gt;ensemble tests, then ensemble implementation&lt;/strong&gt; — is how we avoided false oracles for the rest of development.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers from 103 runs
&lt;/h2&gt;

&lt;p&gt;We dogfooded thinktank on itself. Here's what the data looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;thinktank stats

  thinktank stats
  ─────────────────────────────
  Total runs:          103
  Avg agents/run:      4.3
  Avg convergence:     64.8%
  Avg &lt;span class="nb"&gt;test &lt;/span&gt;pass rate:  47.5%

&lt;span class="nv"&gt;$ &lt;/span&gt;thinktank stats &lt;span class="nt"&gt;--passed-only&lt;/span&gt;

  thinktank stats
  ─────────────────────────────
  Filters:             passed-only
  Total runs:          57
  Avg agents/run:      4.6
  Avg convergence:     64.4%
  Avg &lt;span class="nb"&gt;test &lt;/span&gt;pass rate:  79.2%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;47.5% test pass rate — roughly half of individual agent attempts fail. That's the whole point: in a 5-agent ensemble, you don't need every agent to succeed, you need &lt;em&gt;one&lt;/em&gt;. Filter to runs where at least one agent passed and you're at 79.2%.&lt;/p&gt;

&lt;p&gt;If your pass rate is dropping, your prompts are too vague or your tests are too strict. If convergence is trending down, the tasks are ambiguous — rewrite the prompt before spending more tokens.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;thinktank evaluate

  Scoring Method Evaluation
  ──────────────────────────────────────────────────────────
  Usable runs: 74 &lt;span class="o"&gt;(&lt;/span&gt;of 103 total&lt;span class="o"&gt;)&lt;/span&gt;

  Run     Agents  Weighted  Copeland  Borda   Agree?
  ──────────────────────────────────────────────────────────
  &lt;span class="c"&gt;#4      5       #1        #1        #1      yes&lt;/span&gt;
  &lt;span class="c"&gt;#10     5       #1        #2        #2      NO&lt;/span&gt;
  &lt;span class="c"&gt;#14     5       #1        #3        #3      NO&lt;/span&gt;
  ...

  Agreement Rates
  ──────────────────────────────
  All three agree:         45/74 &lt;span class="o"&gt;(&lt;/span&gt;61%&lt;span class="o"&gt;)&lt;/span&gt;
  Weighted &lt;span class="o"&gt;=&lt;/span&gt; Copeland:     53/74 &lt;span class="o"&gt;(&lt;/span&gt;72%&lt;span class="o"&gt;)&lt;/span&gt;
  Weighted &lt;span class="o"&gt;=&lt;/span&gt; Borda:        47/74 &lt;span class="o"&gt;(&lt;/span&gt;64%&lt;span class="o"&gt;)&lt;/span&gt;
  Copeland &lt;span class="o"&gt;=&lt;/span&gt; Borda:        62/74 &lt;span class="o"&gt;(&lt;/span&gt;84%&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Copeland and Borda disagree on a run (16% of the time), that's a signal to manually review with &lt;code&gt;thinktank compare&lt;/code&gt; instead of blindly applying. When all three methods agree, apply with confidence. This is how we discovered that weighted scoring was the outlier — and why Copeland became the default.&lt;/p&gt;

&lt;h2&gt;
  
  
  We're not the only ones thinking about this
&lt;/h2&gt;

&lt;p&gt;The ensemble idea is in the air. Mozilla AI's &lt;a href="https://github.com/peteski22/agent-pragma" rel="noopener noreferrer"&gt;Star Chamber&lt;/a&gt; fans out code &lt;em&gt;reviews&lt;/em&gt; to Claude, GPT, and Gemini in parallel, using consensus tiers and optional &lt;a href="https://blog.mozilla.ai/the-star-chamber-multi-llm-consensus-for-code-quality/" rel="noopener noreferrer"&gt;debate rounds&lt;/a&gt; where anonymized feedback circulates back to all models. Karpathy's &lt;a href="https://github.com/karpathy/llm-council" rel="noopener noreferrer"&gt;llm-council&lt;/a&gt; runs a deliberation pipeline — parallel query, peer review, chairman synthesis — for general-purpose tasks. &lt;a href="https://github.com/askbudi/roundtable" rel="noopener noreferrer"&gt;Roundtable&lt;/a&gt; orchestrates multiple AI CLI tools through a unified MCP interface. &lt;a href="https://github.com/ComposioHQ/composio" rel="noopener noreferrer"&gt;Composio's Agent Orchestrator&lt;/a&gt; manages parallel coding agents with git worktrees for task decomposition — different agents working on different sub-tasks. And &lt;a href="https://aider.chat/docs/usage/modes.html" rel="noopener noreferrer"&gt;Aider's Architect Mode&lt;/a&gt; pairs two models in complementary roles (planner + editor).&lt;/p&gt;

&lt;p&gt;Thinktank is doing something specific: &lt;strong&gt;same-task ensemble with true isolation&lt;/strong&gt;. Every agent gets an independent git clone, solves the identical problem with zero knowledge of the others, and results are ranked by Copeland pairwise scoring — not majority vote, not debate, not model synthesis. The isolation is the point: it's what makes the ensemble math work, and it's why convergence is a meaningful signal rather than an artifact of shared context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; thinktank-ai
thinktank init
thinktank run &lt;span class="s2"&gt;"your task here"&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 5 &lt;span class="nt"&gt;-t&lt;/span&gt; &lt;span class="s2"&gt;"npm test"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires &lt;a href="https://docs.anthropic.com/en/docs/claude-code" rel="noopener noreferrer"&gt;Claude Code CLI&lt;/a&gt;. Works with Anthropic API keys or &lt;strong&gt;Amazon Bedrock&lt;/strong&gt; (pass any model ID starting with &lt;code&gt;anthropic.&lt;/code&gt;, e.g. &lt;code&gt;--model anthropic.claude-opus-4-6-v1&lt;/code&gt; — AWS credentials from your environment are inherited automatically). MIT licensed. Contributions welcome — especially runners for other AI coding tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;thinktank currently runs Claude Code only. The runner interface is designed to be pluggable — adding support for other AI coding tools (&lt;a href="https://github.com/anomalyco/opencode" rel="noopener noreferrer"&gt;OpenCode&lt;/a&gt;, Aider, Codex CLI, Gemini CLI) is the highest-priority roadmap item. Multi-tool ensembles would address the single-model diversity limitation and unlock the gains the research says are there.&lt;/p&gt;

&lt;p&gt;Beyond that: more algorithmic showcases, a web dashboard for visual diff comparison, and more.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/that-github-user/thinktank" rel="noopener noreferrer"&gt;thinktank on GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/thinktank-ai" rel="noopener noreferrer"&gt;npm install -g thinktank-ai&lt;/a&gt; · &lt;a href="https://github.com/that-github-user/thinktank/blob/main/docs/scoring-evaluation.md" rel="noopener noreferrer"&gt;Technical report&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>opensource</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
