<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Suat</title>
    <description>The latest articles on Forem by Suat (@suat_cad1c7e9617374be97b2).</description>
    <link>https://forem.com/suat_cad1c7e9617374be97b2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3896214%2F86a97188-a12b-4541-b5c5-61fa59b45b18.png</url>
      <title>Forem: Suat</title>
      <link>https://forem.com/suat_cad1c7e9617374be97b2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/suat_cad1c7e9617374be97b2"/>
    <language>en</language>
    <item>
      <title>I Built a Multi-LLM Debate Engine That Fact-Checks Itself in Real Time</title>
      <dc:creator>Suat</dc:creator>
      <pubDate>Fri, 24 Apr 2026 15:09:57 +0000</pubDate>
      <link>https://forem.com/suat_cad1c7e9617374be97b2/i-built-a-multi-llm-debate-engine-that-fact-checks-itself-in-real-time-3oib</link>
      <guid>https://forem.com/suat_cad1c7e9617374be97b2/i-built-a-multi-llm-debate-engine-that-fact-checks-itself-in-real-time-3oib</guid>
      <description>&lt;p&gt;When you ask one LLM a question, you get one answer. When you ask five LLMs the same question, you get five answers and no way to tell which is right.&lt;/p&gt;

&lt;p&gt;The naive fix — make them vote, or make them argue, or summarize them all — turns out to make things worse, not better. LLMs are prone to &lt;a href="https://arxiv.org/abs/2308.03958" rel="noopener noreferrer"&gt;sycophancy&lt;/a&gt;; when one confidently states a wrong fact, the others tend to concede rather than push back. Add a summarizer on top and you get a polished, cited-looking answer that is confidently wrong.&lt;/p&gt;

&lt;p&gt;I wanted a different shape: a structured debate between agents with different roles, plus a &lt;strong&gt;sixth agent whose only job is to fact-check the others mid-debate&lt;/strong&gt; — before any of them gets a chance to agree with a hallucination.&lt;/p&gt;

&lt;p&gt;This post is a walkthrough of what I built, why it works, and where it doesn't. The code is on GitHub under MIT: &lt;a href="https://github.com/capitansuat/swarm-debate" rel="noopener noreferrer"&gt;capitansuat/swarm-debate&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of the problem
&lt;/h2&gt;

&lt;p&gt;Imagine you ask five LLMs: "Is Acme Corp's recent acquisition of Beta Inc going to close by year end?"&lt;/p&gt;

&lt;p&gt;You'll get responses that sound like this, rewritten for brevity:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Model A&lt;/strong&gt;: "Morgan Stanley's November 28 M&amp;amp;A tracker shows the deal at 85% approval probability..."&lt;br&gt;
&lt;strong&gt;Model B&lt;/strong&gt;: "According to the DOJ Second Request docket DOJ-HSR-2025-4471..."&lt;br&gt;
&lt;strong&gt;Model C&lt;/strong&gt;: "The Wall Street Journal reported on October 17 that both parties received antitrust clearance..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One of those is real. The other two are fabrications — a made-up Morgan Stanley tracker (dated in the future, which makes it impossible), and a DOJ docket number that doesn't exist.&lt;/p&gt;

&lt;p&gt;A human reading the three responses is probably fine. A human running a downstream pipeline that summarizes them into a single answer is not fine, because the fabricated citations carry the same rhetorical weight as the real one. If a second LLM is then asked to synthesize these three, the odds it surfaces the fabrication as a problem are low. It will more likely produce a smoothly paraphrased answer that treats all three sources as equivalent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern I borrowed
&lt;/h2&gt;

&lt;p&gt;While reading about Mixture-of-Experts language models, I came across the &lt;strong&gt;shared expert&lt;/strong&gt; pattern. In an MoE model with routing, each input token selects K experts to process it. But some architectures also include one &lt;em&gt;shared&lt;/em&gt; expert that runs on every token, regardless of what the router picks. The shared expert handles general competence; the routed experts handle specialization.&lt;/p&gt;

&lt;p&gt;This is a strong structural answer to the debate problem: what if the "shared expert" in a multi-agent system is just... a fact-checker?&lt;/p&gt;

&lt;p&gt;The shape would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Round 1:
  Analyst -&amp;gt; opinion
  Strategist -&amp;gt; opinion
  Devil's Advocate -&amp;gt; opinion
  Researcher -&amp;gt; opinion
  Validator -&amp;gt; reads all four, fact-checks every concrete claim

Round 2:
  Each persona sees the previous round's output
  AND the validator's findings (OK / WARN / FAIL markers)
  AND is told: "Do not use claims marked FAIL"
  ... generates a new, hopefully more grounded opinion
  Validator runs again on the new outputs

Round 3: same pattern, then synthesize
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design choices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Validator does not debate.&lt;/strong&gt; It doesn't take sides, doesn't argue, only verifies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validator output is filtered before injection.&lt;/strong&gt; Other agents see only the structured markers, not the full validator reasoning. Otherwise they start quoting the validator as a peer, which defeats the point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FAIL findings carry forward explicitly.&lt;/strong&gt; The next round's prompt literally says "claims marked FAIL were verified wrong; do not reuse them." This is not subtle; it's what makes the pattern work.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What the Validator actually sees
&lt;/h2&gt;

&lt;p&gt;The Validator's system prompt is strict and narrow. Paraphrased:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a validator. You do NOT participate in the debate.
Read what was said this round. Identify verifiable claims:
numbers, dates, company names, reports, URLs, events.

For each concrete claim, you MUST use web_search to verify.
Future-dated source claims (e.g. "May 25 report" cited on April 24)
are automatically [FAIL].

Output format:
  [OK]   &amp;lt;claim&amp;gt; — verified, source URL: ...
  [WARN] &amp;lt;claim&amp;gt; — suspect, reason: ...
  [FAIL] &amp;lt;claim&amp;gt; — fabricated or wrong, correction: ..., source URL: ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this on the acquisition example and you get something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[OK]   WSJ reported antitrust clearance on Oct 17 (wsj.com/articles/..., 2025-10-17)
[FAIL] "Morgan Stanley M&amp;amp;A tracker, November 28" — today is October 20, future-dated
[FAIL] "DOJ Second Request docket DOJ-HSR-2025-4471" — no such filing in PACER or DOJ records
[WARN] "85% approval probability" — probability figure unsourced, no widely published tracker confirms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Those FAIL lines get inlined into the next round's prompt. Model A, which fabricated the Morgan Stanley citation, reads its own claim marked [FAIL] and is told not to reuse it. In my test runs, &lt;strong&gt;the same model, given the same topic, in the very next round, correctly drops the fabrication and reframes its argument around real data&lt;/strong&gt;. No fine-tuning, no retraining — just structured feedback during generation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before/after numbers from a real run
&lt;/h2&gt;

&lt;p&gt;I ran the same 4-persona × 3-round debate twice on the same topic. The only difference: the first run had a broken Validator (timeouts mid-round so most fact-checks didn't land). The second had the Validator running cleanly every round.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Run 1 (broken validator)&lt;/th&gt;
&lt;th&gt;Run 2 (clean)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Persona calls completed&lt;/td&gt;
&lt;td&gt;9/12&lt;/td&gt;
&lt;td&gt;12/12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validator rounds that ran&lt;/td&gt;
&lt;td&gt;1/3&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fabricated citations in log&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validator FAIL markers&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verified source URLs in log&lt;/td&gt;
&lt;td&gt;~5&lt;/td&gt;
&lt;td&gt;~20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total runtime&lt;/td&gt;
&lt;td&gt;26 min&lt;/td&gt;
&lt;td&gt;30 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Four extra minutes of runtime. Two fewer fabrications surviving to the synthesis step. For any downstream use that treats the synthesis as input — a decision support pipeline, a summary for a human in a hurry, a training dataset — this is a disproportionately good trade.&lt;/p&gt;

&lt;h2&gt;
  
  
  The implementation is boring (intentionally)
&lt;/h2&gt;

&lt;p&gt;The engine is one Python file, under 600 lines, pure stdlib plus PyYAML. Personas are YAML. Providers are OpenAI-compatible HTTP endpoints with a dispatcher that also knows how to shell out to CLI tools (useful if you already pay for a chat subscription and would rather reuse that access than buy API credits).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;swarm-debate/&lt;/span&gt;
&lt;span class="s"&gt;├── src/&lt;/span&gt;
&lt;span class="s"&gt;│   ├── swarm_debate.py&lt;/span&gt;    &lt;span class="c1"&gt;# the engine&lt;/span&gt;
&lt;span class="s"&gt;│   ├── config.yaml&lt;/span&gt;         &lt;span class="c1"&gt;# providers, timeouts&lt;/span&gt;
&lt;span class="s"&gt;│   └── personas.yaml&lt;/span&gt;       &lt;span class="c1"&gt;# the six roles&lt;/span&gt;
&lt;span class="s"&gt;├── examples/&lt;/span&gt;
&lt;span class="s"&gt;│   ├── topics.md&lt;/span&gt;           &lt;span class="c1"&gt;# topics that produce good debates&lt;/span&gt;
&lt;span class="s"&gt;│   └── product-brief-...&lt;/span&gt;   &lt;span class="c1"&gt;# example context document&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I deliberately kept the model names and auth patterns out of the hot path. Which model you pick for each persona is in &lt;code&gt;personas.yaml&lt;/code&gt;; the engine itself doesn't care. You can run the whole thing entirely on local Ollama if you want, or mix local personas for cheap speech with a single cloud-backed persona for the Validator.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that surprised me
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The validator is the bottleneck by a wide margin.&lt;/strong&gt; On my setup, the debating personas each took 30-180 seconds per round. The Validator took 300+ seconds because it has to read all four persona outputs and run a web search per claim. If you want this faster, lowering reasoning effort on the validator is the single highest-leverage knob.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality is non-linear in reasoning effort for the validator specifically.&lt;/strong&gt; Cheap validator = performative. It nods at claims without actually looking anything up. It might say &lt;code&gt;[OK] "according to Reuters"&lt;/code&gt; without verifying that Reuters actually said the thing. You can tell from the log: a good validator produces URLs; a cheap one produces vague attributions. This matches the intuition that fact-checking is harder than answering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personas with single-responsibility prompts outperform multi-responsibility prompts.&lt;/strong&gt; An early version had the Researcher persona double as the validator — "when you research, also fact-check the others." Argument quality dropped, fact-check quality dropped, and both responsibilities became half-hearted. Splitting them fixed both.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's not solved
&lt;/h2&gt;

&lt;p&gt;A few things I left on the roadmap because I didn't want to ship speculative solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Round adaptivity.&lt;/strong&gt; All debates run a fixed number of rounds. Most topics converge by round 3 anyway, but "no new information" detection would save time on easy questions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async validator.&lt;/strong&gt; The validator currently blocks the next round. Running it in parallel is straightforward but changes the injection semantics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meta-validator.&lt;/strong&gt; Two validators from different model families, disagreements flagged. Cheap insurance against validator-specific failure modes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persona reliability metrics.&lt;/strong&gt; Track which personas accumulate the most FAIL markers in your domain. In my runs one persona was noticeably more prone to fabrication than the others; I'd rather surface that data than guess.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/capitansuat/swarm-debate.git
&lt;span class="nb"&gt;cd &lt;/span&gt;swarm-debate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; swarm/debates
&lt;span class="nb"&gt;cp &lt;/span&gt;src/&lt;span class="k"&gt;*&lt;/span&gt;.yaml swarm/

python3 src/swarm_debate.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--topic&lt;/span&gt; &lt;span class="s2"&gt;"Should we migrate our 65kLOC TypeScript backend to Rust?"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agents&lt;/span&gt; analyst,strategist,devils_advocate,researcher &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rounds&lt;/span&gt; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Edit &lt;code&gt;swarm/personas.yaml&lt;/code&gt; to point at whatever providers you have (API keys, CLI tools, local Ollama — any combination works). The dispatcher figures out which path to use based on what's configured.&lt;/p&gt;

&lt;p&gt;The output is a Markdown log with all rounds, validator findings, and a synthesis section ready to pipe into a strong model for the final answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Credit
&lt;/h2&gt;

&lt;p&gt;The shared-expert idea came from reading the &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos&lt;/a&gt; community repository — a speculative reconstruction of a hypothetical MoE language model. OpenMythos is architecture-level speculation rather than a runnable model, and its specific claims about actual production systems are unverified, but the &lt;em&gt;structural&lt;/em&gt; idea of one expert always running alongside the routed experts is a real pattern in MoE research and it translates cleanly into multi-agent systems.&lt;/p&gt;

&lt;p&gt;Related papers worth a skim if you find this interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2101.03961" rel="noopener noreferrer"&gt;Switch Transformers&lt;/a&gt; — foundational MoE work&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2401.06066" rel="noopener noreferrer"&gt;DeepSeekMoE&lt;/a&gt; — formal shared-expert definition&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1807.03819" rel="noopener noreferrer"&gt;Universal Transformers&lt;/a&gt; — for the "same weights, different round behavior" idea I want to try next&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run it on a topic I wouldn't think to try, I'd like to see the log. Open an issue with the result attached — it's the kind of feedback that tells me whether the pattern generalizes or works only in my specific workload.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/capitansuat/swarm-debate" rel="noopener noreferrer"&gt;capitansuat/swarm-debate&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
