<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Muhammad Ali Nasir</title>
    <description>The latest articles on Forem by Muhammad Ali Nasir (@al1nasir).</description>
    <link>https://forem.com/al1nasir</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3817260%2F1fe7d997-e7ca-4b37-b3d3-4def622b0758.jpg</url>
      <title>Forem: Muhammad Ali Nasir</title>
      <link>https://forem.com/al1nasir</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/al1nasir"/>
    <language>en</language>
    <item>
      <title>I Built an AI System That Makes 4 Agents Debate Scientific Papers , And Then Tells You Where They Disagree</title>
      <dc:creator>Muhammad Ali Nasir</dc:creator>
      <pubDate>Tue, 10 Mar 2026 16:38:34 +0000</pubDate>
      <link>https://forem.com/al1nasir/i-built-an-ai-system-that-makes-4-agents-debate-scientific-papers-and-then-tells-you-where-they-102n</link>
      <guid>https://forem.com/al1nasir/i-built-an-ai-system-that-makes-4-agents-debate-scientific-papers-and-then-tells-you-where-they-102n</guid>
      <description>&lt;p&gt;&lt;em&gt;How GraphRAG + a multi-LLM council produces more trustworthy answers than any single AI model&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnw7flkllbuqa9yb5a87f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnw7flkllbuqa9yb5a87f.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is a quiet crisis in AI-assisted research that nobody talks about.&lt;/p&gt;

&lt;p&gt;Every tool you've used — ChatGPT, Perplexity, Copilot, Elicit — does the same thing: it reads papers and gives you &lt;strong&gt;one confident answer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfgjqqpd2nuyicicmmhz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfgjqqpd2nuyicicmmhz.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem is that science doesn't work that way.&lt;/p&gt;

&lt;p&gt;Take BRCA1's role in triple-negative breast cancer. Ask any AI tool and you'll get a confident, well-written paragraph. What you won't get is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Study A says BRCA1 mutations are associated with &lt;strong&gt;increased aggressiveness&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Study B says the same patients show &lt;strong&gt;better response rates to chemotherapy&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Study C shows &lt;strong&gt;shorter progression-free survival&lt;/strong&gt; despite better initial response&lt;/li&gt;
&lt;li&gt;Three studies report conflicting BRCA1 expression levels&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't edge cases. These are the real contradictions buried across 200 papers that a researcher needs to know before designing an experiment, filing an IND, or trusting a conclusion.&lt;/p&gt;

&lt;p&gt;A single AI model smooths these over. It synthesizes them into a confident answer. And in doing so, it hides exactly the information a scientist needs most.&lt;/p&gt;

&lt;p&gt;This is why I built &lt;strong&gt;Research Council&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea: Deliberation Over Confidence
&lt;/h2&gt;

&lt;p&gt;The insight that drove this project came from an unlikely place: Andrej Karpathy's &lt;a href="https://github.com/karpathy/llm-council" rel="noopener noreferrer"&gt;llm-council&lt;/a&gt; repo — a simple Saturday hack that instead of asking one LLM a question, routes it to multiple LLMs and has them review each other's answers.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;cross-model review catches things a single model misses.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I wanted to take this further. What if instead of generic LLMs reviewing each other, you had &lt;em&gt;specialized agents&lt;/em&gt; — each trained to look at the evidence from a fundamentally different angle — deliberating over a &lt;em&gt;structured knowledge graph&lt;/em&gt; of papers?&lt;/p&gt;

&lt;p&gt;That's Research Council.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Actually Does
&lt;/h2&gt;

&lt;p&gt;When you ask Research Council a research question, here's what happens:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: The Knowledge Graph
&lt;/h3&gt;

&lt;p&gt;Before you ask anything, papers are ingested from PubMed, arXiv, Semantic Scholar, or uploaded as PDFs. Research Council doesn't just chunk them into vectors. It builds a &lt;strong&gt;Neo4j knowledge graph&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nodes&lt;/strong&gt;: Paper, Gene, Drug, Disease, Protein, Pathway, Author, Conclusion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationships&lt;/strong&gt;: CONTRADICTS, SUPPORTS, CITES, MENTIONS, STUDIES, TARGETS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the critical difference from standard RAG. Traditional RAG asks: &lt;em&gt;"which chunks are semantically similar to my query?"&lt;/em&gt; GraphRAG asks: &lt;em&gt;"what is the structural relationship between the entities in my query?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When you ask about BRCA1 and TNBC, the graph doesn't just return the most similar text chunks — it returns the &lt;strong&gt;neighborhood of BRCA1&lt;/strong&gt;: every paper that mentions it, every drug that targets related proteins, and critically, every paper that &lt;strong&gt;contradicts&lt;/strong&gt; another paper on the topic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Token-Efficient Context Assembly
&lt;/h3&gt;

&lt;p&gt;Here's a technical detail that matters a lot at scale.&lt;/p&gt;

&lt;p&gt;Naive multi-agent systems load every available tool into the prompt upfront. With 50+ tools, that's 25,000+ tokens before the agent does anything useful.&lt;/p&gt;

&lt;p&gt;Research Council uses &lt;a href="https://github.com/langchain-ai/langgraph-bigtool" rel="noopener noreferrer"&gt;langgraph-bigtool&lt;/a&gt;: tools are embedded with SentenceTransformers at startup and retrieved semantically at query time. Only 2-4 relevant tools are loaded per query.&lt;/p&gt;

&lt;p&gt;The result: a full 4-agent deliberation on a complex biomedical question uses &lt;strong&gt;3,118 tokens total&lt;/strong&gt;. About $0.002.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Council Deliberates
&lt;/h3&gt;

&lt;p&gt;Four specialized agents receive the same knowledge graph subgraph and analyze it in parallel via &lt;code&gt;asyncio.gather()&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔬 Evidence Agent&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"What does the data actually show? Be precise about sample sizes, study types, and effect sizes. Never speculate beyond what the data shows."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;⚔️ Skeptic Agent&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Find the weaknesses: biased study designs, underpowered samples, conflicting results, publication bias. Be constructively critical."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;🔗 Connector Agent&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Find non-obvious links — drug repurposing opportunities, analogous mechanisms from other diseases, techniques from adjacent fields."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;📋 Methodology Agent&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Evaluate whether experimental designs are appropriate, controls are adequate, statistical methods are sound, and whether conclusions are justified by the methods used."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each agent produces an independent response. Then the real work begins.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: 12 Cross-Reviews
&lt;/h3&gt;

&lt;p&gt;Every agent reviews every other agent's response — anonymized, to prevent model bias. That's 4 × 3 = 12 peer evaluations.&lt;/p&gt;

&lt;p&gt;Each review produces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An agreement score (0.0 to 1.0)&lt;/li&gt;
&lt;li&gt;Specific points of disagreement&lt;/li&gt;
&lt;li&gt;Constructive critique&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;aggregate agreement score&lt;/strong&gt; becomes a signal for confidence. High agreement → higher confidence. Persistent disagreement → lower confidence, and the Chairman must explain &lt;em&gt;why&lt;/em&gt; the agents disagreed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: The Chairman Synthesizes
&lt;/h3&gt;

&lt;p&gt;A Chairman agent (running on OpenRouter with the best available model) receives all four original responses plus all twelve cross-reviews. It produces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"key_findings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"contradictions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"citations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"claim"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"paper_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PMID:..."&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"methodology_notes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"agent_agreement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the confidence score is &lt;strong&gt;0.65, not 0.95&lt;/strong&gt;. That's intentional. The system doesn't inflate confidence. If the evidence is contested, the score reflects that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: The Write-Back Loop
&lt;/h3&gt;

&lt;p&gt;Every conclusion the Chairman produces gets written back to Neo4j as a new &lt;code&gt;Conclusion&lt;/code&gt; node — linked to every paper it references. The graph compounds over time. Each query makes future queries more informed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;Here's the actual output on &lt;em&gt;"Are there contradictions in BRCA1's role in triple-negative breast cancer?"&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confidence: 65%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not 95%. Not "based on multiple sources." A calibrated 65% because the agents genuinely disagreed on two points and the methodology agent flagged three studies as underpowered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4 Contradictions surfaced:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BRCA1 mutations associated with both increased tumor aggressiveness AND better prognosis&lt;/li&gt;
&lt;li&gt;Higher treatment response rates but shorter progression-free survival&lt;/li&gt;
&lt;li&gt;Conflicting reports on BRCA1 expression levels across studies&lt;/li&gt;
&lt;li&gt;Variable associations between BRCA1 mutations and TNBC significance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6 Key findings&lt;/strong&gt;, each cited to a specific PubMed ID.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8 Methodology concerns&lt;/strong&gt; — variable TNBC definitions, selection bias, small sample sizes, retrospective designs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent agreement: 80%&lt;/strong&gt; — two agents disagreed on whether the survival paradox was explained by tumor heterogeneity or methodological inconsistency.&lt;/p&gt;

&lt;p&gt;Compare this to what ChatGPT gives you: a confident, well-written paragraph that mentions none of the contradictions.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Researcher Query
      │
      ▼
GraphRAG Layer
  Neo4j: entities + relationships
  ChromaDB: vector embeddings (CPU-only, MiniLM)
  Hybrid retrieval: ~2,000 token context
      │
      ▼
LangGraph Orchestrator
  BigTool: 2-4 tools loaded dynamically
  Hybrid retrieval node
  Context assembly node
      │
      ▼
LLM Council (Groq + OpenRouter)
  Stage 1: 4 parallel agents
  Stage 2: 12 cross-reviews
  Stage 3: Chairman synthesis
      │
      ▼
Answer + Neo4j Writeback
  Confidence · Citations · Contradictions · Provenance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Full stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LangGraph + langgraph-bigtool (orchestration)&lt;/li&gt;
&lt;li&gt;Neo4j 5 Community (knowledge graph, local)&lt;/li&gt;
&lt;li&gt;ChromaDB (vector store, local, CPU)&lt;/li&gt;
&lt;li&gt;all-MiniLM-L6-v2 (embeddings, 80MB, CPU-only)&lt;/li&gt;
&lt;li&gt;Groq llama-3.3-70b + llama-3.1-8b (council agents, fast)&lt;/li&gt;
&lt;li&gt;OpenRouter claude-sonnet (Chairman, best synthesis quality)&lt;/li&gt;
&lt;li&gt;FastAPI + React + Vite&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hardware requirements:&lt;/strong&gt; 16GB RAM, 4GB VRAM. No beefy GPU needed. Embeddings run entirely on CPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned Building This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The write-back loop is the most underappreciated part&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most RAG systems are stateless. Query in, answer out. Research Council writes every conclusion back to the graph as a new node linked to source papers. After 50 queries, the graph has 50 validated conclusions that inform future answers. This is the difference between a tool and a system that compounds knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Confidence calibration is harder than it sounds&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Getting agents to express &lt;em&gt;genuine&lt;/em&gt; uncertainty rather than inflated confidence required careful prompt engineering. The current approach — deriving confidence from agent agreement scores — works but isn't theoretically principled. There's real research to be done here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. 12 cross-reviews might be overkill&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;n × (n-1) cross-reviews scales quadratically. With 4 agents that's 12, manageable. With 8 agents that's 56 — too slow. A smarter aggregation strategy (maybe pairwise disagreement sampling) would make larger councils viable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The skeptic agent is the most valuable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After dozens of test queries, the Skeptic Agent consistently surfaces the most useful information — not because skepticism is inherently valuable, but because existing AI tools have a strong bias toward presenting positive findings. The explicit adversarial role corrects for this.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Research Council V1 is live and open source (MIT licensed).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V2 plans:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Community detection on the knowledge graph (Louvain clustering)&lt;/li&gt;
&lt;li&gt;Temporal analysis — track how understanding of a topic evolves year by year&lt;/li&gt;
&lt;li&gt;MCP server integrations — Zotero, PubMed MCP, Neo4j MCP&lt;/li&gt;
&lt;li&gt;HuggingFace Space for live demos&lt;/li&gt;
&lt;li&gt;Export council output as formatted PDF for lab notes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What I'd love help with:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better confidence calibration methodology&lt;/li&gt;
&lt;li&gt;Async optimization of the cross-review loop&lt;/li&gt;
&lt;li&gt;Additional paper sources (bioRxiv, ChemRxiv, Europe PMC)&lt;/li&gt;
&lt;li&gt;Domain-specific agent specializations beyond biomedicine&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/al1-nasir/Research_council" rel="noopener noreferrer"&gt;https://github.com/al1-nasir/Research_council&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MIT licensed. Runs on a laptop. Groq has a free tier. OpenRouter costs fractions of a cent per query.&lt;/p&gt;

&lt;p&gt;If you work in research, drug discovery, or AI for science — I'd genuinely love to know what question you'd throw at it first.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by Muhammad Ali Nasir. Inspired by karpathy/llm-council, extended with domain-specific GraphRAG and biomedical agent specialization.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; Artificial Intelligence · Machine Learning · Biomedical Research · GraphRAG · Open Source · LLM · Python · Drug Discovery · Research Tools&lt;/p&gt;

</description>
      <category>graphrag</category>
      <category>llm</category>
      <category>multiagent</category>
    </item>
  </channel>
</rss>
