Forem: Muhammad Ali Nasir

LocalForge: I built a self-hosted LLM control plane with intelligent routing and LoRA finetuning

Muhammad Ali Nasir — Thu, 23 Apr 2026 14:33:00 +0000

Running local LLMs is easy. Running them well in a real application is not.

You end up with fragile inference scripts, no idea which model fits which task, manual VRAM calculations, and zero observability into what's actually happening. I got tired of it, so I built LocalForge

What it is

LocalForge is a self-hosted AI control plane. It exposes a single OpenAI-compatible endpoint and handles everything else — model lifecycle, intelligent routing, memory, and finetuning.

# Your app stays the same. Just change base_url.
client = openai.OpenAI(base_url="http://localhost:8010/v1", api_key="lf-xxx")
response = client.chat.completions.create(model="auto", messages=[...])

How the router works

When you send model: "auto", the routing engine:

Classifies the query — TF-IDF + Logistic Regression, under 5ms, into coding / math / reasoning / instruction / general
Scores each model using:
- Benchmark scores from HuggingFace (MMLU-Pro, HumanEval, GSM8K) — 40%
- Vector memory of past query→outcome pairs stored in Qdrant — 30%
- Measured latency on your hardware — 15%
- Thumbs up/down feedback — 15%
Falls back to cloud (OpenAI/Gemini) if confidence < 0.3

The memory layer uses nomic-embed-text-v1.5 to embed every query locally. Similar past queries are retrieved at routing time, and scores decay exponentially (λ = 0.95) so fresh failures hurt more than old ones.

VRAM lifecycle

Consumer GPUs can only hold 1–2 models at a time. LocalForge manages atomic state transitions:

UNLOADED → LOADING → HOT → UNLOADING → UNLOADED

Requests queue during model swaps. The "Resident Model" (most-used in the past 24h) is prioritized to stay loaded.

Finetuning pipeline

Upload CSV or JSONL dataset via the dashboard
Pick base model + hyperparameters
Training runs in an isolated subprocess via Unsloth (2× faster, 60% less VRAM)
Live loss curves stream to the browser via SSE
On completion: LoRA adapters merged → GGUF exported → model auto-registered in the router

Tech stack

Layer	Tech
Backend	FastAPI + aiosqlite (WAL)
Frontend	Next.js 16 + React 19
Inference	llama-cpp-python
Vector store	Qdrant (disk, no Docker)
Embeddings	nomic-embed-text-v1.5
Finetuning	Unsloth / PEFT + TRL
Classifier	scikit-learn TF-IDF + LogReg

GitHub

al1-nasir / LocalForge

Self-hosted AI control plane for intelligent local LLM orchestration. OpenAI-compatible API · ML-powered multi-model routing · LoRA finetuning · vector memory · RAG

⚡ LocalForge

Self-Hosted AI Control Plane for Intelligent Local LLM Orchestration

A production-grade platform for running, routing, benchmarking, and finetuning local LLMs.
Drop-in OpenAI-compatible API · Intelligent multi-model routing · LoRA finetuning with live monitoring

Overview

LocalForge is a self-hosted AI control plane that transforms your GPU workstation into an intelligent LLM serving infrastructure. Instead of manually managing model files, writing inference scripts, and guessing which model fits which task — LocalForge automates the entire lifecycle:

Browse & Download GGUF models from HuggingFace with automatic VRAM compatibility filtering
Serve models via a fully OpenAI-compatible /v1/chat/completions endpoint
Route queries to the optimal model using ML-powered task classification + multi-signal scoring
Learn from usage patterns via a vector-based memory layer that improves routing over time
Benchmark models against standard evaluations (MMLU-Pro, HumanEval, GSM8K, GPQA, MT-Bench)
Finetune models with LoRA/QLoRA via a managed subprocess pipeline with live loss streaming
Augment responses with a…

View on GitHub

Built by Ali Nasir — alinasir.me · LinkedIn

Would love feedback on the routing architecture in particular!

I Built an AI System That Makes 4 Agents Debate Scientific Papers , And Then Tells You Where They Disagree

Muhammad Ali Nasir — Tue, 10 Mar 2026 16:38:34 +0000

How GraphRAG + a multi-LLM council produces more trustworthy answers than any single AI model

There is a quiet crisis in AI-assisted research that nobody talks about.

Every tool you've used — ChatGPT, Perplexity, Copilot, Elicit — does the same thing: it reads papers and gives you one confident answer.

The problem is that science doesn't work that way.

Take BRCA1's role in triple-negative breast cancer. Ask any AI tool and you'll get a confident, well-written paragraph. What you won't get is this:

Study A says BRCA1 mutations are associated with increased aggressiveness
Study B says the same patients show better response rates to chemotherapy
Study C shows shorter progression-free survival despite better initial response
Three studies report conflicting BRCA1 expression levels

These aren't edge cases. These are the real contradictions buried across 200 papers that a researcher needs to know before designing an experiment, filing an IND, or trusting a conclusion.

A single AI model smooths these over. It synthesizes them into a confident answer. And in doing so, it hides exactly the information a scientist needs most.

This is why I built Research Council.

The Core Idea: Deliberation Over Confidence

The insight that drove this project came from an unlikely place: Andrej Karpathy's llm-council repo — a simple Saturday hack that instead of asking one LLM a question, routes it to multiple LLMs and has them review each other's answers.

The key insight: cross-model review catches things a single model misses.

I wanted to take this further. What if instead of generic LLMs reviewing each other, you had specialized agents — each trained to look at the evidence from a fundamentally different angle — deliberating over a structured knowledge graph of papers?

That's Research Council.

What It Actually Does

When you ask Research Council a research question, here's what happens:

Step 1: The Knowledge Graph

Before you ask anything, papers are ingested from PubMed, arXiv, Semantic Scholar, or uploaded as PDFs. Research Council doesn't just chunk them into vectors. It builds a Neo4j knowledge graph:

Nodes: Paper, Gene, Drug, Disease, Protein, Pathway, Author, Conclusion
Relationships: CONTRADICTS, SUPPORTS, CITES, MENTIONS, STUDIES, TARGETS

This is the critical difference from standard RAG. Traditional RAG asks: "which chunks are semantically similar to my query?" GraphRAG asks: "what is the structural relationship between the entities in my query?"

When you ask about BRCA1 and TNBC, the graph doesn't just return the most similar text chunks — it returns the neighborhood of BRCA1: every paper that mentions it, every drug that targets related proteins, and critically, every paper that contradicts another paper on the topic.

Step 2: Token-Efficient Context Assembly

Here's a technical detail that matters a lot at scale.

Naive multi-agent systems load every available tool into the prompt upfront. With 50+ tools, that's 25,000+ tokens before the agent does anything useful.

Research Council uses langgraph-bigtool: tools are embedded with SentenceTransformers at startup and retrieved semantically at query time. Only 2-4 relevant tools are loaded per query.

The result: a full 4-agent deliberation on a complex biomedical question uses 3,118 tokens total. About $0.002.

Step 3: The Council Deliberates

Four specialized agents receive the same knowledge graph subgraph and analyze it in parallel via asyncio.gather():

🔬 Evidence Agent

"What does the data actually show? Be precise about sample sizes, study types, and effect sizes. Never speculate beyond what the data shows."

⚔️ Skeptic Agent

"Find the weaknesses: biased study designs, underpowered samples, conflicting results, publication bias. Be constructively critical."

🔗 Connector Agent

"Find non-obvious links — drug repurposing opportunities, analogous mechanisms from other diseases, techniques from adjacent fields."

📋 Methodology Agent

"Evaluate whether experimental designs are appropriate, controls are adequate, statistical methods are sound, and whether conclusions are justified by the methods used."

Each agent produces an independent response. Then the real work begins.

Step 4: 12 Cross-Reviews

Every agent reviews every other agent's response — anonymized, to prevent model bias. That's 4 × 3 = 12 peer evaluations.

Each review produces:

An agreement score (0.0 to 1.0)
Specific points of disagreement
Constructive critique

The aggregate agreement score becomes a signal for confidence. High agreement → higher confidence. Persistent disagreement → lower confidence, and the Chairman must explain why the agents disagreed.

Step 5: The Chairman Synthesizes

A Chairman agent (running on OpenRouter with the best available model) receives all four original responses plus all twelve cross-reviews. It produces:

{
  "summary": "...",
  "confidence_score": 0.65,
  "key_findings": ["...", "..."],
  "contradictions": ["...", "..."],
  "citations": [{"claim": "...", "paper_id": "PMID:..."}],
  "methodology_notes": "...",
  "agent_agreement": 0.80
}

Notice the confidence score is 0.65, not 0.95. That's intentional. The system doesn't inflate confidence. If the evidence is contested, the score reflects that.

Step 6: The Write-Back Loop

Every conclusion the Chairman produces gets written back to Neo4j as a new Conclusion node — linked to every paper it references. The graph compounds over time. Each query makes future queries more informed.

The Result

Here's the actual output on "Are there contradictions in BRCA1's role in triple-negative breast cancer?":

Confidence: 65%

Not 95%. Not "based on multiple sources." A calibrated 65% because the agents genuinely disagreed on two points and the methodology agent flagged three studies as underpowered.

4 Contradictions surfaced:

BRCA1 mutations associated with both increased tumor aggressiveness AND better prognosis
Higher treatment response rates but shorter progression-free survival
Conflicting reports on BRCA1 expression levels across studies
Variable associations between BRCA1 mutations and TNBC significance

6 Key findings, each cited to a specific PubMed ID.

8 Methodology concerns — variable TNBC definitions, selection bias, small sample sizes, retrospective designs.

Agent agreement: 80% — two agents disagreed on whether the survival paradox was explained by tumor heterogeneity or methodological inconsistency.

Compare this to what ChatGPT gives you: a confident, well-written paragraph that mentions none of the contradictions.

The Architecture

Researcher Query
      │
      ▼
GraphRAG Layer
  Neo4j: entities + relationships
  ChromaDB: vector embeddings (CPU-only, MiniLM)
  Hybrid retrieval: ~2,000 token context
      │
      ▼
LangGraph Orchestrator
  BigTool: 2-4 tools loaded dynamically
  Hybrid retrieval node
  Context assembly node
      │
      ▼
LLM Council (Groq + OpenRouter)
  Stage 1: 4 parallel agents
  Stage 2: 12 cross-reviews
  Stage 3: Chairman synthesis
      │
      ▼
Answer + Neo4j Writeback
  Confidence · Citations · Contradictions · Provenance

Full stack:

LangGraph + langgraph-bigtool (orchestration)
Neo4j 5 Community (knowledge graph, local)
ChromaDB (vector store, local, CPU)
all-MiniLM-L6-v2 (embeddings, 80MB, CPU-only)
Groq llama-3.3-70b + llama-3.1-8b (council agents, fast)
OpenRouter claude-sonnet (Chairman, best synthesis quality)
FastAPI + React + Vite

Hardware requirements: 16GB RAM, 4GB VRAM. No beefy GPU needed. Embeddings run entirely on CPU.

What I Learned Building This

1. The write-back loop is the most underappreciated part

Most RAG systems are stateless. Query in, answer out. Research Council writes every conclusion back to the graph as a new node linked to source papers. After 50 queries, the graph has 50 validated conclusions that inform future answers. This is the difference between a tool and a system that compounds knowledge.

2. Confidence calibration is harder than it sounds

Getting agents to express genuine uncertainty rather than inflated confidence required careful prompt engineering. The current approach — deriving confidence from agent agreement scores — works but isn't theoretically principled. There's real research to be done here.

3. 12 cross-reviews might be overkill

n × (n-1) cross-reviews scales quadratically. With 4 agents that's 12, manageable. With 8 agents that's 56 — too slow. A smarter aggregation strategy (maybe pairwise disagreement sampling) would make larger councils viable.

4. The skeptic agent is the most valuable

After dozens of test queries, the Skeptic Agent consistently surfaces the most useful information — not because skepticism is inherently valuable, but because existing AI tools have a strong bias toward presenting positive findings. The explicit adversarial role corrects for this.

What's Next

Research Council V1 is live and open source (MIT licensed).

V2 plans:

Community detection on the knowledge graph (Louvain clustering)
Temporal analysis — track how understanding of a topic evolves year by year
MCP server integrations — Zotero, PubMed MCP, Neo4j MCP
HuggingFace Space for live demos
Export council output as formatted PDF for lab notes

What I'd love help with:

Better confidence calibration methodology
Async optimization of the cross-review loop
Additional paper sources (bioRxiv, ChemRxiv, Europe PMC)
Domain-specific agent specializations beyond biomedicine

Try It

GitHub: https://github.com/al1-nasir/Research_council

MIT licensed. Runs on a laptop. Groq has a free tier. OpenRouter costs fractions of a cent per query.

If you work in research, drug discovery, or AI for science — I'd genuinely love to know what question you'd throw at it first.

Built by Muhammad Ali Nasir. Inspired by karpathy/llm-council, extended with domain-specific GraphRAG and biomedical agent specialization.

Tags: Artificial Intelligence · Machine Learning · Biomedical Research · GraphRAG · Open Source · LLM · Python · Drug Discovery · Research Tools