Forem: Zohar Babin

From Node.js to Go: Rebuilding an MCP Server for Production

Zohar Babin — Tue, 19 May 2026 20:48:20 +0000

This is the story of why I rebuilt google-researcher-mcp (Node.js/TypeScript) from scratch as web-researcher-mcp (Go), and what the lessons learned along the way.

The Starting Point

The original project — google-researcher-mcp — was a TypeScript/Node.js MCP server distributed via npm. It had real traction: 36 GitHub stars, 6,500+ npm downloads, 860+ tests, and active users. But five critical issues kept surfacing that couldn't be solved within the existing architecture.

Why Rewrite in Go (Not Refactored)

Orphan Processes (Issue #108)

npx spawns deeply nested process trees. When the parent MCP client (Claude Desktop, Cursor) crashes or closes unexpectedly, the Node.js process doesn't receive a signal — it keeps running, consuming memory and holding file locks.

Myself and collaborators spent three versions (v6.2.0 through v6.4.0) building increasingly complex orphan detection: a Worker thread watchdog with CPU spin detection, three-layer parent-alive checks, and graceful degradation. It was all band-aids on a fundamental runtime limitation.

Go fix: A single static binary. No runtime process tree. EOF on stdin = immediate exit. The entire problem category disappeared.

Google Discontinuing "Entire Web" Search (Issue #107)

Google announced it would be discontinuing support for Programmable Search Engines configured to search the "entire web." The project was named google-researcher-mcp — the dependency on a single search provider was an foundational risk.

Go fix: Interface-driven search.Provider with multiple implementations, plus a Router that provides multi-provider routing with automatic failover via per-provider circuit breakers.

Alternative Search Engines (Issue #55)

Users wanted Brave, Bing (go figure), and other providers. But the TypeScript codebase was too tightly coupled to Google's API response format — the shared directory (41 files) made every change risky and far-reaching.

Go fix: A clean Provider interface — each adapter normalizes provider-specific responses to common types (SearchResult, ImageResult, NewsResult). Adding a new provider is one file implementing one interface.

Redis Caching (Issue #72)

The in-memory cache was lost on every process restart — which happened frequently with npx-launched servers. The complex persistence manager offered four strategies (Periodic, WriteThrough, OnShutdown, Hybrid), but none reliably survived the volatile process lifecycle.

Go fix: A cache.Cache interface with a hybrid implementation: memory LRU + AES-encrypted disk + optional Redis. Simple, testable, and it never loses data because the disk layer persists across restarts.

Monolithic Architecture (Issue #40)

The project had 100+ source files but a tightly coupled shared/ directory with 41 files. Adding a single tool required touching 4+ documentation sections, and the import graph made refactoring perilous.

Go fix: One package per concern. Tool handlers are self-contained files. Adding a tool means writing one file and one line in the registry.

What Changed Architecturally

Aspect	Node.js (old)	Go (new)
Distribution	npm/npx (runtime required)	Single static binary
Memory	430MB idle (80MB after optimization)	~25MB baseline
Startup	2-4 seconds (lazy imports)	<100ms
Process lifecycle	Worker thread watchdog	EOF detection, no orphans
Search providers	Google only	Multiple providers + fallback routing
Concurrency	Event loop + async/await	Goroutines + semaphores
Type safety	TypeScript + Zod	Go type system + struct tags
Testing	860+ Jest tests	Table-driven tests + race detector
Scraping	Playwright (heavy)	4-tier pipeline (lightweight first)

Key Lessons Learned

1. Don't Fight Your Runtime

Node.js process management is fundamentally fragile for long-lived servers launched via npx. The runtime doesn't support robust parent-death detection, and the nested process tree (npx → node → worker) makes signal propagation unreliable. We spent three versions building increasingly complex orphan detection. Go's single binary eliminated the entire category of problems.

Takeaway: If you're spending significant engineering effort working around your runtime's limitations, that's a signal to evaluate whether the runtime fits the problem.

Side note: looking for a better runtime I looked into both Go and Rust (isn't Rust aweoms!?). Go won primarily for its lightweight goroutines exceling at I/O-bound operations, and the mcp-go SDK is superbly maintained.

2. Interface-Driven Design Enables Fearless Extension

Adding Brave Search in the Go version was one file implementing one interface — about 200 lines including tests. In the Node.js version, the equivalent change would have touched 6+ files due to tightly coupled imports in the shared directory.

Takeaway: When you know extension is likely (new providers, new tools), invest in clean interfaces upfront. The interface is the specification; implementations are interchangeable.

3. Memory Matters for MCP Servers

MCP servers run alongside AI assistants on developer machines. They're always-on background processes. A 430MB idle memory footprint was unacceptable — users would notice and uninstall. Go's ~25MB baseline lets the server stay resident without impact.

Takeaway: For developer tools that run continuously, memory efficiency is a feature, not an optimization. Choose your runtime accordingly.

4. Caching Architecture Should Be Boring

The old project had four persistence strategies with complex heuristics for when to flush. The new one has: memory LRU + optional encrypted disk + optional Redis. Each layer is simple and independently testable. No heuristics, no race conditions, no data loss.

Takeaway: Boring infrastructure is reliable infrastructure. If your caching layer needs its own debugging session, it's too complex.

5. Documentation Should Be Drift-Resistant

The old project required updating four separate documentation files per new tool. Inevitably, docs drifted from reality. The new project's test suite programmatically validates documentation claims — tool descriptions must mention alternatives, output schemas must match actual responses, and annotations must be consistent.

Takeaway: If documentation can be wrong without a test failing, it will eventually be wrong.

What We Kept

The rewrite preserved the user-facing contract:

Same tools with identical semantics and parameter names
Same MCP protocol compatibility (Claude Desktop, Cursor, VS Code, any MCP client)
Same environment variables (drop-in replacement for existing configs)
Same search lenses (curated domain lists, identical JSON format)

What improved (without breaking backwards compatibility):

OAuth 2.1 authentication for multi-client deployments
Multi-tenancy with per-tenant session isolation
Per-provider circuit breakers with automatic fallback
Prometheus metrics for observability
Structured audit logging for compliance

Results

Since launching the Go version:

Zero orphan process reports (vs. recurring issue in Node.js version)
Multiple search providers with automatic failover (vs. single provider)
4-tier scraping pipeline that tries lightweight methods first (vs. Playwright-only)
Sub-100ms cold startup (vs. 2-4 seconds)
Production-ready: rate limiting, circuit breakers, session isolation, audit trail

Should You Rewrite?

Probably not. Most rewrites fail because they're motivated by developer preference ("I want to use a new language") rather than architectural necessity. Ours succeeded because:

The problems were architectural, not implementational — no amount of refactoring within Node.js would fix process orphaning
The user-facing contract was well-defined — MCP provides a clean protocol boundary
The scope was bounded — we knew exactly what the server needed to do
We had comprehensive tests on the old version to validate behavioral equivalence

If your problems are solvable within your current architecture, refactor. If they're fundamentally incompatible with your runtime or architecture, consider a rewrite — but only with clear success criteria and a well-defined boundary.

This article covers the migration from google-researcher-mcp to web-researcher-mcp. The new project is open source under MIT and works with any MCP-compatible AI assistant.

Building a 13-Agent AI System for M&A Due Diligence — Architecture Deep Dive

Zohar Babin — Sun, 17 May 2026 19:05:46 +0000

The Problem Nobody Was Solving

As a corp dev lead, I spent weeks doing the same thing after every deal: assembling the cross-domain picture from siloed advisor reports.

Legal would flag a termination clause. Finance would flag revenue concentration. Same entity. Nobody connected the dots.

This happens because due diligence is split into parallel workstreams — legal, financial, commercial, tax, regulatory — each run by separate teams with separate deliverables. The cross-referencing happens in someone's head, over coffee, two days before the IC memo is due.

The numbers back this up:

31% of M&A failures trace back to DD shortcomings (HBR, McKinsey, KPMG research)
DD timelines keep compressing — six weeks becomes three, same scope
Corp dev teams screen 200-1,000+ companies/year but close 1-3%

I built Due Diligence Agents to fix this.

What It Does

13 AI agents analyze every document in an M&A data room across 9 specialist domains — Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, and ESG — then cross-reference findings automatically and trace each one to the exact page, section, and quote.

pip install dd-agents
dd-agents auto-config "Buyer" "Target" --data-room ./your_data_room
dd-agents run deal-config.json

Output: an interactive HTML report, a 14-sheet Excel workbook, and per-subject JSON findings. See a sample report from synthetic data.

The Architecture

The system has four layers:

Layer 1: 38-Step Async Pipeline

The orchestrator (engine.py) is a state machine with 38 async steps grouped into phases:

Setup (steps 1-5): Load config, validate data room, resolve entities
Discovery (steps 6-13): Extract documents, build inventory, classify files, compute precedence
Analysis (steps 14-17): Build specialist prompts, route documents, spawn agents in parallel, check coverage
Cross-Domain (steps 18-20): Symbolic trigger evaluation, targeted respawn, merge
Quality (steps 21-26): Judge review, merge findings, validate, deduplicate
Reporting (steps 27-38): Generate HTML, Excel, JSON, knowledge base

Every step supports checkpoint/resume. If the pipeline crashes at step 23, it restarts from step 23 — not from scratch. Steps are typed, and the state object serializes cleanly to JSON.

Layer 2: 13 Agents

9 specialists + 4 meta-agents, each spawned via Anthropic's claude-agent-sdk:

Specialists: Legal, Finance, Commercial, ProductTech, Cybersecurity, HR, Tax, Regulatory, ESG. Each gets domain-specific prompts, the relevant documents, and a set of tools (file read, search, finding write).

Meta-agents:

Judge: Reviews specialist findings for quality, consistency, and missed coverage
Executive Synthesis: Produces the deal-level summary with go/no-go signals
Red Flag Scanner: Pattern-matches across all findings for deal-killers
Acquirer Intelligence: Tailors findings to the buyer's strategic context

Specialists run in parallel (batched by resource constraints). Meta-agents run sequentially after all specialists complete.

Layer 3: Neurosymbolic Cross-Domain Analysis

This is the part that solved my original problem.

After specialists produce their findings (pass 1), a deterministic rule engine scans them for cross-domain dependencies. No LLM calls — just Python pattern matching.

# Example: Finance finds revenue recognition issue
# → Rule fires → Legal agent re-examines specific contracts
# for enforceability, clawback clauses, delivery milestones

Seven built-in trigger rules cover the most common M&A cross-domain dependencies:

Source → Target	When It Fires
Finance → Legal	Revenue recognition finding needs contract enforceability check
Legal → Finance	Change-of-control clause needs financial exposure quantification
Legal → Finance	Termination-for-convenience needs revenue-at-risk calculation
Legal → ProductTech	IP ownership dispute needs technical dependency assessment
ProductTech → Legal	Data privacy finding needs DPA/GDPR compliance review
Commercial → Finance	SLA risk with >10% service credits needs financial quantification
Finance → Commercial	Pricing discrepancy needs commercial rate card validation

When a rule fires, it creates a CrossDomainTrigger with the specific contracts to re-examine and instructions for the target agent. The target agent runs a targeted pass-2 review — only on the cited contracts, not the full data room. This keeps costs bounded.

Budget-capped, priority-ordered. If no triggers fire, zero additional cost.

The design is inspired by the FAOS Platform — asymmetric coupling where symbolic rules constrain the LLM's scope while the LLM provides judgment. Symbolic decides when intelligence is needed; the LLM provides what to do about it.

Layer 4: 5 Blocking Quality Gates

Every finding goes through validation before it reaches the report:

Coverage gate: Did the agent analyze every assigned document?
Schema validation: Does every finding have the required fields (severity, citations, category)?
Citation verification: Can we trace the finding back to a specific page and quote?
Semantic dedup: Are two agents saying the same thing about the same document? (rapidfuzz token_sort_ratio ≥ 80)
Numerical audit: Do financial figures in findings match what's in the source documents?

Fail-closed. If validation fails, the pipeline stops — it doesn't silently produce bad output.

The Chat Mode (My Favorite Feature)

After the pipeline runs, you can interrogate the results:

dd-agents chat --report _dd/forensic-dd/runs/latest

The chat agent has 14 MCP tools: citation verification against source PDFs, cross-contract search, entity resolution, and sandboxed document generation. Ask "build me a board summary of all P0 findings with revenue impact" and it writes a Python script, executes it in a sandbox, and hands you the .xlsx file.

15 Things I Learned Building This

These lessons apply to any system doing cross-document analysis at scale — not just M&A.

1. Extraction is harder than analysis. By a lot.

Everyone focuses on the LLM prompts. But 80% of the real engineering is getting clean text out of messy documents. Our extraction pipeline has 4 tiers: pymupdf → pdftotext → OCR (Tesseract → GLM-OCR) → Claude vision as last resort. Each tier has 6 quality gates (min chars, printable ratio, density, readability, watermark detection, corruption check). Confidence scales with method quality — pymupdf gets 0.9 base, OCR gets 0.65.

2. Entity resolution is your invisible foundation

"IBM", "International Business Machines", and "Red Hat" — are these the same entity? We use a 6-stage cascade: exact match → normalized (strip legal suffixes) → alias expansion → fuzzy match (rapidfuzz) → TF-IDF cosine similarity → learned matches from prior runs. Names ≤5 characters are blocked from fuzzy matching — without this, "Inc." matches random entities.

3. Don't dump everything into one context. Map-merge-resolve.

A 200-page master agreement might have the deal-killer on page 147. You can't skip large files. But dumping them into one context drops accuracy from 95% to 74% (Addleshaw Goddard, 510 contracts). Instead: chunk at page boundaries (150K chars, 15% overlap), analyze each chunk independently, merge with priority logic (YES beats NO, specific beats generic), and only invoke LLM arbitration when chunks disagree. The 21-point accuracy gain is entirely engineering — no model change.

4. Hallucination is an engineering problem, not a model problem

No single defense works. We use 5 layers: (1) Pydantic schema validation on every response, (2) mandatory citation with file_path/page/exact_quote verified against source, (3) explicit "NOT_FOUND" escape valve — without this, models fabricate clauses rather than admit ignorance, (4) adversarial Judge review with accusatory framing ("this finding appears fabricated — prove it with a direct quote"), (5) 6-layer deterministic numerical audit.

Layer 3 changed everything. When you tell the model "if you can't find this clause, say NOT_FOUND," hallucination drops dramatically.

5. Know when to stop using LLMs

We had an LLM agent doing validation and report synthesis. We replaced it with deterministic Python. Quality went up, cost went down. The rule: use LLMs for analysis and synthesis; use Python for validation, dedup, and audit. If you can write the logic as deterministic code, do it.

6. Self-verification works — but only with accusatory framing

After agents produce findings, a follow-up pass challenges them on high-severity claims. Polite prompts ("please review your finding") have near-zero effect — models confirm their own output. Accusatory prompts ("this finding appears fabricated," "the cited clause doesn't exist") force re-examination and produce a 9.2% accuracy improvement.

7. Cross-agent dedup is different than you think

When 4 agents analyze the same document, they find the same issue but describe it differently. Three rules: (1) never dedup within the same agent — two similar findings from Legal are intentionally distinct, (2) only dedup across agents on the same document — similar findings on different documents are different findings, (3) keep contributing agent metadata so you know which domains flagged it.

8. Context window engineering is a first-class discipline

It's not just about fitting data in — it's about where things go. Critical instructions go at the start (highest recall zone). Document content goes in the middle (lowest recall — ~40% worse). Constraints and format rules go at the end (second-highest recall). We budget 40% of the context window for tool calls and reasoning.

9. Quality gates must be blocking, not advisory

If validation just logs a warning, nobody reads it. If it halts the pipeline, quality is non-negotiable. Same for agent guardrails: hard turn limits (soft at 200, force-kill at 3x), path guards (agents can only write under _dd/), bash guards (24 blocked patterns — no rm -rf, no sudo, no pipe-to-shell). Better to produce nothing than unreliable output.

10. Every claim must be traceable to source

Citation verification uses 4 scopes: exact page match → adjacent pages ±1 → full document fuzzy match (80%+) → cross-file search. That last one matters — if the quote isn't in the cited file, we search all files for that entity. Auto-corrects file misattribution.

11. Most of what AI finds is noise

Run 9 agents across hundreds of documents and you'll get thousands of findings. We use a 3-stage classification: noise filter (15 patterns for extraction artifacts), data quality filter (14 patterns for "data unavailable" gaps), then material findings. Plus 5 severity recalibration rules — e.g., a change-of-control clause that only applies to competitors gets downgraded from P0 to P3 automatically.

12. Same clause, different deal, different severity

An anti-assignment clause is P0 in an asset purchase (blocks contract transfer) but P3 in a stock purchase (entity doesn't change). Deal-type context must flow through the entire pipeline: prompt-time rules, post-hoc deterministic adjustments, and executive judgment overrides — with full audit trail.

13. Every API call is a deal cost

Three model profiles: economy (Haiku for extraction), standard (Sonnet for analysis), premium (Opus for synthesis). Per-agent cost tracking. Hard budget limits that halt the pipeline. Right model for right task.

14. Pydantic v2 everywhere

137+ models with model_json_schema() for structured outputs. Strict mypy across 199 source files. The type system catches real bugs — a finding with evidence instead of citations gets blocked by the schema guard hook before it's written to disk.

15. Make every run smarter than the last

Inspired by Karpathy's "LLM Wiki" pattern: a persistent knowledge base compounds across runs. Finding lineage via SHA-256 fingerprinting tracks findings even when wording changes. A NetworkX knowledge graph with 11 typed edge types captures entity relationships, contradictions, and clause interactions. Run 2 knows what Run 1 found — and catches what changed.

Try It

pip install dd-agents
dd-agents auto-config "Buyer" "Target" --data-room ./your_data_room
dd-agents run deal-config.json --dry-run  # Preview without API calls

Sample report (synthetic data, no install needed)

GitHub — Apache 2.0, 3,714 tests, strict mypy.

Built on Anthropic's Claude Agent SDK. Looking for feedback — especially from anyone who's dealt with data room analysis and can tell me whether the report structure maps to how DD findings are actually consumed.