Forem: Daniel Nwaneri

MCP Just Landed on Your Phone: What Google AI Edge Gallery Actually Does

Daniel Nwaneri — Wed, 20 May 2026 17:09:24 +0000

This is a submission for the Google I/O Writing Challenge

I was already running MCP servers on my desktop — connected to Claude, wired into my daily workflow — when Google announced at I/O 2026 that AI Edge Gallery now supports MCP connections on Android. I pulled out my Pixel and started testing.

First attempt: "No eligible devices." The app requires capable hardware. Second device — it opened.

What I found is more interesting than the announcement.

What Google AI Edge Gallery Is

Google AI Edge Gallery is an open-source Android app from Google Research. Large language models run entirely on your device — no internet required for inference, no data leaves your phone. Every prompt, every image, every audio clip stays local.

That part isn't new. What changed at I/O 2026: the app now supports agents. Not a chat interface with a web search button — a proper agent runtime with toggleable skills, calendar integration, scheduled reminders, and experimental MCP connections. The same protocol the rest of the serious agent ecosystem runs on.

The Model List Surprised Me

Opening the Models panel, I expected a Gemma showcase. That's not what this is.

Model	Size	Notes
Gemma-4-E2B-it	2.6 GB	Recommended across most use cases, 32K context
Gemma-4-E4B-it	3.7 GB	Multi-modal, 32K context
Gemma-3n-E2B-it	3.7 GB	Text, vision, audio, 4096 context
Gemma-3n-E4B-it	4.9 GB	Text, vision, audio, 4096 context
Gemma3-1B-IT	584 MB	4-bit quantized
Qwen2.5-1.5B-Instruct	1.6 GB	Alibaba's model, LiteRT-LM ready
DeepSeek-R1-Distill-Qwen-1.5B	1.8 GB	Reasoning model, fully on-device
TinyGarden-270M	289 MB	Fine-tuned FunctionGemma, task automation
MobileActions-270M	289 MB	Fine-tuned FunctionGemma, device control

DeepSeek and Qwen sitting in a Google app isn't accidental. The actual product here is LiteRT-LM — Google's mobile inference runtime — not Gemma. Which model you run on top is your choice.

Gemma 4: The Number That Matters

Gemma 4 E2B runs in 2.6 GB and opens a 32K context window. Gemma 3n had 4096 tokens. Multi-modal input — text, images, and audio — in one model.

That context gap is what makes the agent use case real. Tool call outputs, calendar data, conversation history — there's room to feed all of it back to the model without truncating. Running a 32K context window offline on a phone wasn't viable when the best on-device options topped out under 2K.

All models run through LiteRT-LM — previously TensorFlow Lite, now meaningfully upgraded.

Agent Skills: What It Actually Looks Like

This is the part I came to test. Here's what's actually inside.

The Skills Architecture

Agent Skills runs on 12 skills total — split into two tiers:

Built-in skills (Google's own):

calculate-hash — hash a given text
create-calendar-event — write to OS calendar
interactive-map — show a map view for a location

Community skills (user-created, same interface):

read-calendar-events — read OS calendar for a specific date
schedule-notification — schedule a one-time or repeating daily notification
query-wikipedia — pull a Wikipedia summary on a topic
qr-code — generate a QR code for a URL
mood-tracker — stores daily mood and comments, tracks history
send-email — send an email
learn-something-new — daily learning companion with image card and scheduled notification
kitchen-adventure — dungeon master RPG set in a world of sentient kitchen appliances
text-spinner — "Spin the given text on my head"

Two things worth noticing. First: calendar read and calendar write are separate skills with separate toggles. That's a granular permissions model — the agent can write events without having read access, or vice versa. Second: send-email means an offline on-device model can send emails through a skill. That's not a demo capability.

Every skill has a "View" button — inspect the full skill definition before enabling it. Each is individually toggleable. The chat bar shows a live count: Skills 8 | MCP 0. Skills and MCP are tracked separately in the same toolbar.

Four ways to add skills:

Featured list — curated community contributions
Load from URL — any web-hosted skill directory
Import local skill — from the device directly, no server needed
GitHub Discussions — browse the full community

Local import is the developer detail. You can build and test a skill entirely on-device without hosting anything. The iteration loop for custom skill development doesn't require a deployed server.

Agent Skills only accepts two models — both Gemma 4. Gemma 3, DeepSeek, Qwen: available for AI Chat, locked out of agents. Google drew a capability line and stuck to it.

The MCP Setup

Tap the MCP counter in the toolbar → empty state with a single button: + Add MCP server.

The dialog asks for:

Your MCP server URL
Authorization: None, Request header, or OAuth (WIP)

OAuth isn't ready yet. That's the honest limitation for anyone planning to connect enterprise or authenticated MCP servers — you're working with no auth or bearer token only, for now. Public MCP servers work today. Authenticated production servers will have to wait for OAuth to ship.

Worth noting what this architecture means even without OAuth: tool-selection logic runs on-device, and only the structured API call leaves the phone. For healthcare or legal tooling that can't send raw queries to a server, that's a meaningful trust boundary — not a workaround.

For developers already running public MCP servers: enter the URL, the app fetches the tool manifest, tool definitions load into the system prompt alongside your active skills. The model handles invocation from there.

The Rest of the App

Each use case — AI Chat, Ask Image, Audio Scribe, Prompt Lab — links directly to API documentation and example code from its own screen. This is a teaching environment, not just a demo. Google built it for developers to read, fork, and build on.

AI Chat supports Thinking Mode — Gemma 4's step-by-step reasoning exposed inline. Useful before you wire a model into anything production-facing.

Mobile Actions runs on MobileActions-270M — 289 MB, fully offline. A 270M parameter model doing device automation. For context, Gemma 4 E2B is roughly 10x that size and handles general reasoning. The argument being made with that design: narrow fine-tunes at sub-300MB can do discrete tasks better than a general model, and they fit anywhere.

Tiny Garden — 289 MB, natural language gardening game — is the same point made playfully. Watch how function-calling works on-device in a consequence-free environment.

Getting Started

1. Install Google AI Edge Gallery (Play Store)
   Requires capable Android hardware — tested on Pixel

2. Download Gemma-4-E2B-it (2.6 GB)
   Only Gemma 4 models run Agent Skills

3. Open Agent Skills → tap the Skills or MCP button in the toolbar

4. For built-in skills: toggle on what you need
   For MCP: tap MCP → Add MCP server → enter URL + auth

5. Start chatting — the model sees your active skills and connected tools

Source code: github.com/google-ai-edge/gallery. Community skills shareable via GitHub Discussions.

What Developers Should Take From This

The Gemma 4-only lock on Agent Skills is the tell. This isn't a checkbox feature. Google shipped agentic tool use where the model can handle it reliably, and locked out weaker models until that changes. That's a better decision than letting anything run and degrading silently.

The OAuth (WIP) flag on MCP auth is the other honest signal. Public MCP servers work today. Enterprise-grade authenticated connections aren't there yet. That's not a failure — it's a preview of where this is going, with the current edges visible rather than hidden.

The 270M fine-tuned models are the underrated part of this release. MobileActions and TinyGarden are evidence of a different architecture: specialized micro-models for narrow tasks, general models for reasoning, LiteRT-LM as the runtime connecting them. At 289 MB each, those models fit anywhere.

MCP being here matters because it's the same protocol across Claude, Cursor, VS Code extensions, and now Google's on-device runtime. Build a tool as an MCP server once and every compatible client picks it up. That's not small.

Verdict

AI Edge Gallery is the most developer-forward release from Google I/O 2026. Not a consumer product — a reference implementation of what on-device agents look like when an open protocol, a capable model family, and a mobile inference runtime land in the same place.

If you're building with MCP today, install the app and point it at your existing server. Your tools already work. That's not a coincidence — it's what a protocol looks like when it actually wins.

Written with assistance from Claude (Anthropic). Hands-on testing on Pixel: model list, Agent Skills interface, MCP setup flow, and skills management observed directly. Gemma-4-E2B-it downloaded; model inference and chat results not included in this article.

Cloudflare Deprecated My Production Model. The Recommended Upgrade Costs $4/M Tokens. Gemma 4 MoE Doesn't.

Daniel Nwaneri — Tue, 19 May 2026 11:55:01 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

On May 8, Cloudflare posted a deprecation notice.

@cf/moonshot/kimi-k2.5 — the model synthesising knowledge across 45,000 of my saved tweets — was going away on May 30.

I had a live production system, a daily cron, and 100,000+ indexed documents depending on that model. I had 22 days.

Cloudflare's recommended replacement: @cf/google/gemma-4-26b-a4b-it.

So I migrated and benchmarked every step. Here's what I found, what broke, and why Gemma 4 MoE was the right call even after a better Kimi arrived.

bookmark-cli is a personal knowledge engine I built after getting frustrated with X's native search. It syncs my bookmarks and likes into local SQLite, then pushes everything into a Cloudflare Worker for semantic retrieval.

The numbers:

45,053 tweets (11,835 bookmarks + 33,218 likes)
7,155 photo tweets enriched by Llama 4 Scout vision descriptions
100,302 total documents in the vector index
Daily cron syncing new content automatically
$5/month total running cost

The architecture: bookmark-cli calls vectorize-mcp-worker, which runs hybrid BM25 + vector search, cross-encoder reranking, and a knowledge reflection layer that synthesises connections across documents.

One question worth answering upfront: if the data is from 2023, what good is it?

This isn't a news feed — it's a thinking tool. When you liked a tweet about RAG failure modes two years ago, you were signalling "this matters to me." The reflection engine connects that to four other things you saved that week across different topics and surfaces the thread you didn't consciously notice. The index only contains what you chose to save. No engagement algorithm, no ads, no recency bias — just your own curation, made searchable and cross-referenced. Google searches the internet. This searches your mind.

A reflection the engine generated from tweets I saved about AI and work — none of which said this:

"Non-technical users are increasingly using AI agents to 'vibe-code' large amounts of software without manual code review or verification. This reliance on generated outputs often involves a level of blind trust that bypasses the rigorous research and scrutiny essential to traditional programming. Although this method can appear highly productive, the lack of technical expertise makes debugging these systems exceptionally difficult and prone to subtle, painful failures."

That's three fragments from different weeks, connected by the model into one coherent insight. The technical details are below.

Demo

Live dashboard: vectorize-mcp-worker.fpl-test.workers.dev/dashboard

Code

vectorize-mcp-worker: github.com/dannwaneri/vectorize-mcp-worker
bookmark-cli: github.com/dannwaneri/bookmark-cli
Benchmark endpoint: POST /benchmark with Authorization: Bearer header

How I Used Gemma 4

Why Gemma 4 MoE specifically

Three Gemma 4 variants exist on Workers AI. I needed to pick one.

Model	Active params	Best for
gemma-4-e4b-it	4B total (dense)	Local / memory-constrained
gemma-4-27b-it	27B dense	Max quality, more compute
gemma-4-26b-a4b-it	26B total, 4B active (MoE)	Edge inference, reasoning depth

The reflection layer does multi-document synthesis — it reads 5 related chunks and produces a structured 3-sentence insight. That's not a summarisation task, it's a reasoning task. The 4B dense model would have been too shallow. The 27B dense would have been too slow at the edge.

4B active parameters per forward pass. 26B total. At the edge, you need the first number. For multi-document synthesis, you need the second. The MoE architecture is the only way to have both.

The entire pipeline — embed, retrieve, rerank, reflect — runs inside one Cloudflare Worker. Gemma 4 MoE is a native Workers AI binding. No external API call. No data leaving the edge.

The migration

The codebase already had a REFLECTION_MODEL env var. The model registry needed one addition:

export const REFLECTION_MODELS = {
  'gemma-4': {
    id: '@cf/google/gemma-4-26b-a4b-it' as const,
    label: 'Gemma 4 26B MoE (4B active)',
    note: 'Recommended. 4B active params via MoE — edge-native, no external hop.',
  },
  'kimi-k2.5': {
    id: '@cf/moonshotai/kimi-k2.5' as const,
    label: 'Kimi K2.5',
    note: 'Deprecated May 30 2026.',
  },

Then:

wrangler secret put REFLECTION_MODEL
# enter: gemma-4

wrangler deploy

That was the migration. The reflection engine reads env.REFLECTION_MODEL dynamically. Nothing else changed.

Three gotchas worth knowing

1. max_tokens. Gemma 4 is a thinking model. It writes a full reasoning chain before producing output. With max_tokens: 180 set for the old model, Gemma 4 was spending all its tokens on internal reasoning and returning empty content. Bumping to max_tokens: 2048 fixed it.

2. Response extraction. For thinking models, use choices[0].message.content — not .reasoning and not .response. The reasoning field is the internal chain of thought, not the answer.

3. Prompt format. Verbose rule-lists trigger Gemma 4's constraint-analysis behaviour — it restates your rules as bullet points instead of following them. Keep prompts simple and end with a direct action cue:

Read the new source and related sources below, then write 3 plain prose sentences
that synthesise them into a knowledge base entry. No bullets. No analysis. No preamble.
Just 3 sentences.

New: "..."
Related: ...

Write the 3-sentence synthesis now:

The benchmark

I built a /benchmark endpoint that runs both models in parallel against the same query, logs latency and response to D1, and returns side-by-side results.

POST /benchmark
{ "query": "What are the common failure modes of RAG systems?" }

Results from D1 (9 real queries):

Query	Gemma 4 MoE	Kimi K2.5
RAG failure modes	12.9s	12.4s
Embedding model selection	9.9s	90.7s ⚠️
BM25 vs vector search	19.6s	7.3s
Reducing hallucination	19.0s	6.9s
Chunking strategies	9.3s	9.0s
Edge AI model selection	11.8s	8.3s
MoE efficiency at scale	16.5s	8.0s
Cloudflare Workers AI	22.8s	FAILED
KB maintenance	10.1s	5.5s

Kimi K2.5 was faster on 7 of 9 queries. But it produced a 90-second response on one query and failed outright on another — within a single benchmark run. A model that's faster on average but unreliable under load isn't a production model.

Gemma 4 MoE was consistent. Every query returned. Every response was coherent. Latency was predictable.

Beyond the latency numbers, the Kimi K2.5 reflections in the index all started with "Here are the 3 sentences:" — the model was leaking the instruction prefix into every stored reflection. Gemma 4 produces clean prose output with the right prompt.

What's live now

GET /stats → models.reflection: "@cf/google/gemma-4-26b-a4b-it"

The live dashboard is at vectorize-mcp-worker.fpl-test.workers.dev/dashboard — open it and the active reflection model is listed in the stats panel. Gemma 4 MoE, running in production.

1,525 reflections generated since the migration. The cron added more this morning. Verify live: /public-stats — no API key needed.

A second reflection, this one on AI and management:

"AI is increasing individual contributor leverage and is frequently marketed as a labor replacement, driving companies to prioritize cost-cutting and individual productivity. This trend often places pressure on managers to perform individual contributor roles, potentially devaluing the necessity of human oversight and organizational management. Relying on these technologies also introduces risks involving accountability for failures, misunderstandings of AI's true capabilities, and the loss of human-centric benefits like upskilling."

This came from unrelated tweets saved across different weeks, connected by the engine into a single coherent insight, stored back into the index so it surfaces when I search anything adjacent to AI, management, or developer tooling. That's the reflection layer working as intended.

Full pipeline:

bookmark-cli → vectorize-mcp-worker
  embed (BGE Small) →
  retrieve (Vectorize + BM25) →
  rerank (BGE cross-encoder) →
  reflect (Gemma 4 MoE) ← NEW

Everything stays inside one Cloudflare Worker. No external hop for the reasoning layer.

Gemma 4 MoE isn't here because of a challenge. It's here because Cloudflare deprecated the model it replaced and this was the right call. It will still be running after June 4.

The verdict

Gemma 4 MoE is not faster than Kimi K2.5 was on average. If raw speed were the only metric, and if Kimi K2.5 were staying around, I'd have a harder decision.

But it isn't staying around.

Cloudflare has since released Kimi K2.6 — 1T parameters, 262k context window, reasoning, vision, tool calling. It's impressive. It's also $0.95/M input tokens and $4.00/M output tokens. The reflection layer synthesises on every ingest. At that pricing, running it across a 100k-document backlog would end the $5/month cost story in a single batch. Gemma 4 MoE, as a native Workers AI model, stays within the free tier. The upgrade path wasn't really an upgrade for this use case.

And for a reflection layer specifically — where the task is multi-document synthesis, where you need reasoning depth more than raw throughput, and where you want the entire pipeline to stay edge-native — Gemma 4 MoE is the right model. The MoE architecture is why. 4B active parameters gives you the inference speed you need at the edge. 26B total parameters gives you the knowledge depth the task requires.

At $4/M output tokens, the upgrade wasn't an upgrade. Gemma 4 MoE still is. The daily cron doesn't know it's in a challenge. It ran this morning.

What's next for Gemma 4 MoE in this pipeline

The reflection layer is one use. The code already has a second.

Every 3 new ingests, the pipeline runs a consolidation pass — Gemma 4 MoE reads the 10 most recent reflections and merges them into a single doc_type='summary': dominant theme, two or three specific non-obvious facts, and the most persistent open question across all the reflections. The summary lands in Vectorize and surfaces in search exactly like a reflection does. Reflections capture individual connections. Summaries capture patterns across connections. Both are Gemma 4 MoE, both are edge-native, both add to the index without touching the $5/month cost ceiling.

That's the current state. Three extensions are already scoped:

Query-time answer synthesis. Right now the pipeline retrieves chunks and returns them. The next layer uses Gemma 4 MoE to read the top 5 retrieved chunks and produce a direct answer — not a list of results, an actual response grounded in what you saved. The retrieval already works. The synthesis step is the same task the reflection layer already does, with a different prompt.

Routing upgrade. The V4 intelligent router currently runs on Llama 3.2 3B — fast classification into six query routes (SQL, BM25, vector, graph, etc.). Moving that to Gemma 4 MoE's thinking mode means the router can reason about ambiguous queries instead of classifying them. A question like "what did I save about RAG that I disagreed with?" hits multiple routes simultaneously. A 3B classifier guesses. A 26B MoE reasons.

Gap detection. The reflection engine already identifies gaps — questions the combined knowledge doesn't answer. A weekly pass that reads all gap annotations across the index and surfaces the three most persistent unanswered questions would make the tool actively useful for research, not just reactive to search queries. One scheduled cron, one Gemma 4 MoE call per week, zero additional cost in the free tier.

Personal preference reranker. The index contains 100k+ documents across AI, politics, sports, and everything else saved since 2016. Every bookmark and like is a signal: this person found this worth keeping. The longer-term path is fine-tuning a small cross-encoder on that signal — not domain expertise, but preference prediction. A model trained on "did this person save this or not" beats every general reranker at one narrow task: knowing what you care about. It slots into the existing pipeline as a final reranking layer after BGE, before the reflection pass. The training data is a decade of curation. The narrow task is yours alone.

The reflection layer was the migration. These four are the reason it stays.

I Ran SERP Feature Detection on 8 Nigerian Creator Queries. Every Single One Had an AI Overview.

Daniel Nwaneri — Thu, 14 May 2026 14:40:49 +0000

I built a SERP feature detection module for my SEO agent. Then I ran it on the queries I'm targeting for a site about how Nigerian creators get paid online.

The results were more uniform than I expected.

What the Module Does

The module calls SerpApi for each target query and checks the structured JSON response for seven SERP features:

AI Overview
Featured snippet
People Also Ask (PAA)
Image pack
Video results
Local pack
Knowledge panel

Each query comes back with a feature matrix and a list of content opportunities. No browser, no CAPTCHA, no bot detection — SerpApi handles the residential proxy infrastructure on their end.

The free tier is 100 searches/month. I have 8 target queries. That's comfortable.

The Results

I ran it on these queries for naija-vpn.com:

does twitch pay nigerians
how to receive money from twitch in nigeria
cleva vs geegpay nigeria
payoneer nigeria freelancers
how nigerians get paid on youtube
how to receive fiverr payment in nigeria
does tiktok pay nigerians
best dollar account for nigerian freelancers

The output:

Query	AI Overview	Feat. Snippet	PAA	Images	Video	Local	KP
does twitch pay nigerians	✓	—	✓	—	✓	—	—
how to receive money from twitch in nigeria	✓	—	✓	—	✓	—	—
cleva vs geegpay nigeria	✓	—	✓	—	✓	—	—
payoneer nigeria freelancers	✓	—	✓	—	✓	—	—
how nigerians get paid on youtube	✓	—	✓	—	✓	—	—
how to receive fiverr payment in nigeria	✓	—	✓	—	✓	—	—
does tiktok pay nigerians	✓	—	✓	—	✓	—	—
best dollar account for nigerian freelancers	✓	—	✓	—	✓	—	—

8 out of 8 queries: AI Overview ✓, PAA ✓, Video ✓.

No featured snippets. No local pack. No knowledge panels. Clean organic SERPs with three consistent rich features sitting above the fold.

What This Actually Means

AI Overview on every query

Google is summarising all of these topics directly in the search results. The old model — rank #1, get the click — has a layer on top of it now. The AI Overview reads the top sources and generates a summary. If your content is the clearest, most direct answer, there's a chance you get cited inside the summary. If it's buried in context, you don't.

The implication for writing: the first paragraph of every article in this niche now needs to be a complete, direct answer. Not "In this article, we'll explore..." — an actual answer. 40–60 words, no preamble.

PAA on every query

People Also Ask boxes are Google surfacing the secondary questions that the same user is likely to have. They're also secondary ranking opportunities — each box is its own small search result.

The catch: to rank in a PAA box, your heading needs to match the question phrasing closely. "Common Questions" sections with paraphrased questions miss the slot. The exact wording matters.

I ran related_questions from the SerpApi response for my two priority queries:

"does twitch pay nigerians":

Is Nigeria eligible for Twitch monetization?
How much money for 1000 views on Twitch?
Does streaming pay in Nigeria?
How much is 20k gifted subs on Twitch in Nigeria?

"how to receive money from twitch in nigeria":

Is Nigeria eligible for Twitch monetization?
How to make money from Twitch in Nigeria?
How much is 20k gifted subs on Twitch in Nigeria?
How many viewers on Twitch to make $1000 a month?

These became the <h3> headings in my FAQ section — verbatim, with full answers under each one.

Video on every query

YouTube videos are ranking for all 8 queries. I don't have videos. That's a gap I'm noting but not chasing right now — text content first, get indexed and cited, video is a later play.

No featured snippets

The AI Overview is eating what would otherwise be featured snippets. This isn't surprising — Google uses featured snippets to feed AI Overviews. If you're getting cited in the AI Overview, the featured snippet slot effectively doesn't matter.

What I Changed on the Pages

I updated three pages (Twitch, TikTok, YouTube) with the same two interventions:

1. Opening paragraph rewrite

Before:

Quick Answer: Nigerian streamers receive Twitch payments by using virtual US dollar accounts...

After:

Yes, Twitch pays Nigerians. Nigerian streamers can earn from subscriptions, bits, and ads through the Twitch Affiliate and Partner programs. To receive payments, you need a virtual dollar account — Cleva or Geegpay — which gives you real US bank details for Twitch's ACH transfers. Setup takes under 10 minutes with your NIN and BVN.

The difference: the second version opens with "Yes" (direct answer to the query), names the mechanism (virtual dollar account), names the specific products (Cleva, Geegpay), and gives the time estimate. All within 55 words. That's extractable.

2. FAQ section with exact PAA wording

Replaced generic Q&A sections with <h3> headings matching the SerpApi PAA questions word for word. Full paragraph answers under each.

I also tested adding @type: FAQPage structured data. I removed it before publishing — Google deprecated FAQ rich results in August 2023 and pulled it from their documentation entirely in September 2024. The schema does nothing now. The <h3> headings with exact PAA wording are what actually matter for ranking in PAA boxes, not the schema.

The Tool That Did This

serp-features is a module in my open-source SEO agent. You give it a list of queries, it returns a feature matrix and opportunity notes. It's about 260 lines of Python.

# Install
pip install httpx

# Set your SerpApi key (free tier, no credit card)
export SERPAPI_KEY="your-key-here"

# Run on a single query
python main.py serp-features --query "does twitch pay nigerians" --project naija-payments

# Run on a file of queries
python main.py serp-features --queries queries.txt --project naija-payments

The output is a markdown file with the feature matrix and per-query opportunity notes.

GitHub: github.com/dannwaneri/seo-agent

What I Learned Building It

The first version used Playwright to visit the real Google SERP. It worked for 1–3 queries before Google's bot detection kicked in. I tried user agents, delays, networkidle waits — none of it made a meaningful difference. Google's detection isn't based on request timing; it's based on IP reputation at scale.

The solution was to stop trying to scrape Google and use an API that already solved the infrastructure problem. SerpApi uses residential proxy networks — the same approach DataForSEO uses, just accessible to individuals on a free tier.

The rewrite was cleaner than the original. No Playwright dependency, no browser window opening, no CAPTCHA prompts. One httpx.get() call per query, structured JSON in, feature flags out.

Sometimes the right answer is to not fight the infrastructure problem yourself.

This is part of an ongoing series on building an open-source SEO co-pilot. The full agent handles core audits, GSC analysis, backlink scoring, internal link mapping, SERP feature detection, and LLM visibility checking — all local, all free or near-free to run.

OpenSEO Has 1.7k GitHub Stars. I Built the Same Thing for $0.

Daniel Nwaneri — Thu, 14 May 2026 14:37:44 +0000

I saw OpenSEO trending and did what every developer does.

I starred it before reading the pricing.

Then I read the pricing.

The appeal is real

The pitch is clean: open source, self-hostable, pay-as-you-go. No Semrush subscription. No bloat. Fork it and add your own features. For developers tired of paying $200/month for tools that do 10x more than they need, it lands perfectly.

1.7k stars. 196 forks. Active releases. The community is real.

I get it. I would have starred it too.

Then I opened the pricing section

"OpenSEO itself remains free. It works by using DataForSEO's APIs, which is a paid third-party service."

So it's free the same way a printer is free.

Here's what DataForSEO actually costs:

Minimum top-up: $50
Backlinks API: $100/month commitment
100 keyword research requests: $3.50–$7.00
100 domain overviews: $4.01
Rank tracking at scale: climbs fast depending on keywords and devices

The stars came from developers who love the idea. The cost reality hits after setup.

What I built instead

Part of what I do is build real-browser SEO automation tools — agents that visit pages the way Google does, not the way a scraper does.

My SEO agent does a full site audit — titles, meta descriptions, H1s, canonical tags, broken links, GSC quick wins, internal link clusters — in a real Chromium browser.

Not an API. An actual browser visiting each page.

Here's what it costs to run:

Browser visits: $0. Playwright is free.
GSC data: $0. Google already collected it for you.
Claude API calls: fractions of a cent per page on Haiku.
Total for a full audit: under $0.01 for most sites.

I wrote about the exact cost breakdown here. The short version: I was paying $0.006 per URL until I realized most URLs needed $0.

The technical difference that matters

OpenSEO pulls data from DataForSEO's index. That index is updated periodically. It tells you what DataForSEO's crawlers saw, when they saw it.

My agent visits the page right now, in a real browser, and extracts what's actually there — rendered JavaScript, actual title tags, real canonical values, live broken links.

If a page has a client-side rendering issue that hides the H1 from crawlers, a scraper-based tool misses it. A real browser catches it.

This is the same principle behind the Cloudflare-based automations I build for clients — edge-deployed, real output, not cached assumptions.

That's not a criticism of OpenSEO. It's a different architectural choice with a real tradeoff.

What OpenSEO has that I don't

I'll be honest:

Rank tracking over time — I don't have this
Keyword research at scale — not built
Backlink analysis — not in my agent
A polished UI — mine outputs JSON

If you need those features and you're comfortable with the DataForSEO cost model, OpenSEO is a reasonable choice.

But if your core need is: does this page have what Google needs to rank it — a real browser costs less and sees more.

One more thing

OpenSEO's contributor list includes @claude.

So does mine.

We're both building production tools with the Claude API. The difference is what you're optimizing for — features, or cost per insight.

I chose cost per insight. My sites are proof it works.

The agent is open source

github.com/dannwaneri/seo-agent

Run it on your own site. It's resumable, so if it crashes at URL 47 it picks up at URL 48. No DataForSEO account needed.

If you want a version that runs on Cloudflare Workers at the edge, that's something I build for clients too.

I build AI agents and SEO automation tools at dannwaneri.com. Everything I ship, I've run on my own domains first.

The Language Wars Are Over. The Ground Shifted Without You.

Daniel Nwaneri — Tue, 12 May 2026 14:12:05 +0000

I stopped caring which language someone uses. Somewhere in the last eighteen months, that happened without me deciding it.

Not because I became a better person. Because the argument stopped mattering.

In August 2025, TypeScript surpassed both Python and JavaScript as the most-used language on GitHub for the first time ever. Not because developers sat down and decided TypeScript won. Because AI tools handle it better, so it spread. The debate didn't resolve. The ground shifted underneath it and most people are still fighting on the old map.

The War That Already Ended

The Python vs JavaScript argument ran for a decade. Rust evangelism became a personality type. C++ veterans looked down on everyone. The fight was never really about syntax — it was about belonging. Who gets to call themselves a real developer. Who gets filtered out at the interview. Who gets taken seriously in the architecture meeting.

That argument is over.

Not because anyone won. Because something else became the constraint.

What Replaced It

The new constraints aren't linguistic. Tokens — how much context a session can hold before the model starts forgetting what it's building. Context windows — how much of your codebase an agent can actually see at once. Prompt discipline — whether your instructions are tight enough that the agent doesn't guess. Three things. None of them are in any job description yet.

Nobody voted on this shift. There was no announcement. It just became true while we were arguing about whether Rust was worth learning.

The developer who ships consistently now isn't the one who knows the most syntax. It's the one who can structure a spec tightly enough that the agent doesn't hallucinate the requirements, manage a context window without losing architectural coherence across sessions, and catch what the model got confidently wrong before it reaches production.
I’ve been experimenting heavily with this in my own production AI agents and real-browser automation workflows.

That's a different skill. No bootcamp teaches it yet. Most job descriptions don't list it.

The Gate Didn't Disappear. It Moved.

Language gatekeeping excluded people by syntax preference. You didn't know pointers? Not a real programmer. You used PHP? Embarrassing. You learned with a framework instead of from scratch? Shortcuts.

The new gatekeeping is quieter. You're not excluded for your language anymore.

You're excluded for your context budget.

Token limits are a billing problem dressed as a technical one. But knowing how to structure prompts, manage agent memory, and stay coherent across a long multi-step workflow — these compound. The developer who can do this produces dramatically better output than the one who can't. The gap is real and it grows with complexity.

Same exclusion mechanism. Different surface. Less visible, which makes it harder to name and harder to argue against.

The old gatekeeping was at least honest about what it was filtering for. The new one looks like a productivity difference.

What Doesn't Change

Not everything shifted.

That person is still you.

The things that actually matter — judgment, accountability, knowing when the confident answer is wrong — those don't change with the terrain. They get more important as generation gets cheaper.

Uncle Bob Martin, who spent months coding with Claude and wrote about it publicly, noticed something: Claude codes faster, holds more details, but can't hold the big picture. It doesn't foresee the disaster it's creating. Someone still has to see that. Someone still has to slow down and ask whether this is right, not just whether it compiles.

But the marker of competence shifted. The proxy changed. The new proxy is harder to fake than the old one.

You can memorize syntax. You can pass a whiteboard interview on language trivia. You can't fake knowing how to structure a ten-step agent workflow without the context collapsing at step seven, or how to write a spec that gives an agent something real to work with instead of something it'll interpret five different ways.
This is exactly why I built my own SEO automation agent that runs unsupervised on Cloudflare.

The old gate was about what you'd memorized. The new one is about how you think.

The Part That's Still Unresolved

I don't know if the new gate is better than the old one.

The old gatekeeping protected a social hierarchy more than it protected code quality. CS degrees, whiteboard interviews, years-of-experience requirements — they controlled access. They decided who got to call themselves real engineers. That architecture was never really about quality.

The new constraints are at least about something real. Context discipline, prompt structure, verification habits — these produce actual output differences. The filter is less arbitrary.

But "less arbitrary" isn't the same as "fair." Token budgets cost money. The developer in Lagos with a $20 API limit and the developer in San Francisco with a $200 plan are not operating in the same environment. The new constraint is technical and financial simultaneously. That's not a coincidence — it's just the old hierarchy in different clothes.

We spent years arguing about languages. Now the argument is how well you can give instructions.

That's not obviously worse. It's just different.

And we haven't decided yet whether the new gate is better than the old one, or just less visible.

I Gave My SEO Agent a Real Site. It Found Bugs I'd Missed for Weeks.

Daniel Nwaneri — Wed, 06 May 2026 13:24:06 +0000

does twitch pay nigerians — position 9.5 in Google Search Console, 29 impressions, 0 clicks.

Position 9.5 is close. One honest title rewrite from page one. But the CTR was 0. Nothing. The title Google was showing users didn't match the query at all.

I'd built that page myself. I just couldn't see it until the tool told me.

That's the thing about auditing your own content. You know what you meant. The tool only knows what's there.

What the agent was before

I've written about this SEO agent on FCC and dev.to over the past few months. The baseline version: scrape a URL, check title length, meta description, H1 match, keyword density, Core Web Vitals, schema markup. Standard on-page audit. Useful. Not smart.

It told you what was there. Didn't tell you what was costing you rankings.

Four modules later, it does.

Backlink Qualifier — scores referring domains on niche relevance (0–100), traffic quality (0–100), spam score (0–100, inverted). Weighted average: niche * 0.50 + traffic * 0.30 - spam * 0.20. Tiers: Insert Worthy ≥80, Good ≥60, Review ≥40, Avoid <40. Takes a list of backlink URLs, fetches each one via the real browser, sends summaries to Claude Haiku for scoring. Resumable — cached to a flat JSON file so a crash at URL 47 doesn't restart from zero.

GSC Insights — parses a Google Search Console export CSV, deterministically flags quick wins (position 4–20, impressions ≥50, CTR <5%), then sends the top 50 rows to Haiku for cannibalisation detection and cluster gap analysis. The prompt asks specifically for queries where two pages are competing. That's the one you can't see just by sorting the spreadsheet.

Relevance Scorer — given a target page and a list of candidates, scores each as a potential internal link source: topical alignment (0–100), anchor opportunity (0–100), link equity (0–100). Same weighted approach. Tiers: Strong Link ≥75, Good Link ≥55, Weak Link ≥35, Skip <35. Before scoring, it checks whether the candidate already links to the target — deterministically, from the raw links in the page snapshot — so it never recommends a link that already exists.

Cluster Audit — builds a full internal link graph across the site, computes incoming link counts per page, then sends the whole picture to Haiku to map topic clusters, identify orphan pages, flag missing hub pages, and suggest cross-cluster links.

All four: Claude Haiku for cost efficiency, flat JSON for resumable state, markdown reports in the project directory.

The code is on GitHub. The prompts are in /prompts. The modules are in /modules. dannwaneri.com/seo-automation has the links.

naija-vpn.com — what the agent found

I run a small site for Nigerian creators trying to get paid internationally. Five pages at the time of the audit: homepage, Cleva vs Geegpay comparison, Twitch payments Nigeria guide, Carter Efe case study, Data Saver app landing page.

I'd built these pages, deployed them to Cloudflare Pages, checked them manually. They looked fine.

GSC Insights found this:

does twitch pay nigerians — 29 impressions, position 9.5, 0% CTR.

The Twitch guide page was ranking for that exact query. But the title was: "How to Receive Twitch Payments in Nigeria (2026 Guide)"

Nobody types "how to receive twitch payments." They type "does twitch pay nigerians" — the doubt query, the "wait, is this even possible?" moment. The title was answering a question nobody asked.

The fix was a one-line change: "Does Twitch Pay Nigerians? Yes — Here's Exactly How (2026)"

That's the query, with the answer embedded. A user scanning results at position 9.5 now sees confirmation in the title itself.

GSC Insights also found this:

cleva vs geegpay — position 5.25, also 0% CTR.

The comparison page title was: "Cleva vs Geegpay: Best Choice for Nigerian Creators?"

Ending a title with a question when the user is asking a decision query creates friction. They're not looking for a question. They're looking for someone who already did the test. New title: "Cleva vs Geegpay: Which Is Better for Nigerian Freelancers?" Plus a description that actually answers it: "We tested both. Cleva wins for direct bank wires, Geegpay for Upwork/Fiverr. Full fee breakdown, transfer speeds, and which to use based on how your clients pay."

Then the cannibalisation.

The Carter Efe case study — a page about how Africa's highest-earning Twitch streamer receives his payments in Nigeria — had this title: "NaijaVPN - Get Paid Internationally in Nigeria"

That's the homepage title. Copy-pasted verbatim.

Google was indexing two pages with identical titles and descriptions. When it can't distinguish between two pages, it picks one and effectively buries the other. The case study was fighting the homepage for clicks on the same queries — and losing.

New title: "How Carter Efe Gets Paid from Twitch in Nigeria ($50K/Month)". The $50K/Month is in the GSC data, it's real, and it makes the page about one specific person instead of a generic payment guide.

The Relevance Scorer found two missing internal links.

The Carter Efe page never linked to the Cleva vs Geegpay comparison — even though it spent three paragraphs naming both services. The Twitch guide had the same gap. Both pages were talking about a comparison that existed on the same site and never pointing to it.

The scorer flagged both as Strong Link (≥75). Both links added.

The Cluster Audit found an orphan.

The Data Saver app page — /data-saver — had zero internal links pointing to it. Not from the homepage. Not from any other page. It existed as a standalone page with no connection to anything else on the site.

The report flagged a missing hub: "How to Save Mobile Data on Nigerian Networks: Tools, Tips & Strategies" at /save-mobile-data-nigeria. That hub connects the Data Saver app to the broader Nigerian mobile user context, and cross-links to the payment content where data usage overlaps. I built it the same week. The audit told me exactly what to build next — I didn't have to guess.

What changed after

Three title rewrites. Two internal links added. One orphan fixed — the hub page built and linked. Deployed to Cloudflare Pages.

The position 9.5 query has 21 days to show movement before the reindex settles. I'm checking on 2026-05-25.

What I have now: a site that was structurally broken in ways I couldn't see manually. A page cannibalising its own homepage. A ranking query with no matching title. Two pages talking about a comparison they never linked to. One completely isolated page with no internal path to it.

None of these showed up in the basic on-page audit. They're not "is your title too long" problems. They're structural — the site was working against itself at the architecture level.

The actual point

I didn't build these modules because I needed them for a single site.

I built them because I offer SEO as a service. The basic auditor told clients what was there. These tell clients what's costing them.

The difference is whether the tool earns the conversation. A checklist of "your meta description is 143 characters" doesn't. "You're at position 9.5 on a high-intent query with 0% CTR because your title doesn't answer the question" does.

The code is open source because the tool is the delivery mechanism, not the product. The product is the analysis. Someone can clone this and run it — but knowing which findings matter, what to fix first, and how to communicate that to a client who doesn't know what cannibalisation means is where the service is.

That gap doesn't shrink when you open-source the code.

It widens.

All four modules are open source. Full implementation details and the service page are at dannwaneri.com/seo-automation. If you want the agent run on your site and a prioritised fix list delivered, that's available too.

I Did Everything the AI Era Asked. It Still Didn't Pay My Bills.

Daniel Nwaneri — Thu, 30 Apr 2026 11:31:11 +0000

I didn't start because of AI. I was already building when it arrived.

Then AI came and reframed everything. New tools. New possibilities. A new story about what the market would reward. I leaned in — harder than most, because I had more to prove.

I have seven freeCodeCamp tutorials live.

Not drafts. Not unpublished. Live. With real readers, real comments, real engagement.

I built an SEO audit agent from scratch — Python, Browser Use, Playwright, the Claude API — evolved it through three versions, documented every failure. I built a production RAG system with hybrid search, multimodal vision, and a native MCP server, running on Cloudflare for about $5 a month. Not a demo. Not a tutorial project. Something anyone could deploy today. I built a federated knowledge commons. A suite of Claude Code skills because the defaults weren't good enough for how I actually work. I published on DEV.to, contributed to open source, earned an AWS Community Builder badge, maintained a 100% Job Success Score on Upwork.

I did everything the internet said to do.

2026 still hasn't paid me back.

Here's what nobody tells you when they say "build in public":

The AI era didn't democratize opportunity. It democratized output.

Those are not the same thing. Alex Hormozi said it cleaner than I can: "AI doesn't reduce the value of money. It reduces the value of labor. Big difference."

When everyone can publish a tutorial in a day, seven tutorials mean less. When every developer suddenly has a GitHub portfolio, portfolios stop being signal. When AI writes 90% of the job application emails landing in recruiters' inboxes, the remaining 10% gets buried with them.

The bar to produce something "good enough" collapsed. So the market for "genuinely good" collapsed with it.

I didn't know that when I started. I believed the story — build enough, publish enough, the right person notices. That was the theory.

The theory was wrong.

The data backs this up, but I didn't need the data. I felt it.

Entry-level tech hiring dropped 25% year-over-year in 2024. Developer employment for people aged 22–25 is down nearly 20% from its 2022 peak — a gap that Stanford researchers confirmed opened specifically when generative AI arrived, with younger developers losing work while developers over 35 saw employment grow. Developers are sending out 200–300 applications to get one callback.

Not because they're bad developers.

Because companies are figuring out how much of the work AI can absorb before they need to hire again. And while they're figuring it out, they're not responding.

The silence isn't personal. But it lands personally.

I want to be honest about something I haven't seen written anywhere:

The AI era created a content surplus that made human content invisible.

Think about what happened. AI lowered the cost of creating tutorials, blog posts, open source tools, portfolios. So everyone made more of them. Supply exploded. Recruiter attention didn't. The math was always going to end this way — we just didn't want to see it while we were building.

I spent months producing content that AI could have generated in minutes.

That's not an insult to my work. My work is real, tested, honest — I catch fabrications before I publish, I build systems before I write about them. But the market can't tell the difference at the speed it's moving. It doesn't have time to read deeply enough to notice.

So signal and noise look the same from the outside.

I'm not angry at the technology.

I'm angry that I believed the story around it.

The story went: learn the tools early, document everything, build in public, the market rewards signal. The AI era is the great equalizer. Geography doesn't matter. Credentials don't matter. What you build matters.

I'm from Port Harcourt, Nigeria. I believed this story harder than most, because it was the story that said someone like me could compete on a global market through sheer quality of work.

Maybe that was always naive. Maybe the market never actually worked that way and the AI era just made it obvious faster.

But I built real things. I didn't fake the metrics. I didn't cut corners on integrity. And I still have bills I can't pay.

Here's what I think happened — not just to me, but to a whole generation of developers who did everything right:

We optimized for visibility in a market that was optimizing for cheapness.

Think about what's happening on both sides simultaneously. Applicants are using AI to write cover letters at scale. Recruiters are using AI to screen those cover letters at scale. The human beings — the ones with seven live tutorials and a 100% Job Success Score — are somewhere in the middle of a conversation happening entirely between machines. We became the signal that neither side had time to read.

Recruiters aren't searching for seven-tutorial developers on DEV.to. They're using AI tools to screen 500 applications in the time it used to take to read five. The filtering happens before a human sees anything. And the filters weren't built to find people who built honest, production-grade systems and wrote about them carefully.

They were built for keywords.

We were writing essays. They were scanning for tokens.

I don't know what comes next.

I'm not going to pretend I have a reframe ready. I'm not going to tell you to "niche down" or "build an audience" or "the right opportunities are coming." I'm too tired for that and you'd see through it anyway.

What I know is this:

A lot of developers are sitting where I'm sitting right now. Some of them have more tutorials than me. Some have more GitHub stars, more followers, more credentials. And they're also not getting callbacks.

This isn't a skill problem.

It's a market that moved faster than the promise did.

The AI era asked us to learn fast, build fast, publish fast, adapt fast.

We did.

It just didn't tell us that fast was the only thing it valued — and that the moment we got fast enough, it would stop needing us to be fast anymore.

I built real things. I did it honestly.

The bills are still real too.

OpenClaw Burned $5,600 of API Credits in One Month. Here's the Spec Habit That Prevents It.

Daniel Nwaneri — Wed, 22 Apr 2026 18:09:00 +0000

The formula was exact. The assumption was wrong.

I was building vectorize-mcp-worker — a semantic search system on Cloudflare Workers that costs $5/month where alternatives run $130-190. The architecture was clean. Workers AI for embeddings, Vectorize for the index, one Worker handling everything at the edge. I'd spec'd the core search flow, written the populate endpoint, deployed it. Everything worked.

Then I tried to add hybrid search. I gave the agent a clear instruction: extend the search endpoint to support keyword fallback when vector similarity scores drop below 0.7.

It did exactly that. It also rewrote the embedding pipeline to use a different model — bge-base-en-v1.5 instead of bge-small-en-v1.5 — because the larger model "would improve result quality." Technically correct. Completely wrong for my use case. The dimensions changed from 384 to 768. The existing Vectorize index became incompatible. I didn't catch it until I tried to query documents I'd already populated.

The formula was exact. The assumption — that I wanted better quality regardless of index compatibility — was wrong.

That's not an agent failure. That's a spec failure. And it's the most dangerous kind because it looks like progress until production breaks.

I build with Claude Code CLI. That incident is what pushed me toward spec-first development — not as a formality, but as a forcing function. Stop feeding half-formed ideas into agents and hoping they fill in the right gaps. They don't. They fill in a gap, confidently, and move on.

So I built spec-writer — a Claude Code skill that takes a vague feature request and forces it into a structured spec before any code gets written. Goals, non-goals, assumptions, success criteria, edge cases. The stuff that feels like overhead until an agent ships something technically correct and operationally broken.

When OpenClaw dropped, I didn't start building. I started speccing.

The concern everyone has but few articulate clearly

Francis from the DEV community put something plainly in his OpenClaw post this week:

"You have to be careful on how you prompt something to the AI agent."

He's right. His conclusion — wait before using OpenClaw — is understandable but backwards. The problem isn't OpenClaw's autonomy. The problem is going into a session without constraining the ambiguity space first. The agent fills every gap you leave. You just don't get to choose which gaps or how.

The answer isn't caution. It's a spec.

What spec-writer actually does

The skill lives in your Claude Code workflow as /spec-writer. You describe what you want to build — as loosely or as precisely as you want — and it returns a structured spec with explicit goals and non-goals, a list of assumptions it caught in your description, success criteria you can actually verify, and a task breakdown ready to hand to any coding agent.

That last part is the one people underestimate. "A task breakdown ready to hand to any coding agent" sounds like formatting. It isn't. It's the difference between an agent that does what you meant and an agent that does what you said.

Here's what happens when you skip it.

The gap between what you said and what you meant

I wanted to build a household manager on top of Tradclaw — the OpenClaw scaffold Claire Vo shipped last week for family and home operations. School reminders. Meal plans. NEPA generator schedules (that's load-shedding for anyone outside Nigeria). Bedtime stories in Pidgin English.

My first instinct was to clone the repo, read BOOTSTRAP.md, and start the interview process. That's what the scaffold tells you to do.

But I ran /spec-writer first.

The input was roughly: "Build a household AI manager for a Nigerian family using Tradclaw and OpenClaw. School calendar, NEPA schedules, local meal planning, Pidgin bedtime stories."

The spec-writer came back with 11 flagged assumptions. Not suggestions. Flags.

Things like:

"Assumed NEPA schedule is predictable and retrievable — is there an API or does the agent need to infer from household patterns?"
"Assumed 'Pidgin' means Nigerian Pidgin English specifically — clarify dialect and code-switching rules before bedtime story generation."
"Assumed meal planning requires local market integration — is pricing data available, or is this a manual input workflow?"
"Assumed school calendar maps to Nigerian term structure — which state? Federal school calendar differs from state school calendars."

None of these feel like big catches until an agent confidently answers all of them wrong and you're two sessions deep wondering why your "Nigerian family manager" thinks school starts in September like it's England.

What the spec looked like after

After answering the assumption flags, I had something I could actually hand to OpenClaw:

Project: Nigerian Family Household Manager
Scaffold: Tradclaw (https://github.com/ChatPRD/tradclaw)
Agent runtime: OpenClaw

Core objectives (priority order):
1. Daily family brief — school events, power schedule, weather, tasks
2. NEPA/generator schedule management — manual pattern input, 30-min reminders
3. School calendar tracking — Rivers State federal calendar, term dates, exam periods
4. Meal planning — local market context, no external pricing API
5. Bedtime story generation — Lagos Pidgin/English code-switching, age-appropriate

Non-goals:
- No automatic purchasing or financial actions
- No external data calls for power schedules
- No sharing children's data beyond local workspace

Constraints:
- SOUL.md must define explicit boundaries around children's data
- Agent must surface ambiguity instead of resolving silently
- Pidgin dialect: Lagos variety, natural code-switching is correct behavior

Success criteria:
- Daily brief runs without hallucinated school events
- NEPA reminder fires within 5 minutes of predicted outage window
- Bedtime story passes native speaker read test (natural Pidgin rhythm)
- No skill composition errors across calendar + messaging + file-write

Flagged assumptions (resolved):
- School calendar = Rivers State federal (not Lagos state, not UK)
- Generator schedule = household-defined patterns (no API)
- Pidgin = Lagos variety (not Warri, not Kano)
- Meal planning = manual ingredient input (no market price API)

That spec didn't take long to write. But the session that followed it was different from every session I'd run without one.

The agent stayed in scope. When it didn't have enough information, it asked instead of assumed. When I gave it an ambiguous instruction, it surfaced the ambiguity rather than resolving it silently.

Why this matters for OpenClaw specifically

OpenClaw is an agentic system. That's the whole point — you give it a workspace, skills, memory, and it runs. The autonomy is the feature.

But autonomy without constraints is just energy. It goes somewhere. The question is whether that somewhere is where you wanted to go.

Skills in OpenClaw get composed. A household agent might combine a calendar skill, a messaging skill, a file-writing skill, and a web search skill in a single session. Each skill is well-defined. The composition is where your spec either held or didn't.

One commenter on the OpenClaw challenge launch post put it better than I could: "A skill that chains cleanly into other skills — well-defined inputs, clear failure modes — tends to be far more useful in real agent workflows than a monolithic 'do everything' skill."

That's composability. Your spec is what makes composability possible. Without it, you're handing OpenClaw a list of groceries and hoping it knows which store you meant.

The spec-writer habit gives you a document that sits at the top of every OpenClaw session. Not a system prompt. Not a README. A live constraint document the agent can reference when it hits an ambiguous decision point.

That document is the difference between an agent that asks "should I use the family's school calendar or the national one?" and one that picks silently and proceeds.

There's now economic proof for what that gap costs.

Luo Fuli — who leads Xiaomi's MiMo large model team and previously worked at DeepSeek — posted a detailed breakdown in April after Anthropic cut off third-party tools from routing through Claude subscriptions. Her analysis circulated widely in Chinese tech circles. Almost no English coverage picked it up.

Her core finding: a single Claude Max subscriber paying $100/month generated over $5,600 in equivalent API costs in one billing cycle. A 56x subsidy ratio. The mechanism was context management failure — third-party harnesses firing rounds of low-value tool calls as separate API requests, each carrying 100,000-token context windows. Every 3 steps, the harness compressed tool responses to manage the context limit. Every compression rewrote the conversation prefix. Every rewrite killed the cache hit. The model reprocessed the full context from scratch, again and again. Luo's description: "That's not a gap. That's a crater."

The spec is the fix for the crater. When your spec defines what tools the agent is allowed to call, in what order, with what scope — you're not just preventing bad outputs. You're preventing the token burn pattern that makes the economics collapse. Constrained sessions hit cache. Unconstrained sessions don't.

Three assumptions OpenClaw will make if you don't stop it

These aren't hypothetical. They're documented failure patterns from the April 2026 release notes — fixed in recent builds, but revealing about how the agent thinks when it has room to decide.

1. Memory bleeds into the wrong session.

Nested agent runs were inheriting the caller's channel account in shared workspaces. A subagent spawned from one session was running under another session's identity and routing context. The fix (#67508) scoped cross-agent runs to the target agent's bound channel. But the underlying pattern — agent assumes shared context is the right context — is exactly what happens without an explicit spec. Your household agent doesn't know which family member's calendar to update unless you tell it. It picks one.

2. The dreaming loop re-ingests its own output.

OpenClaw's memory system was promoting its own dream narratives back into short-term recall, then re-ingesting them as new memories. The agent was training itself on its own summaries (#64682, #65320). This is context rot. What you spec as "remember meal preferences" becomes "remember OpenClaw's summary of meal preferences" over time. Without a constraint in your spec that defines what counts as a valid memory source, the agent will decide. It will decide wrong.

3. Heartbeat events poisoned later sessions.

Heartbeat, cron, and exec events were overwriting shared session routing metadata. A synthetic heartbeat target was corrupting delivery for subsequent user turns (#66073). For a household manager running on a daily brief schedule, this means your morning reminder context could bleed into the evening meal planning session. Same agent, different task, wrong metadata.

None of these failed because the code was bad. They failed because the agent assumed shared context was safe context. A spec that explicitly defines session boundaries, memory sources, and skill scope constraints would have caught all three before they ran.

The lesson that transfers to every OpenClaw build

Start with what the agent is not allowed to assume.

That sounds backwards. You're building capabilities — why start with limitations? Because the failure mode for agentic systems isn't "agent couldn't do the task." It's "agent did the wrong task fluently."

Run your idea through a spec-writer equivalent before the first session. It takes 20 minutes. The 11 assumption flags I got back saved me from sessions that would have felt productive and built the wrong thing.

OpenClaw is powerful because it composes skills and acts autonomously. Those are also exactly the conditions where bad assumptions compound. One wrong inference in a household manager that runs daily isn't a one-time bug. It's a recurring one.

The spec doesn't constrain the agent. It constrains the ambiguity space the agent operates in. That's a different thing entirely.

I built spec-writer because I got tired of fixing confident mistakes. The repo is public at github.com/dannwaneri/spec-writer. Fork it, adapt it to your workflow, run it before your next OpenClaw session.

Not because it'll stop every wrong assumption. But because the ones it catches before a session are cheaper than the ones you find after.

The Gap Andrej Karpathy Didn't Fill

Daniel Nwaneri — Tue, 21 Apr 2026 14:10:07 +0000

I was debugging a customer support issue and typed the question into Claude.

Perfect answer. Wrong company. The model told me what the general best practice was for the problem category — accurate, well-reasoned, completely useless for my situation. Because the actual answer was in our internal runbook. Three paragraphs written eight months ago by a contractor who's since left. Not in any training set. Not in any index. Just sitting in a Notion page nobody searched anymore.

That's the gap Karpathy didn't fill.

The Wikipedia Argument

Karpathy made a compelling point: LLMs are becoming the new Wikipedia. For general knowledge questions, just ask the model. Why build a retrieval pipeline to answer "what is gradient descent" when GPT-4 already knows it better than your docs ever will?

He's right, for that use case.

But most knowledge work isn't general knowledge questions. It's "what did we decide about the pricing model in Q3" and "which customer reported this exact error last month" and "what's the current state of the integration with Salesforce." None of that is in any model. None of it ever will be.

The moment your question is about your specific context — your decisions, your customers, your code, your history — you need retrieval. The LLM is still doing the reasoning. But it needs your data to reason over.

The Cheap/Expensive Pipeline Problem

The frustrating thing about most RAG tutorials is they hand you the expensive version by default.

Every query embeds. Every query hits the vector index. Every query reruns the full pipeline, even for questions it's already answered. Cache? Optional. Keyword search for exact matches? Not shown. A model worth $0.003/call handling routing decisions? Never comes up.

The result: hobby projects that work fine, production deployments that surprise you at month end.

The routing insight that changed how I think about this: not every query needs vector search. "Show me documents tagged 'finance' from last week" is a SQL query. "Find anything mentioning GPT-4o" is BM25. "What do we know about our churn patterns" is the full vector pipeline. Running everything through embeddings when the first two cases cover maybe 40% of real-world queries is just waste.

V4 of this project classifies query intent first — SQL / BM25 / VECTOR / GRAPH / VISION / OCR — then routes accordingly. In testing, that cuts average embedding cost by 71% compared to running everything through vectors. The expensive path runs when it actually needs to.

What's Actually Running

For anyone who wants the specific stack:

Layer	What's used
Runtime	Cloudflare Workers (TypeScript)
Vector store	Cloudflare Vectorize
Keyword + metadata store	D1 (SQLite at the edge)
Embedding (default)	`@cf/qwen/qwen3-embedding-0.6b` — 1024d
Reranker	`@cf/baai/bge-reranker-base`
Vision / OCR	`@cf/meta/llama-4-scout-17b-16e-instruct`
Knowledge synthesis	`@cf/moonshot/kimi-k2.5`
Query routing	`@cf/meta/llama-3.2-3b-instruct`
MCP transport	Streamable HTTP via Cloudflare Durable Objects

Everything on Cloudflare. No Pinecone account. No OpenAI bill for embeddings. No Redis for caching. The cache is the CF Cache API, which is global and free at reasonable volumes.

Rough cost at 1,000 queries/day with intelligent routing: ~$0.11/month. At 10,000/day: $1–5/month. The Workers free tier absorbs the first 100,000 requests/day with no compute charge.

The Part That Surprised Me

The reflection layer was the thing I didn't expect to matter as much as it does.

Standard RAG retrieves documents. It doesn't learn. Every query goes in cold — no memory of patterns across documents, no synthesis of what the knowledge base collectively knows, no growing sense of what questions remain unanswered.

After every ingest, this system does three things: finds the most semantically related documents already in the index, asks an LLM to synthesise what's new, how it connects to what exists, and what gap remains. That synthesis — one paragraph, structured — gets embedded and stored as doc_type=reflection, with a 1.5× ranking boost in search results.

After every 3 ingests, it consolidates the recent reflections into a doc_type=summary — a compressed view of what the knowledge base has learned across that batch.

The effect: your knowledge base builds a second layer of meaning on top of raw documents. Related concepts get explicitly linked. Contradictions get surfaced. A search that previously returned five independent chunks now might also return a reflection that already synthesised those chunks into one coherent insight.

It's not magic. It's just LLM synthesis run as a background job, stored where retrieval can find it. But the result feels different from a flat document store in a way that's hard to describe until you see it return the reflection instead of the source.

Kimi K2.5 handles the reflection and consolidation by default. It has noticeably better multi-document reasoning than smaller models — worth the slightly higher cost for synthesis quality. Drop in REFLECTION_MODEL=llama-3.2-3b in your env if you want to reduce cost at high volume.

Running It

git clone https://github.com/dannwaneri/vectorize-mcp-worker.git
cd vectorize-mcp-worker
npm install

# Create a 1024d Vectorize index (qwen3-0.6b default)
wrangler vectorize create mcp-knowledge-base --dimensions=1024 --metric=cosine

# Create D1 database
wrangler d1 create mcp-knowledge-db

# Configure, apply schema, deploy
cp wrangler.toml.example wrangler.toml
# paste your database_id into wrangler.toml
wrangler d1 execute mcp-knowledge-db --remote --file=./schema.sql
wrangler deploy

# Set API key
wrangler secret put API_KEY

Dashboard at https://your-worker.workers.dev/dashboard. OpenAPI spec at /openapi.json for Postman / Bruno / Insomnia.

Claude Desktop config:

{
  "mcpServers": {
    "vectorize": {
      "command": "npx",
      "args": [
        "mcp-remote",
        "https://your-worker.workers.dev/mcp",
        "--header",
        "Authorization: Bearer YOUR_API_KEY"
      ]
    }
  }
}

Restart Claude Desktop. Six tools appear: search, ingest, ingest_image_url, find_similar_by_url, delete, stats.

What This Is For

This isn't trying to be better than LLMs at general knowledge. It's making LLMs useful on your data — the knowledge that doesn't exist anywhere else, the decisions your team made that aren't in any training set, the documents that answer the questions your customers actually ask.

Most RAG tutorials hand you a demo. This is what happens when you build the thing properly: hybrid search, cross-encoder reranking, intelligent routing, knowledge synthesis, multi-tenancy, rate limiting, caching, a native MCP server, batch ingestion, and a test suite. In one deployable Worker. At boring prices.

The model knows everything except what you know. That asymmetry is the whole problem. Filling it doesn't have to be expensive.

Full source: github.com/dannwaneri/vectorize-mcp-worker

AI is quietly making human experts invisible. I built a tool to stop it.

Daniel Nwaneri — Tue, 14 Apr 2026 17:01:10 +0000

Every time you ask an AI to write code, something disappears.

Not the code — the code shows up fine. What disappears is the trail. The GitHub discussion where someone spent two hours explaining why cursor-based pagination beats offset for live-updating datasets. The Stack Overflow answer from 2019 where one person, after a week of debugging, documented exactly why that approach fails under concurrent writes. The RFC your team wrote six months ago that established the pattern the AI just silently copied.

The AI consumed all of it. The humans who produced it got nothing.

And I don't mean "nothing" philosophically. I mean: no citation in the codebase. No way for a new developer to trace why the code is written the way it is. No signal to the person who wrote the original answer that their work mattered.

Over time, at scale, those people stop contributing. Why maintain a detailed GitHub discussion if AI will summarize it into oblivion and no one will read the original?

This is the quiet cost of AI-assisted development that nobody is measuring. I've been thinking about it for a while, and I built something to address it.

The scenario

A developer joins a team. Six months of AI-assisted codebase. They hit a bug in the pagination logic — cursor-based, unusual implementation, nobody on the team remembers why it was built that way. The original developer who designed it has left.

Old answer: two days of archaeology. git blame points to a commit message that says "fix pagination." The commit before that says "implement pagination." Dead end.

With poc.py trace src/utils/paginator.py, that same developer sees this in thirty seconds:

Provenance trace: src/utils/paginator.py
────────────────────────────────────────────────────────────
  [HIGH]  @tannerlinsley on github
          Cursor pagination discussion
          https://github.com/TanStack/query/discussions/123
          Insight: cursor beats offset for live-updating datasets

Knowledge gaps (AI-synthesized, no human source):
  • Error retry strategy — no human source cited
  • Concurrent write handling — AI chose this arbitrarily

They now know exactly where the pattern came from and — critically — which parts of the code have no traceable human source. That second section is what saves them. The concurrent write handling is where the bug lives. AI made a choice nobody reviewed.

That's what this tool does. Not enforcement first. Archaeology first.

What I built

proof-of-contribution is a Claude Code skill that keeps the human knowledge chain intact inside AI-assisted codebases.

The core idea is simple: every AI-generated artifact should stay tethered to the human knowledge that inspired it. Not as a comment at the top of a file that nobody reads. As a structured, queryable, enforceable record that lives next to the code.

When the skill is active, Claude automatically appends a Provenance Block to every generated output:

## PROOF OF CONTRIBUTION
Generated artifact: fetch_github_discussions()
Confidence: MEDIUM

## HUMAN SOURCES THAT INSPIRED THIS

[1] GitHub GraphQL API Documentation Team
    Source type: Official Docs
    URL: docs.github.com/en/graphql
    Contribution: cursor-based pagination pattern

[2] GitHub Community (multiple contributors)
    Source type: GitHub Discussions
    URL: github.com/community/community
    Contribution: "ghost" fallback for deleted accounts
                  surfaced in bug reports

## KNOWLEDGE GAPS (AI synthesized, no human cited)
- Error handling / retry logic
- Rate limit strategy

## RECOMMENDED HUMAN EXPERTS TO CONSULT
- github.com/octokit community for pagination

The section that matters most is Knowledge Gaps. That's where AI admits what it synthesized without a traceable human source. No other tool I know of produces this. It's the part that turns "the AI wrote it" from a shrug into an auditable fact.

How Knowledge Gaps actually get detected

This is the part worth explaining carefully, because the obvious assumption — that the AI just introspects and reports what it doesn't know — is wrong. LLMs hallucinate confidently. An AI that could reliably detect its own knowledge gaps wouldn't produce knowledge gaps in the first place.

The detection mechanism is different. It's a comparison, not introspection.

When you use spec-writer before building, it generates a structured spec with an explicit assumptions list — every decision the AI is making that you didn't specify, each one impact-rated. That list is the contract: here is every claim this feature rests on.

When the code ships, proof-of-contribution cross-checks the final implementation against that contract. Anything the code does that doesn't map to a spec assumption or a cited human source gets flagged as a Knowledge Gap. The AI isn't grading its own exam. The spec is the answer key.

The result is deterministic. If the retry logic wasn't specified and no human source covers it, the gap appears in the block regardless of how confident the model was when it wrote the code. The boundary holds because it comes from the spec, not from the model's confidence.

This is also why the confidence levels mean something. HIGH means the spec explicitly covered it or the user provided the source directly. MEDIUM means the pattern traces to recognized human-authored work but the exact source isn't pinned. LOW means the model synthesized it — human review strongly recommended before this code goes anywhere near production.

There's a second detection path that doesn't require spec-writer at all. poc.py verify runs Python's built-in ast module against the file and extracts every function definition, conditional branch, and return path. It cross-checks each one against the seeded claims. No API calls. No model confidence. Pure static analysis. When you run it on a file where import-spec was used first, only the assumptions with no resolved citation surface as gaps. When you run it cold, every uncited structural unit surfaces as a baseline. Either way, the AI's confidence at generation time is irrelevant — the boundary comes from the code's actual structure.

Three things the skill does

Provenance Blocks — attached automatically to any generated code, doc, or architecture output. You don't have to ask. It's always there.

Knowledge Graph schema — when you're building a system to track contributions at scale. Claude generates a complete graph schema for Neo4j, Postgres, or JSON-LD. Nodes for code artifacts, human sources, individual experts, AI sessions, and knowledge claims. Edges that let you ask: "who are the humans behind this module?" or "what did @username contribute to this codebase?"

Static analyser (poc.py verify) — runs after the agent builds. Parses the file's structure using Python's AST, cross-checks every function and branch against seeded claims, and reports deterministic Knowledge Gaps. Zero API calls. Exit code 0 means clean, 1 means gaps found — CI-compatible.

HITL Indexing architecture — when you want AI to surface human experts instead of summarizing them. The query interface returns Expert Cards:

Answer: Use cursor-based pagination with GraphQL endCursor.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
HUMAN EXPERTS ON THIS TOPIC
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
👤 @tannerlinsley  (GitHub)
   Expertise signal: 23 contributions on pagination patterns
   Key contribution: github.com/TanStack/query/discussions/123
   Quote: "Cursor beats offset when rows can be inserted mid-page"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Not a summary. A pointer. The human expert stays visible.

Getting started takes one command

I didn't want this to be another tool that requires you to choose a database before you can do anything. The default is SQLite. It works immediately.

# Install the skill
mkdir -p ~/.claude/skills
git clone https://github.com/dannwaneri/proof-of-contribution.git ~/.claude/skills/proof-of-contribution

# Scaffold your project (run once, in your repo root)
python ~/.claude/skills/proof-of-contribution/assets/scripts/poc_init.py

That creates four things:

.poc/provenance.db — SQLite database, local only, gitignored
.poc/config.json — project config, committed
.github/PULL_REQUEST_TEMPLATE.md — PR template with an AI Provenance section
.github/workflows/poc-check.yml — GitHub Action that fails PRs missing attribution

Then you get a local CLI:

python poc.py add src/utils/parser.py    # record attribution interactively
python poc.py trace src/utils/parser.py  # show full human attribution chain
python poc.py report                     # repo-wide provenance health
python poc.py experts                    # top cited humans in your graph

poc.py verify is what catches gaps before they become incidents:

python poc.py verify src/utils/csv_exporter.py

Verify: src/utils/csv_exporter.py
────────────────────────────────────────────────────────────
  Structural units detected : 11
  Seeded claims             : 3
  Covered by cited source   : 2
  Deterministic gaps        : 1

Deterministic Knowledge Gaps (no human source):
  • function: handle_concurrent_writes (lines 47–61)
      Seeded assumption: concurrent write handling — AI chose this arbitrarily

  Resolve: python poc.py add src/utils/csv_exporter.py

poc.py trace is what I use the most for the full attribution picture. This is what it looks like on a real file:

Provenance trace: src/utils/csv_exporter.py
────────────────────────────────────────────────────────────
  [HIGH]  @juliandeangelis on github
          Spec Driven Development at MercadoLibre
          https://github.com/mercadolibre/sdd-docs
          Insight: separate functional from technical spec

  [MEDIUM] @tannerlinsley on github
           Cursor pagination discussion
           https://github.com/TanStack/query/discussions/123
           Insight: cursor beats offset for live-updating datasets

Knowledge gaps (AI-synthesized, no human source):
  • Error retry strategy — no human source cited
  • CSV column ordering — AI chose this arbitrarily

The GitHub Action is for teams that already find the trace valuable

Once you've used poc.py trace enough times that it's saved you real hours — that's when you push the GitHub Action. Not before.

git add .github/ .poc/config.json poc.py
git commit -m "chore: add proof-of-contribution"
git push

After that, every PR gets checked. If a developer submits AI-assisted code without an ## 🤖 AI Provenance section in the PR description, the action fails and posts a comment explaining what's needed.

The opt-out is simple: write 100% human-written anywhere in the PR body and the check skips.

The enforcement works because the tool already saved them hours before they turned it on. The PR check isn't introducing friction — it's standardizing something people already want to do. That's the only version of a mandate that doesn't get gamed.

It works with spec-writer

I built spec-writer first. It turns vague feature requests into structured specs, technical plans, and task breakdowns before the agent starts building. The problem spec-writer solves is ambiguity before the code exists.

proof-of-contribution solves attribution after the code exists.

They connect at the assumption layer. spec-writer generates an assumptions list — every implicit decision the AI made that you didn't specify, impact-rated, with guidance on when to correct it. Each correction can now carry a citation. Each citation becomes a node in the knowledge graph. By the time a developer runs poc.py trace on a finished module, the full chain is visible:

feature request → spec decision → human source → code artifact
                                       ↑
                              poc.py verify closes this loop
                              without asking the AI what it missed

That chain is what I mean when I say AI should be a pointer to human expertise. Not a replacement. A pointer.

Why 2026 is the right time to build this

The tools are mature. Coding agents are shipping code at scale. The question of "who is responsible for this output?" is becoming real — in teams, in code reviews, in enterprise audits.

The provenance infrastructure doesn't exist yet. git blame tells you who committed. It doesn't tell you what human knowledge shaped the decision. That gap is getting wider every month.

proof-of-contribution is one piece of the infrastructure. It's not the whole answer. But it's the piece I could build, and it's the piece I think matters most: keeping the humans whose knowledge powers AI visible in the artifacts AI produces.

Install

mkdir -p ~/.claude/skills
git clone https://github.com/dannwaneri/proof-of-contribution.git ~/.claude/skills/proof-of-contribution

Works with Claude Code, Cursor, Gemini CLI, and any agent that supports the Agent Skills standard.

Repo: github.com/dannwaneri/proof-of-contribution

Drilling boreholes taught me something AI keeps reminding me of.The formula can be perfect. The model can be correct. Yet the result is still dry.Because the assumption about the ground (or the environment) was wrong.

Daniel Nwaneri — Fri, 10 Apr 2026 15:43:30 +0000

Geology lessons on the limits of modeling

Daniel Nwaneri

Apr 10

The Formula Was Exact. The Assumption Was Wrong. That's Not an AI Problem.

#ai #webdev #career #python

Comments 9

5 min read

The Formula Was Exact. The Assumption Was Wrong. That's Not an AI Problem.

Daniel Nwaneri — Fri, 10 Apr 2026 15:07:48 +0000

Your geology will always govern your geophysics.

My lecturer said it once. I wrote it down. I didn't fully understand it yet.

I do now.

What He Meant

We were studying Vertical Electrical Sounding at the Federal University of Technology Owerri. VES is how you read the earth without drilling it — you send current into the ground, measure how it returns, and infer what's down there from the resistivity curves. Clean method. Decades of field use. Textbook technique.

But the method assumes something. It assumes the layers beneath you are horizontal, homogeneous, well-behaved. The formula works perfectly under those conditions. Run your numbers, get your model, trust the output.

Except Nigeria's basement complex isn't horizontal or homogeneous. It's fractured. Laterally variable. Full of structural surprises that don't announce themselves in your data. You can run perfect VES and still drill a dry borehole — not because the method failed, but because you trusted a reading without interrogating the ground it came from.

The geology is always prior. The geophysics only tells you what's there if you already understand the conditions under which it's operating. Skip that step and the output is confidently wrong.

I spent a semester learning this in a classroom. Then I spent six months at the Nigerian Geological Survey Agency watching it happen in the field — solid mineral surveys, real terrain, results that came back and either confirmed your model or told you your assumptions were off.

I didn't know I was learning how to work with AI.

The Same Failure, Different Surface

I've been building production systems in Port Harcourt for years now. Cloudflare Workers, RAG pipelines, MCP servers, edge infrastructure for users where bandwidth costs real money per request.

And I keep watching the same failure repeat, dressed in different code.

Developer gets AI-generated output. Output looks right. Tests pass. Ships to production. Fails — not catastrophically, but wrongly. Subtly. In ways that take days to trace.

The model wasn't broken. The formula was fine. The geology was different.

AI generates for the environment its training data came from — abundant compute, fast connections, forgiving infrastructure, users on fast networks in cities where the grid is reliable. That's the assumed geology. Most of the time, nobody states that assumption. Nobody interrogates it. The output arrives confident and gets treated as ground truth.

When I built a VPN service targeting Nigerian users, the AI-suggested architecture was technically correct and completely wrong. Correct for the assumed geology. Wrong for mine. The difference wasn't a bug you could find with a linter. It was a mismatch between what the model assumed about the world and what the world actually was.

That gap — between assumed geology and actual geology — is where production failures live.

The VES Lesson Nobody Teaches in CS

VES nearly died as a professional discipline in the 1980s and 1990s.

Not because the physics were wrong. The physics were exact. It died because practitioners kept getting bad results — boreholes drilled on confident readings that came back dry. Clients stopped trusting the method. The reputation collapsed.

The post-mortem was brutal in its simplicity: the formula was exact. The assumption was wrong. Geophysicists had been applying a method that required horizontal homogeneity to terrain that wasn't horizontally homogeneous. The model was rigorous. The geology was inconvenient.

They fixed it — better field protocols, more explicit assumption-checking, ground-truthing before committing to a reading. The method recovered. But only after the field admitted that confident output isn't the same as correct output.

We are at that moment with AI.

The models are rigorous. The outputs are confident. And we are shipping to production without interrogating the geology — the actual environment, the actual users, the actual constraints — that the output will have to survive in.

Ben Santora has been stress-testing LLMs with logic puzzles designed to expose reasoning failures. His finding: most models are solvers, not judges. They produce an answer. They don't flag when the assumed conditions don't match the actual problem.

"Knowledge collapse happens when solver output is recycled without a strong, independent judging layer to validate it."

The judging layer is the geologist's job. It always was.

What Field Work Actually Trains

I did my industrial training at NGSA in 2015. Solid mineral surveys. Real field conditions. Mineralogy in the lab, VES in the terrain.

The thing fieldwork does that coursework doesn't is this: it makes the gap between model and ground visible in real time. You take your reading. You record your resistivity curves. You run your interpretation. Then you go back the next day and find out if the borehole hit water or came back dry.

That feedback loop — model, prediction, ground truth, reckoning — is what builds the instinct to hold your interpretation lightly. Not to distrust the method. To distrust the assumption.

When I ran my SEO audit agent against my own published content this month — seven URLs, seven FAILs — I wasn't surprised. I'd built the agent, I knew what it was checking, and I ran it on myself first because that's the only version of a demo I trust. The agent was right. Three freeCodeCamp tutorials had broken meta descriptions. Two DEV.to article titles were too long for Google to render cleanly.

That reflex — interrogate the output before you trust it — isn't something I learned from a JavaScript tutorial. It came from standing in terrain that didn't match the model and having to explain why.

The same thing happened when I shipped The Foundation's clipboard capture in February. The workflow looked right. I documented it, wrote the article, shipped to GitHub. It was broken — capturing only user messages, missing every AI response, missing everything above the visible viewport. I'd reviewed it at the same speed I built it. The geology was inconvenient. I didn't check.

Five days later I wrote publicly: "I launched The Foundation with big plans. But I underestimated the scope." Not in a GitHub issue. On DEV.to. The well came back dry. You say so.

The Reframe

The conversation about AI in software development keeps getting stuck on the wrong question.

Not: is the model capable?

The model is capable. That's not the problem.

The question is: does the model know your geology?

AI-generated code is optimised for an assumed environment. Plenty of RAM. Reliable connectivity. Users on fast networks. Infrastructure that forgives. Most of the time, nobody states this assumption — it's baked into the training data, invisible until the output meets terrain it wasn't built for.

The developers who catch this aren't necessarily the most experienced. They're the ones who learned — somewhere, from something — to name the geology before trusting the reading.

In the field: name the geology before you trust the reading.

In production: name the environment before you trust the output.

Same question. Different surface.

What Non-CS Backgrounds Actually Transfer

The argument I keep hearing: your background doesn't matter, code is code.

It's wrong. And it misses the point.

What transfers from geophysics isn't syntax knowledge. It's the prior question. The one you ask before you trust the output.

CS tracks teach you to evaluate whether the code is correct. They don't train the instinct to ask whether the assumed conditions match the actual ones. That instinct comes from fields where the gap between model and ground is visible, expensive, and immediately yours to own.

The guts come from somewhere. For some people it's painful production failures. For some it's a good mentor. For me it was a lecturer in Owerri who said one sentence I've never stopped thinking about.

Your geology will always govern your geophysics.

The model doesn't know your terrain. That's not a limitation to wait out. It's a gap you have to close yourself — every time, before you ship.