Forem: Richard Baxter

Implementing WebMCP on a Recruitment Website

Richard Baxter — Mon, 20 Apr 2026 18:57:29 +0000

Thinking about what, exactly, the future of a website “looks” like in the agentic era is a challenging proposition. It might be that in most cases, our future viewers/readers/customers can do everything, from their chatbot of preference while never visiting your site.

WebMCP is a part of this puzzle. The protocol directs an agent to stop guessing what a button does, and starts calling tools with typed inputs with a fundamentally simple tool registration protocol.

So, what’s WebMCP? WebMCP is “an emerging W3C standard developed by Google and Microsoft that acts as a browser API to turn websites into interactive tools for AI agents”.

This is the future and I think it’s very exciting. For fun, I added WebMCP support to YubHub , (my recruitment site).

Today, I’m going to talk about what I learned building, the design choices we made that worked, an agent trace that caught a hallucination in a production system (fixed), and why (in my opinion!) a recruitment site turned out to be an unusually good fit for this protocol.

YubHub in Chrome 146 with the Model Context Tool Inspector open.

Why a Recruitment Site?

Because websites new and old are going to have to change rather quickly (and, I really enjoy starting online business from scratch – the test of a true martech marketer!).

Most of the WebMCP explainers I read while researching this use a checkout flow, a to-do list, or a colour picker. Toys. They work for illustrating the API surface (what can this new thing do?), but without implmenting and playing with teh protocol yourself I fear the penny simply does not drop.

The real value with WebMCP emerges when an agent needs to reason about a lot of structured data that’s already behind a nice URL structure – and recruitment sites are built around exactly that shape: job details, salary ranges, skills required or employment locations.

YubHub already served clean data at predictable URLs (/jobs/skill/figma, /jobs/at/anthropic) and emits schema.org JobPosting markup. What it didn’t have was a “contract”. Any agent browsing the site could see the HTML and JSON-LD I’m sure, but getting an agent to browse your website for an answer via Chrome is just the craziest waste of time when ideas as fundamental as an API have been around for decades.

WebMCP fixes that by publishing the interface explicitly, with typed inputs and structured responses. Prompt in > “find me an X for a Y in Z”. So, for a site whose whole job is “be discoverable”, WebMCP is a big deal.

The basic flow. The page registers tools on navigator.modelContext, the agent discovers and invokes them, the page replies with structured data. No screen-scraping in between.

What’s WebMCP?

WebMCP is a proposed web standard, co-authored by engineers at Google and Microsoft under the W3C Web Machine Learning community group, that lets a site expose a set of callable tools to an AI agent running in the browser. There are two APIs: imperative (navigator.modelContext.registerTool() called from JavaScript), and declarative.

Think of it as a contract the site publishes to any agent that lands on it. Instead of the agent guessing that a div with class="btn-primary" means “checkout”, the page says: here’s a checkout tool, here’s what it needs, here’s what you’ll get back.

That’s a welcome leap from screen-scraping and MCP based browser control. The Chrome DevTools MCP quickstart benchmarked a simple “set counter to 42” task at 3,801 tokens using screenshots and 433 tokens using WebMCP. That’s an 89% reduction in token use!

The current spec lives on the W3C community group site (webmachinelearning.github.io/webmcp), with Chrome’s early-preview announcement and detailed posts on developer.chrome.com/blog/webmcp-epp. It ships behind a flag in Chrome Canary 146. Google Chrome Labs maintain a reference extension – the Model Context Tool Inspector – that lists the tools any page has registered and lets you call them manually. If you’re planning to build anything with this, you’ll want both installed.

How WebMCP Differs From MCP

The names don’t help much, do they! Model Context Protocol (MCP) is server-side – you deploy an MCP server that an agent connects to, it runs in its own process, exposes tools to your AI assistant over JSON-RPC. We have a growing library of our own MCP connectors like my favourites Gemini MCP and Houtini-LM.

WebMCP is the browser-side sibling. Your tool code runs in the page’s JavaScript context, so it has the user’s cookies, their session, their permissions – everything the user already has access to. There’s no deployment, no auth bridge, no server to pay for. Well, mostly – you still have to maintain the page, obviously, but there’s nothing extra. If your site already authenticates the user, your WebMCP tools inherit that authentication for free.

The practical consequence: MCP is the right fit for agents that need third-party data (GitHub, Gmail, a database). WebMCP is the right fit for agents that need to interact with a specific site the user has requested.

Getting Set Up

If you have Chrome Canary 146 or higher, you are already setup. The stable, Beta, and Dev channels of Chrome do not ship with the WebMCP flag – so you have to enable it:

Open a new tab to chrome://flags/#enable-webmcp-testing and set “WebMCP for testing” to Enabled. Click Relaunch.

Enable WebMCP for testing

Third, install the Model Context Tool Inspector extension from the Chrome Web Store. This is currently the only way to test anything before you’ve wired an agent in. It lists every tool the current page has registered, shows you their JSON schemas, and lets you execute them manually with whatever input you want. The extension requires a Gemini API key as the extension uses Gemini as a natural-language test harness, which is useful for checking whether your tool descriptions are clear enough for a real LLM to pick the right tool. This is much like tool context set in tool descriptions in the MCP protocol.

The WebMCP Inspector extension.

To verify the whole thing is working, open DevTools and run console.log(navigator.modelContext). If you see an object come back rather than undefined, you’re in business. If it’s undefined, either your flag isn’t set or you’re on the wrong Chrome channel.

Setting Up the Imperative API

Here’s the minimum viable tool registration from YubHub’s live scaffold:

if (!navigator.modelContext) return;

var pageController = new AbortController();
addEventListener('pagehide', () => pageController.abort(), { once: true });

navigator.modelContext.registerTool({
  name: 'yubhub_fetch_job',
  description: 'Fetch a single YubHub job as clean markdown (company, location, salary, work arrangement, full description, apply URL, skills). More token-efficient than the HTML page.',
  inputSchema: {
    type: 'object',
    properties: {
      id: { type: 'string', pattern: '^job_[a-z0-9_-]+$' },
    },
    required: ['id'],
  },
  execute: async (params) => {
    const r = await fetch('/jobs/' + params.id + '.md',
      { headers: { Accept: 'text/markdown' } });
    return { id: params.id, markdown: await r.text() };
  },
}, { signal: pageController.signal });

There are 4 things to look at here – firstly, the feature-detection guard – navigator.modelContext only exists in Chrome 146 with the flag on, so every public-facing scaffold needs to short-circuit gracefully elsewhere or you’ll throw on every other browser your users visit from.

The pattern constraint on id – the spec treats JSON Schema loosely in the browser, so validating in code catches the cases the schema doesn’t. And the return value is a structured object, not a string – the agent reads it as data rather than trying to parse text/html.

The Inspector showing the tool’s schema (green box) and the agent’s prompt (red box). You can drive this entirely by hand during development, which is the fastest way to catch bad tool descriptions.

The Declarative API Is Free Engineering

I think this is the cleverest part of the WebMCP spec. If you’ve got an HTML form – and of course, you do, you can annotate it and get a tool registration for free.

YubHub’s nav already had a GET form submitting to /jobs/search. All I did was add three attributes:

<form action="/jobs/search" method="GET"
      toolname="yubhub_search"
      tooldescription="Search YubHub's enriched job board by free-text query - title, company, or keyword. Submits a GET to /jobs/search and renders the results page."
      toolautosubmit>
  <input type="search" name="q"
         toolparamdescription="Job title, company name, or any keyword (2-100 chars)"
         minlength="2" maxlength="100" />
</form>

The browser reads the form, builds a JSON Schema from the input fields, and registers it as a tool. When an agent calls it with {q: "python"}, the browser focuses the form, populates the input, and – because of toolautosubmit – submits it automatically. Zero JavaScript. Your form already works for humans; now it works for agents.

One thing worth flagging: toolautosubmit is only really safe on read-only operations. Search queries, availability lookups, status checks – fine, let the agent submit. Anything that creates, modifies, or deletes data should leave the flag off, so the human has to click Submit after the agent fills the form.

And that, I reckon, is how WebMCP adoption will probably start to emerge – by adding it to their internal site search forms and results pages. But I hope to see people agreeing that this idea undersells what WebMCP can do in the longer term.

Context-Scoped Tool Registration

I’ve got a tool called yubhub_current_job. It takes no arguments, and it just returns the current job’s markdown. Obviously, that only makes sense on my job listing pages, so anywhere else on the site it would be meaningless (worse, it’d be confusing). The scaffold pattern-matches the URL and registers conditionally:

const jobMatch = location.pathname.match(/^\/jobs\/(job_[a-z0-9_-]+)$/i);
if (jobMatch) {
  const currentJobId = jobMatch[1];
  navigator.modelContext.registerTool({
    name: 'yubhub_current_job',
    description: 'Return clean markdown for the job the user is currently viewing ('
      + currentJobId + '). Zero-arg shortcut.',
    inputSchema: { type: 'object', properties: {} },
    execute: async () => {
      const r = await fetch('/jobs/' + currentJobId + '.md');
      return { id: currentJobId, markdown: await r.text() };
    },
  }, { signal: pageController.signal });
}

The AbortController ties it to the page lifecycle – when the user navigates away, pagehide fires, the controller aborts, and all tools registered with its signal drop off the agent’s radar. Importantly this leaves no leftover closures holding references to the previous page’s state.

Same site, three pages, three different tool menus. The agent sees only the tools that make contextual sense.

Navigation Tools vs Data-Retrieval Tools

There are two kinds of tools you can give an agent, and they create completely different user experiences.

Navigation tools change the URL on the user’s behalf. They’re the right call when the user actually wants to end up somewhere else – looking at a specific job, browsing a facet, running a search. So the agent calls yubhub_browse_jobs({type: 'skill', slug: 'figma'}), the tab navigates, the user sees the archive page. The return value is basically a receipt (“I took you here, here’s what’s on the page”).

The real power, IMO, is this: Data-retrieval tools return parsed data and leave the URL alone. The agent calls them to reason, compare, filter, summarise – and the user stays on whatever page they were already on, quite unaware that the agent just read a hundred jobs to answer a question. yubhub_fetch_facet returns up to 100 schema.org JobPosting entries; yubhub_shortlist filters them by salary floor, remote status, title keyword, and hands back a ranked list.

Data-retrieval is what I’m bullish on – it’s a midbendingly powerful solution to the current state of play with AI based webcrawl / actions for users and so on.

yubhub_shortlist returning a ranked list of product manager roles.

Only shipping navigation tools is the mistake I’d warn you off. It just makes the agent a fancy URL-rewriter – it can answer “take me to X”, but it can’t answer “summarise the options and recommend one” without hauling the whole user along.

What WebMCP Can’t Do (Yet)

A browsing context is required. There’s no headless mode – if a tab isn’t open, the tools don’t exist. That rules out batch or scheduled agentic work against a WebMCP site, for now.

Discoverability is thin. An agent has to visit your page to see your tools. There’s no global registry, no “hey agent, here’s a list of sites with useful WebMCP tools” index. A .well-known/webmcp manifest has been proposed – and I think it’ll be necessary eventually – but nothing’s shipped yet. Someone is going to come up with an authority directory of WebMCP supportive brands any minute now.

The spec moves weekly. Between my first discovery of the docs and shipping, my implementation, provideContext and clearContext got removed entirely, unregisterTool came back after being briefly deprecated, and the AbortSignal handling in registerTool‘s second argument settled into the form above. Track the spec repo and expect to rewrite things. I’m in the preview group which is very easy to register for.

If you run a site that has structured data behind predictable URLs – a product catalogue, a job board, a docs site, a media library – WebMCP is worth shipping now, even though browser adoption is near zero. It’s cheap to add, it forces you to think about your tool contracts before agents try to guess them, and it positions the site for when Chrome lifts the flag and Edge follows. When that happens I think there will be quite a fuss about it.

The post Implementing WebMCP on a Recruitment Website appeared first on Houtini.

Making a Local LLM MCP Server Deterministic: Model Routing, Think-Block Stripping, and the Problems Nobody Warns You About

Richard Baxter — Tue, 17 Mar 2026 12:48:05 +0000

For some time, I've been experimenting with the idea that by using an MCP server, we can delegate bounded tasks from Claude Code to cheaper local or cloud models (models I run on a local server in LM Studio). It makes sense, why chew through long, repetitive regression testing tasks when this could be directed by claude, but executed by a simpler, arguably more efficient for the task model instead?

The other worry I have - what if Anthropic added a few zeros to their subscription and half of us had to rethink how we use the flagship models? This is my ongoing experiment. There's no "this is how you have to work from now on" pressure that I feel everytime I read about a new release, I'm just curious to see if we can get to a point where Claude is orchestrating and delegating to whatever local model(s) you have available for the same of token efficiency. It might matter one day!

My v1 was simple - running one model, on one endpoint, instructing Claude to think about handover for specific tasks. After not very long I'd rewritten most of it - as it turns out Claude doesn't want to share much work.

Today, I'm going to give you a tour of my work so far (this is experimental project; I welcome honest feedback, forks and pull requests). The post-mortem on what broke, what wasn't cutting the mustard and how that influenced where I've got to: model routing, think-block stripping, SQLite model caching, and per-model prompt tuning. All TypeScript, all open source.

I wrote about using SQLite as a context saver in MCP servers a couple of weeks ago and the core argument there was: don't cram raw API data into your LLM's context window. It doesn't scale. As the dataset grows, the token cost balloons, the signal-to-noise ratio collapses, and the model starts forgetting becuase of incomplete, compacted data. "Memory" - just stuffing everything in context or a SQLlite dependency and hoping the model sorts it out - is not architecture. It's a different flavour of context bloat through tool descriptions and database entries of what the model did.

This is the same problem showing up in a completely different place. When your MCP server needs to know what twelve different local models are good at - their strengths, weaknesses, best task types, context lengths, quantisation levels - you can either dump all of that into every conversation, or you can cache it locally and query what you need. One approach costs hundreds of tokens per call and gets worse as you add models. The other costs just much less.

The MCP server is houtini-lm. It sits between Claude Code and whatever OpenAI-compatible endpoint you've got running - LM Studio, Ollama, vLLM, cloud APIs, whatever speaks /v1/chat/completions. Claude keeps on top of the reasoning. The cheap(er) model handles the outputs.

The hurdles.. Some of which I haven't quite overcome.

The routing problem

v1 assumed you had one model loaded. You'd set LM_STUDIO_URL, maybe override LM_STUDIO_MODEL, and every delegation call went to the same place. Fine if you're running Qwen Coder and only delegating code tasks.

Then I loaded GLM-4 alongside Qwen Coder because I wanted a general-purpose model for chat-style delegation - code explanations, content rewrites, commit messages. And immediately hit the problem: houtini-lm had no concept of "this is a code task, use the coder model" versus "this is a chat task, use the general model." Everything went to whatever model ID was in the config.

So I wrote a router. Here's the core of routeToModel:

type TaskType = 'code' | 'chat' | 'analysis' | 'embedding';

async function routeToModel(taskType: TaskType): Promise<RoutingDecision> {
  const models = await listModelsRaw();
  const loaded = models.filter((m) => m.state === 'loaded' || !m.state);

  let bestModel = loaded[0];
  let bestScore = -1;

  for (const model of loaded) {
    const hints = getPromptHints(model.id, model.arch);
    let score = (hints.bestTaskTypes ?? []).includes(taskType) ? 10 : 0;

    // Bonus: code-specialised models for code tasks
    const profile = getModelProfile(model);
    if (taskType === 'code' && profile?.family.toLowerCase().includes('coder'))
      score += 5;

    // Bonus: larger context for analysis tasks
    if (taskType === 'analysis') {
      const ctx = getContextLength(model);
      if (ctx && ctx > 100000) score += 2;
    }

    if (score > bestScore) {
      bestScore = score;
      bestModel = model;
    }
  }

  return { modelId: bestModel.id, hints: getPromptHints(bestModel.id) };
}

Three things worth noting about this:

1. It queries the LM Studio /v1/models endpoint at routing time. This sounds expensive but the endpoint returns in under 5ms locally and it means model hot-swaps in LM Studio are picked up immediately (even if they can take their good time to load...). I tried caching this and it caused more problems than it solved - we don't want stale model lists when you unload something.

2. It can't JIT-load models. The MCP SDK has a hard ~60-second timeout on tool calls. Loading a model in LM Studio takes minutes. So if the best model for a task isn't loaded, the router uses the best available one and returns a suggestion string: "💡 qwen3-coder-next is downloaded and better suited for code tasks - ask the user to load it in LM Studio." Claude surfaces this to the user. Not ideal, but the alternative (silent timeout) is worse.

3. The scoring is deliberately simple. Current version: does the model's bestTaskTypes include this task type? 10 points. Is it a coder model and this is a code task? 5 bonus. Large context and analysis task? 2 bonus. The highest score wins.

Think-block stripping

GLM-4, Qwen3, Nemotron - these models always emit internal chain-of-thought reasoning wrapped in <think> tags before producing their actual response. When I first loaded GLM-4, every delegation call came back with 400+ tokens of the model arguing with itself before the 50-token answer I actually needed. Watching GLM have a discussion with itself tells you a lot about the model itself - it doesn't seem very confident and really seems to question itself.

The fix is simple:

// Strip <think>...</think> reasoning blocks
let cleanContent = content.replace(/<think>[\s\S]*?<\/think>\s*/g, '');  // closed blocks
cleanContent = cleanContent.replace(/^<think>\s*/, '');                   // orphaned opening tag
cleanContent = cleanContent.trim();

It's 2 lines of regex. But that second line took me a bit long to pin down. Sometimes the model runs out of generation tokens mid-think-block. You get <think>The user wants test stubs for... and then the actual output, with no closing </think>. The first regex doesn't match because there's no closing tag. So I was getting leaked reasoning mixed into the response which cause obvious problems.

My orphaned tag regex catches that - it's not elegant, but it works, and it was a hard won breakrthough.

The emitsThinkBlocks flag in the prompt hints system means this only runs for models that produce think blocks. There's no unnecessary processing for LLaMA or other instruct models that don't use this pattern.

SQLite model cache (sql.js WASM, again)

This is where the argument from my previous SQLite post loads back into reader context... The router needs to know what each model is good at. You could stuff model profiles into the system prompt - strengths, weaknesses, best task types for every loaded model. With two models that's maybe 300 tokens. With twelve it's 2,000. And it's the same 2,000 tokens on every single tool call, burning context on metadata the model has already seen. That's the "memory as architecture" trap again: it works at small scale and falls apart the moment your data grows.

So I did what I did with the Search Console data example: cache it locally, query what you need, return only what's relevant to this specific routing decision.

For Qwen Coder or GLM-4, I've got hand-written (copy and pasted) profiles - curated descriptions, strengths, weaknesses, what best task types suit the model. But what about when someone loads a random GGUF they downloaded from HuggingFace? Let's query that and store it to the db.

The cache works in two tiers:

Tier 1: Static profiles - regex-matched against the model ID or architecture field. I maintain these by hand for model families I've used recently:

const MODEL_PROFILES: { pattern: RegExp; profile: ModelProfile }[] = [
  {
    pattern: /qwen3-coder|qwen3.*coder/i,
    profile: {
      family: 'Qwen3 Coder',
      description: 'Code-specialised model with agentic capabilities.',
      strengths: ['code generation', 'code review', 'debugging', 'test writing'],
      weaknesses: ['non-code creative tasks'],
      bestFor: ['code generation', 'code review', 'test stubs', 'refactoring'],
    },
  },
  // ... 12 more families
];

Tier 2: SQLite cache with HuggingFace auto-profiling - if no static profile matches, the server queries HuggingFace's free API, parses the model card, and generates a profile. This gets cached in a SQLite database with a 7-day TTL.

I chose sql.js (pure WASM) instead of better-sqlite3. For Better Search Console I used better-sqlite3 because it was the data layer - hundreds of thousands of rows, complex queries, WAL mode, the lot. For houtini-lm, the cache holds maybe 20 rows. The priority is zero native dependencies. sql.js compiles to WASM, which means npx -y @houtini/lm works on any machine without needing a C++ toolchain. No node-gypand no build failures on Windows. I wince at whether this would work without a fix on MAC becuase I don't own one and perhaps never will. Still, I've had better-sqlite3 fail on three separate machines because of node-gyp version mismatches - none of this is worth such a small 20-100 row resource.

sql.js is slower for heavy workloads. For a 20-row lookup table, though, the speed difference is not noticeable.

// Schema
const SCHEMA = `
  CREATE TABLE IF NOT EXISTS model_profiles (
    model_id TEXT PRIMARY KEY,
    hf_id TEXT,
    pipeline_tag TEXT,
    architectures TEXT,
    license TEXT,
    downloads INTEGER,
    likes INTEGER,
    library_name TEXT,
    family TEXT,
    description TEXT,
    strengths TEXT,
    weaknesses TEXT,
    best_for TEXT,
    fetched_at INTEGER,
    source TEXT DEFAULT 'huggingface'
  )
`;

The fetched_at timestamp drives the 7-day TTL. After a week, the cache re-fetches from HuggingFace in case the model card has been updated. In practice this almost never matters, but I've had at least one case where a model's pipeline tag changed after a major update and the stale cache was routing it incorrectly.

Per-model prompt hints

This is the bit that makes the difference to output quality.

Each model family has its own set of annoying quirks. GLM-4 writes a paragraph of introduction before every response unless you tell it not to. Even if you do, there's a chance it'll ignore you - Llama is incredibly hard to get only-what-you-want in the output. Qwen Coder is best at temperature 0.1 for code but that's far too low for chat tasks. Nemotron handles structured output well but needs explicit "no preamble" instructions.

The PromptHints interface:

interface PromptHints {
  codeTemp: number;        // temperature for code generation
  chatTemp: number;        // temperature for chat/analysis
  outputConstraint: string; // injected into system prompt
  emitsThinkBlocks: boolean; // flag for think-block stripping
  bestTaskTypes: ('code' | 'chat' | 'analysis' | 'embedding')[];
}

And the per-model configuration:

// GLM-4: great general model, but verbose without constraints
{
  pattern: /glm[- ]?4/i,
  hints: {
    codeTemp: 0.1,
    chatTemp: 0.3,
    outputConstraint: 'Respond with ONLY the requested output. No step-by-step reasoning. No numbered analysis. No preamble. Go straight to the answer.',
    emitsThinkBlocks: true,
    bestTaskTypes: ['chat', 'analysis'],
  },
},

// Qwen Coder: focused, needs minimal constraint
{
  pattern: /qwen3.*coder|qwen.*coder/i,
  hints: {
    codeTemp: 0.1,
    chatTemp: 0.3,
    outputConstraint: 'Be direct. Output only what was asked for.',
    emitsThinkBlocks: true,
    bestTaskTypes: ['code'],
  },
},

The outputConstraint string gets injected into the system prompt before every delegation call. Without it, GLM-4 would generate something like:

Let me analyze this code step by step.

Step 1: First, I'll examine the function signature...
Step 2: Next, I'll consider the edge cases...
Step 3: Now I'll write the tests...

Here are the tests:
// actual tests

With the constraint, you get just the tests. That's 200+ tokens of preamble saved on every single call. Over a day of heavy delegation, that adds up.

I tested this properly - ran the same twenty delegation calls with generic settings versus model-specific hints. The tuned version produced usable output on eighteen of them first try, although hallucination seemed to be a massive problem with GLM 4.7. The generic setup managed twelve. Six out of twenty calls needing a retry doesn't sound terrible, but each retry is another round-trip to the model and another chunk of context in the Claude conversation.

Performance measurement: TTFT and tok/s

Every response now includes timing data in the footer:

Model: qwen/qwen3-coder-next | 145→248 tokens (38 tok/s, 340ms TTFT) | Session: 12,450 offloaded across 23 calls

TTFT (time to first token) and tokens per second, measured from the SSE stream. A slower model was running at 12 tok/s on certain prompts - painfully slow for something that normally hits 100-150. The TTFT metric made the problem obvious: the model was spending 8 seconds thinking before it started generating, which meant it was doing extended reasoning (the think blocks again) on prompts that shouldn't have needed it.

Turned out it was temperature. At 0.3, certain prompts triggered the model's reasoning mode, so dropping to 0.1 for code tasks fixed it. Without the TTFT data, I'd probably still be wondering why some calls take 15 seconds and others take 2.

The session totals (tokens offloaded, call count) serve a different purpose. Watching the counter climb to 40,000 offloaded tokens in a heavy coding day made the savings seem tangible rather than theoretical.

The stack

TypeScript, distributed via npm. Uses sql.js (WASM) for the model cache, SSE streaming for the 55-second soft timeout, and the MCP SDK for the protocol layer. Apache-2.0 licence.

npm: @houtini/lm
GitHub: houtini-ai/lm

If you're building MCP servers and running into similar problems with model determinism, I'd genuinely like to hear what patterns you've landed on. The routing and prompt hints system works for my setup but I'm under no illusion it's the only approach. PRs welcome - especially model profiles for families I haven't tested.

PS: incase I didn't mention, you can use this to add any external cloud model with an OpenAI compatible API endpoint. Comments welcome - it'd be nice to see this through to being fully useful and a genuine sidekick for Claude.

SQLite as an MCP context saver: stop cramming raw API data into your LLM

Richard Baxter — Wed, 04 Mar 2026 14:40:33 +0000

Most MCP servers dump raw API responses into the conversation. I've been using SQLite as a dependency in my MCP to sync data from my Google Search Console account locally and query it with SQL - here's the pattern and a working implementation for Google Search Console.

I'm fascinated by the utility and insight an LLM can provide, provided the data is presented correctly. You're not going to get great results from consumer-grade AI with 100,000 lines of JSON. A SQL query to a populated local database, however, is a different matter. Today's post is the why and how I use SQLite to give Claude a fighting chance of doing analysis, accurately, without the context bloat or hallucination.

I've been shipping MCP servers since late 2025. Every time I hook one up to an API with any volume of data, without paging/chunk management the same problem shows up. You call the API, and if you're not limiting the rquest you get all the rows back. Claude, or your LLM of choice dutifully analyses the data as if its the whole story. But a decent site generates tens of thousands of query/page combinations per month - so that 1,000-row response is actually a small sample, and even that might be too much to deal with efficiently.

MCP gives AI models access to APIs - we know this. But most APIs spit back volumes of data that don't fit in a conversation, or much of the data is irrelevant to answering the question. I found that after a couple of GSC API calls, about 80% of the token use was wasted on raw data, leaving maybe 20% for the model to actually think with. The analysis that comes back is built on incomplete data, but it reads with full confidence - and that's a tricky combination to work with.

My approach

The fix is less about prompting and more about architecture.

Two phases:

Sync - pull your complete dataset from the API into a local SQLite database as a dependency. No pagination caps, handle rate limits, retries, and deduplication in the background. Create your local data warehouse.
Query - your MCP tools translate prompts to SQL queries, defined in the tool description. The mcp returns aggregated results, not raw dumps.

So in my example, when you ask "which pages are losing traffic?", Claude runs a targeted query that returns maybe 20 rows instead of trying to work through 50,000 raw data points crammed into the conversation. It's like handing someone the analyst's summary rather than the raw spreadsheet.

You get accuracy because your SQL hits the complete dataset, not a 1,000-row sample. And you get efficiency because you're sending 20 aggregated rows into the context window instead of 50,000 raw ones.

I reckon this approach has legs beyond just Search Console. Sync locally, query with SQL, return aggregated results. It addresses the context window problem at an architectural level rather than hoping your API response fits.

The thing that clicked for me building these is that your tools don't have to be thin API wrappers. You can build something that syncs an entire dataset in the background, builds indexes, prunes stale data, and then exposes simple query endpoints. Your model couldn't care less about pagination or rate limits or auth tokens - it asks a question and gets something useful back. Once you reach that point, it's not a party trick - it's just infrastructure.

MCP is a lot more powerful than some of us might realise!

Under the hood: Better Search Console

I built Better Search Console as a working implementation of this pattern, specifically for Google Search Console data.

The sync engine

The Google Search Console API caps you at 1,000 rows per request and 25,000 rows per date range. For anything beyond a hobby blog, that's nowhere near enough. So the sync engine breaks your date range into 90-day chunks and fetches them in parallel - three concurrent streams, each paginating through its chunk until the API returns fewer rows than requested.

You can trigger this as a CLI command if it's a particularly massive job, just by the way.

Writes are serialised though. SQLite doesn't love concurrent writes, so all three streams funnel into a single write queue. Every row hits the database via INSERT OR REPLACE keyed on date + query + page + device + country.

I threw 800K rows at it (six months from a client's property) and the initial sync took about four minutes. After that, syncs are incremental - it picks up from where it left off.

SQLite config

Default SQLite settings are fairly conservative, so for this workload I've adjusted a few things:

WAL mode - write-ahead logging so reads don't block writes during sync
64MB cache - keeps hot pages in memory instead of hitting disk
4GB memory-mapped I/O - lets SQLite memory-map the database file directly
10 covering indexes - one for each common query pattern (query lookups, page lookups, date ranges, country breakdowns, the lot)

The result is that queries against 800K rows come back in under a second. SQLite with proper indexes is surprisingly fast for this kind of analytical work.

Data retention

Left unchecked, six months of GSC data for a busy site can balloon to millions of rows. The server auto-prunes after every sync: keeps the last 90 days in full, then drops zero-click rows from older data. The thinking is - if a query/page combination got zero clicks three months ago, it's probably noise, but anything with actual clicks stays around indefinitely.

Query sandboxing

I spent longer on this than anything else. The query_gsc_data tool lets Claude write freeform SQL against your data, which is powerful - but you obviously don't want it running DROP TABLE or DELETE FROM. So every query gets regex-checked before execution, and anything that matches INSERT, UPDATE, DELETE, DROP, ALTER, CREATE, or ATTACH gets blocked. There's also a 10,000-row automatic LIMIT on queries that don't specify one, so a careless SELECT * can't eat your entire context window.

The tools

Twelve tools in total. Four get about 90% of my use. The first two will trigger an mcp app to render charts etc with vite (really cool if you're using Claude Desktop).

get_overview - shows all your properties at a glance with clicks, impressions, CTR, average position, and sparkline trends.
get_dashboard - deep dive on one property. Hero metrics, comparison periods, top queries, country breakdown, ranking distribution, new/lost queries, branded vs non-branded.
get_insights - sixteen pre-built analytical queries. The one I keep reaching for is opportunities: it surfaces queries where you're ranking 5-20 with decent impression counts. You're close enough to page one that a title tweak or a content refresh could push you over. These tend to be the lowest-effort wins in SEO.
query_gsc_data - freeform SQL against the search_analytics table with the sandboxing I described above.

That last one kicked the whole project off for me - being able to ask anything of the data and get a real answer.

SQL queries I've been running for SEO analysis

The schema is nice and simple:

search_analytics: date, query, page, device, country, clicks, impressions, ctr, position

I ran every one of these on a client's ecommerce site last Tuesday - 340K rows in the database - so they're well tested.

Find long-tail referring keywords (5+ words):

This is probably my favourite query in the set. Longer phrases tend to be conversational - they read like something you'd type into ChatGPT or ask Google's AI Overview. Finding which of these already send you traffic tells you exactly what questions your content is answering, and more importantly, which questions it isn't.

SELECT query, SUM(clicks) as clicks, SUM(impressions) as impressions,
  ROUND(AVG(position), 1) as avg_position
FROM search_analytics
WHERE date >= date('now', '-90 days')
  AND LENGTH(query) - LENGTH(REPLACE(query, ' ', '')) >= 4
GROUP BY query
ORDER BY impressions DESC LIMIT 30

That LENGTH trick counts spaces to filter for five-or-more-word phrases. Not pretty SQL, but it works in SQLite which doesn't have a native WORD_COUNT.

Spot keyword cannibalisation (multiple pages ranking for the same query):

SELECT query, COUNT(DISTINCT page) as pages, SUM(clicks) as clicks
FROM search_analytics
WHERE date >= '2025-01-01'
GROUP BY query HAVING pages > 1
ORDER BY clicks DESC LIMIT 20

I find cannibalisation on almost every client audit I do. Two pages scrapping over the same query, neither ranking properly, and nobody on the team has noticed. This surfaces the worst offenders so you can decide whether to consolidate or differentiate.

Content decay detection (pages losing traffic month-over-month):

SELECT page,
  SUM(CASE WHEN date >= date('now', '-28 days') THEN clicks END) as recent,
  SUM(CASE WHEN date BETWEEN date('now', '-56 days') AND date('now', '-29 days') THEN clicks END) as prior
FROM search_analytics
GROUP BY page HAVING prior > 10
ORDER BY (recent * 1.0 / NULLIF(prior, 0)) ASC LIMIT 20

Questions people are asking about your content:

SELECT query, SUM(clicks) as clicks, SUM(impressions) as impressions,
  ROUND(AVG(position), 1) as avg_position
FROM search_analytics
WHERE date >= date('now', '-90 days')
  AND (query LIKE 'how %' OR query LIKE 'what %' OR query LIKE 'why %'
       OR query LIKE 'can %' OR query LIKE 'does %' OR query LIKE 'is %')
GROUP BY query
ORDER BY impressions DESC LIMIT 25

I've used this one on sites when building FAQ sections. If you're getting impressions for "how to configure X" but no clicks, your existing content probably doesn't answer that question clearly enough - or at all. Once you can see them, they're straightforward fixes.

Country breakdown for international targeting:

SELECT country, SUM(clicks) as clicks, SUM(impressions) as impressions,
  ROUND(AVG(position), 1) as avg_position
FROM search_analytics
WHERE date >= date('now', '-90 days')
GROUP BY country
ORDER BY clicks DESC LIMIT 15

The nice thing about all of these is that you're querying the complete dataset. Everything sits in SQLite and every query hits every row - no 1,000-row cap, no pagination limits.

Interactive dashboards with ext-apps

Something else I got to explore on this project: MCP's ext-apps framework, which lets you render interactive HTML directly inside Claude Desktop. Actual clickable dashboards with real data, living right there in the conversation as embedded iframes.

The overview, the property drilldowns, the sparkline trends - they're all interactive. It feels like the beginning of something interesting to me: MCP-native SaaS, where you build the backend as an MCP server, the frontend as ext-apps, and distribute via npx. No hosting to maintain, no auth flows to build, no deployment pipeline to manage. And your data stays on your machine, which for SEO data is a nice bonus.

Setting it up

Three steps. The Google credentials bit takes the longest but you only do it once.

1. Google service account

Head to Google Cloud Console, create a new project and enable the Search Console API
Credentials → Create service account → Keys tab → Add Key → JSON
Grab the client_email value from that JSON file
In Search Console → Settings → Users and permissions → Add the email with Full permission

That last step is the one everyone misses (I did too, first time). The service account is basically a separate Google identity - it can't see your properties until you explicitly share them. Skip it and you'll get "No properties found" errors wondering what went wrong.

2. Claude Desktop config

Add this to your claude_desktop_config.json:

"better-search-console": {
  "command": "npx",
  "args": ["-y", "@houtini/better-search-console"],
  "env": {
    "GOOGLE_APPLICATION_CREDENTIALS": "/path/to/your-service-account.json"
  }
}

3. Restart and sync

Restart Claude Desktop and ask it to set up Better Search Console. The setup tool discovers your properties, syncs them in the background, and shows an overview. Initial sync depends on data volume - 30 seconds for a small site, a few minutes for one with millions of rows. After that, syncs are incremental.

The stack

Built in TypeScript, distributed via npm. Uses better-sqlite3 for storage, full pagination against the Google Search Console API, and MCP ext-apps for the interactive dashboard bits. Apache-2.0 licence - fork it, break it, whatever works for you.

npm: @houtini/better-search-console
GitHub: houtini-ai/better-search-console

I reckon the sync-locally-query-with-SQL approach has legs well beyond Search Console. Better answers from less context - that's genuinely all there is to it.

Pull requests welcome - I'm especially interested in SQL queries you've found useful for SEO analysis, and I'll add the good ones to the pre-built insights.