Forem: Ryan Carter

I Was a Human AI Agent Before it was Cool

Ryan Carter — Thu, 07 May 2026 14:35:14 +0000

In 2026, what I'm about to describe would be a three-tool agent loop. An LLM instance with computer use, a couple of MCP servers, maybe a router for the edge cases. A weekend project.

When I actually built it, none of that existed. What I had was Puppeteer, three vastly different internal systems with no usable APIs, and a two-month backlog of customer escalations my team couldn't clear because each ticket required logging into all three systems, correlating what you found, and making a judgment call. We were a team of five. The backlog was hundreds deep and growing.

I cleared it in a week.

This isn't a post about how clever the script was; it wasn't, particularly. It's a post about what I learned doing the work of an AI agent by hand, before the agents showed up in droves. Because looking back, the code was the easy part. The hard part was the same part agents are still tripping over today: knowing which discrepancy means something and which one is just a guy on a business trip.

The tickets that broke tier-1

Our tier-1 support team was good, like any tier 1 team. They weren't stuck because they lacked skill. They were stuck because each escalated ticket required something support tools aren't built for: holding context from three different systems in your head at the same time and reasoning about whether the inconsistencies between them mattered. They did their job. But the issues extended into an area they were not tasked to handle.

A typical ticket looked like this. A customer can't log in. Tier-1 checks the auth system; password reset went through fine, no lockouts, account looks healthy. Case closed? Not quite. Because in the billing system, the customer's last login came from a different state than their billing address. And in the activity logs, there's a flag from two days ago that tier-1 doesn't have permission to interpret. So they send the issue off down a kiddie slide to a pile of "someone else's problem" at the bottom. Guess who my team was. Someone else, correct.

So which is it? Compromised account? Fraud? Or is the customer just on a business trip, logging in from a hotel, and hitting a security check that quietly broke their session and denied them access to something they paid for?

I couldn’t answer that without all three views. And even with all three views, you can't answer it without knowing which combinations of signals are expected and which ones aren't. A login from a new location plus a recent password reset plus an activity flag is one story. A login from a new state plus a clean auth history plus a flag from a known maintenance window is a completely different story. Same data points, opposite conclusions.

Tier-1 had access to all three systems. What they didn't have was twenty minutes per ticket to log into each one, run the right queries, and cross-reference to make a call. It wasn't their job. Multiply that by hundreds of tickets and you get a two-month backlog and no plan to fix it.

This is the part I want to sit with for a second, because it's the part that matters.

What tier-1 was being asked to do was a tool-use loop. Observe state in System A. Use that state to query System B. Use those results to formulate a query against System C. Synthesize. Decide. Act. That's not a support workflow; that's an agent trace. That's an AI automation workflow waiting to happen. We just didn't have the vocabulary for it yet, and we definitely didn't have anything that could run it.

So the work fell to whoever could do it. Which, on a busy day, was nobody.

What I had to work with

Three internal web apps. No APIs. SSO that made scripted access painful. One of the systems was where the activity logs lived, queried through a UI nobody loved but everybody used. The other two were a customer database and a backend logging system, both with their own quirks and their own session timeouts.

If I had API access, clearly I would have used it. I asked, but no dice. So I thought about what COULD access those systems and get the data I needed, without giving me carpal tunnel from hundreds of repeated copy, paste, and pray operations. I reached for Puppeteer.

The choice wasn't clever. It was the only thing I could think of at the time that could log in like a human, click through pages, copy data out of one tab and into the search bar of another, and do it without complaining or getting distracted. Which is exactly what we'd now call a computer-use agent. The shape of the problem hasn't changed. The tooling has. It was a lot faster than I could be, which is a major advantage to solving a problem when the list doesn't get shorter except by brute-force.

This wasn't beautiful architecture. It was a script. A long one, with a lot of await page.waitForSelector and a lot of try/catch blocks. If you're imagining a clean modular system with a queue and retry logic and structured logs, lower your expectations. It was held together by stubbornness and the fact that I could re-run the whole thing in about four minutes if anything broke. And boy did I rerun that thing.

What it did, abstractly, was this:

Pull the queue of escalated tickets from a spreadsheet.
For each ticket, grab the relevant context from System A, customer ID.
Use that context to query System B, for more needed data.
Run a parameterized lookup against System C, to find very specific scenarios in the logs, how many times, and why. This was the messiest part.
Apply a small set of rules to the combined picture and produce a verdict: incorrect password or username, user was not allowed access due to location, or inconclusive.
Write the verdict back to the spreadsheet where the team would see it.

// Simplified agent loop — each ticket looked like this
for (const ticket of escalatedTickets) {
  const customerData = await getFromSystemA(ticket.customerId);
  const billingData  = await getFromSystemB(customerData.accountId);
  const activityLog  = await getFromSystemC(customerData.userId, { days: 7 });

  const verdict = applyRules(customerData, billingData, activityLog);

  await writeVerdict(ticket.id, verdict); 
  // verdict: 'wrong_credentials' | 'location_block' | 'inconclusive'
}

That's it. That's the whole thing. The part that took the week wasn't the structure. It was getting the rules right.

The rules were the work

This is where most automation posts wave their hands.

The rules weren't a config file. They were driven by one goal, give the customer what they need. Things like: if the auth system shows a password reset within the last 48 hours AND the new login is from a location the customer has logged into before, it's almost certainly the customer. Or: if the activity log shows user was not at their usual location, they should not have had access, and sometimes even user should have had access at their usual location but didn't because we incorrectly cached their hotel location instead.

Each rule sounds obvious in isolation. The work was in finding all of them, and in figuring out which combinations mattered. There was no playbook, no one to ask, no docs to draw from. Sometimes those are the best problems; sometimes they're the kind that burn bad UIs into your retinas forever. I suggest frequent coffee breaks.

Customer satisfaction and proper access was the goal. The Puppeteer script was the deliverable. The rules were the project.

This is the part I think about a lot now, watching everyone build agents. Everyone wants to skip to the model. The model is the easy part. The hard part is that nobody has written down what good judgment in your domain actually looks like, and until somebody does, the agent has nothing to imitate.

It isn’t just building agents. It's anyone using AI to do real work. If you can't say what “good” is in your domain, specifically enough that you'd catch the model getting it wrong, then you're not using AI. You're hoping.

This is the thing we still need humans for. Not the typing. Not the clicking. Not even most of the deciding. The judgment. Knowing what good looks like, and being able to tell when something isn't it. That's the part that doesn't automate, and I don't think it's going to anytime soon. Every useful AI system I've seen, including the one I built with Puppeteer in a week before it was cool, works because someone did the unglamorous work of getting that judgment out of human heads and into something checkable. That work is still ours. It might be the most human work there is.

The gotchas

Enough philosophy. Here's what actually broke. Some of these will sound familiar if you've built anything that talks to systems you don't own.

Sessions and auth. Three systems, three session lifetimes, none of them aligned. I ended up with a small login-recovery routine that detected "you've been logged out" pages by their distinctive shape and re-authenticated mid-run. It was ugly. It worked.

Selectors that broke at the worst possible time. One of the internal apps got a UI update on day three. A <div> became a <section> and half my selectors went dark. I fixed it in fifteen minutes and added a "did this page render the way I expected?" check at the top of each step. That check caught two more silent failures over the following days.

My own auth system thought I was a bot. Because, well, I was, with real anatomy and muscles cosplaying as a boring, repetitive agent. The pattern of logins triggered a soft lockout many times. I solved it the boring way: spaced the runs out, added jitter, and honestly prayed. A lot. I offered fresh pots of coffee to the servers in reverent ritual sacrifice and backed away slowly. It worked.

The cases the script couldn't solve. This is the gotcha that matters most. Maybe 15% of tickets didn't fit the rules cleanly; combinations of signals my supervisor looked at and said "I need to see this one." For those, the script tagged them with a note about why. That tagging was as important as the resolution itself. An agent that confidently resolves a ticket it shouldn't have touched is worse than no agent at all.

That last point connects directly to the hardest unsolved problem in agent design today: knowing when to stop. The script knew when to stop because I built it to know. Most agents don't, yet. The temptation to make the model decide everything, including whether it has enough information to decide, is real, and it's why so many agent demos look great and so many agent deployments quietly fail.

What happened

The backlog cleared in a week. Five days of running the script, fixing things it surfaced, expanding the rule set, tweaking for more reliable selectors, and re-running. By the next Monday we were caught up to present-day tickets and the team had time to actually look at the ones the script had flagged for human review. Turns out the answer was some of each: a mix of legitimate users blocked by location caching, a handful of real fraud cases, and a long tail of edge cases that didn't fit any neat category.

The team of five didn't shrink. They got their time back. They were surprised. Tickets that used to take nearly an hour in some cases now took less than a minute of confirming what the script had already figured out, or fifteen minutes on the genuinely weird cases the script knew it couldn't handle. The customers got answers in hours instead of weeks. As a product developer, that was the real win. I know how it feels to be locked out of something you paid for, and I don't want anyone stuck there.

Nobody ever asked me to productionize it. It ran on my laptop for as long as I worked there, kicked off by a cron job, and as far as I know it kept running for a while after I left. That's the other thing about this kind of work. Sometimes the right shape for an automation is "the thing that lives on one person's machine and saves a team from drowning." Not everything needs to be a service. Not everything needs dev ops. At least not right away.

What I'd build today

If I had this problem in 2026, here's what I'd reach for.

An LLM agent with computer use (probably Claude), three MCP servers (one for each internal system, or one for whichever ones now have APIs), and a small router that decides which sub-agent handles which ticket type. The agent reads the queue, takes a ticket, logs into what it needs, gathers the cross-system view, applies a prompted set of decision rules, and writes its verdict and reasoning back to the spreadsheet. The hand-off rules, "stop and ask a human when X", go in the system prompt and get tightened over time based on what the human reviewers actually push back on.

That system would take a weekend. Maybe less.

But here's the thing: the hard part would be exactly the same. I'd still need to work out what the issues are and build a mental model at least to understand what needs to happen to get to said result. I'd still need to figure out which combinations of signals mean what. I'd still need to design the "I don't know, please look at this" path carefully, because that's the path that determines whether the agent is trustworthy or dangerous.

The model doesn't change the project. It changes the speed of the deliverable. The project is, and has always been, getting good judgment from human heads and into something that runs at scale. Leveraging our intelligence to inform the LLM’s.

The experience of having done this by hand, with Puppeteer and stubbornness, in the years before AI tooling existed, is a story I tell because it reminded me where the actual work is. And the actual work hasn't moved.

That is the promise of AI today. If used correctly and adequately verified to be the correct answer by a smart engineer who is in control of the outcome, you are more valuable, not less.

Turning Manual Ops Into a 10-Minute Task

Ryan Carter — Tue, 28 Apr 2026 22:07:22 +0000

I once turned a 2-week manual data update process into a 10-minute automated pipeline by writing a PHP script that ingested a vendor spreadsheet, normalized everything into a temporary MySQL database, and surfaced the result in a review dashboard before pushing to production. This post is the short version of that project — the tools I used, the approach, and the outcome — for any developer staring at a tedious manual ops process and wondering whether it's worth automating. (Spoiler: it almost always is.)

TL;DR

Before: ~10 business days of careful manual data entry against a fragile legacy database, every six months.
After: ~10-minute automated run, ~30-second push to prod, single dashboard for human review.
Stack: PHP for ingestion and transforms, a temporary MySQL database for staging and validation, a web dashboard for human review.
Why it worked: The repetitive parts were genuinely repetitive (same enums, same transforms, same edge cases) and a human still got the final sign-off before anything hit production.
Outcome: ~90%+ reduction in customer-facing data issues, plus dev hours and company time saved every cycle.

The BEFORE Process

I worked for a marketing company who's job it was to update a major restaurant's nutrition information with ingredients, UOMs, and caloric content. We would get a new spreadsheet full of updates we needed to apply to the database to update the website's display for 8m customers.

This process typically took around 10 business days (2 weeks) to complete all the changes. We updated this I believe every 6 months or so. A lot of manual work, checking, re-checking, typing very carefully to maintain the dwindling data integrity without introducing new issues. Very picky old system that had to be handled a certain way so it would correctly feed the iOS app. Legacy code and DB setup. Very tedious and exhausting to complete.

My Approach To Improve

After completing this a handful of times I saw that there were common assumptions I could make that would shortcut the time we needed to complete this process, and automate many of the repetitive tasks. This would of course increase the validity of data (no human error) and allow easier checking of final results as well (verified against the original source of truth (the spreadsheet).

Tools I Used

The best tools I had at that time was PHP as the scripting language for specific tasks, and a temporary MYSQL DB to help check and manipulate data to speed things along.

The Solution

I wrote some logic in PHP to ingest the spreadsheet data, match all the fields against common enums per category, and applied transforms for specific labels and description content, and then piped that into the final database only after it was tested, reviewed on a dashboard for quality and ready for production.

Essentially the fix was to let the computer do as much processing as it could, have a human verify its work when done, and then automatically apply it to the target system, without the tedium of checking things one by one.

The Result

The process with my script and DB system would take only a 10 minute run to process the data and display the final values. We could check it all on a web page and make adjustments to anything that was off, and then it was around 30 seconds to push to the prod DB. This saved devs hassle, the company money and time, and the customer issues at around a 90%+ rate. Not a bad outcome in the end. It is one of the projects I am most proud of to date and it was my original thought to even work on it. This is the kind of thing I love to do.

FAQ

Why a temporary MySQL database instead of validating in code?

Spreadsheets have repetition, contradictions, and edge cases that are far easier to spot with SQL than with imperative validation code. A staging table with constraints catches duplicates and bad references immediately, and the dashboard can run any ad-hoc query against it before production gets touched.

Why keep a human in the loop if the script is reliable?

The data feeds an iOS app used by 8 million customers. A bad row in production isn't a bug — it's a customer-facing nutrition error. The 30 seconds it takes a human to scan a dashboard is cheap insurance against the kind of mistake nobody wants to explain in a postmortem.

Could you have used a more modern stack?

Sure — Python with pandas would have been a natural fit, and Postgres would have given me more flexibility. But PHP and MySQL were what the company already ran, and the entire project shipped without asking anyone for new infrastructure. That's part of why it actually got built.

What would you change if you did it again today?

I'd add automated diff-vs-previous-cycle reports so reviewers see only what changed, version-control the transform rules, and write per-row confidence scores so the dashboard can highlight low-confidence entries first. The core architecture — ingest → stage → review → push — would stay exactly the same.

Sending email from alias via Gmail

Ryan Carter — Tue, 28 Apr 2026 22:07:18 +0000

To send email from an alias address through your Gmail account, generate a Google App Password, then add the alias under Settings → Accounts and Import → Send mail as with smtp.gmail.com:587 as the SMTP server and the App Password (not your normal Gmail password) as the credential. The alias has to already forward to your Gmail, and your account needs 2-Step Verification enabled.

Prerequisites: The alias address is a forwarder pointing to your Gmail. You need 2-Step Verification enabled on your Google account.

Go to myaccount.google.com/apppasswords, create an App Password, and copy the 16-character code without spaces. You must have two-factor auth turned on.
In Gmail: Settings → See all settings → Accounts and Import → "Send mail as" → Add another email address
Enter the alias name (e.g. Ryan Carter, however you want it to show in the person's inbox) and alias email address (e.g. ryan@someplace.com), click Next Step
Fill in the SMTP dialog:
- Server: smtp.gmail.com
- Port: 587
- Username: your real Gmail address (e.g. ryan1234@gmail.com)
- Password: the App Password (no spaces, NOT your regular password)
- Security: TLS (don't need to select it usually, it is by default)
Click Add Account — Gmail sends a verification email to your alias, which forwards it to your Gmail inbox
Click the confirmation link in that email

Done — the alias will now appear in the From dropdown when composing.

Turning on 2-Step Verification:

Go to myaccount.google.com/security → scroll down to "How you sign in to Google" → click "2-Step Verification" → follow the prompts.

If you use Google Workspace:

Yes, it works the same way with one caveat — your Workspace admin has to allow App Passwords. If the App Passwords option doesn't appear at myaccount.google.com/apppasswords, it means the admin has it disabled. They'd need to go to the Admin Console → Security → Authentication → and enable "Allow users to manage their own app passwords."
Also in Workspace there's an extra step: the admin needs to enable "Allow users to send mail through an external SMTP server" under Apps → Google Workspace → Gmail → Advanced settings. Otherwise the SMTP relay gets blocked.

FAQ

Why do I have to use an App Password instead of my regular Gmail password?

Google blocks third-party access (including SMTP) using your normal account password by default — that's what 2-Step Verification is for. App Passwords are 16-character credentials scoped to a single app and revocable from your Google account, so they're safer than handing your real password to an SMTP client.

Can I do this without enabling 2-Step Verification?

No. App Passwords only exist on accounts with 2-Step Verification turned on. The "less secure app access" toggle that used to allow this was deprecated by Google in 2022.

Will recipients see my real Gmail address anywhere?

No — once the alias is verified, mail sent from it shows only the alias address in the From field. Recipients can still inspect the raw message headers and see Gmail's servers in the trace, but your real Gmail address is not exposed in the From, Reply-To, or visible headers.

What if I want replies to come back to my real Gmail and not the alias?

That's the default if your alias is a forwarder pointing to your Gmail — replies go to the alias and get forwarded back. If you want replies to bypass the alias and hit your Gmail directly, set "Reply-To" to your real Gmail address when composing, or configure it on the Send-mail-as entry.

Can I send from multiple aliases?

Yes — repeat the same setup for each alias. Each one becomes a separate option in the From dropdown when composing. You can also pick a default From address per recipient under "When replying to a message."

Multi-Model LLM Orchestration with OpenRouter

Ryan Carter — Tue, 28 Apr 2026 22:07:14 +0000

Multi-model LLM orchestration is the practice of routing AI requests to different models based on what each task needs — speed, cost, reasoning depth, or code quality. OpenRouter makes it practical by exposing models from Anthropic, OpenAI, Google, Meta, Mistral, and others through a single OpenAI-compatible API: one key, one bill, one client, and you swap models by changing a string. The implementation is a few dozen lines of code on top of the OpenAI SDK.

This post walks through the pattern: defining named model slots, routing by task or complexity, streaming, fallback handling, and tracking cost across providers.

TL;DR

What it is: Routing each AI request to the model best suited for that task instead of using one model for everything.
Why it matters: Cheaper at scale (small models for simple tasks), faster perceived latency (fast models for chat), better quality (right model for the job), and resilient (fall back across providers when one is down).
How OpenRouter helps: One API key gives you access to 100+ models across providers using the OpenAI SDK. Model strings follow provider/model-name.
Two routing strategies: By task type (summarize → fast model, reason → deep model) or by estimated complexity (token count thresholds).
Production essentials: Streaming for chat UIs, try/catch fallbacks for provider outages, and per-request cost logging via the usage object OpenRouter returns.

Why bother with multiple models?

A few real reasons:

Cost. Frontier models like GPT-4o or Claude Opus are expensive at scale. For tasks that don't need that level of reasoning — summarization, classification, simple Q&A — a cheaper, faster model does the job at a fraction of the cost.

Speed. Small models respond faster. If a user is waiting for a response, latency matters. Route quick tasks to a fast model and save the slow, expensive one for when it's actually needed.

Quality. Some models are better at specific things. Code generation, structured output, long-context reasoning, multilingual text — the best model for each task isn't always the same model.

Resilience. If one provider has an outage or rate limit, you can fall back to another without rewriting your integration.

Setting up OpenRouter

Install the OpenAI SDK — OpenRouter is compatible with it:

npm install openai

Point it at OpenRouter's base URL with your API key:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://openrouter.ai/api/v1",
  apiKey: process.env.OPENROUTER_API_KEY,
});

That's it. Everything else is standard OpenAI SDK calls, just with different model strings.

Defining your model roster

The key to orchestration is deciding upfront which models you'll use and what each one is for. A simple approach is to define a set of "personas" — named roles that map to specific models:

const models = {
  fast: "google/gemini-flash-1.5",      // quick tasks, low latency
  balanced: "openai/gpt-4o-mini",        // everyday reasoning
  deep: "anthropic/claude-opus-4",       // complex reasoning, long context
  code: "anthropic/claude-sonnet-4",     // code generation and review
};

Model strings in OpenRouter follow the pattern provider/model-name. You can find the full list and pricing at openrouter.ai/models.

By mapping names to models rather than hardcoding model strings throughout your codebase, you can swap the underlying model without touching anything else. If a better cheap model comes out next month, you change one line.

Routing by task type

The simplest orchestration strategy is routing based on task type — you decide which model to use before making the call:

async function chat(task, messages) {
  const modelMap = {
    summarize: models.fast,
    classify: models.fast,
    draft: models.balanced,
    reason: models.deep,
    code: models.code,
  };

  const model = modelMap[task] ?? models.balanced;

  const response = await client.chat.completions.create({
    model,
    messages,
  });

  return response.choices[0].message.content;
}

// Usage
const summary = await chat("summarize", [
  { role: "user", content: "Summarize this document: ..." }
]);

const plan = await chat("reason", [
  { role: "user", content: "Help me think through this architecture decision..." }
]);

This is explicit and predictable. You know exactly which model runs for each task type, which makes debugging straightforward and costs easy to reason about.

Routing by estimated complexity

A more dynamic approach is routing based on the size or complexity of the request itself:

function selectModel(prompt) {
  const tokenEstimate = prompt.length / 4; // rough chars-to-tokens estimate

  if (tokenEstimate < 500) return models.fast;
  if (tokenEstimate < 2000) return models.balanced;
  return models.deep;
}

async function chat(prompt) {
  const model = selectModel(prompt);
  const response = await client.chat.completions.create({
    model,
    messages: [{ role: "user", content: prompt }],
  });
  return response.choices[0].message.content;
}

You can combine both approaches — route by task type first, then apply complexity thresholds within each category.

Streaming responses

For any user-facing interface, streaming makes responses feel faster even when they aren't:

async function streamChat(model, messages, onChunk) {
  const stream = await client.chat.completions.create({
    model,
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const text = chunk.choices[0]?.delta?.content ?? "";
    if (text) onChunk(text);
  }
}

// Usage
await streamChat(models.balanced, messages, (chunk) => {
  process.stdout.write(chunk); // or push to your UI
});

Fallback handling

Models go down. Rate limits happen. Add a fallback layer so a failure from one provider doesn't take your whole app down:

async function chatWithFallback(preferredModel, fallbackModel, messages) {
  try {
    const response = await client.chat.completions.create({
      model: preferredModel,
      messages,
    });
    return response.choices[0].message.content;
  } catch (err) {
    console.warn(`Model ${preferredModel} failed, falling back to ${fallbackModel}`, err.message);
    const response = await client.chat.completions.create({
      model: fallbackModel,
      messages,
    });
    return response.choices[0].message.content;
  }
}

Tracking cost across models

One of the underrated benefits of OpenRouter is that it returns token usage and cost metadata with each response. Log it and you'll know exactly what you're spending per task type:

async function chatWithCostTracking(task, messages) {
  const model = selectModelForTask(task);

  const response = await client.chat.completions.create({
    model,
    messages,
  });

  const usage = response.usage;
  console.log({
    task,
    model,
    inputTokens: usage.prompt_tokens,
    outputTokens: usage.completion_tokens,
    // OpenRouter includes cost in the response
    cost: response.usage?.cost,
  });

  return response.choices[0].message.content;
}

Once you have that data you can see which task types are eating your budget and tune your routing accordingly.

Putting it together

The pattern here is straightforward:

Define named model slots tied to task roles, not specific model strings
Route requests to the right slot based on task type, complexity, or both
Stream responses for user-facing interfaces
Add fallbacks so individual provider failures don't cascade
Log usage so you can optimize over time

OpenRouter removes the vendor lock-in that makes this feel risky. You're not betting on one provider — you're building a routing layer that can point at any model, from any provider, updated as the landscape changes. Given how fast the model landscape moves, that flexibility is worth more than it might seem today.

FAQ

Is OpenRouter more expensive than calling providers directly?

OpenRouter passes through provider pricing with a small markup baked in (typically a few percent), and in exchange you get a single account, single bill, automatic failover, and the ability to swap models without touching keys or SDKs. For most teams the convenience is worth it; for very high-volume workloads on a single model, going direct can be cheaper.

Does OpenRouter support streaming and tool/function calling?

Yes. Streaming works exactly like the OpenAI SDK — set stream: true. Tool/function calling is supported per-model: most modern models from Anthropic, OpenAI, and Google handle it; smaller open models vary. Check the model card on openrouter.ai/models for capability flags.

How does this compare to LangChain or LiteLLM?

LangChain is a much heavier framework with chains, agents, retrievers, and abstractions on top of providers. LiteLLM is the closest comparison — it's a unified provider proxy you self-host. OpenRouter is a hosted version of that idea: less control but zero ops, plus access to models you don't have direct accounts for.

What happens if a model gets deprecated or removed?

OpenRouter announces deprecations in advance and usually keeps a redirect to a sensible successor. Because your code references a model string in one place (the named-slot map), updating to a new model is a one-line change. This is the main argument for the named-slot pattern over hardcoding model names throughout the codebase.

Can I route by user, by feature, or by A/B test?

Yes. The routing function is just code, so you can include any signal in the decision: user tier, feature flag, A/B bucket, time of day. A common pattern is routing premium users to the deeper model and free users to the fast one. Another is shadow-routing — sending a copy of each request to a candidate model and comparing outputs offline.

How do I track which model performed best for a task?

Log the model, task, latency, token usage, and a quality signal (user thumbs-up, downstream success, eval score) for every request. Once you have a few weeks of data, group by task and model and compare. This is how you justify routing decisions empirically instead of guessing.

How I Built an AI Document Ingestion Pipeline

Ryan Carter — Tue, 28 Apr 2026 22:07:10 +0000

Symport is an AI document ingestion pipeline that turns a phone photo of any paper document — receipt, EOB, prescription, utility bill — into structured JSON, then stores it in Postgres with embeddings for semantic search. The full flow is: image upload → Sharp preprocessing → GPT-4o vision extraction → normalized JSON → Postgres + pgvector. I built it because I hate paper and I also lose paper.

This post walks through how the pipeline actually works, including the prompt engineering decisions that make extraction reliable enough to trust and the fallback layers that keep the app useful when extraction fails.

TL;DR

Stack: Sharp for image preprocessing, GPT-4o for vision extraction, Prisma + Postgres + pgvector for storage and semantic search.
The extraction prompt does most of the work: explicit date context to fight year hallucinations, constrained type/category enums for predictable downstream branching, and a strict "JSON only, no markdown" tail.
User correction loop: Users can add freeform feedback ("the drug name is metformin, not metFORMIN") and re-run extraction; the feedback gets injected back into the system prompt.
Schema choice: A single extractedData JSON column instead of per-type tables, with a denormalized searchText field for fast keyword search and an embedding column for semantic search.
Two fallback layers: Document still saves if there's no API key, and still saves with an error summary if extraction throws — nothing is ever lost because AI had a bad day.

What the pipeline does

The flow is straightforward:

Image upload → sharpen + encode → GPT-4o vision → structured JSON → Postgres + embeddings

A user photographs a receipt, an insurance EOB, a prescription, a utility bill — anything on paper. The app returns a structured JSON object with the relevant fields extracted, tagged, and ready to query. No manual data entry.

Step 1: Image preprocessing

Raw phone photos are large and often noisy. Before sending to the vision model, every image gets sharpened and re-encoded using Sharp:

const rawBuffer = Buffer.from(await file.arrayBuffer());
const buffer = await sharpenAndEncode(rawBuffer);

Sharp handles resizing, sharpening, and JPEG re-encoding in one pass. This serves two purposes: it reduces the payload size for the API call, and sharpening improves OCR accuracy on text-heavy documents like receipts. A blurry photo of small print is genuinely harder for vision models — a little preprocessing pays off.

The processed image gets saved to disk as the source of truth, then the buffer goes to the extraction pipeline:

const filename = `${randomBytes(12).toString("hex")}.jpg`;
await writeFile(fullPath, buffer);
const extracted = await extractFromImageBuffer(buffer);

Random hex filename prevents collisions and avoids leaking any metadata about the document in the path.

Step 2: The extraction prompt

This is where most of the real engineering lives. The system prompt does a lot of work to make the model's output consistent and parseable.

The prompt has three parts assembled at startup:

const EXTRACTION_SYSTEM_HEAD = `You are a document extraction assistant. Analyze the image and extract structured data.

Current date context: We are in 2025. Use 2025 (not 2023 or other past years) for any ambiguous or partial dates when no stronger clue is present.

Use context clues from the document text to infer the correct year:
- "2025 taxes due in 2026" → tax year 2025
- "Plan year 2025", "Coverage year 2025" → use 2025
- "Due in 2026" on a tax-related doc often refers to tax year 2025

Respond with a single JSON object. Include "type", "category", "title", and "tags" in every response.

- "type": one of rx_receipt, eob, utility_bill, general
- "category": one of receipt, financial, medical, government, legal, identity, general
- "title" (required, 2-5 words max): Short label only. No sentences.
- "tags": array of 3–8 short labels. No spaces; use underscores if needed.
`;

A few decisions worth calling out here:

Explicit date context. Vision models can hallucinate dates, especially on documents where the year is ambiguous. Anchoring the prompt with the current year and showing examples of how to reason about year context dramatically reduces date errors. Without this, a 2025 tax document might come back with 2023 dates because the model defaulted to its training data.

Constrained type and category values. Giving the model an explicit enum for type and category means you get predictable values you can branch on in code. Open-ended classification produces inconsistent strings that are annoying to handle downstream.

Short title constraint. "2-5 words max, no sentences" prevents the model from writing a summary disguised as a title. You want "Prescription receipt" not "This document appears to be a receipt from Walgreens for a prescription medication."

The tail of the prompt closes with:

const EXTRACTION_SYSTEM_TAIL = `
Use null for missing values. Amounts as numbers. Dates as YYYY-MM-DD; use context clues for year. Output only valid JSON, no markdown or explanation.`;

"Output only valid JSON, no markdown or explanation" is load-bearing. Without it, GPT-4o will frequently wrap the response in a markdown code block. The extraction code handles that case anyway, but telling the model not to do it reduces the cleanup work:

// Strip optional markdown code block
let jsonStr = raw;
const match = raw.match(/```
{% endraw %}
(?:json)?\s*([\s\S]*?)
{% raw %}
```/);
if (match) jsonStr = match[1].trim();

Step 3: User feedback loop

One of the more useful features is the ability to correct extractions. If the model gets something wrong — misreads a drug name, gets the date wrong, miscategorizes the document — the user can add a correction note and re-run extraction. That feedback gets injected directly into the system prompt:

if (options?.userFeedback?.trim()) {
  systemContent += `\n\nIMPORTANT - User feedback on this document (apply these corrections):
${options.userFeedback.trim()}`;
}

This means the model gets a second pass with explicit correction instructions. In practice it works well — "the drug name is metformin not metFORMIN" or "this is a 2025 EOB not 2024" gets applied reliably.

The feedback also gets stored in the database as extractionNotes on the document, so you have a record of what was corrected.

Step 4: The data model

The Prisma schema keeps things straightforward:

model Document {
  id            String   @id @default(cuid())
  imagePath     String?
  noteText      String?
  status        String   @default("pending")
  extractedData Json
  searchText    String?
  embedding     Unsupported("vector(1536)")?
  tags          String[]
  extractionNotes String?
  createdAt     DateTime @default(now())
  updatedAt     DateTime @updatedAt
}

A few design choices here:

extractedData is a JSON blob. Rather than creating separate tables for each document type (receipts, EOBs, utility bills), all extracted data lives in a single JSON column. This makes the schema flexible — different document types have different fields, and a rigid relational schema would be a constant maintenance burden as new types are added.

searchText is denormalized. After extraction, key fields get pulled out and concatenated into a single searchText string for full-text search. This is faster to query than parsing JSON at search time:

export function buildSearchText(data: ExtractedDoc): string {
  const parts: string[] = [];
  parts.push(effectiveTitle(data));
  if ("summary" in data && data.summary) parts.push(String(data.summary));
  if ("drug_name" in data && data.drug_name) parts.push(String(data.drug_name));
  if ("insurer" in data && data.insurer) parts.push(String(data.insurer));
  if ("tags" in data && Array.isArray(data.tags)) {
    parts.push(...data.tags.map(t => String(t).trim()).filter(Boolean));
  }
  return parts.join(" ");
}

embedding for semantic search. After the document is saved, an embedding gets generated from searchText and stored in a pgvector column. This enables semantic search — finding "cholesterol medication" when the document says "lipitor" — without a separate vector database. Just pgvector as a Postgres extension.

Step 5: Graceful degradation

The pipeline has two fallback layers. First, if there's no API key configured, the document still gets saved — just without extraction:

if (!process.env.OPENAI_API_KEY) {
  await prisma.document.create({
    data: {
      imagePath: filename,
      status: "pending",
      extractedData: { type: "general", title: "Document", summary: "Extraction skipped (no OPENAI_API_KEY)" },
      searchText: "Extraction skipped",
      tags: [],
    },
  });
}

Second, if extraction throws, the document still gets saved with an error summary rather than failing the whole request:

try {
  const extracted = await extractFromImageBuffer(buffer);
  extractedData = extracted as Record<string, unknown>;
} catch (err) {
  extractedData = {
    type: "general",
    summary: "Extraction failed: " + (err instanceof Error ? err.message : "Unknown error"),
  };
}

The image is always saved. The extraction is best-effort. Users can re-trigger extraction manually, or add correction notes and re-run. Nothing gets lost because AI had a bad day.

The model

The extraction model is configurable via environment variable with gpt-4o as the default:

const model = process.env.OPENAI_EXTRACTION_MODEL || "gpt-4o";

GPT-4o is the right choice here — it's genuinely better than smaller models at reading degraded document images, handwriting, and small print. For this specific task the quality difference is noticeable enough to justify the cost. Document extraction is a write-time operation (not a search-time one), so the latency and cost are acceptable.

What I'd do differently

A few things I'd change with hindsight:

Add a confidence score. The model sometimes hedges on fields it's uncertain about — a low-confidence flag on individual fields would let the UI highlight things that need user review rather than silently storing potentially wrong data.

Chunk large documents. A single-page receipt is fine. A multi-page insurance EOB or medical record is harder — the model gets less accurate as documents get longer or more complex. Chunking multi-page documents and merging the extracted JSON would improve accuracy on longer content.

Store the raw extraction response. Right now only the normalized result gets stored. Keeping the raw model output alongside it would make debugging extraction issues much easier.

The full source is on GitHub at github.com/meownoirsoft/symport. The extraction logic lives in lib/extract.ts and the ingestion endpoint is app/api/documents/route.ts if you want to dig in.

FAQ

Why GPT-4o for vision instead of a cheaper model or open-source alternative?

GPT-4o reads degraded phone photos, handwriting, and small print noticeably better than smaller or open-source vision models. For document extraction, getting the dates and amounts wrong is a much bigger problem than the per-call cost, so paying for the better model is worth it. Extraction runs once at write time, not on every read.

How do you stop the model from hallucinating dates or amounts?

The biggest wins are anchoring "current year" context in the system prompt with explicit examples, asking the model to use context clues from the document itself ("plan year 2025", "due in 2026" → tax year 2025), and constraining types/categories to enums so the model can't drift. The user-feedback loop catches anything that still slips through.

Why store extracted data as a single JSON column instead of typed tables?

Different document types have different fields — a prescription receipt has drug_name and pharmacy, an EOB has insurer and claim_id, a utility bill has account_number and service_period. A relational schema for every variant would be a constant migration treadmill. JSON keeps the schema flexible, and the denormalized searchText and embedding columns make queries fast where it matters.

What happens if the model returns invalid JSON?

The extraction code strips optional markdown code fences (

json ...

) and parses the rest. If parsing still fails, the document saves with an error summary in the extractedData.summary field rather than throwing — the user can re-run extraction or add a correction note. The image and metadata are never lost.

Can I use this with Anthropic's Claude vision instead of GPT-4o?

Yes. The extraction prompt is provider-agnostic and the model is configurable via OPENAI_EXTRACTION_MODEL. Swap the SDK call for the Anthropic SDK (or route through OpenRouter to avoid a code change) and Claude's vision models work as a drop-in alternative. The "JSON only, no markdown" instruction is even more important on Claude — it likes to explain itself by default.

How do you handle multi-page documents?

Today the pipeline treats each photo as a single document, which is fine for single-page items (receipts, prescriptions). For multi-page EOBs or medical records, the right next step is to chunk the document into pages, run extraction per page, and merge the resulting JSON into a single record. Adding that is on the to-do list.

Git Branch Exists on Remote But Won't Show Locally

Ryan Carter — Tue, 28 Apr 2026 22:07:06 +0000

If a git branch shows up on the remote but git branch -r doesn't list it locally, your fetch refspec is almost always scoped to a single branch instead of all branches. Fix it with one config change: git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" followed by git fetch origin --prune. This commonly happens after shallow clones, certain CI checkouts, and clones run with --single-branch.

The full diagnosis takes about two minutes — start by confirming the branch exists on the remote, then walk through the three fixes below in order.

Confirm the branch actually exists on the remote

First, bypass your local cache entirely and ask the remote directly:

git ls-remote origin

If your branch shows up here but not in git branch -r, your local remote-tracking refs are stale or incorrectly scoped. That's the problem — and it's fixable.

If it doesn't show up here either, the issue is permissions or a wrong remote URL. Check with git remote -v and make sure origin points where you think it does.

Fix 1: Fetch with prune

The simplest thing to try first:

git fetch origin --prune

The --prune flag removes stale remote-tracking refs and re-syncs. Sometimes that's all it takes.

Fix 2: Fetch the specific branch by name

If a general fetch isn't picking it up, fetching by name often forces it:

git fetch origin your-branch-name

Fix 3: Check your fetch refspec

This is the most common root cause when the above don't work. Check your git config:

cat .git/config

Look at the [remote "origin"] section. It should look like this:

[remote "origin"]
    url = git@github.com:you/your-repo.git
    fetch = +refs/heads/*:refs/remotes/origin/*

The fetch line is the refspec — it tells git which branches to track. The * wildcard means "all branches."

If yours looks like this instead:

fetch = +refs/heads/main:refs/remotes/origin/main

That's your problem. The refspec is scoped to a single branch, so git is only tracking main and ignoring everything else. This happens with shallow clones, some CI checkout configurations, and certain git clone flags.

Fix it by updating the refspec:

git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*"
git fetch origin

After that, git branch -r should show all remote branches.

Quick reference

Symptom	Likely cause	Fix
Branch in `ls-remote` but not `branch -r`	Stale or scoped refspec	Update refspec, re-fetch
Branch missing after `git fetch`	Stale tracking refs	`git fetch origin --prune`
Branch missing entirely from `ls-remote`	Wrong remote URL or permissions	Check `git remote -v`

The ls-remote check is always the right first step — it tells you immediately whether the problem is on the remote side or local side, which cuts the diagnosis in half.

FAQ

Why does `git fetch` not pick up the new branch?

Either your remote-tracking refs are stale (fix with --prune), or your fetch refspec is scoped to a single branch (the most common cause when --prune doesn't help). The refspec lives in .git/config under [remote "origin"].

What is a fetch refspec and why does it matter?

A refspec tells git which remote refs to download and where to store them locally. The default +refs/heads/*:refs/remotes/origin/* means "fetch every branch on the remote into origin/* locally." If yours is scoped to a specific branch (e.g. refs/heads/main:refs/remotes/origin/main), git will only ever track that one branch.

How did my refspec get scoped to a single branch?

Common causes: cloning with --single-branch, cloning with --branch <name> plus --single-branch, GitHub Actions checkouts that use fetch-depth: 1 and a specific ref, and some Dependabot/CI tools that explicitly scope the refspec to save bandwidth.

Is `git fetch --all --prune` the same as fixing the refspec?

No. --all fetches from every configured remote (relevant if you have multiple), and --prune removes stale remote-tracking refs — but neither expands a refspec that's scoped to a single branch. You still have to fix the refspec itself.

Will fixing the refspec break anything?

No. It just tells git to track all branches instead of one. You won't lose history, refs, or local branches. The next git fetch origin will pull down all the previously-ignored remote branches.

Fixing Godot MCP in Cursor on WSL

Ryan Carter — Tue, 28 Apr 2026 22:07:03 +0000

If godot-mcp won't connect in Cursor on WSL, the real culprit is almost always that Cursor is a Windows app trying to launch a Linux Node binary it can't see. The fix is to set wsl.exe as the command in mcp.json and pass node plus the absolute Linux path as arguments. Two smaller gotchas usually compound the problem along the way: tildes (~) don't expand inside JSON, and JSON config files don't allow // comments.

This post walks through all three issues in the order I hit them, with the working mcp.json config at the end.

TL;DR

Symptom: Cursor logs show Server not yet created, returning empty offerings and the MCP server never connects.
Root cause: Cursor runs on Windows; your node lives in WSL. Cursor can't see it.
Fix: Use "command": "wsl.exe" and put node plus the absolute path in "args".
Two side bugs: ~ doesn't expand in JSON values, and // comments break JSON parsing silently.
Final step: Fully restart Cursor (not just reload), then open Godot before invoking godot-mcp tools.

The Setup

I wanted to use the godot-mcp package to let Cursor's AI interact directly with Godot — launching the editor, querying project info, managing scenes, all that good stuff. I downloaded it, built it, added it to Cursor's mcp.json, and got this in the logs:

2026-03-07 10:55:11.578 [info] Server not yet created, returning empty offerings

Not helpful.

Three Things Were Wrong

1. Tilde doesn't expand in JSON

My first config looked like this:

"args": ["~/game_dev/godot-mcp/build/index.js"]

Cursor launches MCP servers directly without a shell, so ~ never gets expanded. It's looking for a file literally named ~. Use the full absolute path:

"args": ["/home/yourname/game_dev/godot-mcp/build/index.js"]

2. JSON doesn't support comments

I had copied the example config which included:

"env": {
  "DEBUG": "true"   // Optional: Enable detailed logging
}

That // comment is invalid JSON and will silently break parsing. Remove it.

3. Cursor is a Windows app — it can't see your WSL Node

This was the real one. Even after fixing the path and the comment, the server still wouldn't start. The reason: Cursor runs on Windows. When it tries to execute node, it's looking for a Windows binary — not the one you installed inside WSL.

My WSL Node worked fine in the terminal. Cursor had no idea it existed.

Worth noting: if you're using nvm inside WSL, this compounds the problem. Cursor doesn't run your shell init files, so even if nvm is configured in your .bashrc or .zshrc, Cursor won't pick it up. You can't just point at node and expect it to resolve.

The Fix

Use wsl.exe as the command, and pass your WSL path as an argument. Windows knows how to find wsl.exe, and it bridges the call into your Linux environment:

"godot": {
  "command": "wsl.exe",
  "args": ["node", "/home/yourname/game_dev/godot-mcp/build/index.js"],
  "env": {
    "DEBUG": "true"
  },
  "disabled": false,
  "autoApprove": [
    "launch_editor",
    "run_project",
    "get_debug_output",
    "stop_project",
    "get_godot_version",
    "list_projects",
    "get_project_info",
    "create_scene",
    "add_node",
    "load_sprite",
    "export_mesh_library",
    "save_scene",
    "get_uid",
    "update_project_uids"
  ]
}

After a full Cursor restart (not just reload), the MCP server showed as connected.

One More Thing

Most of the useful godot-mcp tools require Godot's editor to be open with your project loaded. The MCP connects to a running editor instance — it's not fully standalone. So once Cursor shows the server as connected, open Godot before you start using tools like get_project_info or launch_editor.

FAQ

Why does `wsl.exe` work when `node` doesn't?

Windows knows where wsl.exe is via PATH, and wsl.exe knows how to invoke programs inside your WSL distribution. So wsl.exe node /home/.../index.js is really "Windows runs wsl.exe, which runs Linux node, which runs your script." The Linux Node binary stays inside WSL where it belongs.

Do I need to do anything special for nvm?

If node is managed by nvm inside WSL, the first command in args should be node — wsl.exe will resolve it through your default WSL shell PATH for non-interactive invocations. If that fails, replace "node" with the absolute path to the active nvm node binary (e.g. /home/you/.nvm/versions/node/v20.11.0/bin/node).

Why a full Cursor restart instead of a reload?

MCP servers are launched as child processes of Cursor at startup. A reload reuses the parent process and may keep stale state. A full quit + relaunch forces Cursor to reread mcp.json and respawn the servers cleanly.

Does this same approach work for other MCP servers on WSL?

Yes. Any MCP server that's installed inside WSL and run via node (or python, etc.) hits the same problem and uses the same fix — "command": "wsl.exe" with the interpreter and absolute Linux path as args. Servers installed as native Windows binaries don't need this.

Why does my JSON look fine but Cursor still ignores it?

The two silent killers are // line comments (invalid JSON, parsers reject the whole file) and trailing commas (also invalid JSON in strict parsers). If Cursor isn't picking up your config at all, paste the file into a JSON validator first.

Building a Context-Aware AI Chat Without a Vector Database

Ryan Carter — Tue, 28 Apr 2026 22:06:59 +0000

You can ground an AI chat in your own data without a vector database by assembling the relevant documents directly into the system prompt before each request. No embedding pipeline, no similarity search, no separate infrastructure — just your structured data, formatted cleanly, injected as system context. It works well when your dataset is modest (hundreds of documents, not millions) and naturally segmented into logical groups.

This is the pattern I used building Wiskr, a multi-model chat app that grounds conversations in documents from a connected document store. The rest of this post walks through how to implement it, where it breaks down, and how to upgrade to full RAG when you outgrow it.

TL;DR

The pattern: Group documents into named contexts, load active contexts on each request, format them into a system prompt, prepend it to every API call.
No vector DB needed: For modest datasets, the model reads structured JSON directly — embeddings and similarity search are unnecessary overhead.
Token-limit guardrails: Cap documents per context, summarize long ones, let users pin important ones, then add vector search only when those run out of room.
Upgrade path: When you need real RAG later, the context-assembly layer stays put — you just add smarter document selection in front of it.
Best fit: Personal assistants, support tools, document Q&A, and any AI feature that needs to reason about a bounded, structured user-specific dataset.

The core idea

A standard LLM chat call looks like this:

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4",
  messages: [
    { role: "user", content: "What's my copay for metformin?" }
  ]
});

The model has no idea who you are or what documents you have. It can only work with what's in the messages array.

The context assembly pattern adds a system message that packages your relevant data before the conversation begins:

const response = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4",
  messages: [
    { role: "system", content: assembledContext },
    { role: "user", content: "What's my copay for metformin?" }
  ]
});

Now the model has your data and can reason against it. The question is how to build assembledContext well.

Step 1: Organize data into contexts

The first thing you need is a way to group related documents. In Wiskr these are called contexts — named buckets like "Medical," "Vehicle," "Insurance," or "House." Each conversation has a set of active contexts the user selects before chatting.

In the database this is a simple structure:

CREATE TABLE contexts (
  id uuid PRIMARY KEY,
  user_id uuid,
  name text,
  created_at timestamptz
);

CREATE TABLE documents (
  id uuid PRIMARY KEY,
  context_id uuid REFERENCES contexts(id),
  title text,
  content jsonb,
  created_at timestamptz
);

Documents belong to contexts. Contexts belong to users. When a chat starts, the user picks which contexts are active — and only those get assembled into the prompt.

Step 2: Load active context documents

When a conversation starts, load the documents for each active context:

async function loadContextDocuments(db, contextIds) {
  const result = await db.query(
    `SELECT c.name as context_name, d.title, d.content
     FROM documents d
     JOIN contexts c ON c.id = d.context_id
     WHERE d.context_id = ANY($1)
     ORDER BY c.name, d.created_at DESC`,
    [contextIds]
  );
  return result.rows;
}

Step 3: Assemble the system prompt

With the documents loaded, format them into a readable system prompt:

function assembleSystemPrompt(documents) {
  // Group documents by context name
  const byContext = documents.reduce((acc, doc) => {
    if (!acc[doc.context_name]) acc[doc.context_name] = [];
    acc[doc.context_name].push(doc);
    return acc;
  }, {});

  const contextBlocks = Object.entries(byContext).map(([contextName, docs]) => {
    const docBlocks = docs.map(doc => `
### ${doc.title}
${JSON.stringify(doc.content, null, 2)}
    `).join('\n');

    return `## ${contextName}\n${docBlocks}`;
  }).join('\n\n');

  return `You are a helpful assistant with access to the user's personal documents.
Use the information below to give accurate, personalized responses.
If the answer isn't in the documents, say so — don't guess.

${contextBlocks}`;
}

Raw JSON is fine for the document content. Current models read it well, and it preserves the structure of your data without you having to write custom serializers for every document type.

Step 4: Inject into every request

Pass the assembled context as the system message on every API call, alongside the full conversation history:

async function chat(db, conversationId, userMessage) {
  // Load conversation state
  const conversation = await getConversation(db, conversationId);
  const documents = await loadContextDocuments(db, conversation.activeContextIds);
  const history = await getMessageHistory(db, conversationId);

  // Assemble context
  const systemPrompt = assembleSystemPrompt(documents);

  // Build messages array
  const messages = [
    { role: "system", content: systemPrompt },
    ...history,
    { role: "user", content: userMessage }
  ];

  // Call the model
  const response = await client.chat.completions.create({
    model: conversation.model,
    messages,
  });

  const assistantMessage = response.choices[0].message.content;

  // Save to history
  await saveMessage(db, conversationId, "user", userMessage);
  await saveMessage(db, conversationId, "assistant", assistantMessage);

  return assistantMessage;
}

Handling token limits

The obvious risk with this approach is bloated prompts. If a user has 50 documents in their active contexts you'll hit token limits fast.

A few practical strategies:

Cap documents per context. The simplest option — include only the N most recent documents per context. For most use cases, the newest 10-15 documents per context are the most relevant anyway.

const result = await db.query(
  `SELECT c.name as context_name, d.title, d.content
   FROM documents d
   JOIN contexts c ON c.id = d.context_id
   WHERE d.context_id = ANY($1)
   ORDER BY c.name, d.created_at DESC
   LIMIT 15`,  // cap per context
  [contextIds]
);

Summarize large documents. If individual documents are long, run them through a cheap fast model first to produce a condensed version before assembling the prompt.

Let users pin documents. Give users control — a pinned document always gets included, everything else is capped or summarized. This is often more useful than trying to guess relevance automatically.

Add vector search later. When your data grows large enough that capping and pinning don't cut it, vector search is the right next step. You add an embedding column, generate embeddings on save, and query by cosine similarity to find the most relevant documents for each conversation. The context assembly step stays the same — you just get smarter document selection feeding into it.

When this is the right approach

This pattern works well when:

Your data is structured (JSON, not unstructured text blobs)
Your dataset is modest (hundreds of documents, not millions)
Users naturally segment their data into logical groups
You want something working fast without infrastructure overhead

It's a good starting point for any AI feature that needs to reason about user-specific data — support tools, personal assistants, document Q&A, anything where the data set is bounded and the structure is known.

When you outgrow it, the upgrade path to full RAG is incremental rather than a rewrite. The context assembly layer stays. You just add smarter selection in front of it.

FAQ

When should I use context assembly instead of full RAG with a vector database?

Use context assembly when your dataset is bounded (a few hundred documents per user, max), the documents are already structured (JSON, key-value, or short prose), and users have a natural way to scope which subset is relevant for a conversation. Switch to vector-database RAG when you can't fit the relevant slice in a system prompt, when relevance ranking actually matters, or when content is long-form unstructured text.

How big can the system prompt get before this falls apart?

Modern frontier models accept 200K+ token context windows, but cost and latency both scale with prompt size. As a practical rule, keep the assembled context under ~20K tokens for most consumer use cases — beyond that you'll feel the latency in chat, and the per-request cost adds up fast.

Does this work with any LLM provider?

Yes. The pattern is just a system message — every chat-completions API supports it. I've used the same code unchanged across OpenAI, Anthropic, and OpenRouter.

How do I migrate to full RAG later without rewriting everything?

Keep the context-assembly function as-is. Add an embedding column to the documents table, generate embeddings on save, and replace the "load all documents in active contexts" query with "load top N documents in active contexts ranked by cosine similarity to the user's question." Everything downstream of that — the prompt formatting, the chat call — stays identical.

What about prompt caching?

This pattern composes well with prompt caching. The system prompt changes only when documents are added/edited, so providers that support prompt caching (Anthropic, OpenAI) can cache the assembled context across turns and dramatically cut input-token cost on follow-up messages.

Is it safe to dump raw user data into the system prompt?

For a single-tenant app where the user owns the data, yes — that's the whole point. For multi-tenant apps, be strict about which user's contexts get loaded, and never assemble across users. A bug in context selection becomes a data-leak bug.

Adding Semantic Search to Your Postgres App with pgvector

Ryan Carter — Tue, 28 Apr 2026 22:06:55 +0000

pgvector is a Postgres extension that adds vector storage and similarity search to an existing database, so you can run semantic queries directly against your application data without standing up a separate vector store. If you're already on Postgres, you can enable it with one CREATE EXTENSION statement, add a vector column to any table, and have semantic search returning results the same day.

This post walks through adding it to an existing app — from installing the extension to running your first semantic query, with an HNSW index for performance at scale.

TL;DR

What it is: A Postgres extension that adds a vector column type and similarity-search operators (<=>, <->, <#>).
Why it matters: Semantic search without a separate vector database, hybrid keyword-and-semantic queries in one SQL statement, and no new service to operate.
Five steps to ship it: Install the extension, add a vector(N) column, embed at write time, query with cosine similarity, add an HNSW index for scale.
Embedding cost: ~$0.02 per million tokens with text-embedding-3-small. Ollama runs embedding models locally for free if you'd rather not depend on a provider.
When to upgrade beyond pgvector: Tens of millions of vectors with sub-50ms latency requirements. Below that, pgvector is plenty.

What's the difference between keyword search and semantic search?

Keyword search finds exact matches. If a user searches "cholesterol prescription" and your record says "lipid panel results," they get nothing.

Semantic search finds meaning. It understands that "cholesterol prescription" and "lipid panel results" are related concepts, and surfaces the right record even without a word match.

That's what vector embeddings buy you. Instead of storing text, you store a numerical representation of what that text means. Search becomes a question of mathematical similarity rather than string matching.

Step 1: Enable the extension

If you're running Postgres locally or in Docker, install pgvector first:

# Ubuntu / Debian
sudo apt install postgresql-16-pgvector

# or via Docker — use the pgvector image instead of plain postgres
# docker pull pgvector/pgvector:pg16

Then enable it in your database:

CREATE EXTENSION IF NOT EXISTS vector;

That's it. No separate service, no new connection string.

Step 2: Add an embedding column to your table

Pick whichever table holds the content you want to make searchable. Add a vector column — the dimension count needs to match the embedding model you'll use.

OpenAI's text-embedding-3-small outputs 1536 dimensions:

ALTER TABLE documents ADD COLUMN embedding vector(1536);

If you use a different model, check its output dimension and use that number instead. The dimension has to be consistent — you can't mix embeddings from different models in the same column.

Step 3: Generate embeddings when content is saved

Whenever a record is created or updated, generate an embedding from its text content and store it. Here's a Node.js example using the OpenAI SDK:

import OpenAI from "openai";
const openai = new OpenAI();

async function generateEmbedding(text) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
  });
  return response.data[0].embedding;
}

async function saveDocument(db, doc) {
  // Build a text representation of what you want to be searchable
  const textToEmbed = `${doc.title} ${doc.tags.join(" ")} ${doc.content}`;
  const embedding = await generateEmbedding(textToEmbed);

  await db.query(
    `INSERT INTO documents (title, content, tags, embedding)
     VALUES ($1, $2, $3, $4)`,
    [doc.title, doc.content, doc.tags, JSON.stringify(embedding)]
  );
}

A few things worth noting here:

What you embed matters. Concatenating title, tags, and content into one string gives the model more signal than just the raw content. Experiment with what makes your search results feel right.
Embed at write time, not search time. Pre-computing embeddings keeps search fast. You don't want to embed on every query.
If you have existing records, run a backfill script to generate embeddings for everything already in the database before you go live.

Step 4: Search with cosine similarity

When a user submits a search query, embed it the same way you embedded your content, then find the closest matches:

async function semanticSearch(db, query, limit = 10) {
  const queryEmbedding = await generateEmbedding(query);

  const result = await db.query(
    `SELECT id, title, content,
            1 - (embedding <=> $1) AS similarity
     FROM documents
     ORDER BY embedding <=> $1
     LIMIT $2`,
    [JSON.stringify(queryEmbedding), limit]
  );

  return result.rows;
}

The <=> operator is cosine distance — lower means more similar. The 1 - (embedding <=> $1) gives you a similarity score between 0 and 1 if you want to display or filter by confidence.

Step 5: Add an index for performance

Without an index, Postgres does an exact nearest-neighbor scan across every row — fine for small tables, slow for large ones. Add an HNSW index to keep queries fast at scale:

CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops);

HNSW (Hierarchical Navigable Small World) is an approximate nearest-neighbor algorithm. It trades a tiny amount of recall accuracy for a large speed gain. For most applications the tradeoff is well worth it.

Putting it together

Here's what the full flow looks like:

User saves a document → you generate an embedding → store it in the embedding column
User searches → you embed the query → run cosine similarity against stored embeddings → return top matches
Results feel like the app actually understands what the user is looking for

A few things to keep in mind

Embedding cost is low but not zero. OpenAI's text-embedding-3-small is cheap — around $0.02 per million tokens — but it adds up at scale. If you're embedding large documents frequently, keep an eye on usage.

Local embeddings are an option. If you want to keep everything in-house, Ollama can run embedding models locally. The quality varies by model, but for many use cases it's more than good enough and costs nothing per query.

Hybrid search is often better. Semantic search alone can miss exact matches that keyword search would catch. For production apps, consider combining both — run a keyword search with tsvector and a vector search with pgvector, then merge and rank the results. This is sometimes called hybrid search or reciprocal rank fusion.

Chunking matters for long documents. Embedding a 10,000-word document as a single vector loses a lot of nuance. For long content, chunk it into paragraphs or sections, embed each chunk separately, and link chunks back to the parent document.

pgvector is one of those things that looks complicated from the outside but is surprisingly approachable once you start. If you're already on Postgres, there's no reason not to have it.

FAQ

Do I need a separate vector database if I'm already using Postgres?

For most apps, no. pgvector handles tens of millions of vectors comfortably with an HNSW index, and you keep the operational simplicity of one database. You'd reach for a dedicated vector store (Pinecone, Weaviate, Qdrant, Milvus) only when you need extreme scale, very low latency, or specialized features like hybrid sparse/dense indexing that pgvector doesn't cover.

Which embedding model should I use with pgvector?

For most production use cases, OpenAI's text-embedding-3-small (1536 dims) is the default — cheap, fast, and high quality. Use text-embedding-3-large (3072 dims) if you need more accuracy and can pay for it. For local/private deployments, Ollama running nomic-embed-text or mxbai-embed-large is a solid choice. The dimension number in your column type has to match the model.

What's the difference between HNSW and IVFFlat indexes?

HNSW is faster to query and gives better recall, but takes longer to build and uses more memory. IVFFlat is faster to build, lighter on memory, but slower to query and less accurate. For most production workloads, HNSW is the right default. IVFFlat is fine if you're indexing very large datasets infrequently and care about build time.

Should I use cosine, L2, or inner product distance?

Cosine distance (<=>) is the right default for text embeddings — it ignores vector magnitude and only compares direction, which matches how text embedding models are trained. Use L2 (<->) for image embeddings or anything where magnitude carries meaning. Inner product (<#>) is fastest when your vectors are normalized but identical to cosine in that case.

Do I need to re-embed when I update a record?

Only if the text you embedded changed. The cleanest pattern is to embed a derived "search text" string (title + tags + content), and re-embed whenever any of those source fields change. A trigger or BEFORE UPDATE hook keeps it in sync.

Can I combine semantic search with regular SQL filters?

Yes — that's one of pgvector's biggest advantages. You can WHERE user_id = $1 AND status = 'active' ORDER BY embedding <=> $2 LIMIT 10 and get filtered semantic search in one query. With a separate vector store, you'd have to filter in two places and reconcile.

What Is MCP and Why Should Developers Care?

Ryan Carter — Tue, 28 Apr 2026 22:06:36 +0000

MCP (Model Context Protocol) is an open standard that lets AI models connect to external tools and data sources through a single, consistent interface. Anthropic introduced it in late 2024 to replace the bespoke per-tool integrations developers used to build by hand — one shared protocol that works across any MCP-compatible AI host like Claude Desktop or Cursor.

TL;DR

What it is: An open standard for AI-to-tool integration, introduced by Anthropic in late 2024.
Why it exists: Before MCP, every AI tool needed custom integrations. MCP makes them portable across hosts.
How it works: Hosts (Cursor, Claude Desktop) talk to Servers (filesystem, GitHub, database) via Clients using the MCP protocol.
What servers expose: Tools (actions the AI can call), Resources (data the AI can read), Prompts (reusable templates).
Why developers should care: Build an integration once, plug it into any MCP-compatible host. Less duplicated work, more leverage.

The problem MCP solves

AI assistants like Claude, GPT, and Gemini are powerful inside a conversation. But by default they're isolated. They can reason about text you give them, but they can't see your codebase, query your database, check your calendar, or interact with the tools you actually use. Every integration has to be custom-built — you wire up an API call here, a function there, and it's all bespoke plumbing that doesn't transfer between tools.

This is the problem Model Context Protocol (MCP) is designed to fix.

MCP is an open standard, introduced by Anthropic in late 2024, that defines a consistent way for AI models to connect to external tools and data sources. Instead of every AI tool reinventing its own integration layer, MCP gives developers a shared protocol — one way to build a connection that works across any MCP-compatible AI host.

Think of it like USB. Before USB, every device had its own connector. After USB, you plug anything into anything. MCP is trying to be that for AI tool integrations.

How it works

MCP has three main pieces:

Hosts are the AI applications the user interacts with — Cursor, Claude Desktop, or any app that's built MCP support in. The host manages connections to servers and mediates between the AI model and the outside world.

Servers are the integrations. An MCP server exposes a set of tools — things like "read a file," "query a database," "run a terminal command," or "fetch a web page." Servers can be local processes or remote services. They're relatively small, focused, and purpose-built.

Clients live inside the host and handle the communication between the host and each server using the MCP protocol.

When a user asks their AI assistant to do something that requires an external tool, the host checks what MCP servers are connected, picks the right tool, calls it, gets the result, and feeds it back to the model as context. The model never directly touches the external system — it just sees the results as part of its context window.

What MCP servers can expose

MCP servers can expose three types of things:

Tools are functions the AI can call — actions that do something. Run a shell command, create a file, send a message, query an API. These are the most common and most useful.

Resources are data the AI can read — files, database records, documents, anything that can be fetched and fed into context.

Prompts are reusable prompt templates the server makes available to the host — useful for standardizing how certain tasks get framed.

Most real-world MCP servers focus on tools. That's where the practical value is.

A concrete example

Say you're using Cursor to write code for a Godot game. Normally, Cursor can read files you paste in and suggest code, but it has no idea what's actually in your Godot project — what scenes exist, what nodes are in them, what the project structure looks like.

With a Godot MCP server running, Cursor can call tools like get_project_info, list_scenes, or get_node_tree and get real data back from your actual open project. The AI goes from working with whatever you manually paste in to working with live context from your development environment. That's a qualitatively different kind of assistance.

The same pattern applies everywhere: a filesystem MCP server lets the AI read and write files. A database MCP server lets it query your schema and run queries. A GitHub MCP server lets it read issues, PRs, and code. The AI stays the same — what changes is how much of your actual world it can see.

Why the standard matters

Before MCP, if you wanted Claude to talk to your database and Cursor to talk to the same database, you'd build two separate integrations. If a new AI tool came out that you wanted to try, you'd build a third.

With MCP, you build the server once. Any MCP-compatible host can connect to it. That's the compounding value of a shared standard — the integration work accumulates instead of being repeated.

It also means the ecosystem is growing fast. There are already MCP servers for filesystems, databases, web browsers, GitHub, Slack, Google Drive, and dozens of other tools. Most are open source. If a server exists for what you need, you configure it and connect it — no integration work required.

What this means if you're building with AI

A few practical implications:

If you're building AI-powered apps, MCP gives you a cleaner architecture for tool integrations. Instead of hardcoding API calls into your prompt pipeline, you can expose capabilities as MCP tools and let the model decide when and how to use them. It's more composable and easier to extend.

If you're using AI coding assistants, connecting MCP servers to your editor is one of the highest-leverage things you can do right now. Giving your AI assistant access to your actual project context — not just what you paste in — makes it meaningfully more useful.

If you're evaluating AI tools for your stack, MCP support is increasingly a signal worth paying attention to. Tools that support MCP plug into a growing ecosystem. Tools that don't are islands.

Getting started

The best place to start is Anthropic's MCP documentation. It covers the spec, has quickstart guides for building servers in Python and TypeScript, and links to the existing server ecosystem.

If you want to see it in action quickly, Claude Desktop supports MCP out of the box. Install it, configure a filesystem or fetch server in claude_desktop_config.json, and you'll have a working MCP setup in about ten minutes.

For Cursor users, MCP servers are configured in mcp.json in your project or user config directory. The Cursor docs cover the setup, and there are community-maintained lists of available servers worth browsing.

MCP is still early. The spec is evolving, the tooling is maturing, and not every AI host supports it yet. But the direction is clear — shared, composable tool integrations are better for everyone than bespoke one-off wiring. If you're building seriously with AI, it's worth understanding now.

FAQ

What is the difference between MCP hosts and servers?

A host is the AI application a user interacts with directly — Cursor, Claude Desktop, or any IDE that's added MCP support. A server is an integration that exposes a specific capability, like filesystem access, database queries, or the GitHub API. Hosts connect to one or more servers (via clients) so the AI model can use those capabilities as part of a conversation.

Is MCP only for Anthropic's models?

No. MCP is an open standard, and any AI host can implement support for it regardless of which underlying model it uses. Claude Desktop and Cursor were early adopters, but the protocol itself is model-agnostic and not tied to Claude.

Do I need to build my own MCP server?

Probably not. There are already open-source MCP servers for filesystems, GitHub, Slack, Google Drive, Postgres, web fetching, and dozens of other common tools. You only need to build a custom server when you have an internal system or workflow no existing server covers.

Are MCP servers safe to install?

Treat them like any other dependency. MCP servers run as local processes or remote services with whatever permissions you give them, so vet the code or the maintainer before connecting one to your editor — especially if the server can read files, hit your network, or execute shell commands.

How is MCP different from OpenAI function calling or tool use?

Function calling is a model-level feature — a single model deciding to call a function inside one app. MCP sits a layer above that: it standardizes how any host application discovers and connects to any tool integration. The same MCP server works with multiple hosts and models without rewriting the integration each time.

What languages can I use to build an MCP server?

The official SDKs cover Python and TypeScript today, and community SDKs exist for several other languages. Because MCP is just a protocol over standard transports, you can implement it in anything that can speak JSON-RPC over stdio or HTTP.

Multi-Select in Visual Studio Code

Ryan Carter — Fri, 13 Sep 2019 21:42:42 +0000

tl;dr

I am suddenly using VS Code because of multi-select (they call it multi-cursor in VS Code). Never thought I would. How the mighty have fallen.

Mac: Multi-Cursor Shortcuts

(these probably work on Windows with some experimentation):

Some shortcuts first, if that is all you're here for. Otherwise my rambling is below too, you know, if you're into that sort of thing. :)

NOTE: I use the "Selection => Switch to Cmd + Click for Multi-Cursor" option.

Mac: Shift + Cmd + L

Select a word and press Shift + Cmd + L to select all instances of your selection.

Shift + Alt/Option + I

Select a bunch of lines, then Shift + Alt/Option + I will put a cursor at the end of every selected line.

Cmd + Option + Shift + UP/DOWN (ARROW)

Selects in a column directly up or down from the cursor's position.

Alt/Option + Click

Selects each instance with a new cursor

See the VS Code Key Bindings page for more info on OS specific shortcuts

Senseless Rambling:

The best feature in Sublime Text 2/3 is hands down the multi-select feature. I've used it in many languages/stacks for years. It allows you to highlight a word, then automatically edit all instances of that word in your file. You can also select all lines in a column to edit many rows of data at the same time. It is basically the editing power of vim but more simple and graphical for vim noob idiots like me.

Multi-select is the one thing that has stopped me from moving to another editor for a very long time. Several others have tried to replicate the feature, but none of them seem to get it right, enough to feel as smooth and effortless like Sublime does.

That was until recently when I looked at VS Code and gave it another shot. I initially stopped using it right away because I was trying to write Vue code and the plugins for Vue really did not work correctly and messed up the spacing. I tried it again, and found that it does have multi-select and delightfully is easier to use than most. It isn't quite as good as the original Sublime implementation, but is good enough to make me switch over to use VS Code for most things.

To be fair, I am a bit surprised I like a Microsoft product for programming this much. Microsoft has been making strides for years in many areas and shed the old world view of proprietary nonsense to a large degree. They have truly embraced the open source world with decent offerings. Enough that I have switched. I have gone to the dark side. I don't know if they have cookies, but I'm diabetic so that's a no go anyway. I digress.

There are many other things that make me like VS Code too, but I won't likely be writing Vue/React in it anytime soon, depending on whether it can handle the JSX and other space-formatting issues I had. The built in terminal is very nice, as well as the easy extension support and intelligent features of updates and generally knowing what I want before I need it. Very well done. I appreciate that actual developers make this IDE and made it good for the masses.

Well Microsoft, you did it. I finally embrace our overlords. Coincidence that Steve Ballmer had to leave for me to get on board with your evil plan for world development domination? I think not.

Forem: Ryan Carter

I Was a Human AI Agent Before it was Cool

The tickets that broke tier-1

What I had to work with

The rules were the work

The gotchas

What happened

What I'd build today

Turning Manual Ops Into a 10-Minute Task

TL;DR

The BEFORE Process

My Approach To Improve

Tools I Used

The Solution

The Result

FAQ

Why a temporary MySQL database instead of validating in code?

Why keep a human in the loop if the script is reliable?

Could you have used a more modern stack?

What would you change if you did it again today?

Sending email from alias via Gmail

Turning on 2-Step Verification:

If you use Google Workspace:

FAQ

Why do I have to use an App Password instead of my regular Gmail password?

Can I do this without enabling 2-Step Verification?

Will recipients see my real Gmail address anywhere?

What if I want replies to come back to my real Gmail and not the alias?

Can I send from multiple aliases?

Multi-Model LLM Orchestration with OpenRouter

TL;DR

Why bother with multiple models?

Setting up OpenRouter

Defining your model roster

Routing by task type

Routing by estimated complexity

Streaming responses

Fallback handling

Tracking cost across models

Putting it together

FAQ

Is OpenRouter more expensive than calling providers directly?

Does OpenRouter support streaming and tool/function calling?

How does this compare to LangChain or LiteLLM?

What happens if a model gets deprecated or removed?

Can I route by user, by feature, or by A/B test?

How do I track which model performed best for a task?

How I Built an AI Document Ingestion Pipeline

TL;DR

What the pipeline does

Step 1: Image preprocessing

Step 2: The extraction prompt

Step 3: User feedback loop

Step 4: The data model

Step 5: Graceful degradation

The model

What I'd do differently

FAQ

Why GPT-4o for vision instead of a cheaper model or open-source alternative?

How do you stop the model from hallucinating dates or amounts?

Why store extracted data as a single JSON column instead of typed tables?

What happens if the model returns invalid JSON?

Can I use this with Anthropic's Claude vision instead of GPT-4o?

How do you handle multi-page documents?

Git Branch Exists on Remote But Won't Show Locally

Confirm the branch actually exists on the remote

Fix 1: Fetch with prune

Fix 2: Fetch the specific branch by name

Fix 3: Check your fetch refspec

Quick reference

FAQ

Why does git fetch not pick up the new branch?

What is a fetch refspec and why does it matter?

How did my refspec get scoped to a single branch?

Is git fetch --all --prune the same as fixing the refspec?

Will fixing the refspec break anything?

Fixing Godot MCP in Cursor on WSL

TL;DR

The Setup

Three Things Were Wrong

Why does `git fetch` not pick up the new branch?

Is `git fetch --all --prune` the same as fixing the refspec?

Why does `wsl.exe` work when `node` doesn't?