<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alex Cloudstar</title>
    <description>The latest articles on Forem by Alex Cloudstar (@alexcloudstar).</description>
    <link>https://forem.com/alexcloudstar</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1190670%2F18910089-3a37-4072-9b4c-289211f053eb.JPG</url>
      <title>Forem: Alex Cloudstar</title>
      <link>https://forem.com/alexcloudstar</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alexcloudstar"/>
    <language>en</language>
    <item>
      <title>Temporal vs Inngest vs Vercel Workflow in 2026: Picking a Durable Engine</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Sat, 25 Apr 2026 09:53:17 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/temporal-vs-inngest-vs-vercel-workflow-in-2026-picking-a-durable-engine-2geo</link>
      <guid>https://forem.com/alexcloudstar/temporal-vs-inngest-vs-vercel-workflow-in-2026-picking-a-durable-engine-2geo</guid>
      <description>&lt;p&gt;The first time I realized I needed a durable workflow engine was at 3 a.m. on a Tuesday. A batch job that normally took eight minutes had died halfway through because a third party API rate-limited us, the retry logic was a nested try/catch written by me six months earlier, and there was no way to resume from where it left off. I had to manually reconstruct the partial state from database rows, write a one-off script to skip the completed work, and pray I had not double-processed anything. That morning I added "replace this entire pattern" to the next quarter's roadmap.&lt;/p&gt;

&lt;p&gt;A year and a half later, durable execution is not the niche operations concern it used to be. It is the default way to build anything that involves an LLM, an external API, a long-running background task, or a multi-step process that spans minutes or hours. If you are building &lt;a href="https://dev.to/blog/durable-ai-workflows-orchestration-2026"&gt;durable AI workflows&lt;/a&gt; in 2026, the question is not whether to use a workflow engine. It is which one.&lt;/p&gt;

&lt;p&gt;Three names keep coming up: Temporal, Inngest, and Vercel Workflow. I have shipped production features on all three over the last year, and they are genuinely different tools for different jobs. What follows is an honest side-by-side from someone who has spent real time in the logs of each, including the parts where I stepped on a rake and got a bump on the forehead.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Durable Execution Actually Means
&lt;/h2&gt;

&lt;p&gt;Before the comparison, a quick refresher on what these engines do, because a lot of developers I talk to think they are "just queues with retries" and then get surprised by the actual shape of the thing.&lt;/p&gt;

&lt;p&gt;A durable workflow engine runs your code in a way that survives process crashes, deployments, and hardware failures. Each step of your workflow is checkpointed. If the process dies between step 4 and step 5, the engine restarts the workflow and replays the history up to step 4 without re-running the side effects, then continues from step 5 as if nothing happened. You can pause a workflow for hours or days waiting on external input and it will be exactly where you left it when it resumes.&lt;/p&gt;

&lt;p&gt;This is the critical property for AI agents. An agent that calls six tools, waits on a human approval step, and then does three more tool calls cannot be written as a single function on a serverless platform. It has to be a workflow, because the lifetime is longer than any one function invocation and the state has to survive across invocations.&lt;/p&gt;

&lt;p&gt;Everything else flows from that core property. Retries are durable. Timeouts are durable. Cancellation is durable. Human-in-the-loop steps are durable. "Cron that actually runs even if the last run crashed" is durable. The engines differ in how they implement durability, what the developer experience is, and what they optimize for.&lt;/p&gt;




&lt;h2&gt;
  
  
  Temporal: The Heavyweight Champion
&lt;/h2&gt;

&lt;p&gt;Temporal has been around the longest and has the most serious production pedigree. It is the open-source descendant of Uber's Cadence project, and it is what you pick when you need to run multi-day workflows across thousands of workers and you are willing to pay for the complexity budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model.&lt;/strong&gt; Temporal workflows are written in ordinary code (Go, Java, Python, TypeScript) using a specific "deterministic" style. You write workflow functions that look like regular functions, but under the hood Temporal records every piece of non-determinism (API calls, timers, random numbers) in an event history and replays that history to rebuild state after a crash. Activities are the "unsafe" side-effect functions your workflow calls, and they are what actually hit your database or third-party APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it shines.&lt;/strong&gt; Temporal is the best tool in the category for complex, long-running workflows at scale. Workflows that run for weeks. Workflows that spawn hundreds of child workflows. Workflows where you need fine-grained control over retry policies, timeouts, and cancellation semantics. Large enterprises with dedicated platform teams love it because it gives them a universal primitive for everything that is not a synchronous API call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it hurts.&lt;/strong&gt; The operational overhead is real. You run Temporal Server, which needs a database, metrics, and workers. Temporal Cloud abstracts this away but at a price point that makes sense for teams with meaningful workflow volume and not much sense for a solo developer running a side project. The learning curve is genuinely steep. The deterministic workflow pattern is powerful but it is also foreign, and new team members will write non-deterministic code on their first day and wonder why their workflow breaks on replay.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing.&lt;/strong&gt; Temporal Cloud charges by actions per month (workflow starts, activities, signals) plus storage. For a team running millions of activities per month it is reasonable. For a side project it is a lot. Self-hosted is free if you are willing to operate it, which is a bigger "if" than the docs suggest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it is for.&lt;/strong&gt; You have a dedicated platform team, or at least one engineer whose job includes "own the workflow infrastructure." You have workflows that genuinely need to run for days. You have scale that justifies a learning curve. You want a single tool that covers every asynchronous pattern in your company.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inngest: The Developer Experience Pick
&lt;/h2&gt;

&lt;p&gt;Inngest took a very different path. Where Temporal optimized for power and correctness at scale, Inngest optimized for "a developer can ship a durable workflow in 10 minutes." The pitch is queues, cron, workflows, and AI orchestration in one SDK that feels like writing normal application code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model.&lt;/strong&gt; You write step functions in TypeScript (or Python, Go). Each step is wrapped in a call to &lt;code&gt;step.run&lt;/code&gt;, &lt;code&gt;step.sleep&lt;/code&gt;, &lt;code&gt;step.waitForEvent&lt;/code&gt;, or one of a handful of other primitives. Inngest handles the durability automatically. Your code looks almost identical to what you would write without a workflow engine, which is the whole point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="nx"&gt;inngest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;send-onboarding&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user/created&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;send-welcome-email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;welcome&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wait-a-day&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1d&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;send-tips-email&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tips&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a durable workflow. It survives crashes. It retries. It waits a full day between the two emails without holding a function open. There is nothing else to configure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it shines.&lt;/strong&gt; Speed to first workflow. The DX is the best in the category by a noticeable margin. The local dev story is excellent. The UI for inspecting runs, replaying failed ones, and debugging is fast and useful. For solo developers and small teams who need durable execution but do not want to become workflow experts, Inngest is an obvious choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it hurts.&lt;/strong&gt; The abstraction leaks at the edges when workflows get genuinely complex. Deeply nested child workflows, very long-running (multi-week) workflows, or workflows with exotic retry and cancellation needs push the limits of what Inngest is built for. The SDK does a lot of magic, and when that magic misfires the debugging path is not as clean as Temporal's explicit event history. It is also a newer platform, and while it is growing fast, "five years old" is still not "ten years old" in the trust budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing.&lt;/strong&gt; Inngest has a genuinely usable free tier (50,000 step runs per month) and tiered paid plans that stay affordable through the low-millions of step runs. For most small and mid-sized teams, the bill stays in reasonable territory. For very high volume (tens of millions of steps per month), the math gets less favorable versus self-hosting Temporal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it is for.&lt;/strong&gt; You are a solo developer or a small team. You want durable execution without becoming a workflow expert. Your workflows are measured in minutes to hours rather than weeks. You care about shipping the feature this week, not architecting a platform for 2028.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vercel Workflow: The Platform-Native Newcomer
&lt;/h2&gt;

&lt;p&gt;Vercel Workflow (built on the Workflow DevKit, or WDK) is the newest of the three and represents a different thesis. Rather than being a separate system you integrate, it is a primitive in the Vercel platform itself. If you are already deploying to Vercel, your workflow code runs on Fluid Compute, scales automatically, and is observable from the same dashboard where you watch your deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model.&lt;/strong&gt; You write step functions in TypeScript that run on Vercel Functions. Each step is durable. The engine handles replay, retries, and state. The syntax is similar to Inngest in spirit, but the runtime is the same runtime your app is already deployed on, which removes a layer of infrastructure you would otherwise have to manage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it shines.&lt;/strong&gt; If you are already on Vercel, the integration is essentially free. There is no separate platform to learn, no separate dashboard to monitor, no separate deploy pipeline, no separate billing. Your workflow code sits next to your app code and gets deployed with it. For an AI-heavy app that already uses the Vercel AI SDK and the AI Gateway, the combination of &lt;a href="https://vercel.com/docs" rel="noopener noreferrer"&gt;Fluid Compute&lt;/a&gt; plus Workflow plus AI Gateway is the lowest-friction path from idea to production I have seen.&lt;/p&gt;

&lt;p&gt;Cold starts are largely a non-issue because Fluid Compute reuses function instances. Graceful shutdown is built in. Cancellation propagates cleanly. And the ops story of "the same platform runs your frontend, your API, and your workflows" is genuinely compelling when you are a small team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it hurts.&lt;/strong&gt; It is the youngest of the three. The feature set is still maturing. If you need the full arsenal of advanced patterns (massive fan-out, sophisticated saga orchestration, weeks-long workflows), you are closer to the edge than you would be on Temporal. The vendor lock-in is real in the sense that the pattern tightly couples to Vercel's runtime, which is fine if you are already committed to the platform and a non-starter if you are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing.&lt;/strong&gt; Workflow runs on top of Fluid Compute's Active CPU pricing model. You pay for the CPU time your steps actually consume, provisioned memory, and invocations, which is different from the per-action pricing Temporal and Inngest use. For workloads that spend most of their time sleeping or waiting (classic workflow shape), the Active CPU model is quite friendly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who it is for.&lt;/strong&gt; You are already on Vercel. You want workflows to be a platform primitive rather than an external dependency. Your workflows are tightly coupled to your app's request/response cycle (user actions trigger workflows, workflow results update the app state). You value platform integration over best-in-class feature breadth.&lt;/p&gt;




&lt;h2&gt;
  
  
  How They Compare On The Dimensions That Matter
&lt;/h2&gt;

&lt;p&gt;Laid out against the axes I actually care about when I am picking one for a new project:&lt;/p&gt;

&lt;h3&gt;
  
  
  Learning curve
&lt;/h3&gt;

&lt;p&gt;Inngest &amp;lt; Vercel Workflow &amp;lt; Temporal&lt;/p&gt;

&lt;p&gt;Inngest is the fastest to get productive on. The primitives are small, the docs are good, and the model is familiar. Vercel Workflow is close behind if you are already on Vercel. Temporal is the steepest, not because the primitives are bad but because the deterministic workflow pattern requires a mental model shift.&lt;/p&gt;

&lt;h3&gt;
  
  
  Maximum workflow complexity
&lt;/h3&gt;

&lt;p&gt;Temporal &amp;gt; Inngest ≈ Vercel Workflow&lt;/p&gt;

&lt;p&gt;Temporal is built for workflows that run for weeks, fan out to hundreds of child workflows, and require exotic retry policies. Inngest and Vercel Workflow handle the 95 percent of workflows that run for minutes to hours very well and get harder to reason about at the extremes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Operational overhead
&lt;/h3&gt;

&lt;p&gt;Vercel Workflow &amp;lt; Inngest &amp;lt; Temporal Cloud &amp;lt; Temporal self-hosted&lt;/p&gt;

&lt;p&gt;If you are already on Vercel, Workflow is zero additional ops. Inngest is a managed service with minimal setup. Temporal Cloud is managed but has more moving parts. Self-hosting Temporal is a real commitment.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI agent suitability
&lt;/h3&gt;

&lt;p&gt;All three are fine. Inngest ships AI-specific patterns that make agent loops very clean. Vercel Workflow integrates tightly with the AI SDK and AI Gateway for a smooth end-to-end story. Temporal is the most powerful for complex multi-agent coordination but requires the most plumbing to get there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost at scale
&lt;/h3&gt;

&lt;p&gt;It depends on the shape of the workload. Temporal self-hosted is cheapest at very high volume if you can eat the ops cost. Vercel Workflow is competitive when your workflows are CPU-light and spend most of their time sleeping. Inngest is cheapest up through mid volume and gets more expensive at very high volume.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vendor coupling
&lt;/h3&gt;

&lt;p&gt;Temporal &amp;lt; Inngest &amp;lt; Vercel Workflow&lt;/p&gt;

&lt;p&gt;Temporal is portable between clouds and can be self-hosted. Inngest is a managed service but the SDK is portable and you could theoretically move off it. Vercel Workflow assumes Vercel.&lt;/p&gt;




&lt;h2&gt;
  
  
  Picking One: The Decision Tree I Actually Use
&lt;/h2&gt;

&lt;p&gt;When someone asks me which one to pick, I walk through roughly this decision tree.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you a solo developer or a 2 to 5 person team building AI features?&lt;/strong&gt; Start with Inngest if you are running anywhere, or Vercel Workflow if you are running on Vercel. The DX is friendly enough that you will be productive in an afternoon, and the free and low-tier plans are generous enough to carry you through early growth. You can migrate later if you outgrow the engine, but most teams never do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you a mid-sized team already on Vercel with AI-heavy workflows?&lt;/strong&gt; Vercel Workflow is almost certainly the right call. The integration with Fluid Compute, the AI SDK, and the AI Gateway removes more friction than any other option. You keep one platform, one dashboard, one billing relationship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you a mid-sized team not on Vercel?&lt;/strong&gt; Inngest is the pragmatic default. The DX is great, the platform is mature enough, and you can run it alongside whatever stack you already have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you a large team with a dedicated platform engineering function and genuinely complex workflow needs?&lt;/strong&gt; Temporal. The complexity budget is real but so is the payoff. Nothing else in the category handles multi-week workflows, massive fan-out, or sophisticated saga orchestration with the same rigor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you unsure how complex your workflows will get?&lt;/strong&gt; Start with Inngest or Vercel Workflow. The worst case is that you migrate to Temporal in 18 months, and by then you will know exactly what you need from it. The alternative, which is starting with Temporal because "someday we might need it," usually results in a lot of complexity spent on workflows that would have been fine on something simpler.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changes When You Add AI Agents To The Mix
&lt;/h2&gt;

&lt;p&gt;AI agents are the workload that turned durable execution from a nice-to-have into a hard requirement for a lot of teams. The shape of an agent loop (call a tool, wait for the result, decide what to do next, maybe wait on a human, repeat for an unknown number of turns) is exactly what workflow engines are built for. But agents do bring some specific requirements that are worth thinking about when you pick the engine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/ai-agent-memory-state-persistence-2026"&gt;AI agent state and memory&lt;/a&gt; needs to survive across turns, which all three engines handle natively. You do not have to roll your own state store for most use cases because the workflow history is the state store.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;Agent observability&lt;/a&gt; is where the engines differ more. Temporal's event history is the most rigorous but the least readable to non-engineers. Inngest's run inspector is the most approachable. Vercel Workflow's integration with Vercel's observability tools is convenient if you are already using them.&lt;/p&gt;

&lt;p&gt;Token cost tracking is worth thinking about too. &lt;a href="https://dev.to/blog/ai-agent-token-costs-developer-guide-2026"&gt;Token costs on agents&lt;/a&gt; can spiral if you do not instrument them. All three engines let you emit custom events or metrics for token usage, but the integrations with AI-specific cost tracking are cleanest on Vercel Workflow (via AI Gateway) and Inngest (via built-in AI primitives).&lt;/p&gt;

&lt;p&gt;Human-in-the-loop patterns are handled well on all three. Temporal has the longest track record for multi-day waits. Inngest's &lt;code&gt;step.waitForEvent&lt;/code&gt; is probably the cleanest API for the pattern. Vercel Workflow handles it cleanly as well.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Run In Production
&lt;/h2&gt;

&lt;p&gt;For the record, because abstract comparisons only go so far, here is what I have in production right now across three different projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A solo-founder SaaS&lt;/strong&gt; that runs on Vercel uses Vercel Workflow for every async job. Payment retries, onboarding sequences, AI-powered digest generation, webhook processing. The "everything in one platform" story has saved me days of ops work that I would have spent elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A mid-sized side project&lt;/strong&gt; that is deployed on Fly.io uses Inngest because I did not want to add a second platform dependency and Inngest runs happily as a managed service alongside whatever infrastructure is underneath. The AI agent that powers the main feature lives in Inngest and it has been rock solid for six months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A client project&lt;/strong&gt; with a dedicated platform team and genuinely complex multi-week workflows uses Temporal. The learning curve was real. The operational rigor they get in exchange is also real. I would not have picked it for either of the other two projects, and I would not pick anything else for this one.&lt;/p&gt;

&lt;p&gt;The pattern, if there is one, is that the "right" engine depends more on the team shape and platform commitment than on the raw feature set. All three are good tools. Most of the bad outcomes I have seen come from picking the wrong one for the context, not from picking a weak tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  What To Build This Week
&lt;/h2&gt;

&lt;p&gt;If you do not have a workflow engine in your stack and you are building anything involving AI agents, background jobs that take more than a few seconds, or multi-step processes that need to survive restarts, install one this week. Pick Inngest or Vercel Workflow and ship a single real workflow through it. The one that used to be a fragile cron job, a manually-retried queue consumer, or a background function that silently fails 3 percent of the time. Migrate that one thing. See how it feels.&lt;/p&gt;

&lt;p&gt;If you already have a workflow engine and you are questioning it, do not migrate without a concrete reason. "The DX of the other one looks nicer" is not a concrete reason. "Our workflows regularly exceed the limits of the current engine" is. Migrations between workflow engines are expensive and the payoff is usually smaller than you expect.&lt;/p&gt;

&lt;p&gt;If you are evaluating for the first time for a serious platform commitment, run a bake-off. Pick the single most representative workflow in your system. Build it on all three engines. Run each implementation for a week under realistic load. Measure the things that actually matter to you: DX, observability, cost at projected scale, and how it feels when something goes wrong. The winner from that exercise is almost never the one you would have picked from the comparison tables.&lt;/p&gt;

&lt;p&gt;Durable execution is not a category where the "best" engine wins. It is a category where the right engine for your team and your workload wins. Temporal, Inngest, and Vercel Workflow are all good tools. The interesting question is which one maps onto the specific shape of what you are building, and the only way to answer that is to be honest about what your team looks like and what kind of problems you are actually trying to solve.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>saas</category>
    </item>
    <item>
      <title>Structured Outputs in 2026: Function Calling, JSON Mode, and the Schema Wars</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Sat, 25 Apr 2026 09:52:44 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/structured-outputs-in-2026-function-calling-json-mode-and-the-schema-wars-4c98</link>
      <guid>https://forem.com/alexcloudstar/structured-outputs-in-2026-function-calling-json-mode-and-the-schema-wars-4c98</guid>
      <description>&lt;p&gt;The bug took three days to find. A user reported that our invoice extractor was occasionally swapping the buyer and seller fields. Not all the time. Not even most of the time. Maybe one in two hundred invoices, always on the ones that mattered.&lt;/p&gt;

&lt;p&gt;I dug into the traces. The model was returning valid JSON. The schema validation passed. The downstream code was correct. The fields were just wrong, sometimes, in ways that no test caught and no eval flagged.&lt;/p&gt;

&lt;p&gt;What I eventually figured out was that I had been using JSON mode, which guarantees valid JSON syntax but does not constrain the keys or the values. The model was free to return whatever object it wanted as long as it parsed. On hard invoices it would occasionally hallucinate a slightly different schema and our code, trusting the parse, would write garbage into the wrong column of the database.&lt;/p&gt;

&lt;p&gt;Switching to a real structured output API with a schema constraint took fifteen minutes. The bug never came back.&lt;/p&gt;

&lt;p&gt;This is the kind of mistake that quietly destroys data integrity in LLM pipelines, and the surface area for it has grown a lot in 2026 as every major provider has shipped its own version of structured outputs. Function calling, JSON mode, schema-constrained generation, tool use, response_format, output_schema — these terms overlap, conflict, and sometimes mean different things on different providers. If you do not understand which one you are actually using, you cannot reason about what it will and will not catch.&lt;/p&gt;

&lt;p&gt;This is the field guide I have built up after shipping structured outputs across four products and getting bitten by every variant of this bug at least once.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Things People Mean When They Say "Structured Outputs"
&lt;/h2&gt;

&lt;p&gt;Before any of the provider-specific stuff, you have to separate three different ideas that everyone uses interchangeably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JSON mode&lt;/strong&gt; means the model is constrained to produce syntactically valid JSON. Nothing more. The keys and values can be whatever the model decides. There is no schema. If you ask for an invoice and the model gives you a recipe for soup, JSON mode will happily return valid JSON describing soup. This is the lowest tier and the one most likely to hurt you because it looks like it is doing something useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Function calling&lt;/strong&gt; (sometimes called tool use) is the original way structured outputs got into production. You define a function with parameters and a schema, the model decides whether to call it, and if it does, the arguments come back as a structured object. The model is constrained to fill out the schema, but historically the constraint was a soft suggestion, not a hard guarantee. The model could still return malformed arguments and you had to handle that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema-constrained generation&lt;/strong&gt; (sometimes called structured outputs, response_format with schema, or constrained decoding) is the new world. You define a JSON schema, the provider runs constrained decoding under the hood so the model literally cannot emit tokens that would violate the schema, and you get back a parsed object that is guaranteed to match. No retries, no validation failures, no surprises.&lt;/p&gt;

&lt;p&gt;These three modes are not the same and they fail in different ways. The crux of choosing the right one in 2026 is figuring out which guarantee you actually need.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Each Provider Lands in 2026
&lt;/h2&gt;

&lt;p&gt;The landscape is finally settling, but it is still worth knowing what each provider gives you and what it does not.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI
&lt;/h3&gt;

&lt;p&gt;OpenAI has the most mature schema-constrained API. The &lt;code&gt;response_format&lt;/code&gt; parameter takes a JSON schema and the model is decoded under that constraint at the token level. If you set &lt;code&gt;strict: true&lt;/code&gt;, you get a hard guarantee that the output matches the schema exactly. The schema can be nested, can include enums, can express required vs optional fields, and the constraint is enforced at generation time, not validated after.&lt;/p&gt;

&lt;p&gt;OpenAI also still has the older function calling API and the original JSON mode (no schema). You should treat both of those as legacy unless you have a specific reason. Use &lt;code&gt;response_format&lt;/code&gt; with strict schemas as your default.&lt;/p&gt;

&lt;p&gt;The catch with OpenAI's strict mode is that it is more conservative than the looser modes. If your schema is too restrictive, the model can struggle to produce a useful answer because the constraint is preventing it from saying what it wants to say. The fix is usually to widen the schema with optional fields, not to remove the constraint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anthropic
&lt;/h3&gt;

&lt;p&gt;Anthropic's tool use API has matured significantly through 2026. Tool definitions are JSON schemas, and the model is constrained to fill them. Through Claude Opus 4.7 the constraint enforcement has gotten strong enough that I treat it as equivalent to OpenAI's strict mode for practical purposes. Malformed tool calls are now rare enough that I no longer build retry logic around them.&lt;/p&gt;

&lt;p&gt;Anthropic also added a more direct structured output mode that does not require dressing your call up as a tool. You provide an output schema and get a constrained response. This is the cleaner path when you do not actually need tool semantics, you just want a typed object back.&lt;/p&gt;

&lt;p&gt;The thing to know about Anthropic is that the model is more willing to refuse or partially answer when the schema does not fit the request. If you ask for a structured field that the document does not contain, Claude is more likely to leave it null or empty than to confabulate. This is usually what you want, but it changes how you write prompts. You have to be explicit about what to do when the data is missing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google
&lt;/h3&gt;

&lt;p&gt;Gemini's &lt;code&gt;responseSchema&lt;/code&gt; parameter takes an OpenAPI-style schema and constrains the output. The constraint is enforced at decode time. The schema language is slightly different from JSON Schema, which is annoying, but the practical capability is on par with the others.&lt;/p&gt;

&lt;p&gt;Gemini has the broadest support for very large outputs under structured constraints, which matters if you are extracting structured data from giant documents. If you need a 2 million token context window and a strict schema on the output, this is the one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open source models
&lt;/h3&gt;

&lt;p&gt;For Llama, Qwen, Mistral, and the rest of the open weight ecosystem, structured outputs go through one of three libraries: outlines, lm-format-enforcer, or guidance. All three implement constrained decoding by intersecting the model's logit distribution with the legal next tokens given a schema or grammar. They work, they are reliable, and they are how most production self-hosted setups handle this.&lt;/p&gt;

&lt;p&gt;The thing that surprised me when I moved a workload to a self-hosted model was that the open source constrained decoders are actually stricter than the hosted APIs. The model literally cannot produce an invalid output. If anything, the failure mode shifts from malformed JSON to the model getting stuck producing a less useful but technically valid answer.&lt;/p&gt;

&lt;p&gt;If you are running &lt;a href="https://dev.to/blog/local-ai-models-coding-ollama-2026"&gt;local AI models with Ollama&lt;/a&gt; or any self-hosted inference, you almost certainly want one of these libraries in your stack. The native APIs do not do this for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  Function Calling vs Structured Outputs vs Tool Use
&lt;/h2&gt;

&lt;p&gt;Here is the part that confuses everyone, and where I have seen the most mistakes.&lt;/p&gt;

&lt;p&gt;The historical naming is a mess. "Function calling" was OpenAI's original name. Anthropic called it "tool use." Google called it "function calling" too but did it differently. Eventually everyone converged on "tool use" because the model is not actually calling a function; it is producing a structured object that you then dispatch to a function.&lt;/p&gt;

&lt;p&gt;Independently of that, "structured outputs" emerged as the term for "I want a structured response back, but I do not need tool semantics."&lt;/p&gt;

&lt;p&gt;The distinction in 2026 is this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use tool use&lt;/strong&gt; when the model needs to choose among multiple actions, decide whether to act at all, or take a sequence of actions in a loop. The semantics are about the model's agency. It is deciding what to do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use structured outputs&lt;/strong&gt; when you have already decided what the model should produce and you just want the response in a typed shape. The semantics are about the response format. The model has no choice; it must produce the object.&lt;/p&gt;

&lt;p&gt;In practice, most people use tool use for both because the API has been around longer and the documentation is more comprehensive. This works, but it leaks tool-use semantics into things that should be plain transformations. You end up with prompts that say "use the extract_invoice tool" when what you really mean is "give me an extracted invoice." The latter is less ambiguous and produces better results.&lt;/p&gt;

&lt;p&gt;If your provider supports a direct structured output API, use it for transformations and reserve tool use for actual tool selection.&lt;/p&gt;




&lt;h2&gt;
  
  
  Designing Schemas That Do Not Fight the Model
&lt;/h2&gt;

&lt;p&gt;This is the part nobody warns you about. The schema you write is itself a prompt. The model reads the schema, interprets the field names, and produces output guided as much by what the schema says as by what your text prompt says.&lt;/p&gt;

&lt;p&gt;This means a badly designed schema can make the model worse, even when it is technically valid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use descriptive field names.&lt;/strong&gt; A field called &lt;code&gt;amount&lt;/code&gt; is ambiguous. A field called &lt;code&gt;total_amount_with_tax&lt;/code&gt; tells the model exactly what to put there. The model is not magic; it reads what you wrote and tries to do what you said. Field names are part of the instruction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add field-level descriptions.&lt;/strong&gt; Every major schema language supports a &lt;code&gt;description&lt;/code&gt; per field. Use them. A description like "the date the invoice was issued, in YYYY-MM-DD format" is dramatically more reliable than just &lt;code&gt;issue_date: string&lt;/code&gt;. The model treats the description as part of the prompt for that field specifically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use enums when possible.&lt;/strong&gt; If a field has a fixed set of allowed values, encode that as an enum. The model is then physically incapable of producing anything else, and you do not have to write defensive parsing code. Status fields, category fields, type fields are all candidates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mark fields as nullable when they truly can be missing.&lt;/strong&gt; If you make every field required, the model will fabricate values for fields it cannot find. If you allow a field to be null, the model will leave it null when the data is genuinely absent. This is the biggest source of hallucinated data in extraction pipelines, and the fix is to be honest in your schema about what is optional.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid deeply nested structures.&lt;/strong&gt; Constrained decoding works on every kind of schema, but the model performs better on flat structures. If your schema is six levels deep, consider whether some of those levels could be flat fields with composite keys. The model's accuracy on nested fields drops noticeably as depth increases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do not use schemas to enforce business logic.&lt;/strong&gt; The schema is for shape, not policy. If a value must be between 1 and 100, do not encode that as a JSON Schema constraint and expect the model to produce a number in range. Validate it after. The model knows about JSON shape; it does not reliably reason about numeric constraints. The constraint will pass at decode time and you will get nonsense values that happen to be in range.&lt;/p&gt;




&lt;h2&gt;
  
  
  When the Schema Is Wrong: Failure Modes and Recovery
&lt;/h2&gt;

&lt;p&gt;Even with strict schemas, structured outputs fail. The failure modes are different from unstructured outputs and you need to handle them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Empty or null fields when the model cannot find the data.&lt;/strong&gt; As I said above, this is the correct behavior. Your code needs to handle null. Treat any required-looking field as effectively optional in the model's eyes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Confabulated values that match the schema.&lt;/strong&gt; If you have a required field with no good default, the model will make something up. The fabrication will pass schema validation. The only defense is downstream verification — does the value actually exist in the source document, does it cross-reference with another field, does an LLM-as-judge agree it was extracted correctly. This is where the &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;observability and eval workflow&lt;/a&gt; earns its keep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema-mismatch in fallback paths.&lt;/strong&gt; If your code has a fallback to a cheaper model that does not support strict mode, the fallback can return malformed data that breaks downstream parsers. Always validate after parsing, even if the API claims it cannot fail. Belt and suspenders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token waste on overly verbose schemas.&lt;/strong&gt; Every property name and description is in the prompt every time. A schema with 80 fields and three sentences of description per field can easily run 4000 tokens. If you are calling that schema a million times a month, you are paying for those tokens a million times. Watch the token budget on your schema specifically. I covered the broader pattern in &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization&lt;/a&gt;; schemas are a major hidden contributor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decoder slowdowns on complex schemas.&lt;/strong&gt; Constrained decoding has a runtime cost that scales with schema complexity. A flat schema with ten fields decodes nearly as fast as unconstrained generation. A deeply nested schema with hundreds of optional branches can slow generation by 20% or more. If latency matters, profile this.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompting With Schemas
&lt;/h2&gt;

&lt;p&gt;The prompt and the schema are two halves of the same instruction. Treat them that way.&lt;/p&gt;

&lt;p&gt;The pattern that has worked best for me:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The system prompt explains the task in plain language. What is the model doing, what does the input look like, what should it produce.&lt;/li&gt;
&lt;li&gt;The schema enforces the shape and provides field-level descriptions for anything ambiguous.&lt;/li&gt;
&lt;li&gt;The user prompt provides the input data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Concretely, a system prompt like "Extract invoice fields from the provided document. If a field is not present in the document, return null for it; do not infer or estimate" plus a schema with descriptive field names is dramatically more reliable than either of those things alone.&lt;/p&gt;

&lt;p&gt;Two specific anti-patterns to avoid:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do not put the schema in the prompt as text.&lt;/strong&gt; If your provider supports a real schema parameter, use it. Putting the schema in the prompt as JSON text means the model has to interpret it, and the constraint is not enforced at decode time. You get the worst of both worlds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do not duplicate field descriptions.&lt;/strong&gt; If you have a field description in the schema, do not also describe it in the prompt. The model gets confused when the same instruction appears twice in slightly different words. Keep the field-level guidance in the schema and the task-level guidance in the prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Streaming Structured Outputs
&lt;/h2&gt;

&lt;p&gt;This is the new frontier in 2026. All three major providers now support streaming structured outputs, where the model generates the JSON token by token and you can read partial objects as they come in.&lt;/p&gt;

&lt;p&gt;This matters because waiting for a 2000-token JSON response can take five seconds, and your UI cannot just freeze. Streaming lets you start rendering the first few fields while the rest are still being generated.&lt;/p&gt;

&lt;p&gt;The catch is that partial JSON is not valid JSON. You cannot just &lt;code&gt;JSON.parse&lt;/code&gt; the chunks as they arrive. You need a streaming JSON parser that can handle incomplete objects and emit field-level events as they are completed.&lt;/p&gt;

&lt;p&gt;The libraries that do this well in 2026: &lt;code&gt;partial-json&lt;/code&gt; for Node, &lt;code&gt;pydantic-ai&lt;/code&gt;'s streaming validators for Python, and the AI SDK's &lt;code&gt;streamObject&lt;/code&gt; for full-stack TypeScript. All of them subscribe to the same pattern: parse what you have, emit a typed partial object, repeat as more data arrives.&lt;/p&gt;

&lt;p&gt;Where I have seen this go wrong: developers stream the JSON, render fields as they arrive, but never wait for the final completion event. The user sees fields populate, then the model decides one of those fields was wrong and revises it in the final pass. Now you have a UI that flashes wrong data and corrects itself. Either lock in fields only when their complete event fires, or render to a buffered draft state and only commit on full completion.&lt;/p&gt;




&lt;h2&gt;
  
  
  Schema Versioning
&lt;/h2&gt;

&lt;p&gt;The thing that nobody thinks about until they get burned: your schema is a contract. Once you ship it, every record you wrote conforms to that schema. If you change a field name, add a required field, or change a type, you have a migration problem. If you store the structured outputs in a database, you also have a database migration problem.&lt;/p&gt;

&lt;p&gt;Treat your schemas like API versions. Bump a version number when you change them. Keep the old schema around so you can read old records. If you are storing extracted data, store the schema version with the data so you know how to interpret it later.&lt;/p&gt;

&lt;p&gt;This sounds like overhead but it pays off the first time you change a schema and realize you have ten thousand records that do not parse against the new shape.&lt;/p&gt;

&lt;p&gt;For pipelines that output to a database, the schema versioning question is partly answered by the database layer. If you are using Postgres with JSON columns, the schema is loose enough that minor changes are forgiving. If you are using a strongly typed ORM like the ones I covered in &lt;a href="https://dev.to/blog/drizzle-orm-vs-prisma-2026"&gt;Drizzle vs Prisma&lt;/a&gt;, the schema versioning needs to flow through the type system as well as the data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Evals for Structured Outputs
&lt;/h2&gt;

&lt;p&gt;Evals for unstructured text output are squishy. You compare to a reference, you ask another model to judge, you accept some fuzziness. Evals for structured outputs are crisper, and you should take advantage of that.&lt;/p&gt;

&lt;p&gt;For each test case, you can compute exact-match accuracy on every field. Did the extracted total match the ground truth total? Yes or no. Did the extracted date match the ground truth date? Yes or no. Aggregate this into a per-field accuracy metric and you can tell, at a glance, which fields your model is bad at.&lt;/p&gt;

&lt;p&gt;This is much more actionable than "the output looked right." If you see that the &lt;code&gt;total_amount&lt;/code&gt; field has 99% accuracy but the &lt;code&gt;due_date&lt;/code&gt; field has 78%, you know exactly where to focus prompt or schema work.&lt;/p&gt;

&lt;p&gt;The same eval framework that you use for general agents (which I covered in &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt;) extends naturally to structured outputs, but the assertions get tighter. You no longer need an LLM-as-judge for most fields. You have ground truth and you have outputs. Compare them with code.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;If I were starting a new project tomorrow that needed structured outputs, here is the decision tree I would use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Need to extract typed data from documents or text?&lt;/strong&gt; Use schema-constrained generation. OpenAI's strict response_format, Anthropic's structured output mode, or Gemini's responseSchema. Default to your existing provider; the differences in capability are smaller than the cost of switching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Need the model to choose among multiple actions or use external tools?&lt;/strong&gt; Use tool use. The semantics fit. Do not pretend it is just structured output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Running a self-hosted model?&lt;/strong&gt; Add outlines or lm-format-enforcer. Do not roll your own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outputting JSON to a low-stakes UI feature?&lt;/strong&gt; JSON mode is fine. Skip the schema. The cost of a malformed response is a UI hiccup, not a data corruption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Doing something where data integrity matters?&lt;/strong&gt; Schema-constrained, plus downstream validation, plus per-field eval coverage. The schema catches shape errors. The validation catches policy errors. The evals catch hallucinated values that pass both.&lt;/p&gt;

&lt;p&gt;The bug I described at the start of this post would have been impossible if I had used schema-constrained generation from day one. It cost me three days of debugging and a small amount of database cleanup that I am still slightly bitter about.&lt;/p&gt;

&lt;p&gt;The good news is that in 2026 you do not have an excuse to skip this. Every major provider supports it. The libraries for self-hosted models are mature. The performance overhead is small. The error modes are well understood.&lt;/p&gt;

&lt;p&gt;The only thing left is to actually use it. Stop using JSON mode for anything that matters. Stop trusting that the model will produce the right shape. Define the schema, enforce it at decode time, and validate the values that came through. The data integrity you save will be your own.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>saas</category>
    </item>
    <item>
      <title>Prompt Caching in 2026: Anthropic vs OpenAI vs Gemini for Production Apps</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Sat, 25 Apr 2026 09:52:10 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/prompt-caching-in-2026-anthropic-vs-openai-vs-gemini-for-production-apps-433i</link>
      <guid>https://forem.com/alexcloudstar/prompt-caching-in-2026-anthropic-vs-openai-vs-gemini-for-production-apps-433i</guid>
      <description>&lt;p&gt;I opened the billing dashboard for one of my AI features a few months ago and felt my stomach drop. The feature was working beautifully. Users loved it. Traffic was climbing. And the monthly spend had quietly crossed a line that made me open a second tab to check the math twice. I had been telling myself caching was on the "optimize later" list for about three months. That morning it moved to the top.&lt;/p&gt;

&lt;p&gt;What I learned over the next two weeks is that prompt caching is not an optimization. It is the difference between a production AI feature that pencils out and one that eats your margin alive. Get it right and a 200,000 token system prompt goes from budget-breaking to a rounding error. Get it wrong and your cache hit rate sits at 4 percent while you wonder why the bill keeps growing.&lt;/p&gt;

&lt;p&gt;Every major provider ships caching now. Anthropic, OpenAI, and Gemini all have their own take on it, and the differences matter more than the docs make obvious. The pricing models diverge. The TTLs diverge. The rules about what invalidates a cache entry are different in ways that will bite you if you assume they work the same. I have shipped cached prompts on all three and broken something on all three. Here is the field guide I wish I had before I started.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Caching Became The Whole Ball Game
&lt;/h2&gt;

&lt;p&gt;For most of 2023 and 2024, prompt caching was an optional efficiency trick. You could skip it and still build working AI features. The context windows were small enough and the prompts were short enough that the raw input token bill never got scary.&lt;/p&gt;

&lt;p&gt;That changed in two steps. First, context windows grew. A 1 million token window on Claude Opus 4.7 and a 2 million token window on Gemini 2.5 Pro made long context architectures realistic for use cases that used to require RAG. Second, providers noticed that charging full input price for the same 180,000 token system prompt on every request was going to push developers back to retrieval out of pure sticker shock. Caching was the escape valve.&lt;/p&gt;

&lt;p&gt;The economics now look like this for a typical long context feature:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Without caching, a 200,000 token system prompt at Claude input rates runs about 60 cents per request&lt;/li&gt;
&lt;li&gt;With caching on a warm cache, the same prompt runs around 6 to 8 cents per request&lt;/li&gt;
&lt;li&gt;At 10,000 requests per day, that is the difference between $6,000 and $600&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An order of magnitude. On a single feature. This is not an "optimization" in any normal sense. It is the price difference between "this business works" and "this business does not."&lt;/p&gt;

&lt;p&gt;The catch is that the 90 percent discount only shows up if you do everything right. A single whitespace change in the cached portion can reset the cache. A TTL that expires in the middle of your daily traffic window wipes out the savings. A multi-tenant design that seemed obvious on paper can turn caching into an accounting nightmare. &lt;a href="https://dev.to/blog/context-engineering-ai-coding-2026"&gt;Context engineering&lt;/a&gt; is the umbrella skill here, and caching is the single highest-leverage piece of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Each Provider Actually Implements It
&lt;/h2&gt;

&lt;p&gt;The three providers all solved the same problem, but they made very different choices about ergonomics, pricing, and constraints. If you treat them as interchangeable you will miss the places where each one has a quiet advantage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anthropic (Claude)
&lt;/h3&gt;

&lt;p&gt;Anthropic introduced prompt caching in August 2024 and it has become the most developer-controllable of the three. You place &lt;code&gt;cache_control&lt;/code&gt; breakpoints explicitly in your prompt, up to four of them, and everything before each breakpoint is cached as a prefix.&lt;/p&gt;

&lt;p&gt;The defaults in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Five-minute TTL on the default "ephemeral" cache&lt;/li&gt;
&lt;li&gt;One-hour TTL available with a slightly higher write cost&lt;/li&gt;
&lt;li&gt;Cache writes cost 1.25x the normal input price on the five-minute tier, 2x on the one-hour tier&lt;/li&gt;
&lt;li&gt;Cache reads cost about 10 percent of the normal input price&lt;/li&gt;
&lt;li&gt;Minimum cacheable block of 1,024 tokens for Opus 4.7 and Sonnet 4.6&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The explicit breakpoint model is the thing I like most. You can cache your system prompt, your tool definitions, and the first chunk of conversation separately. You can decide exactly what is stable and what is not. And you can see in the response metadata which cache blocks hit and which did not, which makes debugging a cold cache actually possible.&lt;/p&gt;

&lt;p&gt;The gotcha is that the breakpoint position matters. Everything before a breakpoint must be byte-for-byte identical across requests. A single extra newline, a trailing space, a changed date in the system prompt, and the prefix misses. I once spent an afternoon tracking down a 0 percent hit rate that turned out to be a timestamp in the system instruction. Remove the timestamp, cache works.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI
&lt;/h3&gt;

&lt;p&gt;OpenAI rolled out automatic prompt caching in late 2024 and has kept the interface intentionally minimal. There are no breakpoints to set. The API inspects every request, looks for a cached prefix of at least 1,024 tokens, and charges the cached rate on the portion that matches.&lt;/p&gt;

&lt;p&gt;The defaults in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic caching with no opt-in required&lt;/li&gt;
&lt;li&gt;Cache TTL of 5 to 10 minutes depending on load&lt;/li&gt;
&lt;li&gt;Cache writes are free (no price premium on the first use)&lt;/li&gt;
&lt;li&gt;Cache reads are 50 percent of the normal input price&lt;/li&gt;
&lt;li&gt;Minimum prefix match of 1,024 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The simplicity is genuinely pleasant when it works. You structure your prompt with the stable portion first, you keep the dynamic portion at the end, and the system figures it out. For a lot of use cases this is all you need.&lt;/p&gt;

&lt;p&gt;The downsides show up when you need precision. You do not control where the cache breakpoint lands. You cannot cache multiple disjoint blocks the way you can with Anthropic. The read discount is 50 percent rather than 90 percent, which sounds small but compounds fast at volume. And the TTL is shorter and less predictable, which makes it harder to plan around.&lt;/p&gt;

&lt;p&gt;For my money, OpenAI caching is the right default for simple cases and a frustrating ceiling for complex ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  Google (Gemini)
&lt;/h3&gt;

&lt;p&gt;Gemini takes a third approach that I would call "explicit and durable." You create a cached content object with its own identifier, set an explicit TTL, and then reference that identifier in subsequent requests.&lt;/p&gt;

&lt;p&gt;The defaults in 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TTLs from 1 minute to 24 hours, you pick&lt;/li&gt;
&lt;li&gt;Cache storage is billed per hour the content sits in the cache&lt;/li&gt;
&lt;li&gt;Cache reads are roughly 25 percent of normal input price&lt;/li&gt;
&lt;li&gt;Minimum cacheable content of 4,096 tokens on Gemini 2.5 Pro&lt;/li&gt;
&lt;li&gt;Cache objects are regional and scoped to your API key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The long TTL option is the killer feature. On a stable doc set, you can create a cache entry once in the morning and have it serve requests all day. No cold starts, no mid-day TTL refreshes, no worrying about whether your traffic pattern keeps the cache warm. For batch jobs, evals, or low-traffic features that only see requests every hour or two, this is a huge win because the five-minute TTLs on Anthropic and OpenAI would simply expire between uses.&lt;/p&gt;

&lt;p&gt;The downside is the storage cost. If you cache content and it sits unused, you still pay for the hours. On a 200,000 token cached object held for a day, the storage bill is not zero. You have to match the TTL to your actual traffic or you bleed money on idle storage.&lt;/p&gt;

&lt;p&gt;I have ended up using Gemini caching for long-running async features where the storage cost is predictable, and staying with Anthropic or OpenAI for interactive features where the five-minute TTL matches real user behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hit Rate Trap
&lt;/h2&gt;

&lt;p&gt;The single most common caching bug I see, including in my own code, is a cache hit rate that looks fine on paper but is actually catastrophic.&lt;/p&gt;

&lt;p&gt;Here is the trap. You deploy a long cached system prompt. You run a load test. You see cache hits on request 2, request 3, request 4. You declare victory. You ship. Then production traffic arrives and your cache hit rate drops to 30 percent for reasons you did not anticipate.&lt;/p&gt;

&lt;p&gt;The three things that usually cause this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-tenancy you did not account for.&lt;/strong&gt; If your cached prompt includes anything user-specific, like the user's name, a workspace ID, or a tenant configuration, the cache is keyed per user. Each user sees cold cache on their first request, and users who do not return within the TTL window never get a warm cache at all. The fix is to separate tenant-specific context from stable system context and only cache the stable part.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTL shorter than your traffic interval.&lt;/strong&gt; A five-minute TTL works great when you get a request every 30 seconds. It is useless when you get a request every six minutes. If your traffic is bursty or low volume, you are paying cache write prices on almost every request and cache read prices on almost none. Either switch to a provider with longer TTLs (Gemini's one-hour or 24-hour options) or accept that caching is not going to help your use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Silent invalidation from prompt changes.&lt;/strong&gt; Every code deploy that touches your prompt template invalidates your cache. Every A/B test that changes the system instruction invalidates your cache. Every minor wording tweak that seems harmless invalidates your cache. If you are deploying often, you may be paying cache write costs after every release and never keeping a warm cache long enough to get the read discount.&lt;/p&gt;

&lt;p&gt;I now instrument hit rate as a first-class metric. Every AI request logs whether it hit the cache, how many tokens hit the cache, and how many were billed at full price. If the hit rate drops below 80 percent on a feature that should be at 95 percent, I get paged. This sounds paranoid and it has paid for itself twice already.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Structural Rules That Actually Work
&lt;/h2&gt;

&lt;p&gt;After making every mistake twice, here is the structural pattern I use for any cached prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stable first, dynamic last.&lt;/strong&gt; The static portion of the prompt goes at the top. System instructions, tool definitions, shared context, doc sets. The dynamic portion, which includes the user message, per-request state, and anything else that changes, goes at the bottom. The cache boundary lives between them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No timestamps in the cached portion.&lt;/strong&gt; If your system prompt includes the current date, the current user, the current anything, it is not cacheable. Move it out. If you genuinely need the date in context, put it in the dynamic portion after the cache boundary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strip whitespace carefully.&lt;/strong&gt; I have lost more hit rate to stray newlines than to any other single cause. When building the cached portion, I now run it through a normalizer that strips trailing whitespace on every line and ensures consistent line endings. The byte-for-byte match requirement is unforgiving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One cache boundary, sometimes two, rarely more.&lt;/strong&gt; Anthropic lets you set up to four breakpoints, but in practice I almost never use more than two. One for the system prompt and tool definitions, one for a shared document set. More breakpoints means more places where the prefix can miss, and the cognitive overhead of reasoning about which block is warm is not worth it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitor hit rate per feature.&lt;/strong&gt; Cache hit rate is not a global metric, it is per-feature. Different features have different cache patterns, different TTL needs, and different failure modes. Track them separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pin the cached portion in source control.&lt;/strong&gt; Treat the cached portion of your prompt like an API contract. Changes to it cost real money in lost cache warming. Require review. Roll out prompt changes with the same care as database migrations.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Real Example: A Support Bot At 50k Requests Per Day
&lt;/h2&gt;

&lt;p&gt;Let me get concrete with numbers from a support triage bot I have been running since January. The shape of the feature is the same one I described in the &lt;a href="https://dev.to/blog/rag-vs-long-context-2026"&gt;RAG vs long context&lt;/a&gt; piece. A 180,000 token system prompt with all the support docs, a short per-ticket message, and Claude Opus 4.7 doing the drafting.&lt;/p&gt;

&lt;p&gt;Without caching, the cost math is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;180,000 input tokens at $15 per million = $2.70 per request on the system prompt alone&lt;/li&gt;
&lt;li&gt;50,000 requests per day = $135,000 per day&lt;/li&gt;
&lt;li&gt;This number is obviously not real. We would never have shipped this feature at this price.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With five-minute ephemeral caching on Anthropic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First request of a five-minute window pays a cache write at 1.25x = $3.38&lt;/li&gt;
&lt;li&gt;Subsequent requests in the window pay cache read at 10 percent of input = $0.27&lt;/li&gt;
&lt;li&gt;Steady traffic maintains a warm cache most of the time&lt;/li&gt;
&lt;li&gt;Realistic daily spend ends up around $18,000, of which about $2,000 is cache writes and $16,000 is cache reads plus output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That 85 percent reduction is what makes the feature viable. The absolute number is still significant, but it is a business cost rather than a business catastrophe.&lt;/p&gt;

&lt;p&gt;The last piece worth mentioning is that the hit rate itself is a feature of your traffic pattern. This bot gets requests pretty evenly throughout business hours, which keeps the cache warm. A lower-volume feature with the same prompt would have a much worse ratio of cache writes to cache reads, and the economics would look different. Some low-volume features in the same company are better served by Gemini's long-TTL caching for exactly this reason.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Caching Will Not Help You
&lt;/h2&gt;

&lt;p&gt;There are categories where caching is just not the right tool, and pretending otherwise leads to disappointment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-user data that cannot be separated from the system prompt.&lt;/strong&gt; If your application logic genuinely requires user-specific context at the top of the prompt, caching across users is impossible. You can still cache per user, but only if each user generates enough traffic in a five-minute window to hit the cache meaningfully. Most SaaS apps do not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Highly dynamic doc sets.&lt;/strong&gt; If your knowledge base changes multiple times per hour, the cache invalidates faster than it accumulates hits. RAG becomes the better pattern because you can re-index incrementally without invalidating the entire retrieval path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short prompts.&lt;/strong&gt; There is a minimum prompt size below which caching is not worth the overhead. If your total prompt is 2,000 tokens, the savings on a cache hit are measured in fractions of a cent per request, and the engineering complexity of maintaining a cached prefix is not free. Save caching for prompts over 10,000 tokens where the math starts to matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic workflows with unpredictable tool calls.&lt;/strong&gt; Agents that call tools, get results back, and call more tools have highly variable prompt structure. The portion of the prompt that changes depends on which tool was called and what it returned. You can still cache the initial system prompt and tool definitions, but the dynamic middle portion is not cacheable, and you should not plan your cost structure around caching discounts that only apply to the first turn. &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;Observability for AI agents&lt;/a&gt; is a more impactful investment for these features.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Multi-Provider Strategy That Works
&lt;/h2&gt;

&lt;p&gt;One pattern I have landed on after running features across all three providers is to choose caching strategy per feature rather than per company.&lt;/p&gt;

&lt;p&gt;For a high-volume user-facing feature with stable context and short response-time requirements, I reach for Anthropic. The five-minute TTL matches interactive traffic patterns, the 90 percent read discount is the best on the market, and the explicit breakpoint model makes it obvious what is being cached.&lt;/p&gt;

&lt;p&gt;For a simple feature where the team does not want to think about caching strategy, OpenAI is the right default. It works well enough out of the box, the automatic caching behavior is predictable, and there is nothing to configure.&lt;/p&gt;

&lt;p&gt;For batch jobs, evals, or low-volume async features, Gemini's long-TTL caching is the only one of the three that makes sense. The five-minute TTLs on the other two would expire between requests and the cache would never warm up.&lt;/p&gt;

&lt;p&gt;Routing through &lt;a href="https://vercel.com/docs" rel="noopener noreferrer"&gt;Vercel AI Gateway&lt;/a&gt; or a similar provider abstraction makes this kind of per-feature strategy practical. You keep the same application code and change which provider gets called based on the feature's caching profile. The alternative is either accepting suboptimal caching on some features or scattering provider-specific code throughout your codebase, and neither one ages well.&lt;/p&gt;




&lt;h2&gt;
  
  
  What To Do Monday Morning
&lt;/h2&gt;

&lt;p&gt;If you are shipping AI features and you have not audited your caching, that is the highest-leverage afternoon of work available to you right now. Pull your billing dashboard. Find the top three features by token spend. For each one, check the actual cache hit rate. Not the "we turned caching on" status, the real hit rate.&lt;/p&gt;

&lt;p&gt;If the hit rate is above 80 percent, you are in good shape and can stop reading. If it is below 50 percent, you have a bug, not an optimization opportunity. Something in your prompt is invalidating more than it should. Walk through the structural rules above. Almost always it is a timestamp, a user-specific field, or a whitespace mismatch that moved into the cached portion when nobody was paying attention.&lt;/p&gt;

&lt;p&gt;For new features, caching should be part of the prompt design from day one, not retrofitted. Decide where the cache boundary lives before you write the first line of the system prompt. Keep the stable portion stable. Move the dynamic portion to the end. Instrument hit rate before you ship. The habits are small individually and they compound into an enormous cost difference at scale.&lt;/p&gt;

&lt;p&gt;The providers are all racing to make caching easier, and the 2026 versions are already much better than the 2024 versions. But the structural work of designing a prompt that actually caches well is still on you. A feature with a well-designed cached prompt costs 10 percent of what the same feature costs without one. That gap is not closing. If anything, it is widening as context windows keep growing and more of your bill lives in the input side of the ledger.&lt;/p&gt;

&lt;p&gt;Caching is not the sexy part of building AI features. Nobody is going to tweet about your hit rate. But it is the difference between features that earn and features that bleed, and in 2026 that distinction is the whole game.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>saas</category>
    </item>
    <item>
      <title>Multi-Agent vs Single-Agent Architecture in 2026: When the Crew Beats the Soloist</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Sat, 25 Apr 2026 09:51:37 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/multi-agent-vs-single-agent-architecture-in-2026-when-the-crew-beats-the-soloist-543</link>
      <guid>https://forem.com/alexcloudstar/multi-agent-vs-single-agent-architecture-in-2026-when-the-crew-beats-the-soloist-543</guid>
      <description>&lt;p&gt;The pitch for multi-agent systems is intoxicating. You take a complex task, decompose it into specialized roles, hand each role to its own agent, and coordinate them through a planner. The planner delegates. The workers execute. The critic reviews. The orchestrator stitches it together. It looks like a software team, except the team is a swarm of LLMs and they never go to lunch.&lt;/p&gt;

&lt;p&gt;I bought this pitch in 2024 and built three different multi-agent systems before I admitted that two of them would have been better as a single agent with good tools. The third one genuinely needed multiple agents. It was also the only one I could keep running for more than a quarter without rewriting half of it.&lt;/p&gt;

&lt;p&gt;The problem with multi-agent architectures is that they are simultaneously the right answer for a small set of real problems and a tempting wrong answer for a much larger set of problems that look similar but are not. Every conference talk and Twitter thread that hypes the pattern makes the wrong answer look just as valid as the right one, because the surface case is identical.&lt;/p&gt;

&lt;p&gt;This post is the framework I now use to decide. It comes from rebuilding the same product twice, once as a five-agent system and once as a single agent with five tools, and learning that the second version was strictly better.&lt;/p&gt;




&lt;h2&gt;
  
  
  What People Actually Mean by Multi-Agent
&lt;/h2&gt;

&lt;p&gt;The term gets used loosely. Before any of the trade-offs make sense, we need to separate the patterns that actually exist in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sequential pipeline.&lt;/strong&gt; Agent A produces output, agent B reads that output, agent C reads B's output, and so on. There is no real coordination, just a chain. This is multi-agent in name only. It is a workflow with LLM steps, and it should be reasoned about as a workflow, not as agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specialist crew.&lt;/strong&gt; A planner agent decides what needs to be done and dispatches sub-tasks to specialist agents (a researcher, a writer, a reviewer). The specialists report back. The planner integrates the results. This is what most people mean when they say multi-agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debate or critic loop.&lt;/strong&gt; Two or more agents argue or critique each other's output, and the final answer comes from the consensus or the surviving draft. This is a specific subset of crew, optimized for output quality rather than parallel work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Swarm with shared state.&lt;/strong&gt; Many agents operate on a shared workspace simultaneously, picking up tasks from a queue and updating shared memory. This looks like the multi-agent ideal but in practice it is rare in production because the coordination overhead is brutal.&lt;/p&gt;

&lt;p&gt;When someone says they built a multi-agent system, the right first question is which of these they actually mean. The trade-offs are completely different across the four.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Single Agent Counterargument
&lt;/h2&gt;

&lt;p&gt;Before you reach for any of the multi-agent patterns, the question to answer is whether a single agent with the right tools could do the job.&lt;/p&gt;

&lt;p&gt;A single agent with tools is a model that can call functions, see the results, decide what to do next, and loop until it produces a final answer. It is one execution context, one set of model calls, one conversation history. The model is choosing what to do at every step.&lt;/p&gt;

&lt;p&gt;In 2026 this single-agent pattern has gotten dramatically more capable. The tool-use capability of the frontier models is excellent. Context windows are large enough to hold a meaningful working memory. Cache support means you can keep a long system prompt cheap. Reasoning models can plan multi-step approaches inside a single call.&lt;/p&gt;

&lt;p&gt;The result is that many tasks that looked like they needed coordination across multiple agents in 2024 are now better handled by one agent that has access to the right tools.&lt;/p&gt;

&lt;p&gt;The question becomes: what do you actually gain by splitting into multiple agents?&lt;/p&gt;

&lt;p&gt;The honest answer for most projects is: very little, at the cost of a lot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Pay For Multi-Agent
&lt;/h2&gt;

&lt;p&gt;Every coordination boundary you add introduces costs that are easy to underestimate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token cost is multiplicative, not additive.&lt;/strong&gt; Each agent has its own context window, its own system prompt, and its own conversation history. When agent A tells agent B about the work it did, agent B has to read all of that, plus its own prompt, plus its own history. The total token spend across a five-agent system can easily be 5x the spend of a single agent solving the same task. Caching helps, but only when the prompts are stable, which they often are not in dynamic delegation patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency stacks up.&lt;/strong&gt; Every handoff between agents is a round trip to a model. Five agents in a sequential pipeline means five sequential model calls, each waiting for the previous one to finish. Where a single agent might solve the task in two or three model calls in a tool loop, the multi-agent version can take ten or fifteen. The user feels every one of those.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debugging gets exponentially harder.&lt;/strong&gt; When a single agent misbehaves, you read its trace and see what it did. When five agents misbehave, you have to figure out which agent introduced the error, whether the error was in its work or in the way the previous agent framed the task, whether a downstream agent compounded the mistake, and whether the planner should have caught it. I covered some of the patterns that help in &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;AI agent observability&lt;/a&gt;, but no amount of tooling fully cancels the complexity tax.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure modes multiply.&lt;/strong&gt; Each agent can fail independently. Each handoff can fail. The planner can mis-delegate, the worker can misunderstand, the reviewer can be too lenient or too harsh. You now have to design for retries, partial failures, deadlocks, and infinite delegation loops. None of these problems exist in a single-agent design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coordination prompts eat into the work.&lt;/strong&gt; A meaningful chunk of the prompt budget in a multi-agent system is dedicated to telling each agent how to coordinate with the others. "Wait for the researcher's output before drafting." "Return your result in this format so the planner can integrate it." This is overhead that does not produce value for the user; it just keeps the system from falling apart.&lt;/p&gt;

&lt;p&gt;When the gain is real, these costs are worth it. When the gain is imagined, they are pure burn.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Multi-Agent Is Actually Right
&lt;/h2&gt;

&lt;p&gt;There are three patterns where I have consistently found multi-agent designs to beat single-agent ones in real production work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: Genuinely Parallelizable Subtasks
&lt;/h3&gt;

&lt;p&gt;If the task can be cleanly decomposed into independent subtasks that have no dependency on each other, multi-agent is a real win. The classic example is research. You give a planner a question, it identifies five independent topics to investigate, dispatches them to five workers in parallel, and integrates the results.&lt;/p&gt;

&lt;p&gt;The wins here are concrete. Latency drops because the workers run in parallel. Each worker has a focused context with only the data relevant to its piece, so quality goes up. The planner does not need to know how each worker did its job, only what each one returned.&lt;/p&gt;

&lt;p&gt;The key word is independent. If worker B's task depends on worker A's findings, you have a sequential pipeline, not a parallel crew, and you lose the latency win.&lt;/p&gt;

&lt;p&gt;This is the pattern that I have seen produce actual wins in production research tools, due diligence agents, and competitor analysis bots.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Specialist Knowledge Boundaries
&lt;/h3&gt;

&lt;p&gt;Sometimes a task spans multiple domains where the prompts and tools needed are genuinely different. A code review might need a security specialist, a performance specialist, and a style specialist. Each of them has a different system prompt, a different set of tools, and a different evaluation criterion.&lt;/p&gt;

&lt;p&gt;You can technically pack all of this into a single mega-prompt with all the tools, but in practice the specialists do better work when each one has a focused prompt. The single-agent version starts to confuse the criteria. Should it prioritize security or style? It tries to balance both and ends up doing neither well.&lt;/p&gt;

&lt;p&gt;The split here is justified when the specializations are distinct enough that each agent benefits from a fundamentally different prompt and toolset. If the agents are ninety percent the same prompt with five percent variation, you do not need them; you need a single agent with conditional logic in the prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Output Quality Through Critique
&lt;/h3&gt;

&lt;p&gt;Some tasks have a quality bar that a single pass cannot reliably hit. Long-form writing, complex code, formal proofs. The first draft is rarely good enough, and the model knows this in retrospect but not in the moment of generating it.&lt;/p&gt;

&lt;p&gt;A two-agent setup, writer and critic, produces noticeably better outputs on these tasks. The writer drafts. The critic reads with fresh eyes (a fresh context, no commitment to the draft) and points out problems. The writer revises. Sometimes you loop two or three times before committing.&lt;/p&gt;

&lt;p&gt;This pattern is closer to a debate than a delegation, and the win is purely on output quality. The cost is real (you are doubling or tripling the model calls) but for tasks where quality matters more than latency, it is worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Multi-Agent Is The Wrong Answer
&lt;/h2&gt;

&lt;p&gt;The mirror image of the patterns above is where most multi-agent projects fail. Here are the anti-patterns I have lived through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks that are sequential by nature.&lt;/strong&gt; If the task is "do A, then do B with the result of A, then do C with the result of B," you do not have multi-agent. You have a workflow. Build it as a workflow with explicit steps. Use a &lt;a href="https://dev.to/blog/durable-ai-workflows-orchestration-2026"&gt;durable workflow engine&lt;/a&gt; like the ones I compared in &lt;a href="https://dev.to/blog/temporal-vs-inngest-vs-vercel-workflow-2026"&gt;Temporal vs Inngest vs Vercel Workflow&lt;/a&gt;. The structure of the work is the structure of your code; do not hide it behind agent personalities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks where the specialization is shallow.&lt;/strong&gt; If your "researcher" and your "writer" share 90% of the same context and the same instructions, they are not specialists. They are one agent with two prompts and twice the cost. The split is only justified when the specialists are doing fundamentally different things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks where the planner is just routing.&lt;/strong&gt; If your planner agent is just looking at the input and dispatching to one of three workers, replace it with code. A regex, a classifier, or a simple if-statement is faster, cheaper, and more reliable than an LLM doing a routing decision. Save the LLM for things that actually need an LLM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks where coordination overhead is most of the work.&lt;/strong&gt; I once built a five-agent customer support triage system that spent 80% of its tokens on agents telling each other what they were doing. When I rebuilt it as a single agent with the same tools, it produced better answers, ran 4x faster, and cost a quarter as much. The coordination was the bug, not the feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tasks where the user expects fast feedback.&lt;/strong&gt; Multi-agent systems are slower. If the user is waiting on the output, every handoff adds latency they feel. Single agents with streaming feel fast. Multi-agent systems feel like they are thinking too long, even when they are thinking well.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Hybrid That Often Wins
&lt;/h2&gt;

&lt;p&gt;The shape that I have ended up using most often in 2026 is not pure multi-agent and not pure single-agent. It is a single primary agent that can spawn focused sub-agents for specific subtasks, but only when the subtask is independent and parallelizable.&lt;/p&gt;

&lt;p&gt;The primary agent has the full context of the user request. It does most of the work. When it identifies a subtask that benefits from a fresh context (large research dive, isolated code generation, parallel investigation of multiple options), it spawns a sub-agent with a narrow prompt and a defined return shape. The sub-agent does its thing, returns the result, and the primary agent integrates it.&lt;/p&gt;

&lt;p&gt;This pattern keeps the simplicity of single-agent for the common path and only invokes the complexity of multi-agent where it actually pays off. The primary agent is in charge. The sub-agents are tools, not peers.&lt;/p&gt;

&lt;p&gt;In code, this often looks like the primary agent has a &lt;code&gt;delegate_subtask&lt;/code&gt; tool that takes a focused prompt, runs a separate model call with a clean context, and returns the result. The orchestration is implicit in how the primary agent uses the tool.&lt;/p&gt;

&lt;p&gt;The reason this works is that it inverts the multi-agent default. Instead of "always coordinate, sometimes do work directly," it is "always do work directly, sometimes delegate." The default path is the cheap path, and you only pay multi-agent costs when you need them.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory and State Across Agents
&lt;/h2&gt;

&lt;p&gt;If you do go multi-agent, the state question becomes load-bearing. Each agent has its own context. None of them automatically know what the others did. You have to design how state flows between them.&lt;/p&gt;

&lt;p&gt;Three patterns I have seen work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explicit return values.&lt;/strong&gt; Each agent returns a structured object. The next agent reads it. State is passed by value, never shared. This is simple and reliable but has limits when the state is large.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared scratchpad.&lt;/strong&gt; Agents read and write to a shared memory store (a key-value store, a markdown file, a database). The orchestrator gives each agent a pointer to the relevant section. This scales better but introduces concurrency bugs that are nightmare fuel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Message passing through the planner.&lt;/strong&gt; Workers do not talk to each other. They only talk to the planner, which integrates everything and decides what to send to whom. This is the cleanest from a debugging perspective but the planner becomes a bottleneck.&lt;/p&gt;

&lt;p&gt;The right choice depends on the pattern, but the meta-rule is: the simpler the state model, the easier the system is to debug. Most production multi-agent systems start with explicit return values and only move to shared scratchpads when they have to. I went deeper on this in &lt;a href="https://dev.to/blog/ai-agent-memory-state-persistence-2026"&gt;AI agent memory and state persistence&lt;/a&gt;, but the short version is: pass state explicitly until you cannot.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Modeling Before You Build
&lt;/h2&gt;

&lt;p&gt;Before you commit to multi-agent, do the cost math. This is the step everyone skips and regrets.&lt;/p&gt;

&lt;p&gt;For a typical multi-agent system with N agents and an average of T tokens per agent context, the per-task cost is roughly N × T tokens. For a single-agent equivalent, it is roughly T tokens (sometimes a bit more if the single agent has to keep more context).&lt;/p&gt;

&lt;p&gt;If your task volume is high, that multiplier is your monthly bill. A five-agent system processing a million tasks a month at 5x the token cost of a single-agent equivalent is a meaningful budget difference.&lt;/p&gt;

&lt;p&gt;The ways to fight this in a multi-agent design:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache aggressively.&lt;/strong&gt; If the system prompt of each specialist is stable, prompt caching will dramatically reduce the per-call cost of the static parts. I went deep on this in &lt;a href="https://dev.to/blog/prompt-caching-production-guide-2026"&gt;prompt caching&lt;/a&gt; and the same techniques apply to every agent in the crew.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use cheaper models for narrow workers.&lt;/strong&gt; A specialist with a focused task often does not need the frontier model. Use Haiku or Gemini Flash for narrow workers and reserve the frontier model for the planner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trim the coordination prompts.&lt;/strong&gt; Anything in the system prompt that is not load-bearing should go. In multi-agent designs, prompts tend to bloat with instructions about how to coordinate; audit ruthlessly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch parallel subtasks.&lt;/strong&gt; If your specialists run in parallel, batch the calls. Most providers offer batch APIs at meaningful discounts for non-real-time work.&lt;/p&gt;

&lt;p&gt;If after all of this the multi-agent design is still 3x more expensive than the single-agent equivalent and the quality difference is marginal, you have your answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;Here is the short version that I now run through before reaching for multi-agent.&lt;/p&gt;

&lt;p&gt;Start with a single agent and the right tools. Try to solve the task that way. If it works, ship it.&lt;/p&gt;

&lt;p&gt;If it does not work, ask why. Is the agent confused because the task is genuinely two different jobs (specialist split is justified)? Is the agent slow because subtasks could run in parallel (parallelizable split is justified)? Is the output quality consistently below the bar even with good prompting (critic loop is justified)?&lt;/p&gt;

&lt;p&gt;If none of those apply but the agent is still failing, the problem is probably in the prompt, the tools, or the eval set, not in the architecture. Multi-agent will not fix a bad prompt. It will hide it under a coordination layer.&lt;/p&gt;

&lt;p&gt;If one of the patterns does apply, prefer the hybrid: a primary agent that delegates specific subtasks, rather than a planner-and-workers crew. This keeps the orchestration tractable.&lt;/p&gt;

&lt;p&gt;If you really do need a full crew, design state passing explicitly, pick the cheapest model that works for each role, and budget for the latency tax. Build observability before you build features. The system will surprise you, and the surprises are more expensive when you cannot see what each agent did.&lt;/p&gt;

&lt;p&gt;The multi-agent pattern is real and useful in a small number of places. It is not the default. The default is one agent with good tools, and you should fight to keep it that way as long as you can.&lt;/p&gt;

&lt;p&gt;The systems I have seen succeed in production are almost always smaller than they look from the outside. One agent doing real work, sometimes spawning a focused helper, with disciplined tools and good evals. That setup ships. Five-agent crews delegating to each other in baroque hierarchies usually do not.&lt;/p&gt;

&lt;p&gt;Build the smallest thing that works. Add agents only when you have evidence they are paying for themselves.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>agents</category>
      <category>saas</category>
    </item>
    <item>
      <title>Vector Database Comparison 2026: pgvector, Pinecone, Turbopuffer, and Qdrant</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Thu, 23 Apr 2026 09:19:09 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/vector-database-comparison-2026-pgvector-pinecone-turbopuffer-and-qdrant-55ak</link>
      <guid>https://forem.com/alexcloudstar/vector-database-comparison-2026-pgvector-pinecone-turbopuffer-and-qdrant-55ak</guid>
      <description>&lt;p&gt;Six months ago I was the guy defending Pinecone in every group chat. The managed service was fine, the latency was acceptable, the price was high but predictable, and the API had not burned me. Then my bill hit four figures on a product that was not earning four figures, and I started looking at alternatives with the energy of a man who had just seen his AWS statement.&lt;/p&gt;

&lt;p&gt;Two months later I have run the same workload across four different vector stores. The workload is a RAG pipeline for a niche docs site with roughly 2 million embedded chunks, 500 to 2,000 queries per day, and a latency budget of 200 milliseconds at the p95. Nothing exotic. Just the kind of RAG setup that a lot of developers ship and then stop thinking about until the invoice arrives.&lt;/p&gt;

&lt;p&gt;This post is the honest write-up of what each option actually felt like to run, where each one broke, and which one I ended up keeping in production. If you are picking a vector database in 2026 and you want numbers from a real workload instead of a benchmark deck, this is that.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed In Vector Storage In 2026
&lt;/h2&gt;

&lt;p&gt;Before the comparison, a quick orientation on where the market sits today, because the vector DB landscape has shifted more than most people noticed.&lt;/p&gt;

&lt;p&gt;Pinecone is still the most well-known managed option, but its mindshare is cracking. The complaints that used to be whispered are now loud. Cost at scale is bad. The query model is opinionated in ways that frustrate people. The free tier got stingier.&lt;/p&gt;

&lt;p&gt;pgvector is no longer the "cute little extension" it was in 2023. Postgres 17 and 18 landed serious improvements to HNSW indexing and parallel query execution, and the extension itself got a major overhaul. For most workloads under 10 million vectors, it is now fully production grade and the operational story is simpler than any dedicated vector DB.&lt;/p&gt;

&lt;p&gt;Turbopuffer went from "interesting beta thing" to one of the most talked-about storage layers in the AI infrastructure space. It is built on object storage, which means it is dramatically cheaper than alternatives for large corpora, at the cost of some latency.&lt;/p&gt;

&lt;p&gt;Qdrant keeps quietly eating market share. The open-source story is strong, the hosted product is solid, and the feature velocity is faster than its competitors. It has become the default pick for people who want serious filtering and hybrid search without paying Pinecone prices.&lt;/p&gt;

&lt;p&gt;Weaviate, Milvus, Chroma, and LanceDB all still exist and still have fans, but none of them punched through enough to warrant prime billing in this comparison. I will touch on them briefly at the end.&lt;/p&gt;

&lt;p&gt;The rest of this post is the four options I actually ran in parallel, with the same data, the same queries, and the same team writing the glue code.&lt;/p&gt;




&lt;h2&gt;
  
  
  pgvector: The Boring Winner For Most Projects
&lt;/h2&gt;

&lt;p&gt;pgvector is a Postgres extension that adds a &lt;code&gt;vector&lt;/code&gt; column type and similarity search operators. You install it, you create a column, you insert embeddings, you query with cosine or L2 distance. If you already run Postgres, which you probably do, there is nothing new to learn, deploy, or monitor.&lt;/p&gt;

&lt;h3&gt;
  
  
  What worked
&lt;/h3&gt;

&lt;p&gt;The operational story is the thing. pgvector is just Postgres. Backups, migrations, monitoring, connection pooling, ACID transactions, joins against your relational data, all the things you already know how to do in Postgres. You do not need a separate vector pipeline, a separate set of credentials, a separate billing dashboard, or a separate on-call rotation.&lt;/p&gt;

&lt;p&gt;Joining vector search results against other tables is trivial. Need to filter embeddings by tenant, by access permission, by date range, by any column on any other table? It is a SQL query. On any dedicated vector DB, this same operation is a second round trip plus whatever metadata filtering their API supports, which is usually less flexible than SQL.&lt;/p&gt;

&lt;p&gt;HNSW index performance has caught up. On my 2 million vector workload with 1024-dimensional embeddings, pgvector served queries in 8 to 25 milliseconds at the p95, well under my 200ms budget. Index build time on the initial load was about 40 minutes on a mid-sized RDS instance, which is acceptable for most projects.&lt;/p&gt;

&lt;p&gt;Cost is hard to beat. If you are already running Postgres, adding pgvector is free. The extra compute and storage is measurable but small. My RDS bill went up maybe 15 percent after adding the vector workload. Compared to Pinecone at 4x that number monthly, the math is obvious.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it hurts
&lt;/h3&gt;

&lt;p&gt;It does not scale past a certain point. At 2 million vectors I was comfortable. At 20 million vectors on the same instance, query latency climbed into the hundreds of milliseconds and index build time became a weekend project. You can scale up the instance, but eventually you are running a database server that is mostly serving vector queries, which is a weird shape for Postgres.&lt;/p&gt;

&lt;p&gt;Hybrid search support is okay, not great. You can combine lexical and vector search in SQL, but you are writing the combination yourself. Dedicated vector DBs and search engines have more mature hybrid retrieval built in.&lt;/p&gt;

&lt;p&gt;Re-indexing after large inserts is not online. If you bulk-insert millions of vectors, you need to rebuild the HNSW index, which locks the table. For workloads where data changes constantly, this is annoying.&lt;/p&gt;

&lt;p&gt;Metadata filtering on very selective queries is slower than pure vector search. pgvector is improving here but the pre-filter vs post-filter decision still requires some thought on complex queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who it is for
&lt;/h3&gt;

&lt;p&gt;pgvector is the right default for any project where you are already on Postgres, your vector count is under 10 million, and you do not have a specific reason to reach for a dedicated vector DB. That covers most indie and small-team projects. If you are running a &lt;a href="https://dev.to/blog/rag-vs-long-context-2026"&gt;RAG setup like the hybrid pattern I described recently&lt;/a&gt;, pgvector will handle it fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pinecone: The Managed Option That Used To Be The Default
&lt;/h2&gt;

&lt;p&gt;Pinecone was the early winner in the managed vector DB space and it is still the option most developers have heard of first. It is a hosted, serverless vector database with a clean API, no ops overhead, and a reputation for "it just works."&lt;/p&gt;

&lt;h3&gt;
  
  
  What worked
&lt;/h3&gt;

&lt;p&gt;Setup time is legitimately short. Sign up, get an API key, point your embedder at the index, start querying. There is no infrastructure to manage. There is no version upgrade to worry about. There is no disk to run out of.&lt;/p&gt;

&lt;p&gt;Serverless pricing on smaller workloads is reasonable. For a project under 100,000 vectors with modest traffic, Pinecone's free or low tier is genuinely fine and you should not over-engineer the choice.&lt;/p&gt;

&lt;p&gt;The API is clean and has good client libraries across the major languages. Error messages are decent. The docs are well-maintained. You do not spend your first week fighting the tool.&lt;/p&gt;

&lt;p&gt;Performance is consistent. My queries ran in 40 to 80 milliseconds at the p95, which is slower than pgvector on my particular setup but well within any user-facing latency budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it hurts
&lt;/h3&gt;

&lt;p&gt;Cost at scale is brutal. This is the complaint you will hear most often and it is deserved. At 2 million vectors with low query volume, my Pinecone bill was roughly 4x what pgvector cost me, and the numbers get worse as vector count goes up. At 10 million vectors, Pinecone crosses into "is this even worth it" territory for anyone who is not funded.&lt;/p&gt;

&lt;p&gt;The pod model has been confusing for years. Serverless improved the on-ramp, but tuning for performance still involves understanding concepts that are specific to Pinecone and not transferable to any other system. When you hit a performance issue, the first hour is often spent learning Pinecone-specific terminology rather than debugging.&lt;/p&gt;

&lt;p&gt;Metadata filtering is limited compared to a real database. You can filter by fields, but complex queries with multiple conditions or joins are either not possible or have to be implemented in application code. This is fine for simple tenant-scoped lookups. It is painful for richer filtering patterns.&lt;/p&gt;

&lt;p&gt;Lock-in is real. Your data lives in Pinecone's format in Pinecone's system. Migrating away requires rebuilding your index somewhere else and re-embedding if you want to change models. The cost of switching is non-trivial and Pinecone's pricing power comes from knowing that.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who it is for
&lt;/h3&gt;

&lt;p&gt;Pinecone makes sense if you want zero ops overhead, your vector count is moderate, and the cost is acceptable for your business model. It is a reasonable pick for teams with funding and no appetite for managing infrastructure. It is a bad pick for bootstrapped projects or for anything where unit economics matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Turbopuffer: The Object Storage Play
&lt;/h2&gt;

&lt;p&gt;Turbopuffer took a bet that the future of vector search is object storage backed. Instead of keeping all vectors in memory or on attached SSDs, it stores them in S3 or equivalent and uses caching and smart indexing to make queries fast enough without the memory footprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  What worked
&lt;/h3&gt;

&lt;p&gt;Cost per vector is dramatically lower than any traditional vector DB. For large corpora, the difference is not 2x or 3x. It is 10x to 50x. If you are embedding an entire documentation corpus, a legal archive, or a multi-tenant dataset with millions of vectors per tenant, Turbopuffer's economics are in a different league.&lt;/p&gt;

&lt;p&gt;Scaling to very large datasets is effectively unbounded. You are limited by object storage, which is to say, practically not limited at all. I tested up to 50 million vectors and latency stayed stable, which is the part that matters.&lt;/p&gt;

&lt;p&gt;The API is clean and simple. Insert, query, filter, done. Less surface area than most of the competition, which is good when you want a tool to get out of your way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it hurts
&lt;/h3&gt;

&lt;p&gt;Cold query latency is the trade. First queries against a cold shard are significantly slower than in-memory alternatives, often in the 300 to 800 millisecond range on my workload. Caching warms it up for subsequent queries, but if your query pattern has a lot of cache misses, you feel it.&lt;/p&gt;

&lt;p&gt;The hosted product is younger than Pinecone's. The docs are good but there is less third-party content, fewer tutorials, and fewer people who have already hit your specific problem and written about it.&lt;/p&gt;

&lt;p&gt;Hybrid search is improving but not as polished as Qdrant's. If you need serious lexical-plus-vector retrieval with tuned weighting, you are doing more of the work yourself.&lt;/p&gt;

&lt;p&gt;Metadata filtering is capable but, like most dedicated vector stores, not as expressive as SQL. For filter-heavy workloads, this can push complexity into your application layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who it is for
&lt;/h3&gt;

&lt;p&gt;Turbopuffer is the pick for workloads where you have a lot of vectors and cost matters. Multi-tenant apps with per-tenant corpora. Large document archives. Anything where you would have looked at Pinecone's pricing and spit out your coffee. If your traffic pattern tolerates occasional colder queries, the cost savings are the kind that change whether the feature ships at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  Qdrant: The Feature-Rich Open Option
&lt;/h2&gt;

&lt;p&gt;Qdrant is an open-source vector database written in Rust. You can self-host it or use the hosted product. It has arguably the richest feature set of any option in this comparison: advanced filtering, hybrid search, quantization, sparse vectors, and a lot of knobs for tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  What worked
&lt;/h3&gt;

&lt;p&gt;Hybrid search is genuinely excellent. Qdrant supports lexical and dense retrieval in the same query with tuned weighting, and the results are noticeably better than any of the other options on queries where both signals matter.&lt;/p&gt;

&lt;p&gt;Filtering is expressive. You can filter on nested fields, ranges, geo-spatial conditions, and logical combinations with syntax that reads cleanly. For filter-heavy workloads this is a meaningful step up.&lt;/p&gt;

&lt;p&gt;Self-hosting works well. The defaults are sensible. Resource usage is reasonable. Upgrading between versions has been smooth for me. If you want a vector DB you control on infrastructure you control, Qdrant is the one that gives you the least operational pain.&lt;/p&gt;

&lt;p&gt;Hosted pricing is competitive. Cheaper than Pinecone for comparable workloads, more flexible than pgvector once your dataset grows past its comfort zone.&lt;/p&gt;

&lt;p&gt;Performance on my workload came in around 15 to 40 milliseconds at the p95, between pgvector and Pinecone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it hurts
&lt;/h3&gt;

&lt;p&gt;The surface area is larger than some developers want. All those features mean more things to learn, more choices to make at setup time, and more opportunities to misconfigure something. If you just want a simple vector store, Qdrant can feel like overkill.&lt;/p&gt;

&lt;p&gt;The Rust ecosystem keeps the core tight but the client libraries across every language are not all equally polished. The Python client is great. The TypeScript client is fine. Others vary.&lt;/p&gt;

&lt;p&gt;You are running a second piece of infrastructure. That is the trade against pgvector. It is not a lot of operational overhead, but it is not zero.&lt;/p&gt;

&lt;h3&gt;
  
  
  Who it is for
&lt;/h3&gt;

&lt;p&gt;Qdrant is the right pick when you need advanced filtering, hybrid search, or serious tuning options, and either you are comfortable with self-hosting or their hosted pricing works for you. It is also a great middle ground between the "just use Postgres" extreme and the "pay someone else to care about this" extreme.&lt;/p&gt;




&lt;h2&gt;
  
  
  Side By Side: The Numbers
&lt;/h2&gt;

&lt;p&gt;Here is how the four options stacked up on my actual workload. These numbers are for 2 million vectors, 1024-dimensional embeddings, and roughly 1,000 queries per day with moderate filtering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query latency at p95.&lt;/strong&gt; pgvector: 8 to 25ms. Qdrant: 15 to 40ms. Pinecone: 40 to 80ms. Turbopuffer: 25 to 60ms warm, 300 to 800ms cold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monthly cost at this scale.&lt;/strong&gt; pgvector: marginal bump to existing RDS bill, maybe 30 dollars. Qdrant self-hosted: 40 dollars. Qdrant hosted: 80 dollars. Turbopuffer: 25 dollars. Pinecone: 180 dollars on their serverless tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup complexity.&lt;/strong&gt; pgvector: minutes if you are on Postgres. Pinecone: minutes. Turbopuffer: an hour. Qdrant hosted: an hour. Qdrant self-hosted: half a day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature richness for advanced retrieval.&lt;/strong&gt; Qdrant wins by a meaningful margin. pgvector is capable but you are writing more SQL. Pinecone is adequate for the common cases. Turbopuffer is improving but still the newest of the four.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling ceiling.&lt;/strong&gt; Turbopuffer is effectively unbounded. Qdrant scales well with effort. Pinecone scales if you pay for it. pgvector tops out for most teams around 10 million vectors without significant tuning.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Ended Up With
&lt;/h2&gt;

&lt;p&gt;I moved the production workload from Pinecone to pgvector. It was the right call for my specific situation. The vector count was under pgvector's comfort zone. The ops savings were real because I was already running Postgres. The cost drop was dramatic. The latency was actually better than Pinecone on my workload, which I was not expecting.&lt;/p&gt;

&lt;p&gt;I kept Qdrant for a different project where I needed serious hybrid search. The lexical plus dense retrieval combination delivered results that neither pure vector search nor pure text search was producing, and the feature gap between Qdrant and pgvector on that specific pattern mattered.&lt;/p&gt;

&lt;p&gt;I am using Turbopuffer for a third project, a large documentation archive where cost per vector is the dominant factor. The cold query latency is real but the archive is not user-facing, so a slower first query per topic is acceptable.&lt;/p&gt;

&lt;p&gt;Pinecone is gone from my stack. I do not have a bad word to say about the product. The pricing just stopped making sense for my unit economics. If I had a funded company where ops simplicity was worth 4x on the bill, I might still be there.&lt;/p&gt;

&lt;p&gt;The interesting thing about this exercise is that the right answer was different for each project. "Which vector DB should I use" does not have one answer. It has one answer per workload, and the workload details that matter are size, latency, filter complexity, and whether your business model can absorb the cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Quick Decision Guide
&lt;/h2&gt;

&lt;p&gt;If you remember nothing else from this post, here is the shortest version I can give:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Already running Postgres and under 10 million vectors? Start with pgvector.&lt;/li&gt;
&lt;li&gt;Need serious hybrid search or rich filtering? Qdrant.&lt;/li&gt;
&lt;li&gt;Very large corpus and cost is the biggest factor? Turbopuffer.&lt;/li&gt;
&lt;li&gt;Want zero ops and have the budget? Pinecone is still fine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then validate your pick with real queries on real data before you commit. Benchmarks from a blog post, including this one, are a starting point, not a conclusion. The workload that matters is yours.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About The Other Options
&lt;/h2&gt;

&lt;p&gt;A few words on the vector stores that did not make prime billing, because they will come up and you should know where they sit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaviate&lt;/strong&gt; has strong semantic search features and a loyal community. It feels like Qdrant's slightly older cousin. If you are already on Weaviate, stay on Weaviate. If you are picking new, Qdrant generally wins the comparison.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Milvus&lt;/strong&gt; is the big-data option. Built for serious scale, used by teams with tens or hundreds of millions of vectors. If Turbopuffer's cold latency is a dealbreaker and you need in-memory performance at massive scale, this is the category where Milvus lives. For smaller workloads it is overkill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chroma&lt;/strong&gt; is the developer-experience darling for local and small-scale RAG. It is great for prototypes and local development. It is not where you want to be at production scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LanceDB&lt;/strong&gt; is a quiet sleeper. File-based, embedded, extremely simple to integrate, and the developer ergonomics are excellent. It is the right pick for desktop apps and some edge cases but has not yet hit the traction of the others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Elasticsearch and OpenSearch with vector plugins&lt;/strong&gt; deserve a mention. If you already run one of them for lexical search, adding vector search is a reasonable incremental step. The vector-first options generally outperform them on pure vector workloads, but the "we already have it" argument is strong.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;I spent two months testing four vector databases and the most valuable thing I learned was not which one was best. It was that I had been paying for Pinecone out of habit, not because it was the right tool. I installed it two years ago when it was the obvious default, never revisited the decision, and let the bill grow.&lt;/p&gt;

&lt;p&gt;That pattern is the pattern worth breaking. Vector databases are a category where the landscape shifted hard between 2023 and 2026. If you have not looked at your setup in the last year, there is a real chance you are running the wrong tool for your current workload, or paying three times what you need to.&lt;/p&gt;

&lt;p&gt;The cost of picking wrong used to be "minor inefficiency." The cost now, for a small team running real traffic, is the difference between a feature that pays for itself and a feature that quietly drains your runway. Pick with your eyes open, benchmark on your data, and do not let yesterday's default choice be today's default line item.&lt;/p&gt;

&lt;p&gt;The vector DB question in 2026 has four reasonable answers, not one. Figure out which one fits your workload, stop overpaying for the others, and get back to building the part of the product users actually care about.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>saas</category>
      <category>productivity</category>
    </item>
    <item>
      <title>RAG vs Long Context in 2026: When to Retrieve and When to Just Stuff the Window</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Thu, 23 Apr 2026 09:18:02 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/rag-vs-long-context-in-2026-when-to-retrieve-and-when-to-just-stuff-the-window-dgb</link>
      <guid>https://forem.com/alexcloudstar/rag-vs-long-context-in-2026-when-to-retrieve-and-when-to-just-stuff-the-window-dgb</guid>
      <description>&lt;p&gt;I spent a weekend last month ripping out a retrieval pipeline I had built six months earlier. The feature was a support-ticket triage bot that pulled relevant docs from a vector database, stuffed the top matches into the prompt, and asked Claude to draft a response. The whole thing worked, but the plumbing around it was becoming a second job. Re-indexing whenever docs changed. Tuning the chunk size every time a new doc type landed. Debugging why a clearly relevant doc ranked fourth instead of first.&lt;/p&gt;

&lt;p&gt;The replacement took me a Sunday afternoon. I dropped the vector DB entirely, concatenated all 180,000 tokens of support docs into a single system prompt, enabled prompt caching, and sent every ticket to Claude with the full doc set in context. Quality went up. Latency went up too, but not as much as I expected. Cost went down once caching kicked in. The whole pipeline fit in about 60 lines of code.&lt;/p&gt;

&lt;p&gt;That success made me cocky. The next week I tried the same thing on our codebase assistant, which searches across 400,000 tokens of source code to answer developer questions. I yanked out the retrieval, stuffed the whole repo into context, and waited for the same magic to happen.&lt;/p&gt;

&lt;p&gt;It did not. Quality got worse. Costs tripled. Users complained. I spent the next week quietly putting RAG back.&lt;/p&gt;

&lt;p&gt;The lesson was that long context does not kill RAG. It changes the shape of the decision. This post is the framework I wish I had before I threw out the first pipeline and the mental model that kept me from making the same mistake a third time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Question Suddenly Matters
&lt;/h2&gt;

&lt;p&gt;Two years ago nobody asked whether to use RAG. You used RAG because context windows were 8k or 16k tokens and anything useful would not fit. The question was only which vector DB, which embedding model, and which chunking strategy.&lt;/p&gt;

&lt;p&gt;That world is gone. Here is what the current landscape looks like in early 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Opus 4.7 and Sonnet 4.6 ship with a 1 million token context window&lt;/li&gt;
&lt;li&gt;Gemini 2.5 Pro offers 2 million tokens&lt;/li&gt;
&lt;li&gt;GPT-5 sits at 400,000 tokens&lt;/li&gt;
&lt;li&gt;Prompt caching on Anthropic, OpenAI, and Google cuts repeat-token costs by 75 to 90 percent&lt;/li&gt;
&lt;li&gt;Token prices have fallen roughly 60 percent year over year&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of those shifts moves the calculus. A 200,000 token doc set that would have cost dollars per query in 2023 costs cents in 2026, especially if the tokens hit a warm cache. The "too expensive to be worth it" wall that made RAG mandatory has moved a long way out.&lt;/p&gt;

&lt;p&gt;But long context is not free, and the cost curve is not the only thing that matters. There is a quality story, a latency story, and a developer experience story, and each of them cuts differently depending on what you are actually trying to do.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Case For Stuffing The Window
&lt;/h2&gt;

&lt;p&gt;Let me steelman long context first, because it is the approach I reflexively underestimate and I want to be honest about where it wins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simplicity is a real feature.&lt;/strong&gt; A RAG pipeline has at least five moving parts. An embedding model, a vector store, a retriever, a reranker, and the prompt template that stitches everything together. Each one has its own failure modes. Each one adds ops work when it breaks. Stuffing the window replaces all five with "put the data in the prompt." That simplicity is worth something, and the something is more than pride. It is fewer bugs, faster changes, and less time staring at the retrieval logs wondering why your system returned the wrong chunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quality is often better when relevance is ambiguous.&lt;/strong&gt; RAG is great when you know exactly which doc has the answer. It is bad when the answer requires piecing together information from multiple docs that do not obviously match the query. Long context lets the model do the cross-referencing itself, which is often what humans want when they ask a question. My support-ticket bot was in exactly this category. The right answer frequently drew from three or four docs where none of them contained the exact phrase the user had typed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt caching makes the cost palatable.&lt;/strong&gt; Without caching, running a 200,000 token prompt on every request would be a budget disaster. With caching, the static portion of the prompt pays full price once, gets read from cache on subsequent requests at a fraction of the cost, and refreshes its TTL every time it is hit. The math on a stable doc set changes from "this is too expensive" to "this is a rounding error once traffic is steady." &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization&lt;/a&gt; in 2026 increasingly lives or dies on whether you actually understand how caching works on your provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No training time, no re-indexing.&lt;/strong&gt; When your docs change, you update the prompt. There is no embedding to regenerate, no index to rebuild, no stale-data debugging. For docs that change frequently, this is a significant operational win.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better attention patterns than you expect.&lt;/strong&gt; The "lost in the middle" problem was a 2023 concern that has mostly been engineered around in the frontier models. Claude, GPT, and Gemini all handle mid-context retrieval reasonably well now. Still not perfect. Still worth structuring your prompt so important information is near the top or bottom. But the crippling 2023-era decay is not what it was.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Case For Sticking With RAG
&lt;/h2&gt;

&lt;p&gt;Now let me steelman RAG, because it still wins in entire categories of problem and "just use long context" is a meme that sometimes leads developers into bad architectural calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long context costs scale linearly with your data.&lt;/strong&gt; Every query pays for the entire doc set even if the answer is in one paragraph. At 200,000 tokens, this is cheap with caching. At 2 million tokens, it is not. At 20 million tokens, which is roughly where any real enterprise knowledge base lives, it is architecturally impossible. RAG stays viable at any scale because it only pays for what it retrieves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency is real and it adds up.&lt;/strong&gt; A 200,000 token prompt to Claude Opus 4.7 takes roughly 6 to 12 seconds to return a response, even with caching. A RAG setup that retrieves 4,000 tokens and sends them to the same model returns in 2 to 4 seconds. For a batch job this does not matter. For a user-facing chatbot where the user is waiting, every second counts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attention degrades at the edges even now.&lt;/strong&gt; Even with the improvements since 2023, model accuracy on needle-in-haystack retrieval tasks still drops 10 to 20 percent between 50k and 500k tokens. If you care about catching every relevant detail in the docs, RAG plus a reranker is still more reliable than stuffing everything in context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic data is hard to stuff.&lt;/strong&gt; If your data changes per query, per user, or per session, you cannot get the caching discount. That removes the biggest cost advantage of long context. Multi-tenant apps where every user has different data access are a classic RAG use case and long context does not change that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Citations are easier.&lt;/strong&gt; RAG pipelines can return the source of every fact because they know which chunk they retrieved. With long context you either trust the model to cite correctly, which it sometimes does not, or you build a separate post-processing step that tries to map claims back to positions in the prompt. The RAG approach is architecturally cleaner for any use case where trust and attribution matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Decision Framework
&lt;/h2&gt;

&lt;p&gt;After a couple of months of running both patterns in production, here is the framework I actually use when someone asks whether a new feature should be RAG or long context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Size
&lt;/h3&gt;

&lt;p&gt;If your total data fits in 500,000 tokens, long context is on the table. If it fits in 200,000 tokens, long context is often the better default. Over 1 million tokens, RAG is probably mandatory. Over 10 million tokens, it is definitely mandatory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stability
&lt;/h3&gt;

&lt;p&gt;If your data is the same for every user and changes rarely, long context plus prompt caching is a strong default. If your data is per-user, per-tenant, or per-session, caching breaks and RAG becomes more attractive. If your data changes multiple times per day, the operational cost of re-indexing can still tilt the scales toward long context even for medium-sized corpora.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency requirements
&lt;/h3&gt;

&lt;p&gt;For user-facing chat where response speed matters, RAG is almost always faster. For batch processing, async workflows, or use cases where a 10-second response is acceptable, long context is fine. If you are building something that plugs into a &lt;a href="https://dev.to/blog/durable-ai-workflows-orchestration-2026"&gt;durable workflow engine&lt;/a&gt; anyway, the latency hit of long context may not matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query pattern
&lt;/h3&gt;

&lt;p&gt;If most queries are narrow lookups with clear keywords, RAG works great. If queries often require synthesizing across multiple documents or making inferences that span the corpus, long context usually produces better answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Citation requirements
&lt;/h3&gt;

&lt;p&gt;If you need to cite specific sources in every response, RAG is the cleaner path. If citations are nice-to-have or the response just needs to be useful, long context is fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Built
&lt;/h2&gt;

&lt;p&gt;Let me get concrete about the two features I mentioned at the top, because they map cleanly onto this framework and they are the exact kinds of decisions you will make on your own features.&lt;/p&gt;

&lt;h3&gt;
  
  
  Support ticket triage: long context won
&lt;/h3&gt;

&lt;p&gt;The support docs were about 180,000 tokens total. They changed maybe once a week when the docs team pushed an update. Every user asked similar questions against the same doc set. Most tickets required piecing information from three or four docs. Users did not care if a response took 8 seconds instead of 3, because they had already submitted a ticket and were waiting for a reply by email.&lt;/p&gt;

&lt;p&gt;This is long-context paradise. Data size is small enough to fit comfortably in a prompt. Stability is high so caching works. Latency requirements are loose. Queries require synthesis, which is where long context beats retrieval. Citations are nice but not required.&lt;/p&gt;

&lt;p&gt;Switching to long context dropped maintenance overhead to near zero. Re-indexing was eliminated. Chunk-size tuning was eliminated. The retriever, which had been my biggest source of bugs, was just gone. Quality went up because the model could pull context from anywhere in the doc set instead of being stuck with whatever the retriever handed it. Costs dropped once caching was active because the system prompt had a 90 percent cache hit rate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codebase assistant: RAG won, and it was not close
&lt;/h3&gt;

&lt;p&gt;The codebase was around 400,000 tokens, which still fits in a context window. The problem was that code is not like docs. Users asked questions like "how does billing work" or "where is the webhook handler" that require finding specific files. Response time mattered because developers lose patience fast when their tools are slow. And the whole corpus was not stable. Half a dozen files changed per day, which meant the cache was permanently cold for anything touching active development.&lt;/p&gt;

&lt;p&gt;I tried long context anyway because it worked so well the first time. The results were miserable. Quality was lower because the model would drift into tangentially related code instead of answering the specific question. Latency hit 15 seconds on queries that had been 3 seconds under RAG. Costs tripled because the cache was never warm. And I could not cite file paths cleanly because the model would describe code instead of pointing at it.&lt;/p&gt;

&lt;p&gt;I rebuilt RAG with a better retriever, added a reranker, and shipped the whole thing at roughly 3-second response time with precise file citations. The RAG version is still in production. I have not tried to replace it again.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hybrid Pattern That Actually Ships
&lt;/h2&gt;

&lt;p&gt;The honest production pattern in 2026 is not "pick one." It is "use RAG to narrow the search space, then use long context to let the model reason across what you retrieved."&lt;/p&gt;

&lt;p&gt;Instead of retrieving 4,000 tokens and hoping they contain the answer, you retrieve 100,000 tokens and let the model sort through them. Instead of a dozen painfully chunked paragraphs, you retrieve 10 full documents at their natural boundaries. Instead of a top-1 match that might be wrong, you retrieve a top-20 and let the model figure out which ones matter.&lt;/p&gt;

&lt;p&gt;This pattern gets the best of both. You keep the cost advantage of not paying for your entire corpus on every query. You keep the latency advantage because you are still sending 100,000 tokens instead of a million. And you get the quality advantage of long context because the model has enough room to actually think about the problem instead of being handed the bare minimum.&lt;/p&gt;

&lt;p&gt;The catch is that the retriever matters less than it used to. Getting the top-3 exactly right was critical when the prompt could only hold the top-3. Getting a reasonable top-20 is much easier than getting a perfect top-3, and the model will do the final filtering for you. That shifts the ops story too. You tune your retriever for recall, not precision, and you let the model handle the rest.&lt;/p&gt;

&lt;p&gt;Most of the production AI features I have built in the last six months use this pattern. RAG for the first filter, long context for the reasoning. It is not as clean as "we use Pinecone" or "we just stuff the window," but it reflects what actually works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompt Caching Is The Quiet Unlock
&lt;/h2&gt;

&lt;p&gt;I want to be blunt about something. None of the long context math works without prompt caching, and a lot of developers still treat caching as an optimization they will get around to later. Treat it as mandatory.&lt;/p&gt;

&lt;p&gt;Without caching, a 200,000 token prompt at Claude Opus 4.7 input rates runs about 60 cents per query. At 10,000 queries per day, that is 6,000 dollars per day, which is an insane number for a support bot. With caching on the stable system prompt, the same workflow drops to around 6 to 8 cents per query once the cache is warm. The math only works at the second number.&lt;/p&gt;

&lt;p&gt;The catch is that caching has rules and developers keep getting them wrong. The prompt prefix must be byte-for-byte identical across requests. You place cache breakpoints explicitly. You reset the cache TTL on every hit so it stays warm under steady traffic. You structure your prompt with the long stable portion first and the short per-query portion last. If any of these are off, your cache hit rate drops to near zero and your beautiful long-context architecture becomes a money fire.&lt;/p&gt;

&lt;p&gt;This is also why long context works great for feature-level prompts with shared context and terribly for use cases where every prompt is different. Multi-tenant apps where every user has their own data are not a good fit because you need a separate cache entry per tenant. Single-tenant features or features with a shared doc set are the sweet spot.&lt;/p&gt;

&lt;p&gt;If you are weighing long context and you are not sure whether caching will work for your use case, assume the answer is no until you have read the provider's caching docs and tested the hit rate. &lt;a href="https://dev.to/blog/context-engineering-ai-coding-2026"&gt;Context engineering&lt;/a&gt; is real work and caching is a big part of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Neither Is Enough
&lt;/h2&gt;

&lt;p&gt;There is one more category worth naming because I see developers reach for the wrong tool in it all the time. If your use case involves an agent that needs to perform multi-step reasoning, call tools, and maintain state across turns, neither RAG nor long context is your main problem. &lt;a href="https://dev.to/blog/ai-agent-memory-state-persistence-2026"&gt;AI agent memory and state persistence&lt;/a&gt; is a separate concern that neither approach solves on its own.&lt;/p&gt;

&lt;p&gt;An agent that calls five tools, pulls back results, reasons about them, and calls more tools does not have a retrieval problem. It has a working-memory problem. You can combine either RAG or long context with agent memory, but swapping one retrieval pattern for another will not fix an agent that is losing track of what it decided three turns ago.&lt;/p&gt;

&lt;p&gt;If you find yourself asking whether to use RAG or long context for an agent, the question before that one is probably how you are managing the agent's state, and the retrieval choice will fall out of that answer rather than driving it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What To Build This Week
&lt;/h2&gt;

&lt;p&gt;If you already have RAG in production and it works, do not rip it out. The temptation is real because the long context numbers look sexy and the simplification story is appealing. But "it works" is worth a lot, and long context is not free to adopt correctly. The right time to move is when you are already making structural changes, not on a random Tuesday.&lt;/p&gt;

&lt;p&gt;If you are starting a new feature that involves feeding documents to an LLM, default to long context first. Measure the cost with caching enabled. Measure the latency on your target hardware. If the numbers work, you have saved yourself a month of retrieval plumbing. If they do not, you know exactly what constraints are pushing you toward RAG and you can design the pipeline around those specific needs instead of building something generic.&lt;/p&gt;

&lt;p&gt;If you are somewhere in between, the hybrid pattern is usually the right answer. Retrieve a wide net, let the model reason over it, and iterate from there. This is the pattern that handles the most use cases with the least architectural commitment, and it is the one I reach for first when someone asks me to scope a new AI feature.&lt;/p&gt;

&lt;p&gt;The short version is that long context did not kill RAG. It changed what RAG is for. RAG used to be about fitting anything into a tiny window. Now it is about deciding which chunk of a very large corpus is worth paying attention to, and letting the model handle the rest. The decision is more nuanced than it was three years ago, which is mostly good news. You have more options, and the right option is less often "whatever we built in 2023 and never revisited."&lt;/p&gt;

&lt;p&gt;The one decision you should not make is to ignore the question. Running RAG pipelines you no longer need is a tax. Stuffing windows when you should be retrieving is a different tax. Go look at your current AI features, pick the one that has grown the most awkward, and ask which of these two patterns would handle it better today. The answer might be what you already have. Often it is not.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>saas</category>
    </item>
    <item>
      <title>Durable AI Workflows in 2026: Why Your Next AI Feature Needs Orchestration</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Wed, 22 Apr 2026 10:47:28 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/durable-ai-workflows-in-2026-why-your-next-ai-feature-needs-orchestration-1djd</link>
      <guid>https://forem.com/alexcloudstar/durable-ai-workflows-in-2026-why-your-next-ai-feature-needs-orchestration-1djd</guid>
      <description>&lt;p&gt;I shipped an AI feature last fall that took an input document, called a large language model to extract structured data, called a second model to validate it, posted the results to a webhook, and then emailed the user. The whole thing took between 40 seconds and 3 minutes depending on the document size.&lt;/p&gt;

&lt;p&gt;It worked perfectly in testing. It worked for the first hundred users in production. Then a network hiccup took out the LLM provider for 90 seconds during a busy afternoon, and I discovered the hard way that I had built a very expensive way to lose data.&lt;/p&gt;

&lt;p&gt;My serverless function timed out. The retry was another full run from scratch, which hit the LLM a second time for tokens I had already paid for. Users saw errors. Some of them got two emails. A few of them got neither because the second run failed at a different step and the retry count hit zero.&lt;/p&gt;

&lt;p&gt;I spent the next weekend rewriting the whole thing on top of a durable workflow engine. The problem was not that I had bad code. The problem was that I was using request-response infrastructure to run a multi-step, long-running, stateful process. That is not what serverless functions are for, and pretending it is leads to exactly the kind of failure I walked into.&lt;/p&gt;

&lt;p&gt;This post is the guide I wish I had before I shipped that feature. It covers what durable workflows are, why AI features need them more than almost any other category of work, and how to choose between Inngest, Trigger.dev, and Vercel Workflow in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Breaks When AI Meets Serverless
&lt;/h2&gt;

&lt;p&gt;The default pattern for shipping a feature in 2026 looks something like: a Next.js or similar framework, an API route that handles a request, some business logic, maybe a database call, and a response. This pattern is fast, cheap, and covers 90 percent of what most web apps do.&lt;/p&gt;

&lt;p&gt;It also breaks in predictable ways when AI gets involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timeouts.&lt;/strong&gt; LLM calls are slow. A single Claude or GPT call is typically a few seconds. A chain of them can take minutes. Vercel raised the default function timeout to 300 seconds in 2025, which helps, but a multi-step agent can easily exceed that. If your function times out mid-run, you lose the work in progress and any external side effects you already triggered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retries.&lt;/strong&gt; When an LLM provider has an outage or rate limits you, you need to retry. Naive retries cause duplicate emails, duplicate database writes, and duplicate bills. Smart retries require keeping track of which steps have already succeeded so you can resume from where you left off instead of starting over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; Every retry on an LLM call costs real money. A workflow that reruns from scratch on every failure can 2x or 3x your AI costs during a bad day with a provider. For features where each run is cheap this is tolerable. For agentic workflows that use 50,000 tokens per run, it is a budget problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability.&lt;/strong&gt; When a multi-step AI workflow fails, you need to know which step failed, with what input, and with what output from the previous steps. Tracing this in a standard logging setup is painful. You end up grepping logs across multiple function invocations, trying to correlate request IDs that may not even exist on retries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrency.&lt;/strong&gt; If a user kicks off ten AI workflows at once, you want to throttle them so you do not blow up your rate limits with your LLM provider. Standard serverless functions have no built-in way to do this without building your own queue, and &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization in production&lt;/a&gt; depends on getting this right.&lt;/p&gt;

&lt;p&gt;These are not edge cases. They are the default failure modes for any AI feature that does more than a single one-shot completion. The moment you chain two LLM calls together, or mix an LLM call with an external API, or run something that takes longer than a normal HTTP request, you are in workflow territory whether you planned for it or not.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Durable Workflows Actually Are
&lt;/h2&gt;

&lt;p&gt;The term "durable workflow" sounds like enterprise jargon, but the idea is simple.&lt;/p&gt;

&lt;p&gt;A durable workflow is a function where each step is checkpointed. When a step succeeds, the result is persisted. If the workflow fails partway through, the engine resumes from the last successful step instead of starting over. The function can take minutes, hours, or days to complete. It can pause to wait for external events. It can sleep for a week and then resume. All of this is handled by the engine, not by you.&lt;/p&gt;

&lt;p&gt;The programming model looks almost identical to normal async code. You write a function with steps. Each step is a regular async operation. The engine wraps each step to persist its result and provide the persisted result on replay if the step has already run.&lt;/p&gt;

&lt;p&gt;The magic is that failures become survivable. A network blip in step 3 of a 5 step workflow does not lose the work from steps 1 and 2. A provider outage does not double bill you. A deploy in the middle of a running workflow does not drop it on the floor. These are not optimizations. They are the baseline behavior.&lt;/p&gt;

&lt;p&gt;This is the model Temporal popularized in the enterprise. What changed in 2026 is that the pattern finally got accessible to indie developers and small teams, with tools that work natively with Next.js, serverless functions, and modern TypeScript stacks. You no longer need a dedicated worker infrastructure to run durable workflows. You can run them on the same platform as the rest of your app.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inngest: The Mature Choice
&lt;/h2&gt;

&lt;p&gt;Inngest has been in the durable workflow space longer than most of the current competitors. It is a hosted service with a TypeScript SDK that defines workflows as functions with steps, using a familiar async pattern.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The developer experience is polished. Defining a workflow looks like writing a regular async function with a few wrapper calls. You call &lt;code&gt;step.run&lt;/code&gt; for operations that should be checkpointed, &lt;code&gt;step.sleep&lt;/code&gt; for delays, and &lt;code&gt;step.waitForEvent&lt;/code&gt; for waiting on external triggers. There is no special syntax to learn and the types are strong.&lt;/p&gt;

&lt;p&gt;Event-driven triggers are a first class concept. Instead of calling a workflow directly, you emit an event, and Inngest decides which workflows should run based on event matching rules. This is the right pattern for anything that involves user actions triggering background work, and it composes cleanly as your app grows.&lt;/p&gt;

&lt;p&gt;The local development story is good. Inngest has a local dev server that mirrors production behavior, so you can iterate on workflows without deploying. The dashboard shows you every run, every step, every input, every output. When something goes wrong, you can see exactly what happened and often just click to replay from a failed step.&lt;/p&gt;

&lt;p&gt;Concurrency and rate limiting are built in. You can limit a workflow to process at most 5 runs concurrently per user, or throttle invocations to 10 per second per integration, or back off exponentially on retry. For AI features that need to stay under LLM rate limits, this is the feature you did not know you needed until you shipped without it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;The hosted pricing can get expensive for high-volume workflows. Inngest charges based on step executions and concurrency, and both scale with how chatty your workflows are. For a workflow that checkpoints a lot of small steps, the bill adds up.&lt;/p&gt;

&lt;p&gt;Self-hosting is possible but more involved than the managed service suggests. If you want to run Inngest on your own infrastructure to control costs or compliance, expect to spend time on the deployment.&lt;/p&gt;

&lt;p&gt;The abstraction is opinionated about event-driven triggers. If your mental model is "call this workflow now and wait for the result," Inngest supports it but the ergonomics lean toward async event-driven patterns. This is usually the right pattern, but it can feel foreign if you are coming from a simpler background job queue.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to pick it
&lt;/h3&gt;

&lt;p&gt;Inngest is the right choice if you are building an event-driven system, care about first class concurrency controls, and want a polished managed service. It is also the choice with the longest track record, so if you are risk averse, it is the safe pick.&lt;/p&gt;




&lt;h2&gt;
  
  
  Trigger.dev: The Open Source Friendly Pick
&lt;/h2&gt;

&lt;p&gt;Trigger.dev took a different path. It is open source, self hostable from day one, and focuses on making background jobs and workflows accessible with a minimum of ceremony. Version 3, which is the version you should be using in 2026, is a full rewrite that added durable execution and significantly improved the developer experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The setup is the fastest of the three tools I tested. You install the SDK, define a task with a simple decorator pattern, and it is ready to run. For quick prototyping or for developers who want to minimize the conceptual overhead of adopting a new tool, Trigger.dev is the lightest lift.&lt;/p&gt;

&lt;p&gt;The self-hosting story is first class. The open source version of Trigger.dev runs as a Docker container and has feature parity with the managed cloud product. For teams that need to own their infrastructure for compliance or cost reasons, this is a significant advantage over the more managed-first alternatives.&lt;/p&gt;

&lt;p&gt;The dashboard is genuinely nice. You get a live view of running tasks, a history of past runs, the ability to replay from any step, and the tooling for debugging failed runs is polished. For AI workflows specifically, being able to see exactly what each LLM call received and returned is invaluable when you are tracking down a bad completion.&lt;/p&gt;

&lt;p&gt;The SDK handles common AI patterns well. There is built in support for streaming responses, long running inference calls, and checkpointing expensive LLM outputs so you do not rerun them on retry. This is the kind of domain-specific polish that separates a tool that works for AI from a tool that was designed for AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;The platform is younger than Inngest. Some advanced features like sophisticated event matching, complex concurrency policies, and multi-tenant controls are either newer or still in development. For a simple AI workflow this does not matter. For a complex multi-tenant SaaS with intricate routing needs, it might.&lt;/p&gt;

&lt;p&gt;The managed cloud pricing is competitive but the tool is still finding its positioning. I have seen pricing adjustments several times in the last year, which is normal for a product at this stage but worth being aware of if you are trying to budget.&lt;/p&gt;

&lt;p&gt;The ecosystem around triggers and integrations is smaller than Inngest's. Inngest has invested heavily in pre-built integrations with common services. Trigger.dev leans on you to wire up the integrations yourself, which is fine but slightly more work.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to pick it
&lt;/h3&gt;

&lt;p&gt;Trigger.dev is the right choice if you value open source, want the fastest possible setup, need to self host, or want a tool that was designed with AI workloads in mind from the start. It is especially strong for indie developers building &lt;a href="https://dev.to/blog/one-person-startup-scaling-2026"&gt;one person startups&lt;/a&gt; who want to control their infrastructure without managing it full time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vercel Workflow: The Native Vercel Pick
&lt;/h2&gt;

&lt;p&gt;Vercel Workflow, sometimes called Vercel Workflow DevKit or WDK, is Vercel's answer to the durable workflow problem. It launched in 2025 and matured throughout 2026 as part of Vercel's broader push to own more of the backend runtime. It runs on Fluid Compute, integrates with the rest of the Vercel platform, and requires no separate infrastructure if you are already deploying on Vercel.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The integration with the Vercel platform is seamless. If your app is already on Vercel, adding a workflow is a matter of creating a new function file with the workflow pattern. No separate service, no additional dashboard, no new billing relationship. Everything shows up in your existing Vercel project.&lt;/p&gt;

&lt;p&gt;The programming model is clean. You write a workflow as a regular async function, mark steps that should be checkpointed, and the runtime handles persistence. The API feels like a natural extension of Next.js rather than an external tool bolted on.&lt;/p&gt;

&lt;p&gt;Cost efficiency is genuinely different. Because Vercel Workflow runs on Fluid Compute, you get the benefits of &lt;a href="https://dev.to/blog/ai-sdk-v6-developer-guide-2026"&gt;function instance reuse and active CPU pricing&lt;/a&gt;. For AI workflows that spend most of their time waiting on LLM responses, you are not paying for idle time the way you would with traditional serverless invocation counts.&lt;/p&gt;

&lt;p&gt;The observability tie-in is strong. Workflow runs show up in the Vercel dashboard alongside your deployments, logs, and other platform metrics. When a workflow fails, you can trace it back to the specific deployment, look at the runtime logs, and see the preview environment context all in one place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;It only works on Vercel. This is the obvious limitation and it is not going to change. If you are on AWS, Render, Fly, Cloudflare, or self hosted, Vercel Workflow is not available.&lt;/p&gt;

&lt;p&gt;It is newer than the alternatives. Inngest and Trigger.dev have years of production usage across thousands of applications. Vercel Workflow is production-ready but has less battle-tested coverage of edge cases. For straightforward AI workflows this is fine. For complex orchestration with unusual patterns, you may run into rough edges.&lt;/p&gt;

&lt;p&gt;The ecosystem of patterns, examples, and integrations is smaller. Inngest and Trigger.dev both have mature libraries of patterns for common use cases. Vercel Workflow is catching up but you will sometimes end up implementing things from first principles.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to pick it
&lt;/h3&gt;

&lt;p&gt;Vercel Workflow is the right choice if you are already on Vercel and want the tightest possible integration with your existing stack. For AI features that are part of a larger Next.js app, the zero-configuration setup and platform-native observability are hard to beat.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;After running all three on real projects for the last few months, here is the framework I use to decide which one to reach for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you on Vercel and shipping Next.js?&lt;/strong&gt; Start with Vercel Workflow. The integration is seamless and the setup cost is effectively zero. If you hit a limitation, switching to one of the others is always an option, but most AI features do not hit those limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need to self host?&lt;/strong&gt; Trigger.dev is the pick. Inngest can be self hosted but the experience is more involved. Vercel Workflow is not an option off platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is your workflow fundamentally event-driven?&lt;/strong&gt; Inngest is the pick. The event routing and matching features are first class in a way the others are not. For systems where many different triggers can kick off related workflows, Inngest's model is the cleanest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you optimizing for the fastest possible setup?&lt;/strong&gt; Trigger.dev is the pick. The cognitive overhead is the lowest of the three, and for a solo developer trying to ship an AI feature quickly, this matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you care about long term track record and maturity?&lt;/strong&gt; Inngest is the pick. It has been at this the longest and has the largest set of real-world production deployments to learn from.&lt;/p&gt;

&lt;p&gt;For most of my current projects, I end up running Vercel Workflow for the AI features that live inside a Vercel-hosted app, and Trigger.dev for anything that needs to run off platform or where I want to control my own infrastructure. I have stopped reaching for Inngest on new projects mostly because the pricing for the kind of chatty workflows I write adds up faster than the alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Patterns for AI Workflows
&lt;/h2&gt;

&lt;p&gt;A few patterns I have learned the hard way that apply regardless of which tool you pick.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checkpoint LLM calls aggressively.&lt;/strong&gt; Every LLM call should be its own checkpointed step. If the call succeeds, you never want to run it again, because it costs money and the output is not deterministic anyway. Every durable workflow engine handles this well if you mark the step correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Store the raw LLM output, not just the parsed version.&lt;/strong&gt; When an LLM call succeeds but the parsing fails, you want to be able to fix the parser and replay without rerunning the LLM. This requires persisting the raw completion, not just the structured result you extracted from it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the workflow engine's native rate limiting.&lt;/strong&gt; Do not build your own throttling layer on top of a workflow engine. Every tool I have covered has built in primitives for this. Use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design steps for idempotency.&lt;/strong&gt; Even with durable workflows, steps can retry. If a step sends an email, sends a webhook, or charges a card, make sure running it twice has the same effect as running it once. Idempotency keys, deduplication tokens, and "has this been done already" checks all matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep step inputs small.&lt;/strong&gt; Every step's inputs get persisted. If you pass a large payload to a step, you are paying to serialize, store, and deserialize that payload on every retry. Pass references to stored data rather than the data itself when possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log the prompts and the responses.&lt;/strong&gt; For debugging AI workflows, the prompt-response pair is the source of truth. Log both, correlate them to the workflow run, and make sure you can replay any failed step with the exact same prompt that caused the failure. &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;AI agent observability&lt;/a&gt; is the companion discipline that makes durable workflows debuggable in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Bottom Line
&lt;/h2&gt;

&lt;p&gt;If you are shipping an AI feature that does more than a single one-shot completion, you need a durable workflow engine. The alternative is not "simpler code." The alternative is a production incident that you will write a blog post about, and the blog post will be shaped a lot like this one.&lt;/p&gt;

&lt;p&gt;Inngest is mature and event-driven. Trigger.dev is open source and fast to adopt. Vercel Workflow is native to Vercel and uses Fluid Compute to keep costs down on long running AI workloads. All three are production ready and all three solve the core problem of multi-step, long-running, stateful AI work.&lt;/p&gt;

&lt;p&gt;The wrong answer is to keep running AI workflows on plain serverless functions and hope that your users never hit a provider outage. The provider outage is coming. The only question is whether your code is ready for it.&lt;/p&gt;

&lt;p&gt;I ended up migrating the feature that ate my weekend to a durable workflow in a single afternoon. The rewrite was smaller than the original implementation because most of the retry logic and state tracking I had built by hand got replaced by the engine. Six months later the feature has weathered three LLM provider incidents without dropping a single run. That is the whole pitch.&lt;/p&gt;

&lt;p&gt;Pick a tool. Migrate your AI workflows. Get your weekends back.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>saas</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI Code Review Tools in 2026: CodeRabbit vs Greptile vs Vercel Agent</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Wed, 22 Apr 2026 10:46:55 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/ai-code-review-tools-in-2026-coderabbit-vs-greptile-vs-vercel-agent-jdc</link>
      <guid>https://forem.com/alexcloudstar/ai-code-review-tools-in-2026-coderabbit-vs-greptile-vs-vercel-agent-jdc</guid>
      <description>&lt;p&gt;I merged a pull request last month that introduced a race condition in a background worker. Two reviewers had approved it. The tests passed. The staging environment looked fine. The bug surfaced three days later when traffic picked up on a Monday morning, and I spent most of that day unwinding state that had been corrupted across several thousand rows.&lt;/p&gt;

&lt;p&gt;The kicker was that I had an AI code reviewer enabled on the repo. It had flagged exactly the pattern that caused the incident, buried in a list of twelve other comments that were mostly noise. I had trained myself to skim past its output because most of what it said was wrong or pedantic. The one time it was right, I missed it.&lt;/p&gt;

&lt;p&gt;That experience sent me down a rabbit hole. I spent the next six weeks running CodeRabbit, Greptile, and Vercel Agent side by side on three different codebases: a Next.js SaaS, a Bun-based API, and a messy TypeScript monorepo. I wanted to know which one actually catches real bugs without burying them under style nits, and which one is worth paying for when you are a solo developer or a small team.&lt;/p&gt;

&lt;p&gt;Here is what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why AI Code Review Became Table Stakes in 2026
&lt;/h2&gt;

&lt;p&gt;The shift happened faster than I expected. Two years ago, AI code review was a curiosity. Tools like CodeRabbit existed but felt more like linters with LLM sprinkles. By mid 2026, roughly 60 percent of teams with a CI pipeline run some form of automated AI review on every pull request. For solo developers and small teams, adoption is even higher.&lt;/p&gt;

&lt;p&gt;The driver is not hype. It is math. If &lt;a href="https://dev.to/blog/ai-generated-code-technical-debt-2026"&gt;51 percent of GitHub commits are now AI assisted&lt;/a&gt; and bug density in AI generated code runs 35 to 40 percent higher in error paths and boundary conditions, human review alone cannot keep up. You either add more reviewers, which solo developers cannot do, or you add a second set of eyes that scales with commit volume instead of headcount.&lt;/p&gt;

&lt;p&gt;That is the job AI code review is actually doing in 2026. It is not replacing senior engineers. It is catching the boring stuff so human review can focus on architecture, product decisions, and the subtle bugs that require context a tool does not have.&lt;/p&gt;

&lt;p&gt;The question is which tool actually does that job well.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Tools That Matter
&lt;/h2&gt;

&lt;p&gt;There are a dozen AI code review products on the market right now. Most of them are thin wrappers around GPT-4 or Claude with a webhook receiver and a Stripe integration. Three are worth taking seriously because they have either market share, technical differentiation, or native platform integration that the others lack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CodeRabbit&lt;/strong&gt; is the incumbent. It launched in 2023, has the largest install base, and works on every major code host. If you walk into a random startup that has AI review set up, there is a two out of three chance it is CodeRabbit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Greptile&lt;/strong&gt; is the technical favorite. It builds a graph of your codebase and uses that to reason about how changes ripple through the system. Developers who care about review quality over breadth of features tend to end up here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vercel Agent&lt;/strong&gt; is the newcomer. It is part of Vercel's broader push to own the development loop on their platform, and it leans heavily on context about your deployments, runtime logs, and infrastructure to inform reviews. It is in public beta as of early 2026 but improving quickly.&lt;/p&gt;

&lt;p&gt;I ran all three on the same three repos, on the same pull requests, for six weeks. Here is how each one performed.&lt;/p&gt;




&lt;h2&gt;
  
  
  CodeRabbit: The Market Leader
&lt;/h2&gt;

&lt;p&gt;CodeRabbit is the tool most developers have tried and the one most teams are actively using. It integrates with GitHub, GitLab, Bitbucket, and Azure DevOps. It posts inline comments on pull requests, offers a summary of changes, and lets you chat back to clarify or push back on its suggestions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;Setup takes about three minutes. You install the GitHub app, authorize it on the repos you want, and it starts reviewing. No configuration required. The default behavior is sensible and you can tune it later if you want.&lt;/p&gt;

&lt;p&gt;The pull request summaries are genuinely useful. For any PR over a hundred lines, having a TLDR at the top of the thread saves real time during review. I have a bad habit of submitting PRs with sparse descriptions, and CodeRabbit's summary often ends up being a better description than what I wrote.&lt;/p&gt;

&lt;p&gt;The chat feature is the thing I use most. Instead of leaving a comment and waiting for a human reviewer, I can ask CodeRabbit why it flagged something, ask for alternatives, or push back when it is wrong. This back and forth catches maybe one in five false positives and clarifies another one in five.&lt;/p&gt;

&lt;p&gt;Integration breadth is unmatched. It works with Linear, Jira, Notion, Slack, and several of the major CI providers. If you have an existing toolchain, CodeRabbit probably speaks it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;The noise problem is real. On a PR with thirty lines of changes, I routinely get eight to fifteen comments. Maybe two or three are genuinely useful. The rest range from "consider renaming this variable" to "this function could return early" to outright wrong suggestions that would break the code if applied.&lt;/p&gt;

&lt;p&gt;You can tune this with configuration, but the tuning is fiddly. The default verbosity is calibrated for teams that want lots of signals and are willing to filter. For solo developers who want fewer, higher quality comments, the defaults are exhausting.&lt;/p&gt;

&lt;p&gt;Context is shallow. CodeRabbit reads the diff and some of the surrounding files, but it does not build a deep model of your codebase. That means it misses bugs that require understanding how a change interacts with code elsewhere in the repo. The race condition I mentioned earlier is a category CodeRabbit is structurally weak at catching.&lt;/p&gt;

&lt;p&gt;Pricing gets aggressive fast. The free tier covers public repos. Paid plans start at 15 dollars per developer per month and scale based on PR volume and code lines. For a small team, the bill adds up quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verdict
&lt;/h3&gt;

&lt;p&gt;CodeRabbit is the best tool if you want broad coverage, fast setup, and integration with an existing toolchain. It is not the best if you want high signal to noise or deep code understanding. Use it for teams that value breadth, skip it if you want precision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Greptile: The Precision Pick
&lt;/h2&gt;

&lt;p&gt;Greptile takes a different architectural approach. Instead of reading the diff and some surrounding files, it indexes your entire codebase and builds a graph of how functions, modules, and types relate to each other. When you submit a PR, it uses that graph to reason about the change in context.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The bug catching is noticeably better. On the same pull requests I ran through CodeRabbit, Greptile caught issues that required understanding code outside the diff. A function signature change that broke a call site three files away. An async pattern that conflicted with how the caller was handling errors. A type narrowing assumption that held in one context but not another.&lt;/p&gt;

&lt;p&gt;Noise is dramatically lower. On a typical PR I get two to four comments. Almost all of them are worth reading. When Greptile flags something, I have trained myself to actually read it, which is the opposite of my experience with most AI reviewers.&lt;/p&gt;

&lt;p&gt;The summaries are precise rather than exhaustive. It does not try to describe everything the PR does. It focuses on the parts that have meaningful implications, including downstream effects that a human reviewer might miss on a first pass.&lt;/p&gt;

&lt;p&gt;Greptile also understands your codebase's conventions over time. After a few weeks on a repo, its suggestions start matching the style and patterns the team uses. CodeRabbit's suggestions feel more generic regardless of how long it has been running on your code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;Setup is heavier. Indexing a large codebase takes time and costs compute, which is reflected in pricing. For a small repo, this is not an issue. For a monorepo with millions of lines, the initial indexing can take an hour or more.&lt;/p&gt;

&lt;p&gt;Integration breadth is narrower. Greptile works with GitHub well. GitLab support exists but feels secondary. Bitbucket and Azure DevOps are limited. If you are not on GitHub, CodeRabbit is a more comfortable fit.&lt;/p&gt;

&lt;p&gt;The chat and back and forth is less polished. You can leave comments asking for clarification, but the conversational flow feels rougher than CodeRabbit's. This is improving but worth noting.&lt;/p&gt;

&lt;p&gt;Pricing is positioned at the higher end. Plans start around 30 dollars per developer per month. The value is real if you care about review quality, but it is not the budget option.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verdict
&lt;/h3&gt;

&lt;p&gt;Greptile is the best tool if you want precision over breadth. It catches bugs other tools miss, the noise level is manageable, and the codebase awareness compounds over time. Use it for teams that prioritize quality, skip it if integration breadth or price sensitivity matters more.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vercel Agent: The Native Platform Pick
&lt;/h2&gt;

&lt;p&gt;Vercel Agent sits in a slightly different category. It is not just a code reviewer. It is part of Vercel's broader AI layer, which also includes production investigation, automated incident diagnosis, and deployment analysis. The code review feature uses context from your Vercel deployments, runtime logs, and preview environments to inform its suggestions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it does well
&lt;/h3&gt;

&lt;p&gt;The production context is genuinely unique. When Vercel Agent reviews a PR, it knows about the preview deployment, which environment variables are set, what the runtime logs show during preview traffic, and whether any errors surfaced in the preview environment. No other AI reviewer has this data.&lt;/p&gt;

&lt;p&gt;This leads to categories of feedback the others cannot provide. Vercel Agent has flagged regressions in preview environments that were not obvious in the code diff. It has surfaced performance changes between commits based on actual deployment metrics. On one PR, it caught a cold start regression that would have been invisible to any tool that only reads the diff.&lt;/p&gt;

&lt;p&gt;Integration with the Vercel ecosystem is seamless. If you are already on Vercel, enabling Agent is a toggle. No app install, no webhook configuration, no separate dashboard. It shows up on your PRs and in your Vercel project overview.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/blog/ai-agent-observability-debugging-production-2026"&gt;AI agent observability&lt;/a&gt; angle is interesting. Agent's suggestions often include links to relevant logs, traces, or specific requests that triggered the behavior it is commenting on. That context shortens the time from "this looks like a bug" to "yes, here is exactly what broke."&lt;/p&gt;

&lt;h3&gt;
  
  
  Where it falls short
&lt;/h3&gt;

&lt;p&gt;It only works if you are on Vercel. This is the obvious limitation and it is not going away. If your production runs on Render, Fly, AWS, or anywhere else, Vercel Agent is not an option.&lt;/p&gt;

&lt;p&gt;It is still in public beta. The review quality is good but inconsistent. Some PRs get sharp, context-aware feedback. Others get generic comments that feel like any other AI reviewer. This is improving monthly but it is not yet as reliable as the mature tools.&lt;/p&gt;

&lt;p&gt;It optimizes for the Vercel runtime and patterns. If your codebase does weird things that deviate from typical Next.js or Vercel Function conventions, Agent can get confused or miss issues that a more agnostic tool would catch.&lt;/p&gt;

&lt;p&gt;Pricing is bundled into Vercel's usage-based model, which is both good and annoying depending on your perspective. You do not pay a separate per-developer fee, but your Vercel bill does absorb the cost of Agent's reviews and investigations. For heavy users, this is a meaningful line item.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verdict
&lt;/h3&gt;

&lt;p&gt;Vercel Agent is the best tool if you are already on Vercel and care about connecting code review to production behavior. It is not the best if you are on a different platform or if you need a tool that has been battle tested at scale. Use it for Vercel-native teams that want the tightest possible dev loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Side by Side: Where Each Tool Wins
&lt;/h2&gt;

&lt;p&gt;Here is how the three stacked up across the dimensions I actually cared about after six weeks of daily use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug catching accuracy.&lt;/strong&gt; Greptile wins. It caught the most real bugs, with the fewest false positives, across all three codebases. Vercel Agent was close for anything involving runtime or deployment context. CodeRabbit trailed on precision but covered more surface area in total.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal to noise ratio.&lt;/strong&gt; Greptile wins clearly. Its comment volume is low and its hit rate is high. CodeRabbit produces the most comments overall and has the worst noise ratio on default settings. Vercel Agent is in between and improving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup time.&lt;/strong&gt; CodeRabbit wins. Install the app, authorize, done. Greptile takes longer for the initial index. Vercel Agent is fastest if you are already on Vercel and slowest if you are not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integration breadth.&lt;/strong&gt; CodeRabbit wins by a significant margin. Greptile covers the essentials. Vercel Agent only works on Vercel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production context.&lt;/strong&gt; Vercel Agent wins. No other tool has access to runtime data, deployment metrics, and preview environment logs. This is a category of value the others structurally cannot match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing.&lt;/strong&gt; CodeRabbit and Vercel Agent are comparable depending on usage. Greptile is the most expensive on a per-developer basis but cheaper when you account for the reviewer time it saves by producing less noise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which One Should You Actually Use
&lt;/h2&gt;

&lt;p&gt;If you are a solo developer on a tight budget and your repo is on GitHub, the honest answer is to start with CodeRabbit's free tier or Greptile's trial. CodeRabbit is easier to try and will give you a feel for what AI review does. Greptile is the upgrade if you find yourself ignoring most of CodeRabbit's output.&lt;/p&gt;

&lt;p&gt;If you are a small team of two to five engineers and review quality matters more than integration breadth, Greptile is the pick. The noise reduction alone is worth the higher per-developer cost, and the deep code understanding pays compounding dividends on a stable codebase.&lt;/p&gt;

&lt;p&gt;If you are already on Vercel and shipping Next.js or Vercel Functions as your core stack, add Vercel Agent on top of whatever else you are using. It catches a category of issues the others cannot, and the integration cost is effectively zero. Running Greptile and Vercel Agent together is actually the setup I settled on for my main SaaS project.&lt;/p&gt;

&lt;p&gt;If you are on AWS, Render, Fly, Cloudflare, or any non-Vercel platform, Vercel Agent is out. Choose between CodeRabbit and Greptile based on whether you value breadth or precision.&lt;/p&gt;

&lt;p&gt;Do not run all three on the same repo. The comment overlap creates exactly the noise problem you are trying to avoid. Pick one primary reviewer, maybe add a second if it covers a distinct axis like production context, and trust the signal you get from that setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AI Code Review Does Not Replace
&lt;/h2&gt;

&lt;p&gt;One thing worth being blunt about. None of these tools replace human review on non-trivial changes. They catch common issues, surface obvious problems, and reduce the cognitive load of reading a large diff. They do not understand your product, your customers, or the decisions behind a feature.&lt;/p&gt;

&lt;p&gt;A tool can tell you that a function is inefficient. It cannot tell you that the feature itself is the wrong thing to build. A tool can catch a type error. It cannot tell you that the abstraction you are introducing will make the next three features harder to ship.&lt;/p&gt;

&lt;p&gt;That is the part human review still has to do, and it is the part that does not scale with AI. Treat these tools as a first pass that frees up human attention for the things that actually require human judgment. If you use them to replace all human review, you will ship faster for a few weeks and then hit the exact class of bugs that AI review cannot catch.&lt;/p&gt;

&lt;p&gt;The teams I have seen use these tools well treat them as infrastructure. You set them up, you let them do their job, and you reserve human review for the changes where human judgment actually matters. The teams I have seen struggle treat them as decision makers or try to automate away review entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up AI Code Review the Right Way
&lt;/h2&gt;

&lt;p&gt;A few practical lessons from six weeks of comparing these tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tune the verbosity on day one.&lt;/strong&gt; Every tool has a noise problem at default settings. Turn off style suggestions, turn off pedantic comments, and focus the tool on the categories of issue you actually want to catch. Correctness and security issues first, everything else second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create an ignore file for your conventions.&lt;/strong&gt; If your codebase has patterns the tool keeps flagging as issues, document them. CodeRabbit and Greptile both support repo-level configuration that teaches the tool what to stop complaining about. Ten minutes of setup here saves hours of ignored comments later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review the tool's comments critically.&lt;/strong&gt; Do not blindly apply suggestions. AI review is right often enough to be useful and wrong often enough to cause real damage if you merge without reading. Treat every comment as a suggestion, not an instruction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Combine AI review with &lt;a href="https://dev.to/blog/testing-ai-generated-code-developer-guide-2026"&gt;testing strategies for AI generated code&lt;/a&gt;.&lt;/strong&gt; AI review catches issues at commit time. Tests catch them at runtime. Neither is sufficient alone. The combination is what actually keeps quality up as commit volume increases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measure whether it is helping.&lt;/strong&gt; After a month of running one of these tools, look at your bug reports. Are you catching things earlier? Are you shipping with fewer post-merge hotfixes? If the answer is no, the tool is not earning its cost and you should either tune it more aggressively or switch.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Bottom Line
&lt;/h2&gt;

&lt;p&gt;AI code review in 2026 is not a future technology. It is a current mandatory piece of infrastructure for any team shipping at meaningful velocity. The question is no longer whether to use it. The question is which one, and how to configure it so it helps instead of generating noise you will learn to ignore.&lt;/p&gt;

&lt;p&gt;CodeRabbit is the safe pick for breadth and integration. Greptile is the precision pick when review quality is the priority. Vercel Agent is the native pick for anyone on the Vercel platform who wants runtime context in their reviews.&lt;/p&gt;

&lt;p&gt;Pick one, tune it for signal, and let it do its job. The cost is real but the cost of a race condition that ships to production on a Friday afternoon is much higher. I know, because I merged one of those, and the AI that flagged it was drowned out by the eleven comments it generated that week that I had already learned to ignore.&lt;/p&gt;

&lt;p&gt;The tool does not save you. The tool plus a minute of attention to its output does.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>saas</category>
    </item>
    <item>
      <title>Cursor vs Windsurf vs Zed: The AI IDE Showdown (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Tue, 21 Apr 2026 07:41:09 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/cursor-vs-windsurf-vs-zed-the-ai-ide-showdown-2026-44eo</link>
      <guid>https://forem.com/alexcloudstar/cursor-vs-windsurf-vs-zed-the-ai-ide-showdown-2026-44eo</guid>
      <description>&lt;p&gt;I have a bad habit of switching editors the moment something shinier appears on my timeline.&lt;/p&gt;

&lt;p&gt;Over the last six months I have used Cursor as my daily driver for two features, Windsurf for one side project, and Zed for the last month with Claude Code wired in. I have opinions. Most of them are different from the opinions I had at the start.&lt;/p&gt;

&lt;p&gt;The short version: these are three genuinely different tools that happen to all call themselves AI IDEs. Picking between Cursor, Windsurf, and Zed in 2026 is not about which one is best in the abstract. It is about which tradeoffs match how you actually work. If you pick wrong, you will spend the first week fighting the editor instead of shipping.&lt;/p&gt;

&lt;p&gt;Here is how they actually compare when you use them for real work, what surprised me, and how I would pick today.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Landscape in One Paragraph
&lt;/h2&gt;

&lt;p&gt;Cursor is the AI IDE that won 2024 and 2025 by being the best VS Code fork with AI baked in. Windsurf is Codeium's more autonomous cousin, also a VS Code fork, with an agentic model called Cascade that tries to do more of the work for you. Zed is the odd one out: a Rust-based editor built from scratch for speed, with AI features layered on top (and increasingly, Claude Code as the agentic companion).&lt;/p&gt;

&lt;p&gt;Everyone else (Cline, Aider, Copilot in VS Code, Antigravity, Kiro, Trae) is either a plugin inside another editor or a niche tool worth its own post. The three that most developers are actually picking between are these.&lt;/p&gt;

&lt;p&gt;If you have been using GitHub Copilot inside VS Code and are wondering whether to switch, I wrote about the broader decision in &lt;a href="https://dev.to/blog/claude-code-vs-cursor-vs-github-copilot-2026"&gt;Claude Code vs Cursor vs GitHub Copilot&lt;/a&gt;. This post is specifically about the three AI-native editors most likely to replace your current setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cursor: The Polish Leader
&lt;/h2&gt;

&lt;p&gt;Cursor is what happens when you take VS Code, optimize everything about the AI experience, and charge $20 per month for it.&lt;/p&gt;

&lt;p&gt;The things Cursor does better than everyone else: tab completion that feels telepathic once you get used to it, a chat panel with real codebase understanding via the @codebase command, and multi-file edits that actually work for non-trivial refactors. The UI is familiar to anyone who has used VS Code, all your extensions work, and the learning curve is close to zero.&lt;/p&gt;

&lt;p&gt;After using it daily for months, the workflow that clicks for me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Command-K for inline edits on small scoped changes. Highlight five lines, describe the change, accept.&lt;/li&gt;
&lt;li&gt;The agent panel (Cmd-I) for multi-file changes where I have a clear spec. Feed it the spec, review the plan, let it run, inspect the diff.&lt;/li&gt;
&lt;li&gt;@codebase in chat when I need to ask "where does X live" or "how does Y work" without leaving the editor.&lt;/li&gt;
&lt;li&gt;Tab completion everywhere else, all day, constantly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where Cursor disappoints: the agent mode still gets confused on large multi-file changes in unfamiliar code. It will confidently produce a plan that looks right and then edit files in ways that compile but miss the point. When this happens, the iteration loop (review, reject, reprompt) is worse than writing it yourself.&lt;/p&gt;

&lt;p&gt;The pricing is straightforward: $20 per month for Pro, which gets you fast requests to the best models. You will hit the fast-request limit if you use it heavily; after that you are on slow requests, which still work but feel noticeably worse. For most professional developers, $20 per month is negligible next to the productivity gain.&lt;/p&gt;

&lt;p&gt;Where Cursor wins the comparison: you can be productive on day one. No learning curve. No surprising behavior. All your VS Code extensions work. If you just want a better VS Code with AI that does not fight you, this is the default answer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Windsurf: The Agent-First Bet
&lt;/h2&gt;

&lt;p&gt;Windsurf is the editor you pick when you want the AI to do more of the work, not just the typing.&lt;/p&gt;

&lt;p&gt;Its headline feature is Cascade, the agentic model that can plan and execute multi-step changes across your codebase with less hand-holding than Cursor's agent mode. In practice Cascade feels like you are delegating to a junior developer who occasionally overreaches but gets the easy stuff done while you focus on the hard parts.&lt;/p&gt;

&lt;p&gt;A task I regularly hand to Cascade: "add a rate limiter to the user endpoint with a 60-second window and 100 requests per window, update the tests, add the new middleware to the router." This is three files of coordinated changes. Cascade usually nails it first try. When Cursor's agent does the same task, I get closer to 50% first-pass success.&lt;/p&gt;

&lt;p&gt;The pricing model is different from Cursor in a way that matters. Windsurf offers unlimited autocomplete on its free tier, which is rare. The paid individual plan at $15 per month is cheaper than Cursor and includes Cascade access. Team plans at $30 per user include zero data retention and enterprise features.&lt;/p&gt;

&lt;p&gt;What makes me not use Windsurf as my daily driver despite liking Cascade: the non-AI parts of the editor feel a beat behind Cursor. The tab completion is good but not as uncanny. The inline edit experience is fine but less polished. Extensions mostly work but I have hit a few that do not.&lt;/p&gt;

&lt;p&gt;There is also a trust issue. Cascade will sometimes make changes I did not expect and would not have chosen. Good agentic behavior is a spectrum between "does the minimum I asked for" and "reshapes the code according to its own judgment." Cascade is further toward the second end than I prefer. If you like more autonomy from your AI, this is a feature. If you like an AI that stays in its lane, this is friction.&lt;/p&gt;

&lt;p&gt;Where Windsurf wins the comparison: if agentic workflows are the point for you, and you want to hand off entire tasks rather than speed up typing, Cascade is genuinely better than the Cursor agent today. Pair it with a disciplined review process and you will ship more.&lt;/p&gt;




&lt;h2&gt;
  
  
  Zed: The Speed Play
&lt;/h2&gt;

&lt;p&gt;Zed is the editor you pick when you have tried the others, found them slower than your brain, and are willing to give up some polish for raw responsiveness.&lt;/p&gt;

&lt;p&gt;Zed is written in Rust. It starts in under half a second on my machine. Input latency is under 2 milliseconds. Large files open without thinking. Syntax highlighting and autocomplete never hitch. After spending time in Zed, going back to a VS Code fork feels like wading through molasses.&lt;/p&gt;

&lt;p&gt;The AI story in Zed has evolved fast. The built-in assistant is capable and integrates cleanly with Claude, OpenAI, and other providers via API keys. You get inline AI edits (similar to Cursor's Command-K), a chat panel, and agent features that have improved significantly over the last year. But where Zed really shines for AI development in 2026 is its integration with Claude Code as an external agent.&lt;/p&gt;

&lt;p&gt;My current setup: Zed for editing, reading, and navigating. Claude Code running in a side terminal for larger agentic tasks. The editor stays fast because it is not also trying to be an agent. The agent is agentic because it is not also trying to be an editor. The combination is more productive for me than any all-in-one tool has been.&lt;/p&gt;

&lt;p&gt;The downsides of Zed are real. The extension ecosystem is a fraction of VS Code. Some languages and frameworks have first-class support; others feel underserved. If your daily workflow depends on a specific VS Code extension, you may find Zed cannot replace it yet. The AI features, while good, are not as polished as Cursor's. Tab completion is noticeably less magical.&lt;/p&gt;

&lt;p&gt;The collaboration story is Zed's underappreciated trick. Real-time pair programming is built in. If you are on a small team and occasionally want to mob-program on a gnarly problem, the native multiplayer is genuinely useful.&lt;/p&gt;

&lt;p&gt;Pricing is the easiest of the three: Zed itself is free and open source. You bring your own API key for AI (or use the Zed-hosted offering). If you already pay for Claude Pro or have an API budget, the total monthly cost is often lower than the AI-IDE subscriptions.&lt;/p&gt;

&lt;p&gt;Where Zed wins the comparison: if speed matters to you, if you like minimal tools you can configure, and if you are comfortable with a BYO-agent model where Claude Code or similar runs outside the editor, Zed is the most satisfying choice of the three. It feels like the future of "editor plus agent" even though the editor itself is decidedly traditional.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Feature Comparison That Actually Matters
&lt;/h2&gt;

&lt;p&gt;Every AI IDE comparison online has a feature grid. Most of them are useless because they list features, not behavior. Here is what actually matters when you sit down to work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tab Completion Quality
&lt;/h3&gt;

&lt;p&gt;Cursor: best in class. The ghost text feels like it knows what you are about to type.&lt;/p&gt;

&lt;p&gt;Windsurf: very good. Not quite Cursor, but close enough for most work.&lt;/p&gt;

&lt;p&gt;Zed: good. Noticeably a step behind the two VS Code forks for this specific feature.&lt;/p&gt;

&lt;p&gt;If tab completion is the AI feature you use most (it is for many developers), this alone may decide the comparison.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inline Edit (Command-K Style)
&lt;/h3&gt;

&lt;p&gt;Cursor: polished. Accept or reject with a keystroke, diff view is clean, works reliably on scoped changes.&lt;/p&gt;

&lt;p&gt;Windsurf: nearly identical to Cursor. Same experience.&lt;/p&gt;

&lt;p&gt;Zed: works well, slightly less polished UI, but functionally equivalent for most edits.&lt;/p&gt;

&lt;p&gt;All three are good at this. Not a deciding factor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent Mode (Multi-File Tasks)
&lt;/h3&gt;

&lt;p&gt;Cursor: good on well-scoped tasks, struggles on large or unfamiliar code.&lt;/p&gt;

&lt;p&gt;Windsurf (Cascade): best of the three for autonomous execution. Also the most likely to overreach.&lt;/p&gt;

&lt;p&gt;Zed + Claude Code: the Claude Code agent is state of the art, but it is external to the editor. Integration is via a terminal, not inline.&lt;/p&gt;

&lt;p&gt;If you want to hand off entire tasks, Cascade or Claude-Code-beside-Zed is where you should be.&lt;/p&gt;

&lt;h3&gt;
  
  
  Codebase Understanding
&lt;/h3&gt;

&lt;p&gt;Cursor: @codebase is well-tuned and fast. Works on large repos.&lt;/p&gt;

&lt;p&gt;Windsurf: similar capability, slightly less refined UX.&lt;/p&gt;

&lt;p&gt;Zed: depends heavily on which agent you pair with. Claude Code has excellent codebase understanding but requires you to launch it separately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speed and Responsiveness
&lt;/h3&gt;

&lt;p&gt;Cursor: fine. Occasional slowness on very large files. Startup is slow by Zed standards.&lt;/p&gt;

&lt;p&gt;Windsurf: fine. Similar to Cursor.&lt;/p&gt;

&lt;p&gt;Zed: in a different category. If you have ever been annoyed by editor lag, this is the difference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extension Compatibility
&lt;/h3&gt;

&lt;p&gt;Cursor: full VS Code extension ecosystem.&lt;/p&gt;

&lt;p&gt;Windsurf: full VS Code extension ecosystem.&lt;/p&gt;

&lt;p&gt;Zed: its own ecosystem, smaller but growing. Language server support is good. Specific productivity extensions may not exist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pricing (2026)
&lt;/h3&gt;

&lt;p&gt;Cursor: $20/month Pro. Team plans scale up.&lt;/p&gt;

&lt;p&gt;Windsurf: $15/month individual, $30/user team. Free tier with unlimited autocomplete.&lt;/p&gt;

&lt;p&gt;Zed: free (editor). AI costs come from your provider API key or the Zed-hosted plan.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;p&gt;The bad answer to "which one should I use" is "it depends." The useful answer is to have a default based on how you work, and only deviate if you have a specific reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Cursor if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the least friction to get productive with AI.&lt;/li&gt;
&lt;li&gt;Tab completion is the AI feature you value most.&lt;/li&gt;
&lt;li&gt;You rely on specific VS Code extensions.&lt;/li&gt;
&lt;li&gt;You want a single-subscription tool that handles everything.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Windsurf if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to hand off larger tasks to the AI.&lt;/li&gt;
&lt;li&gt;Cost matters (the free tier and cheaper paid plan are real advantages).&lt;/li&gt;
&lt;li&gt;You are comfortable reviewing more AI-initiated changes before they merge.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pick Zed (with Claude Code or similar) if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Editor speed actively matters to you.&lt;/li&gt;
&lt;li&gt;You prefer the editor and the agent to be separate tools.&lt;/li&gt;
&lt;li&gt;You already pay for Claude Pro or have an API budget.&lt;/li&gt;
&lt;li&gt;You like minimal, configurable tools over all-in-one environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most developers, Cursor is the right default. It is the least-surprising, fastest-to-productive option, and the one I recommend when someone asks "where do I start."&lt;/p&gt;

&lt;p&gt;For developers who have been doing this for a while and know what they like, the Zed-plus-Claude-Code combination is where I have personally landed. It respects my muscle memory for a fast editor while giving me a best-in-class agent for the tasks where agents matter.&lt;/p&gt;

&lt;p&gt;Windsurf is the wild card. If your work pattern is "describe what I want, let the AI take a real stab at it, iterate," Cascade is the best tool today.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Does Not Matter (Even Though the Marketing Says It Does)
&lt;/h2&gt;

&lt;p&gt;A few things keep showing up in AI IDE comparisons that do not actually affect daily work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model choice.&lt;/strong&gt; All three tools let you use Claude, GPT, or other frontier models. The model matters for quality, but the editor choice does not meaningfully gate which model you use. Pick the editor based on ergonomics and pair with whichever model you prefer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Built in Rust" vs "built on Electron."&lt;/strong&gt; This affects startup time and memory, and that shows up in Zed's superior responsiveness. But if your current editor already feels fast enough, the underlying implementation does not matter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent autonomy scores.&lt;/strong&gt; The marketing around "most autonomous AI coder" is a race to make the AI do more with less input. In practice, the bottleneck is almost never autonomy. It is correctness. An agent that does more of the wrong thing is worse than one that does less of the right thing. Don't optimize for autonomy at the expense of quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telemetry and privacy positioning.&lt;/strong&gt; All three offer a private or enterprise tier with zero data retention if you need it. For solo developers working on non-sensitive projects, the default tiers are fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Meta Takeaway
&lt;/h2&gt;

&lt;p&gt;Switching editors is expensive. Not in dollars, in focus. Every time I have switched, I lost a week of momentum learning keybindings, restoring my workflow, and discovering what was missing.&lt;/p&gt;

&lt;p&gt;The best strategy is to pick one as your default and live in it for at least a month before evaluating whether another tool is meaningfully better for you. A week is not enough. Two weeks is not enough. A month of actual work, including at least one hard debugging session and one large multi-file refactor, is the minimum bar for forming an opinion.&lt;/p&gt;

&lt;p&gt;If you are currently happy with your editor, the honest answer is that none of these will 10x your productivity. They will incrementally improve specific parts of your workflow. If you are already using one of the three and feeling resistance, switching might close that gap, or might not. You have to try.&lt;/p&gt;

&lt;p&gt;If you are coming from plain VS Code with no AI, or from a non-AI editor, switching to any of these will be a step change. In that case, start with Cursor. Get comfortable with the workflows. Reassess in three months.&lt;/p&gt;

&lt;p&gt;The bigger shift I think most developers underestimate is not between these three editors. It is between "editor with AI features" and "editor plus dedicated AI agent as separate tool." That is the divide I have come out on the other side of, and the reason my daily driver is Zed with Claude Code instead of one of the more obvious picks.&lt;/p&gt;

&lt;p&gt;For where the industry is heading on that shift specifically, I wrote more about it in &lt;a href="https://dev.to/blog/agentic-coding-2026"&gt;agentic coding in 2026&lt;/a&gt;, which covers the broader pattern of agents as first-class development tools rather than IDE features.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Actual Recommendation
&lt;/h2&gt;

&lt;p&gt;If I had to pick one and only one for a developer who asked me today, with no other context about their work, I would say:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor, for the next three months.&lt;/strong&gt; Get the fundamentals. Get the reflexes. Learn what an AI IDE feels like when it is working well.&lt;/p&gt;

&lt;p&gt;Then, once you have that baseline, try Zed with Claude Code for a month. See if the editor-plus-separate-agent model clicks. If it does, you have found your long-term setup. If it does not, you have a thousand-dollar-a-year tool that you understand deeply and will get more out of because you evaluated the alternative.&lt;/p&gt;

&lt;p&gt;Windsurf is worth a try if the cost matters or if Cascade's specific style of autonomy appeals to you. It is a real contender, just not my daily driver.&lt;/p&gt;

&lt;p&gt;The only wrong answer is paralysis. These tools are good enough that any of the three, used consistently, will move you forward. The tools are getting better faster than you can evaluate them. Pick one, ship work with it, and switch only when you have a specific reason.&lt;/p&gt;

&lt;p&gt;The work is the point. The editor is the thing that gets out of your way.&lt;/p&gt;

</description>
      <category>devtools</category>
      <category>ai</category>
      <category>productivity</category>
      <category>editors</category>
    </item>
    <item>
      <title>AI Agent Observability: Debugging Production Agents Without Going Insane (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Tue, 21 Apr 2026 07:41:08 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/ai-agent-observability-debugging-production-agents-without-going-insane-2026-53km</link>
      <guid>https://forem.com/alexcloudstar/ai-agent-observability-debugging-production-agents-without-going-insane-2026-53km</guid>
      <description>&lt;p&gt;The first time I shipped an AI agent to production, I watched it do something in the logs that I could not reproduce locally, could not explain, and could not fix.&lt;/p&gt;

&lt;p&gt;A user asked it to summarize a meeting. It responded with a confident paragraph that referenced three people who were not in the meeting, a decision that was never made, and a date that did not exist. Everything about the response looked plausible. None of it was true.&lt;/p&gt;

&lt;p&gt;I had logs. I had the prompt. I had the final response. What I did not have was any visibility into the sixteen tool calls, three retries, one silent fallback to a cheaper model, and two places where the context was truncated before the model ever saw the real input. The bug was somewhere in that middle. I could not see the middle.&lt;/p&gt;

&lt;p&gt;That was the moment I understood that agent observability is not a nice-to-have. It is the difference between shipping agents and shipping prayers.&lt;/p&gt;

&lt;p&gt;This is the setup I wish I had that day. It works for solo developers, it costs less than you would guess, and it turns the black box of agent execution into something you can actually debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Agent Observability Is Different From Regular Logging
&lt;/h2&gt;

&lt;p&gt;If you are coming from web development, your instinct is to reach for the logging tools you already know. Sentry for errors. Datadog for metrics. A structured logger for requests. These are great tools and you should still use them. But they do not tell you what you need to know about an agent.&lt;/p&gt;

&lt;p&gt;The problem is that an agent failure is rarely a single event. It is a chain of events, each of which looks fine on its own.&lt;/p&gt;

&lt;p&gt;A tool call returns valid JSON. The model reads that JSON and makes a reasonable next decision. The next step executes without errors. Eventually the agent returns an answer. Every individual step passes validation. The final answer is wrong.&lt;/p&gt;

&lt;p&gt;If you log these steps independently, you see a sequence of successful operations. If you trace them together, you see that the second tool call returned stale data that the model then built a confident hallucination on top of for the next eight turns. The root cause is invisible at the individual log line. It only appears in the full causal chain.&lt;/p&gt;

&lt;p&gt;This is why the observability stack for agents looks different. You need three things that traditional logging tools do not give you by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session traces.&lt;/strong&gt; The full sequence of prompts, completions, tool calls, retries, and state changes that make up a single agent execution, linked together as one object you can inspect top to bottom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token and cost attribution.&lt;/strong&gt; Which parts of your agent are spending the tokens, and therefore the money. Without this you cannot find the prompt that accidentally got 40x more expensive after a refactor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evals that run in production.&lt;/strong&gt; Offline evals catch the bugs you thought to test for. Production evals catch the bugs your users ran into first.&lt;/p&gt;

&lt;p&gt;Get those three right and you will solve 90% of the problems that make solo developers afraid to ship agents. Let me walk through each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  Session Traces: The One Thing You Cannot Live Without
&lt;/h2&gt;

&lt;p&gt;A trace is a single executable view of an entire agent run. For a typical agent, a trace might include the user input, the system prompt, the first model completion, the first tool call and its response, the updated context, the second model completion, and so on until the agent stops.&lt;/p&gt;

&lt;p&gt;You want this view because agent failures are contextual. The question is never just "what did the model say" but "what did the model say given this specific context after this specific history of events." Without the full trace, you are reading the punchline without the setup.&lt;/p&gt;

&lt;p&gt;The tools I have tried and would recommend for solo founders and small teams in 2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Langfuse&lt;/strong&gt; is the open-source option with a generous free tier and a self-hostable option if you want to keep your data. It supports any framework through a simple SDK. Traces render as a nested tree where you can click into each span, see the full prompt and completion, inspect tool inputs and outputs, and compare runs side by side. If you want to just try something and get value quickly, this is where I would start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangSmith&lt;/strong&gt; from LangChain is the most polished of the observability platforms if you are building with LangChain or LangGraph, though it now works framework-agnostically. The trace UI renders the execution tree beautifully and has the best prompt engineering workflow (you can edit a prompt in the playground, run it against the same input, and compare the result). The free tier is enough for early development, and the paid tiers scale with volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Braintrust&lt;/strong&gt; is worth considering if you want observability and evals in the same tool with a product focus on experimentation. The trace view is clean, and the "playground" workflow for iterating on prompts inside real traces is genuinely excellent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Helicone&lt;/strong&gt; is the lightest-weight option if you do not want another SDK. It works as a proxy in front of your LLM provider, so you change one URL and suddenly have observability. For simple agents this is a great "just make the pain stop" option.&lt;/p&gt;

&lt;p&gt;Pick one and integrate it into your agent on day one, not the day you ship to production. The cost of adding observability later is much higher than the cost of adding it at the start.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to Actually Log in Each Trace
&lt;/h3&gt;

&lt;p&gt;Once you have the tool picked, what you log in each trace matters more than the tool choice. The defaults are usually not enough.&lt;/p&gt;

&lt;p&gt;For each agent run, log:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The raw user input (not a summarized or preprocessed version)&lt;/li&gt;
&lt;li&gt;The full system prompt as sent to the model (not the template you intended to send)&lt;/li&gt;
&lt;li&gt;Every tool call input and output, including failed calls&lt;/li&gt;
&lt;li&gt;Every model call with its model name, temperature, and token counts&lt;/li&gt;
&lt;li&gt;Any retries, including why they happened&lt;/li&gt;
&lt;li&gt;Any fallback to a cheaper or different model&lt;/li&gt;
&lt;li&gt;The final response as the user saw it&lt;/li&gt;
&lt;li&gt;A session ID and user ID (or anonymous user hash) so you can correlate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two specific things that you will forget and then regret: the exact model version string (not just "gpt-5" but the full identifier with date stamp if available) and the full prompt after template substitution. The bug is often in the substitution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Naming and Structuring Traces So You Can Actually Find Things
&lt;/h2&gt;

&lt;p&gt;The failure mode I see most often with agent observability is not having too little data. It is having data you cannot query. A million traces are useless if you cannot find the ten that broke a user's workflow.&lt;/p&gt;

&lt;p&gt;Here is the structure that has saved me the most time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use consistent span names.&lt;/strong&gt; Every tool call should be named after the tool, not a generic "tool_call" label. Every retry should be named "retry_1", "retry_2" so you can filter on retries specifically. Every model call should include the model name in the span.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag sessions with user context.&lt;/strong&gt; Add metadata tags for user ID, account plan, feature flag state, and any other dimension you might want to filter on later. "Show me all failed agent runs for paid users in the last 24 hours where the feature flag X was on" is a query you will want to run and cannot answer without tags.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tag sessions with success or failure signals.&lt;/strong&gt; The model cannot always tell you if an agent run was successful. You can sometimes tell from downstream user behavior. If the user copied the response, they probably liked it. If they asked a clarifying question immediately after, they probably did not. Log these signals back to the trace as tags. You will use this data for evals later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capture the decision points explicitly.&lt;/strong&gt; If your agent branches ("should I use the search tool or answer from memory"), log the decision and the reason, not just the action taken. When something goes wrong you want to know the agent chose path A when it should have chosen path B, and you want to know why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Evals: The Part Most Developers Skip
&lt;/h2&gt;

&lt;p&gt;I wrote a whole post on &lt;a href="https://dev.to/blog/ai-evals-solo-developers-2026"&gt;AI evals for solo developers&lt;/a&gt; that covers the basics, but observability is where evals start paying off.&lt;/p&gt;

&lt;p&gt;Evals are the automated tests for your agent. Unit tests for deterministic code ask "does this function return 42 for input X." Evals for agents ask "does this agent's response meet the quality bar for input X." Quality bar is fuzzy, so evals use a mix of deterministic checks (did it call the right tool), LLM-as-judge checks (is the answer factually grounded in the provided context), and sometimes human review.&lt;/p&gt;

&lt;p&gt;The thing I want to hammer on here: evals and observability are the same workflow.&lt;/p&gt;

&lt;p&gt;When you have a trace, you have an input and an output and all the intermediate steps. That is exactly what an eval consumes. Good observability tools make the round trip from "I saw a bad trace in production" to "I added this case to my eval suite and confirmed my fix works on it" a single-click operation.&lt;/p&gt;

&lt;p&gt;The workflow that actually works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You see a bad agent run in production (tagged by a negative signal or reported by a user).&lt;/li&gt;
&lt;li&gt;You open the trace and see what went wrong.&lt;/li&gt;
&lt;li&gt;You save the input as a test case in your eval suite with an expected behavior or quality criterion.&lt;/li&gt;
&lt;li&gt;You make a change to the prompt, the tool, or the model choice.&lt;/li&gt;
&lt;li&gt;You rerun the eval suite to see that your fix works without breaking other cases.&lt;/li&gt;
&lt;li&gt;You ship.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This loop is what separates agents that improve over time from agents that keep making the same mistakes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running Evals in Production (Not Just Before Deploy)
&lt;/h2&gt;

&lt;p&gt;Most developers treat evals as a pre-deploy check. Run them in CI, make sure they pass, ship. This is good but not enough.&lt;/p&gt;

&lt;p&gt;The problem is that production traffic is not a clean superset of your eval set. Users will do things you did not imagine. Edge cases you never thought of will hit your agent on day one. A 95% eval pass rate on your test suite means almost nothing if your test suite is missing the cases that actually break.&lt;/p&gt;

&lt;p&gt;Production evals fix this by running a subset of your eval logic on real production traffic. You sample a percentage of runs, pipe them through an LLM-as-judge or a deterministic check, and log the results as a quality metric.&lt;/p&gt;

&lt;p&gt;What this buys you:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A quality dashboard.&lt;/strong&gt; You see the percentage of agent runs that met your quality bar over the last day, week, month. This lets you detect regressions that would otherwise only show up in customer complaints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Faster incident detection.&lt;/strong&gt; When you push a prompt change and the production quality metric drops by 15%, you know something is wrong within an hour instead of three days later when support tickets pile up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous eval set growth.&lt;/strong&gt; Every production run that fails the judge becomes a candidate for your permanent eval suite. Your test set grows with your real usage.&lt;/p&gt;

&lt;p&gt;Starting point: sample 5% of runs, pipe them through a cheap LLM-as-judge that scores them on two or three dimensions that matter for your product (factual grounding, tool use correctness, response helpfulness). Put the scores on a dashboard. That is it. The dashboard will tell you when something is wrong and which traces to look at.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost and Token Observability
&lt;/h2&gt;

&lt;p&gt;The other thing that bites solo developers shipping agents is cost. You build something, ship it, and a month later see a bill that does not match your mental model of what the agent does.&lt;/p&gt;

&lt;p&gt;The cost comes from places you do not expect. A tool that returns a 50 KB blob of JSON, which the model then re-reads on every subsequent turn. A system prompt that grew 500 tokens during a refactor and now runs on every single call. A retry loop that happens silently when the model returns malformed JSON, doubling or tripling your per-request cost.&lt;/p&gt;

&lt;p&gt;Every agent observability tool I mentioned above tracks tokens by default. Use this.&lt;/p&gt;

&lt;p&gt;Specifically, build a dashboard that answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which spans (tools, model calls) account for most of the tokens?&lt;/li&gt;
&lt;li&gt;What is the average token count per session, and has it drifted over the last month?&lt;/li&gt;
&lt;li&gt;What is the p99 token count per session? (This is often where the cost overruns hide.)&lt;/li&gt;
&lt;li&gt;Which users or accounts are the most expensive? (A power user is fine; a runaway loop is not.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A specific pattern I use now: a sanity check at the start of every agent run that rejects requests where the prefilled context is already over a threshold (say 100k tokens). This has caught bugs where an edge case was pulling way more context than intended, and the agent would then run expensively and slowly for no reason.&lt;/p&gt;

&lt;p&gt;For the broader picture on keeping AI costs sane at production scale, I went into this in &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization in production&lt;/a&gt;, which pairs well with the token observability approach above.&lt;/p&gt;




&lt;h2&gt;
  
  
  Debugging Non-Deterministic Failures
&lt;/h2&gt;

&lt;p&gt;The hardest agent bugs are the ones that only happen sometimes. You run the same input and get the right answer. A user runs it five times and gets the wrong answer twice. What do you do?&lt;/p&gt;

&lt;p&gt;This is where good tracing changes the game.&lt;/p&gt;

&lt;p&gt;First, you need to know these failures are happening. Set up a trace filter that flags sessions where the same user retried a request within 30 seconds. That is a strong signal they did not like the first answer.&lt;/p&gt;

&lt;p&gt;Then, you need to compare runs. A good observability tool lets you diff two traces side by side. You want to see what was different. Often the difference is in the context window, not the input. The first run had one set of prior messages. The second run had a subtly different set because of some intermediate state change. The model responded differently because the context was different, not because the prompt was worse.&lt;/p&gt;

&lt;p&gt;Once you see the diff, you can usually fix the problem. Sometimes it is a prompt change. Sometimes it is a retrieval or memory change. Sometimes it is a model version pin (the provider silently updated the model and your prompt is now slightly off). The fix is downstream. The discovery is only possible with good observability.&lt;/p&gt;

&lt;p&gt;For the harder cases where the non-determinism is baked into the model itself, techniques like temperature 0, structured output, and caching can help. I covered some of these in &lt;a href="https://dev.to/blog/ai-agent-memory-state-persistence-2026"&gt;AI agent memory and state persistence&lt;/a&gt;, where the state layer is often where non-determinism sneaks in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Privacy and Data Handling
&lt;/h2&gt;

&lt;p&gt;One thing I should flag because it comes up a lot in solo-founder-shipping-to-enterprise territory: observability by default captures everything, including things you may not be allowed to capture.&lt;/p&gt;

&lt;p&gt;If your agent processes user data that is sensitive (health records, financial information, personal messages), the observability layer becomes a compliance surface. Logging the raw prompts and completions means your observability provider now has copies of that data.&lt;/p&gt;

&lt;p&gt;Three things that help:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PII redaction.&lt;/strong&gt; Most observability platforms have built-in redactors for email addresses, phone numbers, credit card numbers. Turn this on. Better to lose some debugging information than to accidentally log a user's SSN.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-hosting.&lt;/strong&gt; Langfuse and a few others offer a self-hostable version. If you need the data to never leave your infrastructure, this is the path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sampling and retention policies.&lt;/strong&gt; You do not need to keep every trace forever. A policy like "keep all failed traces for 90 days, sample 5% of successful traces for 30 days" gives you enough data to debug while limiting exposure.&lt;/p&gt;

&lt;p&gt;None of this is a substitute for actual compliance review for regulated industries. But for the common case of a SaaS product that handles user data that you would not want in a breach, these steps get you to a reasonable place.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Minimal Setup That Actually Works
&lt;/h2&gt;

&lt;p&gt;If I had to start from zero tomorrow and wanted the smallest possible observability setup that would still catch most real problems, here is what I would do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One observability tool&lt;/strong&gt; (Langfuse if I wanted free and self-hostable, LangSmith if I was already on LangChain, Braintrust if evals are my priority). Integrate it on day one of the project, not day one of production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traces for every agent run&lt;/strong&gt;, logging the full input, system prompt, tool calls, and response. Tagged with user ID, session ID, and any feature flags relevant to the behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A quality signal&lt;/strong&gt; captured for each run. In the simplest case, this is whether the user replied positively, retried, or abandoned. You can refine later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A basic eval suite&lt;/strong&gt; of 20 to 50 representative cases, run in CI on every prompt or model change, and sampled on 1% to 5% of production runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A token dashboard&lt;/strong&gt; showing total tokens per day, average tokens per session, and top token consumers. Check it weekly.&lt;/p&gt;

&lt;p&gt;That setup takes a weekend to build if you are starting fresh. It takes a couple of hours to add to an existing agent. It turns every production bug from a mystery into a specific trace you can look at, reproduce, and fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mindset Shift That Matters
&lt;/h2&gt;

&lt;p&gt;The last thing I want to leave you with is not a tool recommendation. It is a way of thinking.&lt;/p&gt;

&lt;p&gt;Traditional software development treats observability as a production concern. You write the code, test it, ship it, and add monitoring to see how it behaves in the wild.&lt;/p&gt;

&lt;p&gt;Agent development flips this. The model is the largest source of uncertainty in the system. You cannot unit test your way to confidence because the thing you are testing is non-deterministic by design. You cannot code review your way to correctness because the logic is in weights, not lines.&lt;/p&gt;

&lt;p&gt;The only way to know if your agent works is to watch it work. In development. In staging. In production. Continuously. With enough detail that you can diagnose any failure in minutes instead of days.&lt;/p&gt;

&lt;p&gt;Observability is not a nice-to-have layer you add once the core features are built. It is the development environment itself. The sooner you build this mindset, the sooner you go from shipping agents that mysteriously disappoint users to shipping agents that get measurably better every week.&lt;/p&gt;

&lt;p&gt;Your traces are the new IDE. Your eval set is the new test suite. Your quality dashboard is the new build pipeline.&lt;/p&gt;

&lt;p&gt;Build them first. Everything else gets easier.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>observability</category>
      <category>agents</category>
    </item>
    <item>
      <title>Better Auth vs Clerk vs Supabase Auth: Which Should Solo Devs Pick in 2026?</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:58:20 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/better-auth-vs-clerk-vs-supabase-auth-which-should-solo-devs-pick-in-2026-mf1</link>
      <guid>https://forem.com/alexcloudstar/better-auth-vs-clerk-vs-supabase-auth-which-should-solo-devs-pick-in-2026-mf1</guid>
      <description>&lt;p&gt;Authentication is the decision you make once, forget about, and then curse quietly two years later when you need to do something the vendor does not support.&lt;/p&gt;

&lt;p&gt;I have shipped products on three different auth stacks in the last four years. Each time I picked the one that seemed obvious at the time. Each time, the trade-off I was not thinking about turned into the thing that mattered most.&lt;/p&gt;

&lt;p&gt;The landscape in 2026 is different enough that anyone building a new product should stop and think about this for a second, instead of reaching for whichever provider they used last time. The default has shifted. The self-hosted option is actually good. The pricing math has changed. And there is one provider that has quietly become the right answer for a specific kind of product that most solo devs are building.&lt;/p&gt;

&lt;p&gt;Here is how I think about the choice in 2026 between Better Auth, Clerk, and Supabase Auth. These are the three options worth seriously considering. Everyone else is either too expensive, too niche, or not worth the switching cost anymore.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Decision Matters More Than It Feels Like It Does
&lt;/h2&gt;

&lt;p&gt;Auth is the most load-bearing piece of your product that you do not think about often.&lt;/p&gt;

&lt;p&gt;It sits in front of every request. It shapes your user model. It decides how you handle billing, teams, roles, sessions, invitations, password resets, and every compliance conversation you will ever have. When it is invisible it feels free. When it breaks or has to change, it is weeks of rewriting.&lt;/p&gt;

&lt;p&gt;The switching cost is where people get burned. Going from one auth provider to another means migrating user records, session tokens, password hashes (or not, if your new provider does not accept the old hash format), third-party OAuth connections, and every webhook integration downstream. Most products never switch because the cost never justifies the benefit. You live with what you picked, so picking well is worth an hour of thought.&lt;/p&gt;

&lt;p&gt;Three questions decide which provider fits:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How much of the auth UI do you want to own?&lt;/li&gt;
&lt;li&gt;How much are you willing to pay per active user?&lt;/li&gt;
&lt;li&gt;What is your tolerance for lock-in?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The answers cluster into three patterns. Each provider is the right answer to one of those patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  Clerk: The Polished Default
&lt;/h2&gt;

&lt;p&gt;Clerk is the closest thing to a "just works" auth provider in 2026. You drop their components into your app, configure a few things in their dashboard, and you have a complete auth system including sign-in, sign-up, password reset, social providers, MFA, email verification, a profile UI, and a user management dashboard.&lt;/p&gt;

&lt;p&gt;The quality of the components is the thing that keeps people on Clerk. The &lt;code&gt;&amp;lt;SignIn /&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;UserProfile /&amp;gt;&lt;/code&gt; components are not just functional. They look good out of the box. They handle edge cases that people forget exist. They ship with localization, theming, and accessibility built in. You cannot build this yourself in a reasonable time frame, which is the whole value proposition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Clerk wins:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are building a consumer or prosumer product where auth UX matters. Sign-in friction costs you conversion. The quality of your password reset flow is a real competitive detail. The difference between a janky MFA prompt and a polished one is measurable.&lt;/p&gt;

&lt;p&gt;You want organizations, invitations, and role-based access out of the box. Clerk's org model is one of the best pre-built ones available. If your product needs teams, this saves weeks.&lt;/p&gt;

&lt;p&gt;You are okay paying for active users once you grow. Clerk's pricing starts generous and gets expensive fast. At small scale it is effectively free. At 10,000 active users on a business product, you are paying enough per month that it shows up on the accounting summary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Clerk loses:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The lock-in is real. You do not own your user records in a useful way. You can export them, but the sessions, the verification state, and the OAuth connections all live inside Clerk. Migrating out is a project, not a weekend. This is the same pattern I covered in the &lt;a href="https://dev.to/blog/supabase-vs-firebase-2026"&gt;Supabase vs Firebase breakdown&lt;/a&gt;, and it has gotten worse for managed auth over the last two years.&lt;/p&gt;

&lt;p&gt;Pricing is unpredictable if your app has viral moments or traffic spikes. Clerk charges on monthly active users. A Product Hunt launch that brings 5,000 curious clickers with one-session visits gets counted. Some competitors count differently. Read the pricing page twice before committing.&lt;/p&gt;

&lt;p&gt;You cannot customize the flows beyond what they expose. If your product needs a non-standard sign-up experience, a unique verification step, or integration with an internal identity system, Clerk is the wrong answer. You will hit the walls of their abstraction and there is no escape hatch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Supabase Auth: The Pragmatic Bundle
&lt;/h2&gt;

&lt;p&gt;Supabase Auth is the auth layer that ships with Supabase. You cannot really evaluate it in isolation, because the reason to pick it is that you are using Supabase for other things too.&lt;/p&gt;

&lt;p&gt;If you are using Supabase Postgres, Supabase Storage, or Supabase Realtime, Supabase Auth is the path of least resistance. User records live in your own Postgres database. Row-level security policies reference the authenticated user directly. The auth state flows into your realtime subscriptions automatically. It is the tightest integration available between auth and data in 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Supabase Auth wins:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are building a product where Postgres is the database and Supabase is the backend. The integration with row-level security alone justifies the choice. You can write a policy like "users can only see their own rows" and enforce it at the database layer. No middleware, no manual check in every API route. The auth identity is a first-class concept in your data layer.&lt;/p&gt;

&lt;p&gt;Your cost profile favors flat platform pricing over per-user auth pricing. Supabase charges for database size and API bandwidth. Auth itself is effectively unlimited for most projects. If you expect to grow past a few thousand active users and do not want per-user fees, this is the cheapest provider by a wide margin.&lt;/p&gt;

&lt;p&gt;You want to own your user data. Users live in a Postgres table that you can query, join, and back up like any other data. No export ritual. No vendor-shaped user model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Supabase Auth loses:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The UI components are functional but not polished. They look like open-source components from 2022, because that is roughly what they are. You will end up building your own sign-in pages if design quality matters for your product. That is not bad, it is just more work than Clerk requires.&lt;/p&gt;

&lt;p&gt;The organization and team features are basic. You can build teams on top of Supabase Auth, but you are building them. Invitations, role-based permissions, and multi-tenant support are all DIY. If your product needs those out of the box, you will write them yourself.&lt;/p&gt;

&lt;p&gt;Edge cases around advanced flows are where it shows its age. Some auth providers have a decade of accumulated work on rare-but-important scenarios. Supabase Auth is newer and less thorough on those. For most products this does not matter. For some it does.&lt;/p&gt;




&lt;h2&gt;
  
  
  Better Auth: The Open-Source Answer That Changed the Conversation
&lt;/h2&gt;

&lt;p&gt;Better Auth is the one that has shifted what the default answer should be in 2026. It is an open-source auth library for TypeScript, self-hosted by default, with first-class support for every framework worth caring about.&lt;/p&gt;

&lt;p&gt;It is not a managed service. You install it, configure it, and run it in your own application. It stores user data in your own database. It issues your own session tokens. There is no external service to depend on, no per-user billing, and no risk of a vendor sunsetting a feature you rely on.&lt;/p&gt;

&lt;p&gt;A year ago, the pitch for self-hosted auth was "it is cheaper and you own everything, but you will spend a month making it work." That tradeoff has changed. Better Auth is good enough out of the box that the setup cost is comparable to a managed provider, and the cost curve past the first thousand users bends in your favor forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Better Auth wins:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You are shipping a TypeScript product and have strong preferences about your stack. Better Auth gives you hooks for every step of the auth lifecycle. You can plug in your own email sender, your own session store, your own rate limiter, your own password policy. If you already have opinions, it does not fight you.&lt;/p&gt;

&lt;p&gt;Your cost profile is long-tail. You are building something that could have a lot of users but not a lot of revenue per user. Newsletter tools, community products, developer tools with free tiers. Managed auth priced per active user will eat your margin. Better Auth priced per server costs the same at 100 users and 100,000 users.&lt;/p&gt;

&lt;p&gt;You want to avoid lock-in entirely. The user table is your user table. The sessions are your sessions. If Better Auth changes direction or a new library comes out that is better, you can migrate in a week because your data is already yours.&lt;/p&gt;

&lt;p&gt;You value reading the code. When something breaks, you can step through the auth library itself. When a new social provider launches, you can add it yourself without waiting for a vendor. This is the same reason a lot of developers prefer &lt;a href="https://dev.to/blog/drizzle-orm-vs-prisma-2026"&gt;Drizzle over Prisma&lt;/a&gt; in the ORM layer. Owning the layer means you can fix it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Better Auth loses:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You need to operate it. That means thinking about session storage, rate limiting, monitoring, and making sure your database migrations do not lock the users table during a busy hour. None of this is hard. All of it is work that Clerk does for you and Better Auth does not.&lt;/p&gt;

&lt;p&gt;No polished components. You are building your own sign-in UI. This is fine if you have taste and time. If you are a backend developer shipping a product alone and design is not your thing, the quality of your auth pages will show. Clerk wins on this dimension, cleanly.&lt;/p&gt;

&lt;p&gt;The ecosystem is newer. Integrations with specific services, documentation for obscure edge cases, and Stack Overflow answers to weird problems are thinner than Clerk or Supabase. You will occasionally have to read source code or ask in a Discord. That is fine for most developers and a dealbreaker for some.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Decision Framework I Actually Use
&lt;/h2&gt;

&lt;p&gt;The marketing pitches all sound good. Here is how I pick between the three for real projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If the product is UI-sensitive and growth-sensitive, pick Clerk.&lt;/strong&gt; Consumer products, prosumer tools, anything where sign-in friction visibly matters. Pay the per-user cost for the better conversion and faster launch. The expensive auth bill later is a sign the product is working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If the stack is already Supabase or Postgres with row-level security, pick Supabase Auth.&lt;/strong&gt; The database integration is worth more than any other feature. Do not fight it. Use the provider that is already in your data layer. This is especially true for products where data access is the core complexity and auth is a supporting cast member.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If the product is TypeScript, margin-sensitive, and you have at least some taste for design, pick Better Auth.&lt;/strong&gt; Newsletter products, community tools, developer platforms, internal apps, anything where per-user pricing at scale would hurt. The setup cost is a weekend. The ownership is permanent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A fourth option I use sometimes:&lt;/strong&gt; start with a managed provider, migrate to Better Auth when the cost crosses a threshold. Clerk for the first six months. Better Auth once you have validated the product and growth is real enough to justify the migration project. This has the highest optionality but requires the discipline to actually migrate when the time comes. Most people do not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lock-In Math
&lt;/h2&gt;

&lt;p&gt;The part nobody talks about clearly is the cost of leaving each provider in two years.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaving Clerk&lt;/strong&gt; means exporting your users, which you can do, and then figuring out how to handle sessions, OAuth connections, and password hashes in whatever you migrate to. Clerk hashes passwords with bcrypt, which most systems accept. OAuth connections and MFA factors do not transfer cleanly. In practice, a Clerk migration means asking every user to re-verify and reconnect social providers. That is a real UX cost and a real drop in active users during the transition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaving Supabase Auth&lt;/strong&gt; is easier on paper because the data is in your Postgres. In practice, Supabase Auth uses its own password hashing and session model, and the hashes can be migrated with care. You lose the integration with RLS policies that referenced the auth identity, so any migration needs to rethink the data access layer. The users table is yours. The glue is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaving Better Auth&lt;/strong&gt; is trivial by comparison. Your user table is your user table. You can plug a different auth library into the same schema. Sessions live in your database. There is no meaningful lock-in to unwind.&lt;/p&gt;

&lt;p&gt;If you are building something long-term and not sure what you will need in three years, lower lock-in is more valuable than lower effort today. That is the argument for Better Auth in any situation where the other providers do not have a clear advantage.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About Auth0, Firebase Auth, and Everyone Else?
&lt;/h2&gt;

&lt;p&gt;Worth a quick mention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auth0&lt;/strong&gt; is enterprise auth. It is priced for companies with budgets, not solo developers. If you are building a business product that will sell to enterprises, it is reasonable. For everything else, it is overkill. Okta acquired Auth0 years ago, which added polish and also added price.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Firebase Auth&lt;/strong&gt; is still fine, but Firebase as a whole has lost momentum relative to Supabase for new projects. Google has kept it alive but not modernized it in a way that keeps pace with what developers want in 2026. I would not start a new project on it unless I was already committed to Firebase elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NextAuth / Auth.js&lt;/strong&gt; is in a weird place. It pioneered the open-source TypeScript auth space but has struggled with direction changes and breaking upgrades. Better Auth is the spiritual successor and has captured most of the energy NextAuth had two years ago. If you are on NextAuth and it works, fine. For a new project in 2026, Better Auth is the more active and better-designed choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WorkOS&lt;/strong&gt; is SSO-first, priced for B2B companies shipping to enterprises. If you need SAML, SCIM, and enterprise SSO from day one, it is the right answer. For consumer or prosumer products, it is the wrong shape.&lt;/p&gt;

&lt;p&gt;Everyone else is either a niche solution, an abandoned project, or marketing material. The three I covered in detail cover the real decision space for solo developers in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Thing I Wish Someone Had Told Me Earlier
&lt;/h2&gt;

&lt;p&gt;The dimension I underweighted every time I picked auth is how much the provider shapes the way I think about users.&lt;/p&gt;

&lt;p&gt;On Clerk, I think about users as records in their system that I reference by ID. On Supabase Auth, users are rows in a table I can query. On Better Auth, users are whatever my application says they are. These feel like implementation details until you are a year in and trying to do something the provider did not anticipate, like merging accounts, supporting a weird login method, or building a multi-tenant feature where one user belongs to many workspaces.&lt;/p&gt;

&lt;p&gt;Providers that own less of your user model give you more flexibility later. Providers that own more give you a faster start. Neither is wrong. Both have a price.&lt;/p&gt;

&lt;p&gt;The mistake I made twice was picking the faster start every time, without noticing that the flexibility cost was being paid later in cramped, frustrated workarounds. For my next product, I am picking the flexibility. Your calculus may be different, but this is the dimension most people do not weight properly.&lt;/p&gt;

&lt;p&gt;Pick auth like you pick a database. It is going to be there for a long time. The switching cost is higher than the setup cost. The feature differences matter less than the shape of what you can build on top of it.&lt;/p&gt;

&lt;p&gt;Whatever you pick, make sure you understand what you are actually committing to. If you cannot describe in one sentence what you would do to migrate off your auth provider, you do not understand the commitment you are making. That is true at 10 users and it is true at 10,000. The difference is only how much work it is to fix once it matters.&lt;/p&gt;

&lt;p&gt;For most solo developers shipping in 2026, my default is now Better Auth with Postgres and a simple sessions table. Clerk when UI polish is a business requirement. Supabase Auth when the whole product is already running on Supabase. That is the shape of the decision in one paragraph. The rest is detail. The &lt;a href="https://dev.to/blog/shipping-speed-only-strategy-2026"&gt;shipping speed question&lt;/a&gt; will push you toward whichever one lets you move fastest on your specific product. Use that instinct, but weight the long-term lock-in cost at least as much.&lt;/p&gt;

</description>
      <category>devtools</category>
      <category>startup</category>
      <category>saas</category>
    </item>
    <item>
      <title>AI SDK v6: The Practical Guide to Shipping AI Features Without Vendor Lock-In (2026)</title>
      <dc:creator>Alex Cloudstar</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:58:19 +0000</pubDate>
      <link>https://forem.com/alexcloudstar/ai-sdk-v6-the-practical-guide-to-shipping-ai-features-without-vendor-lock-in-2026-17m0</link>
      <guid>https://forem.com/alexcloudstar/ai-sdk-v6-the-practical-guide-to-shipping-ai-features-without-vendor-lock-in-2026-17m0</guid>
      <description>&lt;p&gt;I spent most of last year bolting AI features onto products the wrong way.&lt;/p&gt;

&lt;p&gt;Direct provider SDK. Hardcoded model string. A streaming response handler I copy pasted from a blog post. It worked. It also meant that every time a new model came out, switching took half a day of untangling types and rewriting stream parsers that I did not remember writing in the first place.&lt;/p&gt;

&lt;p&gt;The AI SDK v6 is the fix I wish I had a year ago.&lt;/p&gt;

&lt;p&gt;If you have been putting off building AI features into your product because the ecosystem felt chaotic, or if you tried the AI SDK a year ago and bounced off it, this is the update worth coming back to. The abstractions finally match how people actually build. The streaming story is coherent. Provider switching is a one-line change. And the tool-use and agent patterns are good enough that you can ship real features instead of demos.&lt;/p&gt;

&lt;p&gt;Here is what actually changed, what it unlocks, and the patterns I use day to day now.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the AI SDK Actually Is
&lt;/h2&gt;

&lt;p&gt;Before the v6 specifics, a quick grounding for anyone who has heard the name and never used it.&lt;/p&gt;

&lt;p&gt;The AI SDK is a TypeScript library that gives you one API for talking to every major model provider. You write your code once against the SDK. Underneath, it talks to OpenAI, Anthropic, Google, Mistral, open-source models via Ollama or Together, and anything else with a compatible adapter. Switching providers is a string change, not a rewrite.&lt;/p&gt;

&lt;p&gt;It also handles the parts of AI work that are annoying to get right: streaming tokens to the browser, tool calls, structured output, chat state, retries, tracing. You can write those yourself. I have. You are not going to do a better job than the SDK for most products, and you will spend weeks on plumbing that does not differentiate your product from anyone else's.&lt;/p&gt;

&lt;p&gt;The v6 release refined all of that and added a few things that change what is reasonable to build solo.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is New in AI SDK v6
&lt;/h2&gt;

&lt;p&gt;The headline changes are smaller than the cumulative effect of them together. Individually, each improvement looks like a polish pass. Used together, they change the shape of what feels worth building.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, the provider abstraction got simpler.&lt;/strong&gt; You used to import a provider package and configure it. Now you can pass a plain &lt;code&gt;"provider/model"&lt;/code&gt; string through the AI Gateway and the SDK handles the wiring. Switching from &lt;code&gt;"anthropic/claude-opus-4-7"&lt;/code&gt; to &lt;code&gt;"openai/gpt-5"&lt;/code&gt; is a string edit. Fallbacks between providers are first-class. If you have been watching the model landscape thrash around and wanted to avoid committing to one, this is the feature that matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, tools and agents are real primitives.&lt;/strong&gt; The &lt;code&gt;tool()&lt;/code&gt; helper and the agent loop are good enough that you can build agentic features without importing a separate framework. I used to reach for LangChain for anything with more than two steps. I do not anymore. The SDK's agent pattern is simpler, more debuggable, and stays out of your way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, streaming got cleaner.&lt;/strong&gt; The &lt;code&gt;streamText&lt;/code&gt;, &lt;code&gt;streamObject&lt;/code&gt;, and &lt;code&gt;streamUI&lt;/code&gt; APIs converged on a consistent shape. The React hooks (&lt;code&gt;useChat&lt;/code&gt;, &lt;code&gt;useObject&lt;/code&gt;, &lt;code&gt;useCompletion&lt;/code&gt;) work against the same streaming protocol the server sends. If you are using Next.js or any other framework, the end-to-end flow is the most boring it has ever been, which is the highest praise an AI streaming story has ever deserved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth, structured output is actually trustworthy.&lt;/strong&gt; &lt;code&gt;generateObject&lt;/code&gt; and &lt;code&gt;streamObject&lt;/code&gt; with a Zod schema produce outputs that match your types. Not "probably match." Match. The SDK retries and reprompts if the model drifts. You get validated TypeScript objects out the other side, and you can rely on them in your product code without a second layer of defensive parsing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fifth, observability is built in.&lt;/strong&gt; OpenTelemetry tracing is not an afterthought. You can see every prompt, every model call, every tool invocation, and every retry in a tracing UI without writing your own logger. When something goes wrong in production, you can actually see what happened.&lt;/p&gt;

&lt;p&gt;These are the load-bearing changes. Everything else is cleanup around the edges.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Mental Model That Makes the SDK Click
&lt;/h2&gt;

&lt;p&gt;The thing that took me too long to internalize is that the AI SDK is not trying to be a framework. It is trying to be a standard library for AI features, the way &lt;code&gt;fetch&lt;/code&gt; is a standard library for HTTP.&lt;/p&gt;

&lt;p&gt;Once you look at it that way, the API makes sense. There are four core functions you use 90% of the time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;generateText&lt;/code&gt; when you want a string back&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;streamText&lt;/code&gt; when you want to stream a string back&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;generateObject&lt;/code&gt; when you want a typed object back&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;streamObject&lt;/code&gt; when you want to stream a typed object back&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything else is a variation on those four themes. Tools attach to any of them. Agents are &lt;code&gt;streamText&lt;/code&gt; in a loop. Chat is &lt;code&gt;streamText&lt;/code&gt; with message history threaded through. Structured output is &lt;code&gt;generateObject&lt;/code&gt; with a schema.&lt;/p&gt;

&lt;p&gt;If you understand the four core functions, you understand the SDK. The rest is ergonomics.&lt;/p&gt;




&lt;h2&gt;
  
  
  Starting From Scratch: The Minimum Viable Setup
&lt;/h2&gt;

&lt;p&gt;Here is the smallest example that does something real. A Next.js App Router route that streams a chat response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// app/api/chat/route.ts&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toDataStreamResponse&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole backend. Eight lines. On the client side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;use client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;Chat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;handleInputChange&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;handleSubmit&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useChat&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;form&lt;/span&gt; &lt;span class="na"&gt;onSubmit&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleSubmit&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;: &lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;input&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="na"&gt;onChange&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;handleInputChange&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt; &lt;span class="p"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;form&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the part that makes the SDK worth using. The hook knows the protocol. The protocol is standardized. The server streams. The client renders. No custom fetch logic, no SSE parser, no state machine to debug.&lt;/p&gt;

&lt;p&gt;You can build a working chat UI in about ten minutes. That is not hype. It is the first time I have been able to write that sentence honestly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Provider Switching Without the Wincing
&lt;/h2&gt;

&lt;p&gt;One of the realities of shipping AI features in 2026 is that the best model for your use case changes every few weeks. GPT leads on one benchmark, Claude leads on another, and a Chinese open-source model nobody had heard of last month is suddenly competitive for half the price.&lt;/p&gt;

&lt;p&gt;If your code is coupled to one provider, you either ignore those changes and fall behind or you eat the rewrite cost every time you switch. Both options are bad.&lt;/p&gt;

&lt;p&gt;The AI SDK solves this by making model selection a configuration value rather than a structural dependency. With the Vercel AI Gateway, you can do this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;CHAT_MODEL&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now switching models is an environment variable change. No code deploy. No provider package swap. If you want A/B testing between providers, you can set the model per request based on user ID, cohort, or feature flag.&lt;/p&gt;

&lt;p&gt;This is more important than it sounds. It means your product is not tied to the fortunes of any single lab. If Anthropic raises prices or OpenAI ships a better model, you can move without an engineering project. That is the main reason I default to gateway-style model strings now rather than direct provider packages, even though both work.&lt;/p&gt;

&lt;p&gt;The only time I reach for a provider-specific package is when I need a feature that is not in the gateway abstraction yet, like specific fine-tuning hooks. For 95% of product work, the gateway is the right default.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tools: Where the SDK Stops Being a Demo Library
&lt;/h2&gt;

&lt;p&gt;The real leap for products comes from tools. If you have only used the SDK for chat completions, you have seen the easy half. Tools are what turn a language model into something that can do work in your application.&lt;/p&gt;

&lt;p&gt;The pattern is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;searchOrders&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Find orders for the current user&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pending&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;shipped&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;delivered&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findMany&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentUser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;streamText&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;searchOrders&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;maxSteps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model decides when to call the tool. You provide the schema and the implementation. The SDK handles the back-and-forth protocol, validates the arguments, calls your function, feeds the result back to the model, and continues the conversation.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;maxSteps&lt;/code&gt; parameter is important. Without it, the model calls tools exactly once and stops. With it, you get multi-step reasoning. The model can call a tool, see the result, decide to call another tool, and keep going until it has what it needs to answer.&lt;/p&gt;

&lt;p&gt;This is where the line between "chatbot with API calls" and "agent" starts to blur. If you set &lt;code&gt;maxSteps&lt;/code&gt; to 10 and give the model a few well-designed tools, you have built an agent. There is no separate agent framework to learn. The surface area is the same as the chat surface area, with tools attached.&lt;/p&gt;

&lt;p&gt;I covered the broader question of what agent memory and state management looks like in &lt;a href="https://dev.to/blog/ai-agent-memory-state-persistence-2026"&gt;my guide on AI agent memory&lt;/a&gt; if you want to go deeper on the stateful side.&lt;/p&gt;




&lt;h2&gt;
  
  
  Structured Output: The Feature That Changes What You Build
&lt;/h2&gt;

&lt;p&gt;Most of the AI features I see in products do not need chat at all. They need a structured result. Classify this ticket. Extract the invoice fields. Summarize this page into three bullet points. Generate a title, a description, and three tags for this upload.&lt;/p&gt;

&lt;p&gt;For those, &lt;code&gt;generateObject&lt;/code&gt; is the function you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;object&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateObject&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`Summarize this article: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;article&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// object is typed as { title: string; tags: string[]; summary: string }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You pass a Zod schema. You get back a validated object that matches it. The SDK handles the prompting, retries invalid outputs, and gives you something your TypeScript compiler is happy with.&lt;/p&gt;

&lt;p&gt;This changes what is worth building. A year ago, adding an AI feature to a product meant writing a prompt, parsing freeform text, and dealing with edge cases where the model wrapped its response in markdown or added commentary. Today, it means writing a schema and a prompt.&lt;/p&gt;

&lt;p&gt;The reliability question matters here. If you have tried this before and been burned by the model returning invalid output, the v6 retry logic is meaningfully better. The SDK reprompts with the validation error included, and modern models are good at correcting themselves on the second pass. In my use, structured output with a reasonable schema succeeds on the first try over 95% of the time, and the retry catches most of the rest.&lt;/p&gt;

&lt;p&gt;If your schema is extremely strict or the task is ambiguous, you can still get failures. Keep schemas lenient where they can be, and design prompts so the model has room to succeed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Streaming UI: When You Want More Than Text
&lt;/h2&gt;

&lt;p&gt;For a long time, AI output in apps meant streaming text into a &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;. That is still the right answer for chat. For more structured experiences, the SDK gives you two better options.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;streamObject&lt;/code&gt; streams a structured object as it is being generated. You see partial data arrive as JSON fields fill in. If you are generating a form, a table, or a card layout, this is the right primitive. The user sees the skeleton fill in rather than waiting for the whole thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;partialObjectStream&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;streamObject&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic/claude-opus-4-7&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;sections&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;heading&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;})),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Generate a blog post about...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;partial&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;partialObjectStream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;partial&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// { title: "The ...", sections: [{ heading: "Why..." }] }&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;streamUI&lt;/code&gt; (in frameworks that support it) lets you stream actual React components. The model picks which component to render and what props to pass. You define the components. This is the shape of the experience if you have used v0 or similar tools. It is powerful and it is niche. For 90% of products, &lt;code&gt;streamObject&lt;/code&gt; plus your own rendering layer is simpler and easier to reason about.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Patterns That Keep Working
&lt;/h2&gt;

&lt;p&gt;After a year of building features with the SDK, a few patterns have settled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Put the model string in config, not code.&lt;/strong&gt; Do this even if you have no plans to switch. Future-you will thank present-you when a better model ships and you want to try it in five minutes instead of an afternoon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with &lt;code&gt;generateObject&lt;/code&gt;, not &lt;code&gt;generateText&lt;/code&gt;.&lt;/strong&gt; Every time I have written &lt;code&gt;generateText&lt;/code&gt; in product code, I have eventually rewritten it as &lt;code&gt;generateObject&lt;/code&gt; because I needed structure. Skip the intermediate step. If the output is going anywhere other than a chat bubble, it should be structured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use tools sparingly and name them well.&lt;/strong&gt; A model with three well-named tools outperforms a model with fifteen confusingly named ones. Each tool is a decision point for the model. More tools means more chances to pick wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set &lt;code&gt;maxSteps&lt;/code&gt; on every agentic call.&lt;/strong&gt; The default is 1, which is safe. Pick a number that matches your use case. Higher means more capability and more cost. I usually start at 5 and adjust from there based on traces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add tracing before you need it.&lt;/strong&gt; Enable OpenTelemetry from day one. The cost of setup is an hour. The value the first time something goes wrong in production is weeks. Observability for AI features is not optional if you care about reliability. I go deeper on this in &lt;a href="https://dev.to/blog/production-observability-solo-developer-2026"&gt;my production observability guide for solo developers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat model output as untrusted input.&lt;/strong&gt; Sanitize anything you send to the browser, the database, or another system. The model will sometimes return something weird. That is not a bug in the model. It is the nature of the work. Validate at the boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Costs, and How to Keep That Under Control
&lt;/h2&gt;

&lt;p&gt;The question that kills more AI features than anything else is not "does it work?" It is "does it pay for itself?"&lt;/p&gt;

&lt;p&gt;The AI SDK does not change the cost of the models you call. A Claude request costs what a Claude request costs. What it does give you is the tooling to keep those costs visible and manageable.&lt;/p&gt;

&lt;p&gt;Three things keep my costs predictable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caching identical prompts.&lt;/strong&gt; If the same prompt is going to run many times, cache the result. The SDK has integrations for this. You can also do it yourself with a hash of the input. For any feature where the input space is bounded, caching is free performance and free money saved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using cheap models for cheap tasks.&lt;/strong&gt; Not every AI call needs the biggest model. Classification tasks, simple extraction, and routing logic work fine on smaller, cheaper models. I default to the big model only for tasks that need real reasoning. Everything else goes to the cheap tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting and spend caps per user.&lt;/strong&gt; If your product has AI features available to users, set limits. A single user burning through your budget because they found a prompt injection or a runaway loop is a pattern I have seen too many times. The AI Gateway has spend caps built in. Use them. I wrote about this in more detail in &lt;a href="https://dev.to/blog/llm-cost-optimization-production-2026"&gt;LLM cost optimization for production&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once you have those three patterns in place, AI features become a predictable line item rather than a variance risk. That is the state you want to be in if you are shipping anything that runs in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Not to Use the AI SDK
&lt;/h2&gt;

&lt;p&gt;It is a boring take, but worth saying out loud. The SDK is not the right choice for everything.&lt;/p&gt;

&lt;p&gt;If you are building a pure chatbot against one provider and you are certain you will never switch, you can get by with that provider's SDK directly. You will save one layer of indirection and lose some TypeScript ergonomics. For most products this is a wash. For extremely minimal integrations, it is simpler.&lt;/p&gt;

&lt;p&gt;If you need a feature that only one provider has and the SDK has not abstracted it yet, use the provider SDK. You can still use the AI SDK for 90% of calls and drop down to the raw SDK for the 10% that need the specific capability.&lt;/p&gt;

&lt;p&gt;If you are doing heavy fine-tuning, custom inference, or deploying your own models, the SDK is not really the layer you need. It is a client library, not a model ops platform.&lt;/p&gt;

&lt;p&gt;For everything else, which is most of what anyone is shipping, the AI SDK is the right default.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reason This Matters More in 2026 Than It Did Last Year
&lt;/h2&gt;

&lt;p&gt;The economics of AI features changed in the last twelve months.&lt;/p&gt;

&lt;p&gt;A year ago, shipping a real AI feature meant spending two weeks on plumbing for every week spent on the feature itself. Streaming, tool calls, retries, observability, switching providers, handling structured output. Each one was a small project. Together they added up to a tax that made AI features feel expensive relative to what they delivered.&lt;/p&gt;

&lt;p&gt;Today, the plumbing is solved. You write a prompt, a schema, maybe a tool, and you ship. The SDK absorbs the infrastructure work. That means the economics tilt back toward the feature itself. You can prototype in a day. You can ship in a week. You can iterate on prompts and models without touching architecture.&lt;/p&gt;

&lt;p&gt;This is the unglamorous version of the AI revolution, and it is the one that actually changes what gets built. Not the demos. The features that ship in products your users never think of as "AI features" because they just work.&lt;/p&gt;

&lt;p&gt;If you have been watching the AI space and waiting for the right moment to actually build, the tooling has caught up. The remaining blocker is picking a problem worth solving. That part is on you.&lt;/p&gt;

&lt;p&gt;For what it is worth, the problems worth solving right now are boring ones. Automated tagging. Smarter search. Better onboarding. Things that were impossible or too expensive to build with traditional code, now trivial. Pick one of those, ship it in a week, and see what it does for your product before you try to build anything more ambitious.&lt;/p&gt;

&lt;p&gt;The tools are ready. The cost is manageable. The patterns are clear. The only thing left is building the thing.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
