Forem: Austin Vance

AI Agent Authentication Starts With Workload Identity | Focused Labs

Austin Vance — Wed, 13 May 2026 14:55:56 +0000

AI agent authentication starts when the system can answer which actor is allowed to make a tool call.

The model can propose the action. The runtime has to attach authority to it.

Most teams start with the fastest answer: an API key in an environment variable. The agent reaches Salesforce, GitHub, Jira, Snowflake, Stripe, whatever system makes the first useful proof feel real, and everyone moves on.

That proof matters. It shows the agent can reach the systems where work actually happens. It also hides the first product decision: who is acting when the tool call leaves the runtime?

The agent gets memory. The agent runs in the background. The agent forks into subagents. The agent retries failed operations. The agent calls tools after the user has walked away. The agent lands in an enterprise workflow where the work has value, the logs have value, and breaking something has a consequence.

A shared API key starts as configuration. Then it quietly becomes the identity of the agent.

An ugly place to stumble into by accident.

The secret becomes the actor

Early security models for agents tend toward good vibes with a bearer token. The prompt gives instructions. The tool schema lists calls. Hard-coded secrets in the runtime decide what actually gets done based on the input, the agent, and whatever authority those secrets carry.

The secret wins.

The agent has all of those powers if the same key can read every customer record, submit refunds, update tickets, and write to production data. Carefulness in the prompt is theater at that point. The tool description can say those powers apply only when appropriate. The audit log will still show one credential able to perform a pile of different tasks.

There is already a category for this outside agents: OWASP's Non-Human Identities Top 10. Production applications identify themselves as non-human identities. Agents are adding themselves to that growing list of stranger workloads, running differently than normal services, but still requiring access to systems and data.

The important step for me is naming the agent as a workload, because the architecture gets less magical and more useful.

Workloads have identities. Workloads can request scoped credentials for those identities. A workload can be denied a credential. A workload can rotate credentials. A workload can leave an audit trail that survives the model, the prompt, and the v2 or v3 abstraction barrier the team is currently working around.

Baseline authentication for production AI agents.

The runtime should issue tool-specific credentials instead of letting the agent carry a shared key everywhere.

Workload identity is the boring answer

This part is old. Good.

Kubernetes already considers service accounts to be identities of processes running in Pods, and the current docs describe short-lived, automatically rotating ServiceAccount tokens issued through the TokenRequest API. SPIFFE generalizes that into workload identity documents, including short-lived X.509 and JWT SVIDs that a workload can use to authenticate itself to other workloads.

Cloud platforms are heading in the same general direction. AWS STS can issue temporary security credentials after a workload has identified itself using OpenID Connect. Google Cloud Workload Identity Federation allows external workloads to access Google Cloud resources without service account keys. Azure managed identity docs describe workload identities as machine and non-human identities associated with compute resources.

The industry knows how to keep long-lived secrets out of the hot path. It just keeps giving agents interfaces that make the old mistake easy.

A developer writes a tool wrapper. The tool wrapper needs credentials. The fastest way to configure it is to add an API key to an environment variable and add a TODO to remove it later. The TODO gets pushed to production because now the agent answers support tickets, reconciles invoices, or looks at CI.

I've worked with teams who reviewed the model, tuned prompts, drew diagrams for tool selection, created a few secrets in deploy config, and crossed their fingers that the tool descriptions would shore it all up.

They are not enough.

Delegation is the missing primitive

In many applications, the agent should rarely hold the credential it uses to act.

Put an identity assertion in the flow. This agent. This tenant. This user context if present. This policy version. This tool request. This approval state. That assertion is exchanged for a credential only when the action needs one.

OAuth was designed to support exactly this shape. RFC 8693 defines token exchange, describing how one temporary credential can be exchanged for another temporary credential intended for a different context. In the agent case, the model proposes an action, the runtime checks policy, the broker issues a credential for that action and tool context, the call happens, and the credential dies.

It does not expire after a quarter. It does not expire after someone remembers to rotate it. It expires because the system puts expiration in the path.

That changes the damage pattern. A compromised tool wrapper no longer implies broad access to every downstream system. A prompt injection has to cross approval, run, tenant, and policy boundaries. A subagent that escapes its execution boundary cannot reuse credentials after the run, approval, or tenant context has expired.

The agent is still useful. It just has to query through a production boundary that understands production concerns.

This is why integrated agents are valuable and dangerous at the same time. The valuable integrated agents do not live in a chatbot tab. They integrate with real systems. Once an agent is tied to real systems, authentication becomes product architecture rather than cleanup work hidden in deployment.

The runtime owns the identity boundary

A model provider should not own this boundary. A prompt should not own this boundary. A tool schema should not own this boundary.

The runtime owns it because the runtime follows the whole path.

It connects agent definitions to threads or runs, tenants, and identity information, including the user who initiated the work, whether the work is backgrounded, whether a human approved a risky step, which tool is being called, and which downstream credential is being requested. It can attach those facts to an identity assertion and make a policy decision before any assertion leaves the process.

That policy decision can be boring and explicit:

The refund tool can request a payment credential for the current tenant.
A GitHub tool can request a write credential after CI has produced an eval pass.
The Snowflake tool can request a read credential for one warehouse, one role, and one time window.
A subagent can run with a delegated identity, but only with fewer capabilities than the parent run.

The list is not impressive, which is why it is powerful.

This is also where multi-agent orchestration gets serious. A supervisor handing work to a subagent creates a delegation relationship along with the task description. The child process needs enough authority to perform the work at hand and no more. The audit log must reflect that chain of trust cleanly or troubleshooting becomes an exercise in futility.

The worst setup is a swarm of agents all sharing the same service account. Simple enough to get going. Terrible when it comes time to debug an incident. Every action has been performed by the same principal, authenticated with the same key, and observed through the same useless blur.

The incident has no useful actor. Just a shared key with a long memory and no accountability.

Short-lived delegated credentials make the agent run, policy decision, tool call, and audit trail line up.

Audit follows identity

Agent observability without identity is half a story.

A trace for the agent step called refund_customer can include latency, tool arguments, model output, retries, all visualized in a convenient span tree. Useful. Then someone asks who had authority to issue that refund, and the trace turns into archaeological excavation.

The right trace shows the tool call connected to a principal. Not just a service account. A principal with an agent ID, run ID, tenant, user context, policy decision, credential scope, and expiration time.

This is what allows a team to answer questions after the tool call has done real work.

Who granted access? What user context did it use? What broker generated the credential? What version of policy allowed it? What downstream resource accepted it? What subagent inherited it? Can that credential be used for something else?

Those questions determine whether there is a real postmortem or just hand waving about the agent doing something weird.

The same principle applies to testing. In Everybody Tests, I argued that every team already tests whether they admit it or not. Agent identity needs that same honesty. If a runtime can create delegated credentials, tests should verify that the boundary holds. A refund agent should fail against the wrong tenant. A code agent should fail when eval gates are red. A research agent should fail when it asks for write access to a system it only reads.

Not a single npx this and that in the whole codebase. Test it in CI.

Shared keys hide product decisions

The fastest credential story hides the decisions that matter most.

A shared key hides tenancy. It hides user context. It hides the identity of the agent performing an action. It hides which subagent inherited authority. It hides whether approval was granted. It hides whether the action matched the original request. It hides rotation until rotation becomes an outage.

OWASP's secrets management guidance recommends dynamic secrets where possible to reduce credential reuse and limit the damage when credentials leak. Agent systems need the same pressure, with the additional constraint that the credential must represent the run instead of only the application.

A normal backend service is expected to behave predictably and follow a reliable lifecycle. It accepts requests, implements endpoints, and changes through controlled deployments. An agent runtime for integration automation can select different tools per request, execute work in subagents, retry steps, and continue running after initial user interaction has completed.

So identity has to be more exact.

The credential loaned to the system should assert what it is currently allowed to do. The operating policy should be visible enough to understand the motivation behind the action. The audit trail must persist long enough for a human to traverse the events as they happened.

A boundary-based platform does not need a full rewrite. Start with one boundary.

Put an identity broker between the agent runtime and the first high-risk tool. Give the agent runtime a workload identity. Have the broker exchange that identity for a tool credential. Associate the decision with tenant, run, and operation. Record the policy decision in the trace. Add a CI test that proves the wrong tenant fails. Expire the credential quickly. Make the failure visible when the broker returns no.

Then move the next tool behind the boundary.

The production line

AI agent authentication is the control plane for non-human actors who do work across systems.

Ownership matters here. Security cannot retroactively add this after the agent and its resources have shipped. Platform cannot stash it in a vault path. Product cannot mark it as a checkbox in consent. Identity, delegation, expiration, and audit have to be inherent in the runtime of the agent and how it executes.

The agent should actually be able to act. That is, after all, why we are doing AI agency in the first place. That agency should have a workload identity.

Production systems have already worked out parts of the problem. Kubernetes, SPIFFE, OAuth token exchange, cloud workload federation, managed identities, dynamic secrets. They exist because static secrets rot and shared principal accounts make bad worse.

It is a mistake to grant agents an exemption because the interface is conversational.

The model can decide on the next step. The runtime decides whether that step gets a credential.

Agentic AI Architecture Needs Model Routing

Austin Vance — Fri, 08 May 2026 01:57:35 +0000

Agentic AI architecture is stuck on model loyalty.

The same graph. The same provider. One giant model doing every job because one graph is easier to defend than a routing policy.

I get why people want to pick one model: it makes demos and evaluation and procurement easier, and sometimes debugging only slightly worse. The agent call becomes always the same, the trace becomes always the same, and the team can blame one provider instead of four.

Fine. But production agents do not do one kind of work.

Classify intent. Search. Summarize. Write code. Choose a tool. Check if a tool's result smells wrong. Write a customer-facing answer when something failed. Decide whether approval is required. Wait for something to happen. Retry something that failed. Recover from something gone wrong.

Production agents run a pile of distinct workloads.

Harrison Chase notes that LLMs are getting expensive, and open source models matter for that reason. LangChain is pushing the same direction from a product perspective, noting that Fleet agents no longer have to be constrained by a single model and can instead use multi-model support.

Those are the same production reality arriving through two doors.

The agent architecture must determine which model should perform which work.

The Same Model Everywhere Is an Architecture Smell

This is surprising. Many current agent stacks treat model selection as just another config parameter of the environment, equivalent to tradeoff parameters or batch sizes. Set MODEL=claude-whatever or MODEL=gpt-whatever and deploy the agent.

That's fine for a chatbot, but lazy for an agent.

Agents introduce variance internally. What looks simple to a user becomes retrieval, planning, transformation, checking, execution, generation and scheduling inside the system. Some of these steps need to be deep, some fast, some cheap. Some need a model that is good at generating code, others an open-weight model because the data cannot legally leave the boundary, or because it is simply too expensive to move around the company.

Using the same frontier model across the board is comforting. It also conceals the waste.

Instead of one glaring failure, I get slow, expensive, bureaucratic agent production. A team looks at the dashboard. Cost rises, latency rises, and people say the model is too expensive or the prompts are too long. The architecture is linear and all steps go to one place.

What gets under my skin is the compute monolith. Everywhere else we have learned to separate compute classes properly (queues are not databases, lambdas are not batch workers, CDNs are not origin servers). Then some clever agent comes along and suddenly every cognitive function has to go through the biggest model in the account.

Come on.

Routing Has to Do More Than Fallbacks

Model routing usually enters the conversation through reliability. If OpenAI is down, try Anthropic. If a deployment is overloaded, try another one. If a provider rate-limits, retry somewhere else.

This is important. LiteLLM's router docs explain load balancing, cooldowns, fallbacks, timeouts, retries, and Redis-based production rate limiting. OpenRouter's provider routing docs explain provider ordering, fallbacks, performance, price, and data policy constraints. Boring infrastructure at its best.

But routing cannot stop at uptime.

In a production agent workflow, the router should understand why a task exists. It should see the agent step, the tool context, the risk, latency budget, data boundary and previous run quality. Then it can pick the appropriate model class for the work at hand.

The router belongs in production architecture, where policy can be tested.

This is where things get more interesting for agentic AI architecture, compared to just building an LLM app. The router turns the agent’s internal structure into an execution policy.

A planner step can go to a reasoning model. A normalization step can go to a fast model. A code-editing subagent can go to a model tuned for code. A bulk summarization step can go to an open-weight model. A regulated data step can stay inside the boundary. A customer-facing final answer can take the slower path because that is where quality matters (since it impacts the customer).

The pattern is already familiar, which is the point. It has the same shape as multi-agent orchestration in LangGraph, but I like it better down at this level. The graph determines what work exists, and the router determines which model class should process that work.

The Router Needs Typed Work

Prompt-based routing is where it all goes wrong.

A team adds "Use the cheaper model when the task is simple." The agent is amiable, but ignores the team's intent at exactly the wrong time. The AI guesses or routes based on whatever words match the current prompt. The result is a vibe with a model attached.

The router needs typed work.

My ideal is for the agent to report task metadata before the model call occurs: task kind, expected output shape, sensitivity of input data, allowed tools, user-facing risk, latency/cost budgets, required capability, and retry posture. I do not need a full taxonomy to start. Most teams can begin with something tiny: classify, retrieve, reason, write, code, act. The key is moving model choice from prose to runtime.

This is a lesson already learned elsewhere in agent architecture. In Developing AI Agency, explicit mechanisms for planning, tools, memory, and verification beat one giant prompt pretending to be architecture. Model selection is another version of this.

The router can start dumb and be a simple lookup table driven by task type. It can be configured to dispatch to the code model for code tasks, the fast model for low-risk summaries, the local model for sensitive data, and the quality model for final text written for specific customers. First, ship that. Verify that it works. Then gradually become less dumb and add more nuance to the router.

The first mistake is expecting the team to find the single best router before shipping anything. The second mistake is letting the model design the router policy inside the same prompt it is supposed to execute.

Observability Makes Routing Honest

A router that does not publish telemetry data becomes an additional place where opinions get hidden.

An engineer's affection for a particular design, the score of a benchmark, and the features listed on a vendor's web page are all useful, but ultimately insufficient. The only relevant test is whether the routing rule improves the production agent's performance on the tasks it actually faces.

This means we need to consider cost, latency, error rate, retry rate, approval rate, human correction rate and eval score when deciding the routing for a request. So these statistics need to attach to the routing decision itself, not just to the trace.

LangSmith's platform language is already pointing in this direction. It treats traces as the record of an agent’s actions and reasoning, and says teams should monitor cost, latency, errors, and qualitative online evals. Fleet's product page puts model choice next to admin controls, observability, approvals, MCP connections, and export via APIs. This is the signal.

Model selection has moved from dropdown aesthetics into operational control. It affects the performance of a wide array of business processes.

Once routing is visible, the discussion shifts. The team can stop arguing over which model is best and start figuring out which route failed: fast model for tool argument generation, reasoning model for eval lift, open-weight model for internal summarization, code model for patch generation.

Those are engineering questions.

The answers need to inform the router policy, or else the agent keeps making yesterday's decisions with today's realities.

Open-Weight Models Are Part of the Architecture

The open-model conversation is often deeply ideological. People tend to think in terms of closed models versus open models, frontier quality versus control, benchmarks, and vibes.

Production is less dramatic.

Open-weight models give teams another execution path. They are useful when the task is bounded, when the data boundary matters, when throughput matters, when the cost curve gets ugly, or when the model only needs to be good enough for an internal step the user never sees.

A frontier connection does not mean every call should route through that location. That misconception is common. Routing makes the difference.

A team can still use a frontier model architecture for the high-risk reasoning step. And yes, the final answer can still go through a strong hosted model. But the retrieval cleanup, first-pass summarization, metadata extraction, and internal critique may not automatically deserve the same spend.

There is no best model for this problem. The more useful question is: Which model owns this step under these constraints?

Interface portability matters for the same reason. LangChain says Deep Agents ships with ACP so the same harness can run across multiple interfaces. The Deep Agents CLI docs show a coding agent with provider credentials, model switching, tools, memory, skills, MCP tools, and LangSmith tracing. The interface can change. The harness can change. The routing policy has to be portable across both.

Model choice that lives in a UI dropdown is prone to drift. Model choice that lives in the agent runtime can be tested, traced, reviewed and rolled back.

Own the Decision Boundary

The old agent stack revolved around a model call. The next one revolves around a decision boundary.

That boundary decides which work deserves which model, which provider, which data path, how many retries to attempt, what approval loop to operate in, and which evaluation loop to use. Less glamorous than a chart, to be sure, but more relevant to production workflows. Most production architecture is less glamorous than the thing that sells the demo.

The teams that get this right won’t talk about having one “agent model”. They’ll talk about routes: Fast route. Deep route. Code route. Local route. Human-review route. And for each route, they’ll know when to use it, how much it costs, how often it fails, and whether the next release made it better.

This is where integrated agents become useful. The agent owns execution decisions instead of wrapping a model call in a little workflow theater.

The code that matters controls the router, the telemetry and the eval loop.

The model will keep changing. The decision boundary should belong to the team shipping the agent.

Stop Eager-Loading MCP Tools Into the Context Window

Austin Vance — Tue, 05 May 2026 20:31:01 +0000

MCP servers should not eagerly load every tool schema into an agent's context window. Lazy-load tools by intent, then govern and audit execution.

Austin Vance, CEO of Focused

I think the problem with the current state of MCP is way deeper than just resizing the context window.

The protocol itself is decent, tool discovery and schema negotiation works well and the JSON-RPC architecture all feel very solid and well engineered. However, the default behavior of populating the agent's context at session start with every tool definition from every connected server makes running production agents virtually impossible.

One developer measured 67,300 tokens consumed before typing a single question. Seven MCP servers. Tool schemas alone ate up a third of the available context. Another measured 81,986 tokens.

The Eager-Loading Tax

When an agent starts a session with MCP servers connected, it downloads the full library of all tools, every session. And never filters out just the tools needed for the job at hand.

My browser automation server is loading 21 tool definitions. A GitHub server loads 27. My web search server bundles 8 providers behind 20 tools. I've not sent a single message yet and I'm already consuming significant context.

The numbers from a study of 856 tools across 103 MCP servers make this worse than it sounds. Fully augmented MCP tool descriptions add 67% more execution steps for a 5.85 percentage point accuracy gain. The tool definitions don't just eat context. They also slow agents down at actually learning to use the tools.

We wrote about evaluation pipelines for production agents. One of the failure modes of context pollution from tool definitions that I never see anyone mention is when the agent becomes less effective over time. It doesn't necessarily die or crash or throw an error. The amount of real conversation history that can be displayed in the working window gets pushed out by the tool schemas.

Even with child agents the context budget gets severely curtailed. Each child agent inherits the MCP configuration. That's new context I guess, but the immediate loss of tens of thousands of tokens to render tool schemas for subagents that may not even use them is completely antithetical to the point of using subagents in the first place: focused context. We covered the architecture patterns for multi-agent orchestration in LangGraph, but even great orchestration can't fix a context budget that's already half spent before the first tool call.

The waste is architectural: eager loading spends the context budget before the agent starts working.

Cloudflare Just Admitted This Is Broken

Cloudflare launched Agents Week on April 12, and buried in their enterprise MCP reference architecture is an admission that the tool-definition model doesn't scale.

Their solution is called Code Mode. It condenses all of the individual MCP tools down into two meta-tools: portal_codemode_search and portal_codemode_execute. Rather than loading every tool definition into context, the agent writes JavaScript to search for and invoke tools on demand.

This means that 4 internal MCP servers exposing 52 tools would normally consume 9,400 tokens just for definitions. Code Mode drops that to 600 tokens. A 94% reduction. For Cloudflare's own API, which would consume over 2 million tokens as a traditional MCP server (twice the largest context window available right now), the reduction hits 99.9%.

That last number deserves to sit for a second. Cloudflare, one of the companies most aggressively adopting MCP across their entire enterprise, had to build a system that essentially replaces MCP's tool discovery mechanism because the original approach would literally overflow the context window. With one server.

The MCP spec team acknowledged context overload as the most frequent community concern in their tool filtering proposal. Quality decreases rapidly after around 10 tools, which far exceeds what most production setups connect.

Lazy-Loading Is the Fix

Not just a theoretical issue. I'm seeing lazy-loading work in multiple production environments, each implementing it slightly differently.

Cloudflare's Code Mode turns the agent into its own tool browser. Give it a search function, give it an execute function, and let it figure out which tools matter for the job at hand. The context cost for exploring MCP servers stays the same regardless of how many servers are connected.

There's also the Skills pattern. Instead of representing all of the tool schemas in detail upfront, agents encode the knowledge needed for a given task in lightweight skill files (typically 200 to 1,500 tokens each) that can be loaded as needed based on intent matching. A skill for browser automation might cost around 2,000 tokens to activate, as opposed to 13,600 tokens to load the full MCP server at startup. GitHub operations drop from 18,000 tokens to maybe 500 or so. Web search goes from 14,100 down to 550.

That's not marginal. That's an order of magnitude.

Arcade's MCP Gateway in LangSmith Fleet takes a third approach by centralizing 7,500+ tools and optimizing the tool descriptions for language models. These tools are not simply API wrappers. They are mapped to actions that agents can perform, with descriptions written specifically for how language models select and call upon them.

Harrison Chase wrote about this from the other side of the spectrum. His continual learning framework identifies three realms where agents improve: model weights, harness code, and context. The context layer is "the most common and most exciting area right now." However, optimizing for context only works if there is room in the context budget to do so. An agent can't learn from its interactions if the space for learning is already completely filled by tool schemas it loaded at boot time.

Lazy-loading turns tool discovery into a governed routing path instead of a context-window tax.

What This Looks Like in Practice

What I particularly like about the current LangChain infrastructure is that the eager version of these agents registers all tools when the agent is built:


from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient

MCP_SERVERS = {
    "github": {"transport": "http", "url": "http://localhost:3001/mcp"},
    "browser": {"transport": "http", "url": "http://localhost:3002/mcp"},
    "search": {"transport": "http", "url": "http://localhost:3003/mcp"},
    "database": {"transport": "http", "url": "http://localhost:3004/mcp"},
}

async def build_eager_agent():
    client = MultiServerMCPClient(MCP_SERVERS)
    tools = await client.get_tools()  # all tools, all servers, every session
    return create_agent("claude-sonnet-4-6", tools=tools)

The lazy approach is not a magic discovery tool that mutates the running agent's tool set. The boring version is a router: decide which MCP servers matter for this task, load only those tools, then build the agent for that run.


from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient

TOOL_REGISTRY = {
    "github": {
        "transport": "http",
        "url": "http://localhost:3001/mcp",
        "triggers": ["pr", "issue", "repo", "commit", "branch"],
    },
    "browser": {
        "transport": "http",
        "url": "http://localhost:3002/mcp",
        "triggers": ["browse", "click", "navigate", "screenshot", "page"],
    },
    "search": {
        "transport": "http",
        "url": "http://localhost:3003/mcp",
        "triggers": ["search", "find", "look up", "query"],
    },
    "database": {
        "transport": "http",
        "url": "http://localhost:3004/mcp",
        "triggers": ["sql", "query", "table", "database", "records"],
    },
}

def select_servers(task_description: str) -> dict[str, dict]:
    selected = {}
    task = task_description.lower()

    for name, config in TOOL_REGISTRY.items():
        if any(trigger in task for trigger in config["triggers"]):
            selected[name] = {
                "transport": config["transport"],
                "url": config["url"],
            }

    return selected

async def run_with_lazy_tools(task_description: str):
    selected_servers = select_servers(task_description)
    if not selected_servers:
        available = ", ".join(TOOL_REGISTRY)
        raise ValueError(f"No matching MCP servers. Available: {available}")

    client = MultiServerMCPClient(selected_servers)
    tools = await client.get_tools()  # only tools from the routed servers
    agent = create_agent("claude-sonnet-4-6", tools=tools)

    return await agent.ainvoke(
        {"messages": [{"role": "user", "content": task_description}]}
    )

The first version of the feature I had written had a terrible context profile because it stored definitions for every tool on every server. The next version routed first, then loaded only the relevant components as needed. The gain in a production system with 5 to 10 MCP servers is in the tens of thousands of fewer tokens processed every session.

Holding all of that tool schema in context is expensive. But more importantly, every token of tool schema that sits in context is a token that could be spent on reasoning, conversation history, or user-specific memory. We wrote about why persistent agent memory is critical for production agents. Memory is useless if there isn't room for it.

Shadow MCP Is the Enterprise Problem Nobody Expected

Cloudflare's reference architecture introduces another concept worth paying attention to: Shadow MCP detection. They scan for unauthorized MCP server connections across the organization, monitoring hostnames, URI paths, and even DLP-based body inspection for JSON-RPC method calls like tools/call and initialize.

MCP has its own shadow IT problem. Developers will sometimes set up their own MCP server, integrate that into their existing agents, and security will never even be aware. This code can execute locally on developer machines, reach out to internal APIs, and bypass security controls. No audit trail, no credential governance, no DLP.

Cloudflare's answer is a monorepo governance model: centralized MCP team, AI governance approval, templates that inherit default-deny write controls and audit logging out of the box. New governed MCP servers deploy in minutes because the governance is baked into the platform, not bolted on after the fact.

I see this pattern constantly with clients. The MCP gold rush has teams spinning up servers faster than security can evaluate them. We wrote about why agent-operable interfaces are the product. The same principle applies to the tools agents use. If an employee can't access a system without approval, the agent shouldn't be able to either.

The Fix Is Architecture, Not Bigger Windows

"Context windows keep getting bigger." They do. And the waste doesn't get smaller.

A million-token window doesn't help if 67,000 tokens of tool schemas still get loaded that the agent won't ever use. The underlying issue is architectural: eager-loading is the wrong pattern for tool discovery in production agents.

Lazy-load tools based on task intent. Gate discovery behind a search mechanism. Keep tool definitions out of the context until the agent actually needs them.

Honeycomb published a set of principles for the AI era that apply here: cost is a system attribute, not an afterthought, and pre-production testing doesn't prepare for the load that comes from real systems in a real environment. Tool context overhead is exactly the kind of emergent cost that only shows up in production, when real agents connect to real MCP servers and the token bills start making people uncomfortable.

The protocol isn't the problem. The eager-loading default is the problem. Own the architecture decision. Lazy-load.

MCP Is Packaging. Agent-Operable Interfaces Are the Product | Focused Labs

Austin Vance — Mon, 04 May 2026 14:25:47 +0000

MCP packages tools, but the real product is the narrow, typed, auditable interface an agent can actually operate.

Austin Vance, CEO of Focused

MCP is not the hard part.

The hard part is designing a system that an agent can use, as opposed to guessing, wandering, or mangling it. The protocol is the distribution rather than the architecture

This is kind of important. Every enterprise AI conversation I’ve had will, at some point, boil down to this: we have a model, we have a workflow, and we have a tangle of internal tools designed for humans to interact with them through a web interface at human speeds. Then the question becomes “should we make an MCP server to handle all of this?”

Fine. But for what?

The Model Context Protocol makes it easy for applications to expose tools and model context. That’s useful and I'm not opposing MCP. I am opposing the use of this protocol to justify exposure of a useless shortcut as being also useful.

Harrison Chase broke down the lock-in problem well: switching model providers is easy, switching harnesses is less so, and model providers want to lock teams in through the harness. The harness is where the agent learns about the actions in an application, the state, the model’s memory, what can be retried, what needs approval, and what telemetry gets written down.

But then there is the interface below the harness, which gets little recognition.

A bad interface can turn an excellent harness into a nightmarish pain. A good interface can make any harness only fair at worst.

I see why “just build an MCP server” isn’t the entire answer. An MCP server can send a messy action. It can wrap up a sharp action. But deciding which action exists in the first place is up to the team. And it's a design / experience problem not engineering.

Teams build integrations for internal agents by wrapping around existing APIs, often structured to hide awkward frontend decisions, like why the API returned an object with an object with an object inside of it. An endpoint might have a side effect of updating state because it’s an admin screen. Exceptions include human-readable error messages, implicit permissions, opaque pagination parameters, no support for dry running, and no idempotency keys. The most lacking verb in this system is “after policy rules apply, approve this one invoice,” and that ends up on an agent with the verb updateInvoice. Stricter prompts don’t work.

Welcome to production.

After reading yet another question about whether a given subsystem has an MCP server, I paused for an instant to ask myself whether I missed something here. We shouldn't be asking "is an MCP server," instead we should ask if the system in question has handles for the agent that just got invited in.

A handle is a small, typed, boring action, describing what it intends to do with some data. It describes what the data contains, what the operation needs from it, and what it will look like afterward. It fails in a way that the caller can understand. Handle-based operations are easy to test without a full model. Finally, handles leave traces of their prior actions.

Do the new examples reinforce the point? Google’s MCP Toolbox for Databases might sound utterly bland because “database plus MCP” is a magical phrase. But in this case, the interesting new aspect is that databases require controlled, auditable work that can be inspected by the software agent. MathWorks has released an official MATLAB MCP server, which is interesting because the interface to MATLAB’s mature technical environment is vastly more appropriate than a chat window. Browserbase and LangChain are demonstrating Deep Agents with search, fetch, and browser subagents. Again, a cheap, light subagent performs quick retrieval, followed by a heavier browser-based operation if necessary.

I don’t mean that every single thing suddenly becomes an MCP server. I mean that more of the important tools in a business can become something controlled through an agent instead of through a browser tab or terminal command.

There is a difference.

An MCP server is just one package boundary among several, each with its own strengths and weaknesses. An agent-operable interface is a product decision, choosing specific verbs, inputs, outputs, reversible operations, and mandatory human pause actions. A protocol can then move that interface around, but it cannot make the interface good.

MCP moves an interface around. It does not make the verbs worth trusting.

This is the same anti-pattern we saw with APIs. Companies would publish a REST API to tremendous fanfare, convinced that integration problems were now solved. In practice, the nouns and mutations provided by the API would prove inadequate for anything beyond the simplest cases. Docs would sometimes contradict behavior. And while most of the workflow might be automatable, the remaining chunk still required a human being logged into the admin console.

The gap costs more as agents move further into it, since they typically stop short of explicitly stating the ambiguities at the boundary, and instead select tools, insert missing fields, retry operations, and give misleading summaries of the results as if they were progress. Agents do not intend to fail in workflows. Instead, they are given an irregular surface to work on for which they have no clear mandate and for which they must pretend to be competent.

A useful way to think about this is Developing AI Agency. The word “agency” comes with unfortunate connotations of personality, so I try to think about it in terms of the required affordances for any agent: a goal, some tools to pursue it with, memory, feedback, and permission to act. When the tool layer is too vague, the AI ends up with fake agency. It can talk about work and even generate a lot of thoughtful-sounding design language, but it can’t actually do the work.

The current gold rush of building MCPs obfuscates this problem because when people say “server” they think of code and physical hardware. Code and hardware are tangible. There is a repo, a README, and a demo of someone, usually Claude or Cursor, opening up the tool and something happening.

That demo is not the test.

Test whether the interface still behaves when the request is boring, partial, duplicated, late, unauthorized, or wrong. Test whether a reviewer can always reconstruct what happened to an object after the agent touched the handle of the thing. Test whether the action can be replayed in staging without accidentally sending the email to customers. Everybody Tests, even when the thing under test is an agent holding a tool handle.

A useful agent-operable interface has a few properties.

The verbs are narrow. A verb for “create refund request” instead of “update order.” A verb for “draft response” instead of “send message.” A verb for “propose schema migration” instead of “run SQL.” Narrow verbs help by letting the operation name strongly suggest the operation’s intent.

All inputs are provided in a form that the domain expects, not just pure JSON schema for the sake of it. Real domain constraints are used where possible, to reflect the kind of validation that matters in the application. This means providing an account ID that actually exists in the system, a payment amount that has a meaningful currency, and a date and time with timezone rules that have real-world meaning to the user. And when using enums, the validated output should contain meaningful strings, not just values used in the demo.

Outputs should be machine-readable and human-readable at the same time. The agent expects certain fields to be populated. A human reviewer wants to read a simple statement of what changed, what didn’t change, and what still needs work.

There’s a dry-run path. A dry run is the cheapest safety mechanism available, and almost nobody shipping generated code tries it first. A dry run turns “can the agent do this?” into “can the agent explain the diff before doing this?” That is where human judgment is better.

Interfaces are idempotent to the degree possible. Networks fail, agents retry, and tool calls time out while the downstream system was actually working. If creating an invocation of create_refund_request also creates a second refund, or a second ticket, or a second production deploy, then the interface is not yet ready for an agent.

Every interface has contract tests that don’t involve a model. This matters. If every single correctness check has to run an LLM, we have built a slot machine and only looked at the CI badge. The tool’s schema, how it validates, what a dry run looks like, how permissions fail, and what audit records are generated should all be tested by normal software tests. Save the model evals for when there’s a model involved.

The interface leaves evidence. Not vibes, though it could strive for better ones. Tangible records of who acted, through which agent, under which policy, against which object, with what proposed change, and with what final result. Here I’m talking about connecting observability to governance without inverting into another dashboard cult.

A useful handle is a contract the agent cannot creatively reinterpret.

The Google Cloud conversation with Harrison Chase framed harness engineering as the path from demo to production. I think that is right, and I think the next practical step is interface engineering. The harness made sense once it had an interface for composing sane things.

This is why abstractions on top of LangChain are useful too. Start with a basic agent primitive, then a graph, and finally a Deep Agent that can even use browser subagents and human interruption. Every level of abstraction still ultimately bottoms out at a tool call, which either corresponds to a clean domain operation or a tangled mess of code that happens to work on the backend.

In practice, Multi-Agent Orchestration in LangGraph is only half the story. The other half is whether the interface lets the worker do anything worth trusting.

It’s getting said out loud in the community now: “Stop building MCP servers. Build CLIs that agents can use”. I don’t care what the end result is, as long as it’s a CLI, OpenAPI endpoint, MCP tool, database management procedure, internal command bus, or whatever boring thing is observable, testable, and readable by others.

Interesting new projects are emerging around this idea too. agent-install treats agent capabilities as installable surfaces across coding agents. loadam turns OpenAPI specs into tests, MCP output, and drift reports. freeCodeCamp’s LangGraph, MCP, and A2A guide also illustrates the progress from single-agent demos to more structured systems with protocols between them.

Good. Just make the distinction between what the protocol diagram shows and what the system can actually do.

The work is deciding what actions the agent can take within Salesforce, Jira, GitHub, Postgres, SAP, Stripe, and the lingering internal admin app that is totally going to get replaced tomorrow. Deleting broad verbs is the new favorite hobby. Adding dry runs is straightforward. Making failures typed is tedious. Writing tests for contracts before a single model sees the tool is boring.

Boring is the point.

Stop Eager-Loading MCP Tools Into the Context Window. A giant pile of tools is not capability. It is usually confusion with a larger token bill. Agents need fewer, sharper handles to their tools, and tool catalogs should feel more like a well-designed command line than a junk drawer with JSON schemas bolted on.

Agent-operable interfaces should be treated as part of product architecture, not just sweeping up integration bits and pieces that product teams don’t want anymore. Enterprise teams should own the verbs the same way they own the database schema. Version them. Deprecate them. Test them and document the failure modes. Have review for dangerous actions. Make the interface boring enough that the agent has no creative wiggle room around the important bits.

MCP will help distribute interfaces. Harnesses will help compose them. Models will get better at calling them.

Companies will not win by having the most MCP-capable servers. They will win by having the cleanest handles in their systems.

Your Customer Service Bot Is Slow Because It's Single-Threaded

Austin Vance — Thu, 23 Apr 2026 19:16:24 +0000

Consider a typical enterprise support agent. A customer asks a complex compliance question and the agent dutifully queries the knowledge base, then searches the web, then checks policy docs. Sequential. Three LLM calls back to back. That's ~12 seconds of wall time.

Users start abandoning chat around 8.

Fan out those three research calls in parallel, same calls, same models, same prompts, and wall time drops to ~6.5 seconds.

This post covers the parallel sub-agent pattern using LangGraph and LangSmith. I'll show the code, but more importantly, I'll show you the failure modes because the pattern is simple and the bugs are not.

The Latency Math

You have an agent that needs to hit three sources, internal KB, web search, and policy documents. Each LLM call takes 2–4 seconds. Sequentially:

Step	Latency
Classify query	~1s
Research KB	~3s
Research Web	~3.5s
Research Policy	~2.5s
Synthesize	~2s
Total	~12s

In parallel, the three research steps overlap:

Step	Latency
Classify query	~1s
Research (all three, parallel)	~3.5s
Synthesize	~2s
Total	~6.5s

A 45% reduction from a structural change, not a prompt improvement. Every additional sub-agent you add sequentially costs another 2–4 seconds. In parallel, it's free, until you hit the slowest branch.

The Parallel Agents Architecture

We're building a research assistant that fans out to three parallel sub-agents, aggregates results, and synthesizes a response:

                     ┌→ [Research: KB]     ─┐
[Classify Query] ────┼→ [Research: Web]    ─┼→ [Synthesize] → END
                     └→ [Research: Policy] ─┘

LangGraph executes parallel branches in a superstep, all three branches run concurrently, state updates are transactional. The fan-in edge waits for all branches before proceeding.

On the Send API: LangGraph has a Send API for dynamic map-reduce where branch count is unknown at build time. Don't reach for it here. Send is designed for running the same node N times with different inputs. For a fixed set of specialist agents, static edges or conditional routing are simpler, preserve graph structure, and keep every branch visible at compile time via graph.get_graph().draw_mermaid(). In practice, you'll rarely need Send. Start with static fan-out, graduate to conditional, reach for Send as a last resort.

State: The One Thing You'll Get Wrong

The Annotated[list, operator.add] reducer tells LangGraph to concatenate results from parallel branches instead of overwriting them. Without it, parallel branches race to write the results field. The last branch to finish wins, and you silently lose the other two. This is one of the most common bugs in parallel agent systems. The synthesizer produces suspiciously narrow responses, coverage evals fail intermittently, and you spend two days blaming the prompt before realizing you're only getting one source's data.

The Code

State, a sub-agent factory, and three agent instances. The @traceable decorator ensures each agent appears as a distinct span in LangSmith — this will be the single most important debugging decision you make.

import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research_results: Annotated[list[dict], operator.add]
    final_response: str


def make_agent(name: str, focus: str):
    """Factory that builds a traceable research sub-agent."""

    @traceable(name=name, run_type="chain")
    def node(state: State) -> dict:
        response = llm.invoke([
            SystemMessage(content=f"You are the {name} agent. Focus on {focus}. "
                                  "Return a concise summary. Cite your source type."),
            HumanMessage(content=f"Research query: {state['question']}"),
        ])
        return {"research_results": [{"source": name, "content": response.content}]}

    return node


kb_agent = make_agent("knowledge_base", "internal knowledge base searches.")
web_agent = make_agent("web_search", "recent news and industry trends.")
policy_agent = make_agent("policy", "compliance, legal, and regulatory frameworks.")

The synthesizer merges sub-agent outputs into one customer-facing response. The key constraint, worth knowing before you ship, is that policy information takes precedence. Without this, the synthesizer will cheerfully soften restrictions to sound more helpful.

@traceable(name="Synthesizer", run_type="chain")
def synthesize(state: State) -> dict:
    context = "\n\n".join(
        f"[{r['source']}]: {r['content']}" for r in state["research_results"]
    )
    response = llm.invoke([
        SystemMessage(
            content="Synthesize the following research into a clear, actionable "
                    "response. When policy information conflicts with or constrains "
                    "other responses, the policy statement takes precedence. "
                    "Never soften or omit policy restrictions."
        ),
        HumanMessage(
            content=f"Customer question: {state['question']}\n\n"
                    f"Research findings:\n{context}"
        ),
    ])
    return {"final_response": response.content}

Graph Assembly

Fifteen lines of wiring. RetryPolicy on every research node so a provider 429 doesn't kill the entire pipeline, successful branches are checkpointed and won't re-execute.

from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy

builder = StateGraph(State)

builder.add_node("kb", kb_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("web", web_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("policy", policy_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "kb")
builder.add_edge(START, "web")
builder.add_edge(START, "policy")
builder.add_edge(["kb", "web", "policy"], "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()

Conditional Routing: The Upgrade

Sometimes hitting every source is wasteful. A simple "what's our refund policy?" doesn't need web search. Conditional fan-out lets you route based on the question using structured output, no regex parsing, no brittle string matching:

from collections.abc import Sequence

from pydantic import BaseModel, Field


class RoutingPlan(BaseModel):
    agents: list[str] = Field(
        description="Agents to activate: kb, web, policy"
    )

structured_llm = llm.with_structured_output(RoutingPlan)


def classify_and_route(state: State) -> Sequence[str]:
    plan = structured_llm.invoke([
        SystemMessage(content="Decide which research agents to invoke. "
                              "Available: kb, web, policy. When in doubt, include the agent."),
        HumanMessage(content=state["question"]),
    ])
    return plan.agents or ["kb"]

The tradeoff is real. Conditional routing saves latency on simple queries but your routing logic becomes a new failure point. And with conditional fan-out, use individual edges from each node to synthesize not the list-style fan-in or LangGraph waits forever for branches that were never dispatched.

Production Failures in Concurrent Execution

These are the failure modes that surface once parallel agents hit real traffic.

State Clobbering. Synthesizer references only one source. Intermittent. Cause: missing operator.add reducer. Parallel branches overwrite instead of appending. There's no warning, the graph runs fine, it just loses data.****
Synthesizer Contradicted the Policy Agent. Say a customer asks about returning an opened product. The policy agent correctly stated the 30-day unopened-only return policy. The KB agent mentioned "hassle-free returns." The synthesizer merged these into: "You can return the product within 30 days, hassle-free" omitting the unopened requirement. LangSmith traces showed the policy agent's output was correct; the synthesizer span revealed where the information was lost. Fix: the policy-takes-precedence constraint in the synthesizer prompt.
Hung Branch Blocking Fan-In. Response times spike from ~6s to 30s+. The fan-in waits for ALL branches. Your p50 is fine, your p99 is determined by the slowest branch on its worst day. Fix: async timeouts per branch, return partial results ({"source": "web_search", "content": "Timed out"}) rather than blocking the pipeline.****
Orchestrator Under-Dispatched. A significant fraction of multi-domain queries will be only partially routed. Over-dispatching (an agent returning empty results) is cheap. Under-dispatching is a customer getting an incomplete answer. Fix: explicit multi-domain examples in the routing prompt and a "when in doubt, include the agent" instruction.

Observability

Parallel agents are hard to debug without tracing. @traceable on every sub-agent gives you per-branch spans in LangSmith. Tag production traces with metadata for filtering:

from langsmith import tracing_context

with tracing_context(
    metadata={"customer_tier": "enterprise", "channel": "chat"},
    tags=["production", "v2"],
):
    result = graph.invoke({"question": "How does GDPR affect our data pipeline?"})

The first thing to check when latency spikes: is one branch consistently slower? LangSmith makes that a 10-second investigation instead of an hour of log-grepping.

Evals

Shipping without evals is negligence. Three evaluators catch the most common regressions: deterministic coverage, structural fan-out validation, and LLM-as-judge for overall quality.

from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="research-agent-evals",
    description="Parallel research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is our refund policy for enterprise clients?"},
        {"question": "How does GDPR affect our data pipeline architecture?"},
        {"question": "What competitors launched AI features last quarter?"},
    ],
    outputs=[
        {"must_mention": ["refund", "enterprise", "policy"]},
        {"must_mention": ["GDPR", "data", "compliance"]},
        {"must_mention": ["competitor", "AI", "feature"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
Customer query: {inputs[question]}
AI response: {outputs[final_response]}

Rate 0.0-1.0 on completeness, accuracy, and tone.
Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the synthesizer actually address the question?"""
    text = outputs.get("final_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def source_diversity(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Is the fan-out actually working, or did it silently degrade?"""
    results = outputs.get("research_results", [])
    sources = {r["source"] for r in results if isinstance(r, dict)}
    return {"key": "source_diversity", "score": min(len(sources) / 2.0, 1.0)}


def target(inputs: dict) -> dict:
    return graph.invoke({"question": inputs["question"]})


results = evaluate(
    target,
    data="research-agent-evals",
    evaluators=[quality_judge, coverage, source_diversity],
    experiment_prefix="parallel-research-v1",
    max_concurrency=4,
)

source_diversity is the only automated check that your parallel architecture is actually parallel. Without it, state clobbering can ship to production and sit there for weeks. Run this eval on every PR that touches agent code.

When to Use This

Use parallel sub-agents when:

Queries regularly span 2+ domains in a single message
You need per-domain traceability for debugging and compliance
Sub-agents have different tool sets or retrieval sources
You're iterating on prompts and need isolated regression testing

Skip it when:

Queries are single-domain (a FAQ bot doesn't need orchestration)
Latency budget is extremely tight (routing adds one LLM call)
You have fewer than 3 distinct knowledge domains

The Bottom Line

Parallel sub-agents aren't architecturally complex it's a fan-out, a fan-in, and a reducer. The code is about 15 lines of graph wiring. The production hardening is everything else.

Start with static fan-out. Add conditional routing when you have data showing which sources matter for which queries. Write the source_diversity eval before you write the second prompt. And put operator.add on your list fields you'll thank me later.

Technical References

Originally published at https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded.

Your AI Just Emailed a Customer Without Permission

Austin Vance — Thu, 23 Apr 2026 19:16:21 +0000

In a customer complaint handler for a fintech company you have drafted responses, checked tone, and verified responses to match company policy. Automated from end to end. Then, the agent sends a $4,200 refund approval to a customer who'd asked about a fee schedule. The LLM hallucinates the complaint, writes up a professional apology with a specific dollar amount, and fires it off before anyone on the team even knows.

Better prompts won’t help because the problem isn't what the model says, it's that nothing stops it from saying it.

To fix this you need an approval gate. Somewhere in the agent’s graph where execution... stops. State gets written to disk and a human looks at the draft. Only after they say "yeah, send it" does anything go out the door. LangGraph has a built-in primitive for this called interrupt.

Let's walk through the full pattern here. The code is straightforward but state management can trip you up.

The cost argument (if you need one)

If you're already sold on why AI shouldn't email customers unsupervised, skip this, but if you need to convince your PM, here's some napkin math:

Metric	Without Gate	With Gate
Messages sent/day	~500	~500
Error rate (wrong tone/info)	~3%	~0.1%
Bad messages/day	15	0.5
Avg cost per bad message	$200	$200
Daily risk	$3,000	$100

What we’re building

A customer complaint response pipeline. Complaint comes in, AI drafts a response, a human approves or edits, system sends the final version.

[Intake] → [Draft Response] → [INTERRUPT: Human Review] → [Send Response] → END

The interrupt is where execution pauses. All the graph state (draft, original complaint, metadata, etc) gets checkpointed. It could be hours or days before someone reviews it and when they do, the graph will pick up right where it stopped.

Even in serverless environments interrupt is resilient. The Python process can crash. Server can restart. You resume with the same thread_id and LangGraph reloads everything from the checkpointer.

The state schema

Whatever the reviewer needs to see has to be in state before the interrupt fires.

from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    complaint: str
    customer_id: str
    draft_response: str
    review_decision: str
    reviewer_notes: str
    final_response: str

The nodes

Let’s build three nodes, draft, review, send. All with @traceable because six months from now when someone asks "who approved sending that email to the VP of procurement at our biggest account," you want a trace showing what the AI wrote vs. what a person changed.

@traceable(name="draft_response", run_type="chain")
def draft_response(state: State) -> dict:
    response = llm.invoke([
        SystemMessage(
            content="You are a customer service agent. Draft a professional, "
                    "empathetic response to the following complaint. Be specific "
                    "about next steps. Do NOT promise refunds or credits unless "
                    "the complaint clearly warrants one. Keep it under 150 words."
        ),
        HumanMessage(
            content=f"Customer ID: {state['customer_id']}\n\n"
                    f"Complaint: {state['complaint']}"
        ),
    ])
    return {"draft_response": response.content}

The review node is where interrupt() does its work.

from langgraph.types import interrupt

@traceable(name="human_review", run_type="chain")
def human_review(state: State) -> dict:
    decision = interrupt({
        "draft": state["draft_response"],
        "customer_id": state["customer_id"],
        "complaint": state["complaint"],
        "instructions": "Review the draft. Respond with a JSON object: "
                        '{"action": "approve" | "edit" | "reject", '
                        '"edited_response": "...", "notes": "..."}'
    })
    return {
        "review_decision": decision["action"],
        "reviewer_notes": decision.get("notes", ""),
        "final_response": decision.get("edited_response", state["draft_response"])
            if decision["action"] != "reject" else "",
    }

The dict you pass to interrupt() is the payload. It shows up in the __interrupt__ field of the graph's return value, which is what your UI or Slack bot reads to build the review screen. When someone calls Command(resume={"action": "approve"}), that dict becomes what interrupt() returns. The function resumes from the line right after the interrupt() call. It looks like a normal function call but there's a checkpoint boundary hiding inside it.

Send node. Don't send if it was rejected:

@traceable(name="send_response", run_type="chain")
def send_response(state: State) -> dict:
    if state["review_decision"] == "reject":
        return {"final_response": "[REJECTED] " + state["reviewer_notes"]}
    return {"final_response": state["final_response"]}

Wiring it up

The checkpointer makes interrupts durable. You can use InMemorySaver for dev, PostgresSaver for prod and if you forget the checkpointer and interrupt() throws a RuntimeError.

from langgraph.checkpoint.memory import InMemorySaver
from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("draft", draft_response)
builder.add_node("review", human_review)
builder.add_node("send", send_response)

builder.add_edge(START, "draft")
builder.add_edge("draft", "review")
builder.add_edge("review", "send")
builder.add_edge("send", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

The full interrupt/resume cycle

Two invoke calls. First one runs until the interrupt and stops, the second one picks up where it left off.

from langgraph.types import Command

config = {"configurable": {"thread_id": "complaint-1234"}}

# Phase 1: Run until the interrupt
result = graph.invoke(
    {
        "complaint": "I was charged twice for my subscription last month. "
                     "Order #A-9912. I want a refund immediately.",
        "customer_id": "cust_8837",
    },
    config=config,
)

# The graph paused. Extract the interrupt payload.
interrupt_data = result["__interrupt__"][0].value
print(f"Draft for review: {interrupt_data['draft']}")
print(f"Customer: {interrupt_data['customer_id']}")

# Phase 2: Human reviews and approves (could be minutes or days later)
final_result = graph.invoke(
    Command(resume={
        "action": "edit",
        "edited_response": "We've identified the duplicate charge on Order #A-9912. "
                           "A refund of $29.99 has been initiated and will appear "
                           "in 3-5 business days. We apologize for the inconvenience.",
        "notes": "Verified duplicate charge in billing system. Approved refund.",
    }),
    config=config,  # Same thread_id — this is how LangGraph finds the checkpoint
)

print(f"Final response: {final_result['final_response']}")

That thread_id in the config matters more than anything else here. It's the key into the checkpointer. Without a thread_id you can't resume. We treat these as primary keys and map it to something stable in your system: ticket ID, conversation ID, etc.

Adding risk-based routing

The basic version sends everything through human review. Start there, but eventually reviewers get tired of approving "thanks for contacting us, we're looking into it" all day, and you'll want to auto-approve the low-risk stuff.

from pydantic import BaseModel, Field


class RiskAssessment(BaseModel):
    risk_level: str = Field(description="low, medium, or high")
    reason: str = Field(description="Why this risk level was assigned")


risk_llm = llm.with_structured_output(RiskAssessment)


@traceable(name="assess_risk", run_type="chain")
def assess_risk(state: State) -> dict:
    assessment = risk_llm.invoke([
        SystemMessage(
            content="Assess the risk level of this customer service response. "
                    "high = involves money, legal, account changes, or could "
                    "be interpreted as a binding commitment. "
                    "medium = emotional topic, could escalate. "
                    "low = simple acknowledgment, FAQ, status update."
        ),
        HumanMessage(
            content=f"Complaint: {state['complaint']}\n\n"
                    f"Draft response: {state['draft_response']}"
        ),
    ])
    return {"review_decision": assessment.risk_level}


def route_by_risk(state: State) -> str:
    if state["review_decision"] == "low":
        return "send"
    return "review"


builder_v2 = StateGraph(State)

builder_v2.add_node("draft", draft_response)
builder_v2.add_node("assess", assess_risk)
builder_v2.add_node("review", human_review)
builder_v2.add_node("send", send_response)

builder_v2.add_edge(START, "draft")
builder_v2.add_edge("draft", "assess")
builder_v2.add_conditional_edges("assess", route_by_risk, {"send": "send", "review": "review"})
builder_v2.add_edge("review", "send")
builder_v2.add_edge("send", END)

graph_v2 = builder_v2.compile(checkpointer=InMemorySaver())

Fair warning: you've now introduced a second LLM call as a gate, and that gate can be wrong in both directions. Under-classify risk and messages go out without review. Over-classify and reviewers are right back to rubber-stamping everything. Run the classifier in logging-only mode for a couple weeks first (route everything through review, but record what the classifier would have done and use long term memory to tune the classifier). Then start skipping reviews on low-risk messages after you trust the data.

The bugs

The demo works great... but...

Lost thread_id

Someone approves a draft in Slack. The integration pulls out the approval decision but constructs a new thread_id instead of looking up the one stored with the interrupt payload. Now Command(resume=...) creates a fresh graph where the input is an approval decision, not the complaint.

This happens a lot. Store the thread_id alongside the interrupt payload when you surface it to reviewers. Put it in a database. Put it in the Slack message metadata, Do not lose it.

Stale state

Reviewer opens the draft at 11:30. Goes to lunch. Comes back at 1pm and hits approve. In the meantime, the customer sent two more messages and someone on the support team already replied manually. The approved draft is now responding to a conversation that moved on.

LangGraph has no idea. It resumes from the checkpoint, which is frozen in time. Fix this by putting a created_at timestamp in the interrupt payload and checking it against the customer record's last_updated_at on resume. If anything changed, re-draft.

Double resume

Shared review queue. Two reviewers see the same pending draft. Both click approve. Depending on the checkpointer implementation, the second resume is either a no-op or an error, but by then the send logic already fired on the first one. Maybe that's fine. Maybe you just sent duplicate emails.

Build in idempotency to check if the thread already has a review_decision before doing anything with the resume.

Interrupt reordering

Two interrupt() calls in one node (say, one for policy review and one for tone). LangGraph matches resume values to interrupts by position, not by name. There are no names. Refactor and swap the order, the policy answer goes to the tone check and vice versa.

Don't put multiple interrupts in one node, instead use separate nodes.

Tracing across the gap

Interrupt-based workflows leave a gap in the LangSmith timeline where the human review happened. The draft trace ends, then hours later the resume trace starts, and nothing connects them unless you're deliberate about it.

from langsmith import tracing_context

ticket_id = "TICKET-4821"
config = {"configurable": {"thread_id": ticket_id}}

# Phase 1: Draft
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "draft"},
    tags=["production", "complaint-handler", "phase-1"],
):
    result = graph.invoke(
        {
            "complaint": "Your app crashed and I lost 3 hours of work.",
            "customer_id": "cust_2291",
        },
        config=config,
    )

# ... time passes, human reviews ...

# Phase 2: Resume
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "resume", "reviewer": "jane@company.com"},
    tags=["production", "complaint-handler", "phase-2"],
):
    final = graph.invoke(
        Command(resume={"action": "approve", "notes": "Looks good."}),
        config=config,
    )

Put the ticket ID in the metadata for both phases. Now you can filter in LangSmith and see the full lifecycle of a single complaint even though draft and resume were separate invocations. The reviewer field in phase 2 is your audit trail.

Evals

You need to know if drafts are any good before a human ever sees them.

Dataset setup and evaluators live in evals.py in the companion repo:

from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

from complaint_handler import graph

ls_client = Client()

DATASET_NAME = "complaint-handler-evals"

if not ls_client.has_dataset(dataset_name=DATASET_NAME):
    dataset = ls_client.create_dataset(
        dataset_name=DATASET_NAME,
        description="Human-in-the-loop complaint handler evaluation dataset",
    )
    ls_client.create_examples(
        dataset_id=dataset.id,
        inputs=[
            {
                "complaint": "Charged twice for order #A-1234. Want a refund.",
                "customer_id": "cust_001",
            },
            {
                "complaint": "App crashes every time I open the settings page.",
                "customer_id": "cust_002",
            },
            {
                "complaint": "Your CEO's tweet was offensive. Cancelling my account.",
                "customer_id": "cust_003",
            },
        ],
        outputs=[
            {
                "must_mention": ["refund", "order", "A-1234"],
                "risk": "high",
            },
            {
                "must_mention": ["crash", "settings", "investigating"],
                "risk": "medium",
            },
            {
                "must_mention": ["feedback", "understand", "account"],
                "risk": "high",
            },
        ],
    )

Three evaluators. LLM judge for draft quality, keyword coverage, and a check for unauthorized promises:

DRAFT_QUALITY_PROMPT = """\
Customer complaint: {inputs}
AI draft response: {outputs}

Rate 0.0-1.0 on empathy, accuracy, and professionalism.
Deduct points if the draft promises specific remedies (refunds, credits)
without explicit authorization.
Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

draft_judge = create_llm_as_judge(
    prompt=DRAFT_QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="draft_quality",
    continuous=True,
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the draft actually address the complaint specifics?"""
    text = outputs.get("draft_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def no_unauthorized_promises(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the draft promise refunds or credits without authorization?"""
    text = outputs.get("draft_response", "").lower()
    dangerous_phrases = ["refund has been", "credit has been", "we will refund",
                         "we will credit", "compensation of"]
    violations = sum(1 for p in dangerous_phrases if p in text)
    return {"key": "no_unauthorized_promises", "score": 1.0 if violations == 0 else 0.0}


def target(inputs: dict) -> dict:
    """Run the graph until the interrupt (draft phase only)."""
    config = {"configurable": {"thread_id": f"eval-{inputs['customer_id']}"}}
    result = graph.invoke(inputs, config=config)
    return {"draft_response": result.get("draft_response", "")}

no_unauthorized_promises catches the failure mode from the top of this post. If the draft says "a refund has been initiated" when nobody authorized a refund, it scores zero. Run this eval every time you change the system prompt.

if name == "main":

    results = evaluate(

        target,

        data=DATASET_NAME,

        evaluators=[draft_judge, coverage, no_unauthorized_promises],

        experiment_prefix="complaint-handler-v1",

        max_concurrency=4,

    )

    print("\nEvaluation complete. Check LangSmith for results.")

When to Human In The Loop

If AI is writing things that go to customers, you need a gate. Processing refunds, updating account records, anything you can't undo with a quick "sorry about that" email. Regulated industries need the gate plus an audit trail of who approved what.

You don't need this for internal stuff. Summarizing meeting notes, running analysis for a dashboard, generating reports that a human reads.

TL;DR

The two function calls: interrupt() and Command(resume=...). Pause execution, persist state, resume later.

Most of the work is everything around those two calls. Thread IDs getting lost, the world changing during the review gap, two reviewers approving the same draft, traces that need to connect across a timeline gap of hours or days.

Start by routing every response through review. Reviewers will complain. Good. Measure which categories they rubber-stamp, run your evals, and only then start auto-approving the boring stuff.

Technical References

Originally published at https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission.

Streaming Agent State with LangGraph

Austin Vance — Thu, 23 Apr 2026 19:15:26 +0000

Your research agent takes 9 seconds to answer a question. It fans out to three sources, synthesizes results, returns a polished answer. The user sees a blank screen for all nine of those seconds. By second 5 they've refreshed the page, doubled your API costs, and still seen nothing.

Streaming fixes this. Show the user what the agent is doing while it's doing it: "Searching knowledge base...", "Found 3 results...", "Synthesizing..." and then stream the final answer token by token. Same 9 seconds, but the user sees progress from millisecond 200.

The Perception Math

Identical work, different user experience:

Pattern

Wall time

Time to first byte

Perceived wait

invoke() (no streaming)

Broken

stream(stream_mode="updates")

~200ms

Working

stream(stream_mode=["updates", "custom", "messages"])

~200ms

Can see what it’s doing

What we're Building

A multi-step research agent that streams three types of events to the UI: node-level progress updates, custom status messages from inside nodes, and token-by-token LLM output for the final synthesis.

                          ┌─ stream: "Searching KB..."
[Intake] → [Research KB]  ┤
                          └─ stream: {results: 3}
                                    ↓
                          ┌─ stream: "Analyzing results..."
         → [Synthesize]  ┤
                          └─ stream: tokens... t-o-k-e-n-b-y-t-o-k-e-n
                                    ↓
                                     → END

Three stream modes run simultaneously: updates for graph state changes, custom for application-specific progress events, and messages for LLM token streaming.

The Five Modes

LangGraph exposes five stream modes. You'll use three in practice:

Mode

What it streams

When to use

values

Full state after each superstep

Debugging, state inspection

updates

State delta from each node

Production UIs — lightweight, shows which node ran

messages

LLM tokens + metadata

Chat UIs — token-by-token output

custom

Arbitrary data from get_stream_writer()

Progress bars, status messages, structured events

debug

Everything — internal execution details

Development only

In production, use ["updates", "custom", "messages"]. values sends the entire state on every step. debug is for development.

The Code

State and two nodes: a research step that emits custom progress events, and a synthesizer that streams its LLM response token by token.

from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.config import get_stream_writer
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research: str
    answer: str

The research node uses get_stream_writer() to push status updates to the client. These show up in the custom stream mode:

@traceable(name="research", run_type="chain")
def research(state: State) -> dict:
    writer = get_stream_writer()

    writer({"step": "research", "status": "starting", "message": "Searching knowledge base..."})

    response = llm.invoke([
        SystemMessage(
            content="You are a research assistant. Search for relevant information "
                    "about the user's question. Return a concise summary of findings."
        ),
        HumanMessage(content=state["question"]),
    ])

    writer({"step": "research", "status": "complete", "message": "Research complete."})

    return {"research": response.content}

The synthesizer uses the LLM normally. LangGraph automatically streams its tokens when messages mode is active:

@traceable(name="synthesize", run_type="chain")
def synthesize(state: State) -> dict:
    writer = get_stream_writer()
    writer({"step": "synthesize", "status": "starting", "message": "Synthesizing answer..."})

    response = llm.invoke([
        SystemMessage(
            content="Synthesize the research into a clear, actionable answer. "
                    "Be concise but thorough."
        ),
        HumanMessage(
            content=f"Question: {state['question']}\n\nResearch:\n{state['research']}"
        ),
    ])

    writer({"step": "synthesize", "status": "complete", "message": "Done."})
    return {"answer": response.content}

Graph Assembly

from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("research", research)
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "research")
builder.add_edge("research", "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()

Multi-mode Streaming

A single .stream() call can emit node updates, custom progress events, and LLM tokens simultaneously:

for mode, chunk in graph.stream(
    {"question": "What are the key differences between REST and GraphQL for mobile APIs?"},
    stream_mode=["updates", "custom", "messages"],
):
    if mode == "updates":
        # Node completed — chunk is the state delta
        node_name = list(chunk.keys())[0]
        print(f"[node] {node_name} completed")

    elif mode == "custom":
        # Custom progress event from get_stream_writer()
        print(f"[status] {chunk.get('message', chunk)}")

    elif mode == "messages":
        # LLM token — chunk is a tuple of (message_chunk, metadata)
        message_chunk, metadata = chunk
        if hasattr(message_chunk, "content") and message_chunk.content:
            print(message_chunk.content, end="", flush=True)

Note that the output shape changes with multi-mode. Single mode (stream_mode="updates") yields chunks directly. Multi-mode (stream_mode=["updates", "custom"]) yields (mode, chunk) tuples. Code that works with single mode breaks with multi-mode because the unpacking is different.

Async streaming

For production APIs, use astream with async for:

import asyncio

from langsmith import traceable


@traceable(name="stream_research", run_type="chain")
async def stream_research(question: str):
    chunks = []
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                chunks.append(message_chunk.content)
                yield {"type": "token", "content": message_chunk.content}
        elif mode == "custom":
            yield {"type": "status", "content": chunk}
        elif mode == "updates":
            yield {"type": "node_update", "content": chunk}


async def main():
    async for event in stream_research("How do vector databases work?"):
        if event["type"] == "token":
            print(event["content"], end="", flush=True)
        else:
            print(f"\n[{event['type']}] {event['content']}")

asyncio.run(main())

FastAPI + SSE

The standard production pattern is a FastAPI endpoint that converts graph streams to SSE. SSE is one-directional (server to client), works over HTTP/1.1, and auto-reconnects:

import json

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langsmith import traceable

app = FastAPI()


@traceable(name="sse_research_stream", run_type="chain")
async def generate_sse(question: str):
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                data = json.dumps({"type": "token", "content": message_chunk.content})
                yield f"data: {data}\n\n"
        elif mode == "custom":
            data = json.dumps({"type": "status", "content": chunk})
            yield f"data: {data}\n\n"
        elif mode == "updates":
            node_name = list(chunk.keys())[0] if chunk else "unknown"
            data = json.dumps({"type": "node_complete", "node": node_name})
            yield f"data: {data}\n\n"

    yield "data: [DONE]\n\n"


@app.post("/research/stream")
async def stream_endpoint(payload: dict):
    return StreamingResponse(
        generate_sse(payload["question"]),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )

Set X-Accel-Buffering: no in the response headers and proxy_buffering off in your nginx config. Without these, nginx buffers the entire response before sending it to the client and your streaming pipeline becomes a regular HTTP response.

The Bugs

These break under load.

Reverse proxy buffering

You deploy behind nginx or a cloud load balancer. SSE events arrive at the client in one big batch after the stream completes. Cause: proxy buffering is on by default. Set the X-Accel-Buffering header, disable proxy_buffering in nginx, and check your cloud provider's load balancer settings.

Message chunk ordering

With messages mode, you receive AIMessageChunk objects. The content field is usually a string, except when the model returns tool calls where it's a list of content blocks. Concatenating .content naively produces garbled output. Check isinstance(message_chunk.content, str) before concatenating and handle tool-call chunks separately.

Backpressure on slow clients

Your agent streams tokens faster than the client can consume them (mobile on 3G, overloaded browser tab). The server-side buffer grows until memory pressure kills the process. Use bounded async queues or configure your ASGI server's per-connection send buffer limits.

Mixed single/multi mode unpacking

Developer switches from stream_mode="updates" to stream_mode=["updates", "custom"] and doesn't update the unpacking code. The for chunk in graph.stream(...) now yields (mode, chunk) tuples, but the code tries to use the tuple as a dict. No error, just wrong data flowing through. Always use multi-mode from the start, even if you only need one mode today.

Observability

Stream-based workflows produce many small events. Tag your traces so you can measure stream performance in LangSmith:

from langsmith import tracing_context

with tracing_context(
    metadata={
        "stream_mode": "multi",
        "client_type": "web",
        "session_id": "sess_12345",
    },
    tags=["production", "streaming", "v1"],
):
    for mode, chunk in graph.stream(
        {"question": "Explain vector similarity search"},
        stream_mode=["updates", "custom", "messages"],
    ):
        pass  # process chunks

The LangSmith trace shows per-node timings. Use this to find nodes that are slow to emit their first token (high time-to-first-byte) vs. nodes that produce tokens slowly (low throughput).

Evals

Streaming doesn't change what the agent produces, it changes how the output is delivered. Evals verify that streamed output matches what invoke() would return, and that custom events are emitted correctly.

from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="streaming-agent-evals",
    description="Streaming research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What are the tradeoffs between REST and GraphQL?"},
        {"question": "How do vector databases enable semantic search?"},
        {"question": "What is retrieval-augmented generation?"},
    ],
    outputs=[
        {"must_mention": ["REST", "GraphQL", "tradeoff"]},
        {"must_mention": ["vector", "embedding", "similarity"]},
        {"must_mention": ["retrieval", "generation", "context"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
User question: {inputs[question]}
Agent response: {outputs[answer]}

Rate 0.0-1.0 on completeness, accuracy, and clarity.
Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the response address the key topics?"""
    text = outputs.get("answer", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def stream_completeness(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Does streaming produce the same output as invoke?"""
    streamed = outputs.get("answer", "")
    invoked_result = graph.invoke({"question": inputs["question"]})
    invoked = invoked_result.get("answer", "")
    # Exact match is too strict — LLM outputs vary. Check key content overlap.
    streamed_words = set(streamed.lower().split())
    invoked_words = set(invoked.lower().split())
    if not invoked_words:
        return {"key": "stream_completeness", "score": 1.0}
    overlap = len(streamed_words & invoked_words) / len(invoked_words)
    return {"key": "stream_completeness", "score": min(overlap, 1.0)}


def custom_events_emitted(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Were custom status events emitted during streaming?"""
    events = outputs.get("custom_events", [])
    expected_steps = {"research", "synthesize"}
    seen_steps = {e.get("step") for e in events if isinstance(e, dict)}
    coverage_score = len(seen_steps & expected_steps) / len(expected_steps)
    return {"key": "custom_events", "score": coverage_score}


def target(inputs: dict) -> dict:
    custom_events = []
    answer_chunks = []
    for mode, chunk in graph.stream(
        {"question": inputs["question"]},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "custom":
            custom_events.append(chunk)
        elif mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                answer_chunks.append(message_chunk.content)
        elif mode == "updates":
            if "synthesize" in chunk:
                pass  # answer is captured via message chunks

    return {
        "answer": "".join(answer_chunks) if answer_chunks else "",
        "custom_events": custom_events,
    }


results = evaluate(
    target,
    data="streaming-agent-evals",
    evaluators=[quality_judge, coverage, stream_completeness, custom_events_emitted],
    experiment_prefix="streaming-agent-v1",
    max_concurrency=4,
)

stream_completeness verifies that the streaming path produces equivalent output to invoke(). This catches bugs where stream chunking drops content, like an SSE serializer silently truncating chunks that exceed a size limit.

When to Stream

Use streaming for any user-facing agent interaction over 2 seconds, multi-step agents where progress indicators reduce perceived latency, and chat interfaces where token-by-token display is expected.

Skip it for background jobs with no user waiting, when latency is already under a second, and when the output is structured data rather than natural language.

TL;DR

Three modes in production: updates for node transitions, custom for progress events via get_stream_writer(), and messages for token streaming. Combine them with stream_mode=["updates", "custom", "messages"].

Deploy behind FastAPI + SSE with X-Accel-Buffering: no. Watch for reverse proxy buffering, backpressure on slow clients, and the single-to-multi mode unpacking change.

Technical References:

Originally published at https://focused.io/lab/streaming-agent-state-with-langgraph.

Driving Value with LangSmith Insights

Austin Vance — Thu, 23 Apr 2026 19:15:24 +0000

Imagine you have a deployed agentic system in production. Everything is going well, users are interacting with the product, and there are no critical issues going on. But what comes next? How can we monitor our system to understand what needs to be improved, fixed or built next?

The first requirement is to have great observability. LangSmith is a great tool for this.

We can use it to monitor all of our production runs, detect errors and understand how the model behaves across different interactions.

In October 2025, October LangChain released a new feature: Insights Agent. This feature allows an agent to analyze your LangSmith traces and surface usage patterns, common behaviors, and recurring error modes automatically. Instead of manually digging through logs, you can let an agent do the analysis for you. If you want to read more about it, here's a link to the docs.

How to run the Insights Agent

We are going to go through a simple demo of how to use this exciting new tool with a simple chatbot graph.

Prerequisites:

A Plus or Enterprise LangSmith plan
A tracing project with a good amount of traces to analyze

The first thing we need to do is go to our LangSmith project. Once there, we are going to see multiple tabs on the top of the screen. Click on the one that says “Insights”.

If this is our first time running Insights, we are going to see an empty page and a “Create Insight” button. We can go ahead and click it.

Now, we are presented with two alternatives for how to run the Insights Agent: auto or manual. For the sake of simplicity, let’s start with the “auto” mode.

We need to answer the following questions:

“What does the agent in this tracing project do?”
“What would you like to learn about this agent?”
“How are traces in this tracing project structured? Are there specific input/output keys to pay attention to?”

This information will be used in our agent prompt, and will help tailor the output to our needs.

We can also choose if we want to use OpenAI or Anthropic as our provider. As a note, you will need an API key for either provider.

After we click on “Run Job”, we are going to see a message saying the agent has started running in the background and that we will have our results in a few minutes. If we navigate to the Insights tab we are going to see the agent run in progress as well as the results that start to come out.

How to understand and use the results

For this example, we are going to be using a chatbot that answers questions about restaurants and helps with making reservations.

The first part of the output is a summary of the findings. This is going to be the answer to the question we were asked earlier about what we wanted to learn about this agent. In this case, we wanted to understand what customers were asking the chatbot in order to identify user patterns.

We can see that in this example, 57% of the questions being asked to our chatbot are about feature discovery, 29% are about operating hours, and only 14% are about making reservations.

This kind of result is interesting because it helps us understand what customers actually need. Maybe we initially assumed that most questions would be about making reservations, but this data doesn’t support that. LangSmith Insights is critical because it grounds our product decisions in real user behavior, helping us invest engineering effort where it delivers the most value.

If we click on the “Hide Findings” button, we can do a deep dive into the traces, broken down by category.

If we click on any of the categories we can see all runs within that category and navigate to the trace we are interested in.

Using evaluation + Insights to get the highest impact on value

Once we are comfortable with the categories of our generated insights, we can build evaluation datasets that mirror those categories. This way, we can understand how well our agent is answering questions across categories.

Why do insights change this process? Imagine we run our evaluations and we discover the agent is only answering 40% of questions around reservations correctly. But insights reveal that reservation questions are actually the least common user queries. That context lowers the overall criticality of the issue and helps us prioritize fixes more intelligently.

Insights add context to the analysis, but they don’t override business requirements. This is only an example: Depending on the use case, a low-frequency category like reservations may still demand zero errors if the business impact is high.

Final thoughts

We have gone through a simple example to illustrate the power of this tool. But as we’ve seen, we can ask the agent virtually any question we want. For example, we could ask, “What types of questions is my agent hallucinating on or answering incorrectly?” and the agent will find all traces that match that criteria. This is extremely flexible and powerful.

LangSmith is still king when it comes to building and observing production grade AI applications, and this kind of feature is the reason why I encourage you to try it out and continue to create amazing applications with it!

Originally published at https://focused.io/lab/driving-value-with-langsmith-insights.

Most Teams Don't Have a Data Flywheel

Austin Vance — Wed, 22 Apr 2026 18:56:27 +0000

LangChain shows how the loop works. Here's why it stalls in production and what it actually takes to make it compound.

Austin Vance, CEO ofFocused

LangChain has been pushing a clear idea: production data should make your agents better.

The loop looks like this: production traces capture real behavior, those traces become datasets, evaluators score performance, feedback improves those evaluators, and improvements get deployed back into the system. Over time, the system compounds.

That is the data flywheel.

And it is directionally right.

But most teams building agents today are not seeing that compounding effect. The loop exists on paper. In practice, it stalls.

What the Data Flywheel Actually Is

In the LangChain ecosystem, especially with LangSmith, the flywheel connects three things: observability, evaluation, and iteration.

Production traces become the source of truth. Failures are turned into datasets. Datasets become regression tests. Evaluators score performance at scale. Feedback improves those evaluators over time.

The goal is simple: every production interaction should become an improvement signal.

Where It Breaks

The issue is not the idea. The issue is that most teams never fully implement the system required to make it work.

1. Traces are collected, but nothing happens. Teams instrument their agents. They capture inputs, outputs, tool calls, and intermediate steps. And then it stops there. The missing step is turning traces into something actionable — structured datasets, labeled failures, repeatable test cases. Without that, you are not building a flywheel. You are just logging behavior.

2. There is no real evaluation layer. This is where most teams stall. They review outputs manually. They rely on intuition. They make changes based on what "looks better." There is no automated evaluation, no regression testing, no baseline performance. So when something changes, there is no way to know if it improved or regressed. If you cannot measure it, the loop does not spin.

3. Evaluators are not trusted. Even when teams introduce evaluation, it often breaks down. LLM-as-a-judge systems can scale evaluation, but only if they are clearly defined, calibrated against human feedback, and continuously refined. Without that, evaluator output becomes noisy. And noisy signals lead to random changes. If you do not trust your evaluation layer, you cannot rely on your flywheel.

4. The loop never actually closes. Even when failures are identified, prompts get tweaked ad hoc, changes are not versioned, and fixes are not tested against past failures. So nothing compounds. A real loop looks like this: a failure is captured, the failure becomes a dataset, the dataset is evaluated, a change is applied, and the change is tested against that dataset. If you skip any step, the loop breaks.

5. There is no real production pressure. This is the quiet failure that kills most flywheels. If your agent is not embedded in a real system, you do not get meaningful traffic, you do not see real edge cases, and you do not generate useful data. Internal demos do not create real signals. Without real usage, the flywheel has nothing to work with.

What a Real Data Flywheel Looks Like

At a system level, this is not a concept. It is a pipeline.

Instrumentation. Every step of the agent is observable — inputs, decisions, state transitions, outputs. Using structured systems like LangGraph makes this consistent.

Dataset creation. Production traces are turned into labeled examples, categorized failures, and reusable datasets. This is where the loop actually begins.

Evaluation. You define what "good" looks like and measure it — correctness, tool selection, completion quality. Evaluations run continuously, not just during development.

Calibration. Evaluators improve over time. Human feedback corrects them, agreement is measured, alignment increases. This step is critical and often skipped.

Iteration and deployment. Changes are applied intentionally — to prompts, graph structure, and tool logic. Then tested against historical failures before being deployed. Only validated improvements ship.

The Shift Most Teams Need to Make

The data flywheel is often described like a product feature. That is the problem.

It is not something you turn on. It is an engineering system that connects observability, evaluation, feedback, and deployment into a continuous loop. Without that system, you do not have a flywheel. You have logs and intuition.

The Bottom Line

Most teams do not have a data flywheel. They have a growing pile of traces and a sense that things might be improving.

The teams that actually get better over time treat this differently. They build the system that makes improvement inevitable.

If your agent only records what happened, it will stall. If your system learns from what happened, it compounds.

That is the difference.

LangGraph Error Handling Patterns for Production AI Agents

Austin Vance — Tue, 21 Apr 2026 18:53:58 +0000

You have a document processing pipeline. It ingests contracts, extracts key clauses, validates them against policy, and generates a summary. Monday morning it processes 200 documents without a hiccup. Tuesday at 2 AM, Anthropic’s API returns a 429, the extraction node throws, and the entire pipeline stops. Not just the one document — the whole batch. Your on-call engineer spends 45 minutes figuring out it was a transient rate limit that would have resolved itself with a 2-second backoff.

The fix isn’t “add a try/except.” The fix is classifying errors by who can fix them and routing each class to the right handler. LangGraph gives you the primitives — RetryPolicy, Command, interrupt(), and ToolNode error handling — but the framework won’t decide your error strategy for you. That’s on you.

This post shows the four error classes, the LangGraph primitives for each, and the production failures that surface when you get the classification wrong.

The Production Error Handling Classification Matrix

Not all errors are equal. The single most important decision in your error-handling strategy is: who fixes this?

Error Class	Who Fixes It	LangGraph Primitive	Example
Transient	System (automatic)	`RetryPolicy`	API 429, network timeout, DNS blip
LLM-Recoverable	The LLM	Error in state + loop back	Tool returned bad JSON, wrong tool chosen
User-Fixable	The human	`interrupt()`	Missing required field, ambiguous input
Unexpected	The developer	Let it bubble up	`TypeError`, schema mismatch, logic bug

Getting this wrong costs you. Retrying a user-fixable error wastes 3 attempts and 6 seconds before failing anyway. Interrupting for a transient error pages a human to click “retry” on something that would have fixed itself. Swallowing an unexpected error hides a real bug behind a generic fallback.

The Architecture

We’re building a document processing pipeline that extracts clauses from contracts, validates them, and generates summaries. Each node has a different error profile:

`python

import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import AnyMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)

class PipelineState(TypedDict):
document: str
messages: Annotated[list[AnyMessage], operator.add]
extracted_clauses: list[dict]
validation_errors: list[str]
retry_count: int
final_summary: str
`

The extraction node calls tools and hits APIs — it gets transient errors and tool failures. The validation node needs complete data — it surfaces user-fixable gaps. The summarizer is the least error-prone but still needs retry protection.

State: Track Errors Explicitly

The key insight: errors are data, not just exceptions. Store them in state so the LLM can see what went wrong and adjust its approach.
`python

from langgraph.types import RetryPolicy

Aggressive retry for flaky external APIs

api_retry = RetryPolicy(
max_attempts=5,
initial_interval=1.0,
backoff_factor=2.0,
max_interval=10.0,
jitter=True,
)

Conservative retry for LLM calls (they're expensive)

llm_retry = RetryPolicy(
max_attempts=3,
initial_interval=0.5,
backoff_factor=2.0,
max_interval=5.0,
jitter=True,
)
`

Pattern 1: Transient Errors with RetryPolicy

API rate limits, network blips, DNS hiccups. These fix themselves. Don’t write code for them — configure them.

`python

from httpx import HTTPStatusError

def should_retry(error: Exception) -> bool:
if isinstance(error, HTTPStatusError):
return error.response.status_code in (429, 502, 503)
return False

selective_retry = RetryPolicy(
max_attempts=5,
initial_interval=1.0,
backoff_factor=2.0,
retry_on=should_retry,
)
`

RetryPolicy parameters worth knowing:

Parameter	Default	What It Does
`max_attempts`	3	Total attempts including the first
`initial_interval`	0.5	Seconds before first retry
`backoff_factor`	2.0	Multiplier per retry (exponential backoff)
`max_interval`	128.0	Cap on wait time between retries
`jitter`	True	Randomize wait to avoid thundering herd
`retry_on`	(default exceptions)	Exception types or callable to filter

The retry_on parameter is where most people get it wrong. The default retries on common network/transient exceptions. If you need to retry on a custom exception type:


from langchain_core.tools import tool
from langgraph.prebuilt import ToolNode

@tool
def extract_clause(text: str, clause_type: str) -> dict:
    """Extract a specific clause from contract text.

    Args:
        text: The contract text to search.
        clause_type: One of 'termination', 'liability', 'indemnification', 'payment'.
    """
    valid_types = {"termination", "liability", "indemnification", "payment"}
    if clause_type not in valid_types:
        raise ValueError(
            f"Invalid clause_type '{clause_type}'. Must be one of: {valid_types}"
        )
    return {
        "clause_type": clause_type,
        "text": f"Extracted {clause_type} clause from document.",
        "confidence": 0.92,
    }

@tool
def check_compliance(clause: str, regulation: str) -> dict:
    """Check if a clause complies with a specific regulation.

    Args:
        clause: The clause text to check.
        regulation: The regulation identifier (e.g., 'GDPR-Art17', 'SOX-302').
    """
    if not clause.strip():
        raise ValueError("Empty clause text provided. Extract the clause first.")
    return {"compliant": True, "regulation": regulation, "notes": "No issues found."}

tools = [extract_clause, check_compliance]

# handle_tool_errors=True: catch exceptions, return error as ToolMessage
tool_node = ToolNode(tools, handle_tool_errors=True)

The superstep transaction rule: LangGraph executes parallel branches in supersteps. If any branch in a superstep raises an exception, none of the state updates from that superstep apply. Successful branches are checkpointed and won’t re-execute on retry, but the state snapshot rolls back to before the superstep started. This means a flaky API in one branch can block state updates from an unrelated branch that succeeded. RetryPolicy per node keeps one bad branch from poisoning the whole superstep.

Pattern 2: LLM-Recoverable Errors with ToolNode

Tool calls fail. The LLM picks the wrong tool, passes bad arguments, or the tool returns something unparseable. The fix isn’t retrying the exact same call — it’s letting the LLM see what went wrong and try a different approach.

ToolNode from langgraph.prebuilt has a handle_tool_errors parameter that catches tool exceptions and returns the error message as a ToolMessage. The LLM sees the error and can adjust:


def format_tool_error(error: Exception) -> str:
    return (
        f"Tool failed with: {error}\n"
        "Review your arguments and try again. "
        "Check the tool's docstring for valid parameter values."
    )

tool_node_custom = ToolNode(tools, handle_tool_errors=format_tool_error)

You can also pass a custom error handler for more control:


from langchain_core.messages import HumanMessage, SystemMessage

@traceable(name="agent", run_type="chain")
def agent_node(state: PipelineState) -> dict:
    system = SystemMessage(
        content="You are a contract analysis agent. Use the provided tools to "
                "extract and validate clauses. If a tool returns an error, read "
                "the error message carefully and adjust your arguments. "
                "Available clause types: termination, liability, indemnification, payment."
    )
    messages = [system] + state["messages"]
    response = llm.bind_tools(tools).invoke(messages)
    return {"messages": [response]}

The agent node calls the LLM, which may invoke tools. When a tool fails, handle_tool_errors=True catches the exception and sends the error back to the LLM as a ToolMessage. The LLM sees the error and tries again — usually with corrected arguments:


from langgraph.types import interrupt, Command

@traceable(name="validate_document", run_type="chain")
def validate_node(state: PipelineState) -> dict:
    clauses = state.get("extracted_clauses", [])
    errors = []

    if not clauses:
        errors.append("No clauses extracted from document.")

    required_types = {"termination", "payment"}
    found_types = {c["clause_type"] for c in clauses if isinstance(c, dict)}
    missing = required_types - found_types
    if missing:
        errors.append(f"Missing required clause types: {missing}")

    low_confidence = [
        c for c in clauses
        if isinstance(c, dict) and c.get("confidence", 1.0) < 0.7
    ]
    if low_confidence:
        types = [c["clause_type"] for c in low_confidence]
        errors.append(f"Low confidence extractions for: {types}")

    if errors:
        human_input = interrupt({
            "type": "validation_errors",
            "errors": errors,
            "message": "Document validation failed. Please review and provide corrections.",
            "document_preview": state["document"][:500],
        })
        return {
            "extracted_clauses": human_input.get("corrected_clauses", clauses),
            "validation_errors": [],
        }

    return {"validation_errors": []}

Pattern 3: User-Fixable Errors with interrupt()

Some errors can’t be fixed by the system or the LLM. The document is missing a signature date. The clause references an undefined term. The input is ambiguous. These need a human.

interrupt() pauses graph execution, saves state to the checkpointer, and returns a payload to the caller. When the human provides input, you resume with Command(resume=...):

`python

from langgraph.checkpoint.memory import InMemorySaver

checkpointer = InMemorySaver()

First invocation — runs until interrupt

config = {"configurable": {"thread_id": "contract-review-42"}}
result = graph.invoke(
{"document": "...", "messages": [], "extracted_clauses": [], "validation_errors": [], "retry_count": 0, "final_summary": ""},
config,
)

Check for interrupt

if "interrupt" in result:
print("Human input needed:", result["interrupt"])

Resume with corrections

corrected = Command(resume={
"corrected_clauses": [
{"clause_type": "termination", "text": "Either party may terminate...", "confidence": 0.95},
{"clause_type": "payment", "text": "Payment due within 30 days...", "confidence": 0.98},
]
})
final_result = graph.invoke(corrected, config)
`

Resuming after an interrupt:
`python

from langsmith import tracing_context

@traceable(name="process_document", run_type="chain")
def process_document(document: str, thread_id: str) -> dict:
config = {"configurable": {"thread_id": thread_id}}
with tracing_context(
metadata={"document_length": len(document), "thread_id": thread_id},
tags=["production", "document-pipeline"],
):
return graph.invoke(
{
"document": document,
"messages": [HumanMessage(content=f"Process this contract:\n\n{document}")],
"extracted_clauses": [],
"validation_errors": [],
"retry_count": 0,
"final_summary": "",
},
config,
)
`

Critical detail: interrupt() requires a checkpointer. Without one, the state is lost and you can’t resume. Use InMemorySaver for development and a durable checkpointer (Postgres, SQLite) for production. Forgetting the checkpointer is a silent failure — the graph runs fine until you actually need to resume, and then it has no idea where it left off.

Pattern 4: Building Fault-Tolerant Agents — Let Unexpected Errors Bubble

TypeError, KeyError, schema mismatches, logic bugs. Don’t catch these. Don’t retry them. Don’t interrupt for them. Let them crash. A retry just wastes time on an error that will never self-resolve. A human interrupt pages someone to look at a bug that should be in your issue tracker.

The only thing to do with unexpected errors is make them observable. Wrap your graph invocation and log context:


@traceable(name="summarize", run_type="chain")
def summarize_node(state: PipelineState) -> dict:
    clauses_text = "\n".join(
        f"- {c['clause_type']}: {c['text']}" for c in state["extracted_clauses"]
    )
    response = llm.invoke([
        SystemMessage(
            content="Summarize the following contract clauses into a concise executive summary. "
                    "Flag any compliance concerns."
        ),
        HumanMessage(content=f"Contract clauses:\n{clauses_text}"),
    ])
    return {"final_summary": response.content}

When it crashes, you get a full trace in LangSmith with the document content, the exact node that failed, and every intermediate state. That’s a 5-minute investigation, not a 2-hour log-grepping session.

The Summarizer

The final node synthesizes everything. It’s simple but gets RetryPolicy protection because it calls the LLM:


from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver

def should_continue(state: PipelineState) -> str:
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return "validate"

def post_tool(state: PipelineState) -> dict:
    tool_results = []
    for msg in state["messages"]:
        if hasattr(msg, "content") and isinstance(msg.content, str):
            if "Extracted" in msg.content:
                tool_results.append({
                    "clause_type": "extracted",
                    "text": msg.content,
                    "confidence": 0.92,
                })
    if tool_results:
        return {"extracted_clauses": tool_results}
    return {}

builder = StateGraph(PipelineState)

# Agent node: LLM retry (expensive, conservative)
builder.add_node("agent", agent_node, retry=llm_retry)

# Tool node: API retry (cheap, aggressive) + error handling for tool failures
builder.add_node("tools", tool_node, retry=api_retry)

# Post-tool processing
builder.add_node("post_tool", post_tool)

# Validation: no retry (errors here are user-fixable, not transient)
builder.add_node("validate", validate_node)

# Summarizer: LLM retry
builder.add_node("summarize", summarize_node, retry=llm_retry)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", should_continue, {"tools": "tools", "validate": "validate"})
builder.add_edge("tools", "post_tool")
builder.add_edge("post_tool", "agent")
builder.add_edge("validate", "summarize")
builder.add_edge("summarize", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

Graph Assembly

Here’s where the error classification meets the graph structure. Each node gets the retry strategy that matches its error profile:

`python

from langsmith import tracing_context

with tracing_context(
metadata={
"pipeline_version": "v3",
"error_strategy": "classified",
"document_type": "contract",
},
tags=["production", "error-handling-v3"],
):
result = process_document(
document="Sample contract text...",
thread_id="contract-42",
)
`

Notice: the validation node has no retry policy. Retrying a missing-clause error 3 times won’t make the clause appear. That’s a user-fixable problem that needs interrupt().

Production Failures

These are the error-handling mistakes that make it past code review and into production.

1. Retrying User-Fixable Errors. The pipeline retries an extraction 3 times, burning 8 seconds and 3 LLM calls, before finally failing with the same “missing payment clause” error. The document genuinely doesn’t have a payment clause. No amount of retrying will create one. Fix: classify the error before choosing the handler. If the document is missing required content, interrupt() immediately.

2. Swallowing Unexpected Errors. A developer wraps the entire graph invocation in try/except Exception: return {"error": "Something went wrong."}. Now every TypeError, every KeyError, every schema mismatch disappears into a generic error message. The LangSmith trace shows the node completed “successfully” — because from the graph’s perspective, it did. It returned a value. The bug lives in production for weeks until someone notices the output quality degraded. Fix: only catch the specific exception types you know how to handle. Let everything else crash loudly.

3. Superstep Transaction Surprise. You have two parallel branches: clause extraction and metadata extraction. The metadata branch succeeds, but the clause branch hits a rate limit and throws. You expect the metadata to be saved — it succeeded, after all. But superstep transactions mean neither update applies. The entire superstep rolls back. Your metadata extraction re-runs on retry (if you have RetryPolicy) or is lost entirely (if you don’t). Fix: put RetryPolicy on every node that can fail transiently. LangGraph checkpoints successful nodes within a superstep so they don’t re-execute, but the state update is still atomic.

4. Interrupt Without Checkpointer. You add interrupt() to your validation node and test it locally. Works great. Deploy to production without a persistent checkpointer (or with InMemorySaver behind a load balancer). The interrupt pauses the graph, the user provides corrections, and... the graph starts from scratch because the in-memory state was on a different server instance. Fix: use a durable checkpointer (PostgresSaver, SqliteSaver) in production. InMemorySaver is for tests only.

5. Error Recovery Loop Explosion. The LLM fails to call a tool correctly, the error goes back to the LLM, the LLM tries again with slightly different wrong arguments, the error goes back again. After 15 loops and $2 in API costs, you hit the recursion limit. Fix: add a retry_count to state. After 3 LLM-recovery attempts, escalate to interrupt() or fail with a clear error message.

Observability

Error handling without observability is guesswork. Here’s how to make every error path visible in LangSmith:


from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="error-handling-evals",
    description="Document processing pipeline error handling evaluation",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"document": "This contract between Party A and Party B includes: Termination: Either party may terminate with 30 days notice. Payment: Net 30 terms apply.", "thread_id": "eval-1"},
        {"document": "This agreement covers liability limitations and indemnification clauses only.", "thread_id": "eval-2"},
        {"document": "", "thread_id": "eval-3"},
    ],
    outputs=[
        {"should_succeed": True, "required_clauses": ["termination", "payment"]},
        {"should_succeed": False, "missing_clauses": ["termination", "payment"]},
        {"should_succeed": False, "error_type": "empty_document"},
    ],
)

The @traceable decorator on every node means you can see in LangSmith:

Filter by the error-handling-v3 tag to compare error rates across pipeline versions. If v3 has fewer interrupts but more retries, your error classification improved — transient errors are being handled automatically instead of paging humans.

Evals

Test error recovery paths the same way you test happy paths. Three evaluators: one for successful processing, one for error classification accuracy, and one LLM-as-judge for output quality under failure conditions.

`python

QUALITY_PROMPT = """\
Document: {inputs[document]}
Pipeline output: {outputs[final_summary]}

Rate 0.0-1.0 on:

Completeness: Did the summary cover all extracted clauses?
Accuracy: Are the clause descriptions faithful to the source?
Error handling: If the document was incomplete, did the pipeline flag it appropriately?

Return ONLY: {{"score": , "reasoning": ""}}"""

quality_judge = create_llm_as_judge(
prompt=QUALITY_PROMPT,
model="anthropic:claude-sonnet-4-5-20250929",
feedback_key="quality",
)

def error_classification(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
"""Did the pipeline correctly classify errors?"""
should_succeed = reference_outputs.get("should_succeed", True)
has_summary = bool(outputs.get("final_summary"))
has_errors = bool(outputs.get("validation_errors"))

if should_succeed:
    score = 1.0 if has_summary and not has_errors else 0.0
else:
    score = 1.0 if has_errors or not has_summary else 0.0
return {"key": "error_classification", "score": score}

def recovery_efficiency(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
"""How many retries did it take? Lower is better."""
retry_count = outputs.get("retry_count", 0)
if retry_count == 0:
score = 1.0
elif retry_count <= 2:
score = 0.7
else:
score = 0.3
return {"key": "recovery_efficiency", "score": score}

def target(inputs: dict) -> dict:
config = {"configurable": {"thread_id": inputs["thread_id"]}}
try:
return graph.invoke(
{
"document": inputs["document"],
"messages": [HumanMessage(content=f"Process this contract:\n\n{inputs['document']}")],
"extracted_clauses": [],
"validation_errors": [],
"retry_count": 0,
"final_summary": "",
},
config,
)
except Exception as e:
return {"final_summary": "", "validation_errors": [str(e)], "retry_count": 0}

results = evaluate(
target,
data="error-handling-evals",
evaluators=[quality_judge, error_classification, recovery_efficiency],
experiment_prefix="error-handling-v1",
max_concurrency=2,
)

[Ingest] → [Extract Clauses] → [Validate] → [Summarize] → END
↑ |
| ↓
← (tool error: retry with context)
|
(missing info: interrupt for human)
`

recovery_efficiency catches the error-loop explosion problem. If your average retry count creeps above 2, your error classification is wrong — you’re retrying things that should interrupt or bubble up.

When to Use This

Use classified error handling when:

Your pipeline calls external APIs that can return transient errors
Documents have variable quality and may be missing required fields
You need human-in-the-loop for ambiguous or incomplete inputs
You're running batch processing where one failure shouldn't kill the batch

Skip it when:

Your pipeline is a single LLM call with no tools
Every error is the same type (all transient, all user-fixable)
You're prototyping and don't need production resilience yet

The Bottom Line

The error classification matrix is the whole strategy: transient errors get RetryPolicy, LLM-recoverable errors get stored in state and looped back, user-fixable errors get interrupt(), and unexpected errors crash loudly. Four patterns, four primitives, zero catch-all try/excepts.

The mistake everyone makes is treating errors as a single category. You either retry everything (wasting time and money) or catch everything (hiding bugs). The classification forces you to ask “who fixes this?” for every failure mode, and that question is worth more than any amount of retry logic.

Put RetryPolicy on every node that touches a network. Put handle_tool_errors=True on your ToolNode. Put interrupt() on validation failures. Let everything else crash. Ship the recovery_efficiency eval before you ship the pipeline.

Technical References:

LangGraph Agent Error Handling in Production GitHub Repo

LangGraph Retry Policy (Handling Retries)

LangGraph Tool Calling and ToolNode

LangGraph Human-in-the-Loop (interrupt)

Evaluation Pipelines for LangGraph Agents

Austin Vance — Thu, 16 Apr 2026 00:43:37 +0000

You changed a system prompt. It looks better on the three examples you tried. You ship it. Tuesday morning, support tickets spike. The agent is now hallucinating policy details on a class of queries you didn’t test. You revert, but 400 users already got bad answers.

This is not a testing problem. You have unit tests.

This is an evaluation problem.

Traditional tests check “does the code run.” Evals check “is the output good.” For LLM applications, you need a clear verdict: pass or fail. Not a 0.73. Not “mostly correct.” The agent either got the answer right or it didn’t.

Binary evaluators give you that clarity. More importantly, they give your CI pipeline a gate that actually means something.

The cost of not having evals is not “we might ship a bad prompt.” It is that you have no idea if any prompt is good.

LangSmith gives you the pieces: datasets with versioned examples, custom evaluators (deterministic and LLM-as-judge), trajectory evaluation for agent behavior, experiment comparison across runs, and production trace monitoring.

This post builds the whole pipeline, from dataset to CI regression detection.

The Eval Tax

Every team resists evals because they seem expensive. Here's the actual math:

Activity	Time Cost	Without Evals	With Evals
Prompt change	~30 min	Ship and pray	Run eval suite, check pass rate
Regression discovery	Hours–days	User reports, support tickets	Caught in CI, before merge
Root cause analysis	1–4 hours	Manual trace inspection	Failed evals pinpoint exactly which capability regressed
Rollback decision	Stressful	"Is this really worse?"	Pass rate dropped from 95% to 71%, clear signal
Total cost per change		Unpredictable	~15 min eval run

The eval suite described below costs ~$0.50 per run (LLM-as-judge calls) and takes 2–3 minutes. The alternative is discovering regressions from users.

** *The Architecture***

We're building an evaluation pipeline for a Q&A agent. The pipeline covers offline evals (before deploy), online monitoring (after deploy), and regression detection (across deploys).


import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import AnyMessage, HumanMessage, SystemMessage
from langchain_core.tools import tool
from langgraph.graph import END, START, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.types import RetryPolicy
from langsmith import traceable

class AgentState(TypedDict):
    messages: Annotated[list[AnyMessage], operator.add]

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)

@tool
@traceable(name="search_knowledge_base", run_type="tool")
def search_knowledge_base(query: str) -> str:
    """Search the internal knowledge base for relevant information.

    Args:
        query: The search query to find relevant documents.
    """
    knowledge = {
        "refund policy": (
            "Full refund within 30 days of purchase for unopened items. "
            "Opened items eligible for exchange only within 14 days. "
            "Digital products are non-refundable after download."
        ),
        "shipping": (
            "Standard shipping: 5-7 business days, free over $50. "
            "Express shipping: 2-3 business days, $12.99. "
            "International shipping: 10-15 business days, $24.99."
        ),
        "warranty": (
            "All electronics carry a 1-year manufacturer warranty. "
            "Extended warranty available for $49.99 (adds 2 years). "
            "Warranty does not cover accidental damage."
        ),
        "hours": (
            "Customer support available Monday-Friday 9am-6pm EST. "
            "Chat support available 24/7. "
            "Phone support: 1-800-555-0123."
        ),
    }
    query_lower = query.lower()
    for topic, info in knowledge.items():
        if topic in query_lower or any(w in query_lower for w in topic.split()):
            return f"Knowledge Base [{topic}]: {info}"
    return "No relevant results found. Try rephrasing your query."

tools = [search_knowledge_base]
llm_with_tools = llm.bind_tools(tools)

SYSTEM_PROMPT = """\
You are a customer support agent. Answer questions using the knowledge base tool.
Be concise and accurate. If the knowledge base doesn't have the answer, say so —
do not make up information. Always cite the source when using knowledge base results."""

@traceable(name="qa_agent_call", run_type="chain")
def call_agent(state: AgentState) -> dict:
    messages = [SystemMessage(content=SYSTEM_PROMPT)] + state["messages"]
    response = llm_with_tools.invoke(messages)
    return {"messages": [response]}

tool_node = ToolNode(tools, handle_tool_errors=True)

builder = StateGraph(AgentState)
builder.add_node("agent", call_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("tools", tool_node)
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", tools_condition)
builder.add_edge("tools", "agent")

qa_agent = builder.compile()

The Agent Under Test

A Q&A agent that answers questions using a knowledge base. Simple enough to evaluate clearly, complex enough to have real failure modes.


from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="qa-agent-evals-v1",
    description="Q&A agent evaluation dataset covering core support topics",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is your refund policy?"},
        {"question": "How long does express shipping take?"},
        {"question": "Does the warranty cover water damage?"},
        {"question": "What are your support hours?"},
        {"question": "Can I return a digital download?"},
        {"question": "Do you sell gift cards?"},
        {"question": "What's the refund window for opened electronics?"},
    ],
    outputs=[
        {
            "expected_answer": "Full refund within 30 days for unopened items. Opened items eligible for exchange only within 14 days. Digital products are non-refundable after download.",
            "expected_tool_calls": ["search_knowledge_base"],
            "must_mention": ["30 days", "unopened", "exchange", "14 days"],
        },
        {
            "expected_answer": "Express shipping takes 2-3 business days and costs $12.99.",
            "expected_tool_calls": ["search_knowledge_base"],
            "must_mention": ["2-3 business days", "12.99"],
        },
        {
            "expected_answer": "The warranty does not cover accidental damage, including water damage.",
            "expected_tool_calls": ["search_knowledge_base"],
            "must_mention": ["does not cover", "accidental damage"],
        },
        {
            "expected_answer": "Customer support is available Monday-Friday 9am-6pm EST. Chat support is available 24/7.",
            "expected_tool_calls": ["search_knowledge_base"],
            "must_mention": ["Monday-Friday", "9am-6pm", "24/7"],
        },
        {
            "expected_answer": "Digital products are non-refundable after download.",
            "expected_tool_calls": ["search_knowledge_base"],
            "must_mention": ["digital", "non-refundable"],
        },
        {
            "expected_answer": "I don't have information about gift cards in the knowledge base.",
            "expected_tool_calls": ["search_knowledge_base"],
            "must_mention": [],
            "expects_no_answer": True,
        },
        {
            "expected_answer": "Opened items are eligible for exchange only within 14 days.",
            "expected_tool_calls": ["search_knowledge_base"],
            "must_mention": ["exchange", "14 days"],
        },
    ],
)

Step 1: Create a Dataset

The dataset is the foundation. Bad examples produce misleading eval scores. Each example has inputs (what goes to the agent) and outputs (the ground truth to evaluate against).


from langsmith import traceable
@traceable(name="eval_keyword_coverage", run_type="chain")
def keyword_coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Pass if the response mentions ALL required keywords. Fail if any are missing."""
    response = outputs.get("response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    if not must_mention:
        return {"key": "keyword_coverage", "score": True}
    hits = sum(1 for term in must_mention if term.lower() in response)
    return {"key": "keyword_coverage", "score": hits == len(must_mention)}
@traceable(name="eval_tool_usage", run_type="chain")
def tool_usage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Pass if the agent called all expected tools. Fail if any are missing."""
    messages = outputs.get("messages", [])
    expected_tools = set(reference_outputs.get("expected_tool_calls", []))
    actual_tools = set()
    for msg in messages:
        for tc in getattr(msg, "tool_calls", []):
            actual_tools.add(tc["name"])
    if not expected_tools:
        return {"key": "tool_usage", "score": True}
    return {"key": "tool_usage", "score": expected_tools.issubset(actual_tools)}
@traceable(name="eval_no_hallucination_on_missing", run_type="chain")
def no_hallucination_on_missing(
    inputs: dict, outputs: dict, reference_outputs: dict
) -> dict:
    """When the KB has no answer, pass if the agent admits it. Fail if it fabricates."""
    if not reference_outputs.get("expects_no_answer", False):
        return {"key": "no_hallucination", "score": True}
    response = outputs.get("response", "").lower()
    hedging_phrases = [
        "don't have information",
        "no information",
        "not available",
        "cannot find",
        "no relevant results",
        "i don't have",
        "not in the knowledge base",
        "i'm not sure",
    ]
    hedged = any(phrase in response for phrase in hedging_phrases)
    return {"key": "no_hallucination", "score": hedged}

Seven examples is a starting point, not a finish line. In production, you need 50-100 examples covering happy paths, edge cases, and adversarial inputs. But starting with seven well-chosen examples that cover your core failure modes is better than starting with zero.

Step 2: Build LangSmith Evaluation Evaluators

Three layers of evaluation: deterministic checks (fast, cheap, reliable), LLM-as-judge (flexible, handles nuance), and trajectory evaluation (validates agent behavior, not just output).

Deterministic Evaluators


from openevals.llm import create_llm_as_judge
CORRECTNESS_PROMPT = """\
You are evaluating a customer support agent's response.
Customer question: {inputs[question]}
Agent response: {outputs[response]}
Expected answer: {reference_outputs[expected_answer]}
Determine whether the agent's response is correct.
A response is CORRECT if it:
- Contains the key factual claims from the expected answer
- Does not contradict the expected answer
- Does not fabricate information beyond what the knowledge base provides
A response is INCORRECT if it:
- Misses critical factual information from the expected answer
- States anything that contradicts the expected answer
- Invents details not present in the knowledge base
Return ONLY: {{"score": true}} or {{"score": false}}
with a "reasoning" field explaining your verdict."""
correctness_judge = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="correctness",
    continuous=False,
)
TONE_PROMPT = """\
You are evaluating the tone and professionalism of a customer support agent.
Customer question: {inputs[question]}
Agent response: {outputs[response]}
Determine whether the agent's tone is ACCEPTABLE or UNACCEPTABLE.
ACCEPTABLE tone: professional, helpful, concise, empathetic, and action-oriented.
UNACCEPTABLE tone: condescending, rude, excessively verbose, robotic, dismissive,
or inappropriately casual for a support context.
Return ONLY: {{"score": true}} or {{"score": false}}
with a "reasoning" field explaining your verdict."""
tone_judge = create_llm_as_judge(
    prompt=TONE_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="tone",
    continuous=False,
)

Why pass/fail instead of continuous scores? You don't ship code when 73% of your unit tests "mostly pass." When keyword_coverage fails, you know exactly what happened: the agent missed a required term. A score of 0.75 tells you something is partially wrong, but you still have to go figure out what. And binary evaluators don't suffer from judge variance — the same input produces the same verdict every time.


from agentevals.trajectory.llm import create_trajectory_llm_as_judge
TRAJECTORY_PROMPT = """\
You are evaluating whether an AI agent took a reasonable path to answer a question.
The agent has access to a knowledge base search tool.
Evaluate the agent's trajectory (sequence of actions and messages):
{outputs}
A trajectory PASSES if:
- The agent called the appropriate tool(s) for the question
- The agent did not make unnecessary or redundant tool calls
- The agent used tool results to formulate its response
- The agent did not ignore relevant tool results
A trajectory FAILS if:
- The agent skipped tool calls and answered from its own knowledge
- The agent made excessive redundant calls (more than 2 calls for a simple question)
- The agent ignored tool results and fabricated an answer
- The agent called completely irrelevant tools
Return ONLY: {{"score": true}} or {{"score": false}}
with a "reasoning" field explaining your verdict."""
trajectory_judge = create_trajectory_llm_as_judge(
    model="anthropic:claude-sonnet-4-5-20250929",
    prompt=TRAJECTORY_PROMPT,
)
@traceable(name="eval_trajectory", run_type="chain")
def trajectory_eval(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Pass if the agent's tool-calling trajectory was reasonable. Fail otherwise."""
    messages = outputs.get("messages", [])
    if not messages:
        return {"key": "trajectory", "score": False}
    result = trajectory_judge(outputs=messages)
    return {
        "key": "trajectory",
        "score": bool(result.get("score")),
    }

Binary judges are more reliable than continuous ones. Ask a model to rate something 0.0-1.0 and you'll get different scores on every run. Ask it "correct or incorrect?" and you'll get the same answer 95%+ of the time. The judge isn't deciding how correct, it's deciding whether the response meets a bar. Easier task, more consistent results, fewer false signals in CI.

Trajectory Evaluator

Trajectory evaluation checks not just what the agent said, but how it got there. Did it call the right tools? Did it call them in a reasonable order? Did it over-call or under-call?


from langchain_core.messages import HumanMessage
from langsmith import evaluate
@traceable(name="qa_eval_target", run_type="chain")
def target(inputs: dict) -> dict:
    result = qa_agent.invoke({
        "messages": [HumanMessage(content=inputs["question"])]
    })
    return {
        "response": result["messages"][-1].content,
        "messages": result["messages"],
    }
results = evaluate(
    target,
    data="qa-agent-evals-v1",
    evaluators=[
        correctness_judge,
        tone_judge,
        keyword_coverage,
        tool_usage,
        no_hallucination_on_missing,
        trajectory_eval,
    ],
    experiment_prefix="qa-agent-v1",
    max_concurrency=4,
)

Step 3: Run the Evaluation

Wire the target function and evaluators into evaluate(). The target function takes dataset inputs, runs the agent, and returns a dict with the keys your evaluators expect.


@traceable(name="compare_experiments", run_type="chain")
def compare_experiments(
    baseline_prefix: str,
    candidate_prefix: str,
    regression_threshold: float = 0.10,
) -> dict:
    """Compare two experiment runs and flag regressions in pass rates."""
    experiments = list(
        ls_client.list_projects(
            project_ids=None,
            reference_dataset_name="qa-agent-evals-v1",
        )
    )
    baseline = None
    candidate = None
    for exp in experiments:
        if exp.name.startswith(baseline_prefix):
            baseline = exp
        if exp.name.startswith(candidate_prefix):
            candidate = exp
    if not baseline or not candidate:
        return {"error": "Could not find both experiments"}
    baseline_results = list(ls_client.get_test_results(project_id=baseline.id))
    candidate_results = list(ls_client.get_test_results(project_id=candidate.id))
    def compute_pass_rates(results: list) -> dict:
        counts = {}
        for r in results:
            for key, val in r.get("feedback", {}).items():
                counts.setdefault(key, {"passed": 0, "total": 0})
                counts[key]["total"] += 1
                if val:
                    counts[key]["passed"] += 1
        return {
            k: v["passed"] / v["total"] if v["total"] > 0 else 0.0
            for k, v in counts.items()
        }
    baseline_rates = compute_pass_rates(baseline_results)
    candidate_rates = compute_pass_rates(candidate_results)
    regressions = []
    for key in baseline_rates:
        if key in candidate_rates:
            delta = candidate_rates[key] - baseline_rates[key]
            if delta < -regression_threshold:
                regressions.append({
                    "metric": key,
                    "baseline_pass_rate": round(baseline_rates[key], 3),
                    "candidate_pass_rate": round(candidate_rates[key], 3),
                    "delta": round(delta, 3),
                })
    return {
        "regressions": regressions,
        "passed": len(regressions) == 0,
        "baseline_rates": {k: round(v, 3) for k, v in baseline_rates.items()},
        "candidate_rates": {k: round(v, 3) for k, v in candidate_rates.items()},
    }

This creates an experiment in LangSmith, a versioned snapshot of your agent's performance. The result is a pass rate per evaluator: "correctness: 6/7 passed, tool_usage: 7/7 passed, keyword_coverage: 5/7 passed." Every future eval run with a different experiment_prefix becomes a comparable data point.

Step 4: AI Agent Testing with Regression Detection

The power of experiments is comparison. When you change a prompt, model, or tool, run the same eval suite and compare pass rates.


from langsmith import tracing_context
def handle_user_query(question: str, user_id: str, channel: str) -> str:
    """Production entry point with trace tagging."""
    with tracing_context(
        metadata={
            "user_id": user_id,
            "channel": channel,
            "agent_version": "v2.1",
            "prompt_version": "2025-02-01",
        },
        tags=["production", channel],
    ):
        result = qa_agent.invoke({
            "messages": [HumanMessage(content=question)]
        })
        return result["messages"][-1].content

In practice, you run this in CI. A prompt change creates a new experiment, the comparison script checks for regressions, and the PR is blocked if any evaluator's pass rate drops more than 10%. This is the single most valuable thing you can build with LangSmith — the rest is instrumentation. The regression threshold is 10%, not 5%, because pass/fail metrics move in discrete jumps. On a 7-example dataset, one additional failure drops your pass rate by ~14%. On a 50-example dataset, you can tighten the threshold to 5%. Scale the threshold to your dataset size.

Step 5: Production Monitoring

Offline evals catch regressions before deploy. Online monitoring catches drift after deploy — the slow degradation that happens when user behavior shifts, knowledge bases get stale, or upstream APIs change.


@traceable(name="eval_tool_call_efficiency", run_type="chain")
def tool_call_efficiency(
    inputs: dict, outputs: dict, reference_outputs: dict
) -> dict:
    """Fail if the agent made more than 3 tool calls for a single question."""
    messages = outputs.get("messages", [])
    tool_call_count = sum(
        len(getattr(msg, "tool_calls", []))
        for msg in messages
    )
    return {"key": "tool_call_efficiency", "score": tool_call_count <= 3}

Tag every production trace with the agent version and prompt version. When you deploy a new version, you can filter traces by version and compare pass rates across versions — with real user traffic, not synthetic examples. The monitoring loop:

Tag all production traces with version metadata
Configure LangSmith online evaluators to sample 10-20% of traces
Dashboard alerts on pass rate drops by version
When a drop is detected, pull the failing traces, add them to your offline dataset, and fix

Production Failures

These are the eval-specific failure modes that surface once you're running evals in CI and production.

1. Flaky LLM-as-Judge Verdicts. The same input/output pair passes on one run and fails on the next. The judge model is non-deterministic, and your eval is measuring judge variance, not agent quality. Fix: set temperature=0 on the judge model, make your pass/fail criteria as specific as possible (list exactly what constitutes a pass), and run each evaluation 3 times with a majority-vote. If the same example flips verdict more than 10% of the time, your criteria need to be sharper. Binary verdicts are already far more stable than continuous scores, but ambiguous criteria still cause flakiness.

2. Eval Gaming. You optimize the prompt to pass the eval dataset. The pass rate goes up. User satisfaction doesn't. Your dataset is too narrow, the agent learned your test distribution, not the actual problem. Fix: rotate examples into and out of the eval set quarterly. Pull 10% of examples from production traces each month. Never let the eval set become stale.

3. Judge Model Disagreement. You switch the judge from Claude to GPT and pass rates shift by 20%. The evaluator is measuring model preference, not quality. Fix: calibrate your judge against human ratings. Run 50 examples through both the judge and a human annotator. If they disagree on more than 10% of verdicts, your judge criteria need work. openevals provides pre-calibrated prompts as a starting point.

4. Dataset Drift. Your eval dataset was created six months ago. The product has changed, new policies, new features, different user behavior. The evals are passing but they're testing scenarios that no longer matter, while ignoring scenarios that do. Fix: timestamp your examples. Review the dataset monthly. Add production failure cases as they occur. Delete examples for deprecated features.

5. Trajectory Eval False Positives. The trajectory judge says the agent's path was "reasonable" even when the agent called the wrong tool first and then self-corrected. Self-correction is fine in production but expensive, it adds latency and cost. Fix: add a separate tool_call_count evaluator that fails trajectories with more than N tool calls. Combine the trajectory pass/fail with an efficiency gate.


from langsmith import tracing_context
with tracing_context(
    metadata={"eval_run": "qa-agent-v2", "trigger": "ci"},
    tags=["evaluation", "ci"],
):
    results = evaluate(
        target,
        data="qa-agent-evals-v1",
        evaluators=[
            correctness_judge,
            tone_judge,
            keyword_coverage,
            tool_usage,
            no_hallucination_on_missing,
            trajectory_eval,
            tool_call_efficiency,
        ],
        experiment_prefix="qa-agent-v2",
        max_concurrency=4,
    )

Observability

Every evaluator is @traceable, which means your evals themselves are traced in LangSmith. This matters more than you think. When an evaluator produces a surprising verdict, you can inspect exactly what it saw and why it ruled that way.


from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge
ls_client = Client()
QUALITY_PROMPT = """\
Question: {inputs[question]}
Response: {outputs[response]}
Expected: {reference_outputs[expected_answer]}
Does the response correctly and completely answer the question
based on the expected answer?
PASS if the response contains all key facts from the expected answer
and does not contradict it.
FAIL if the response misses critical information, contradicts the
expected answer, or fabricates details.
Return ONLY: {{"score": true}} or {{"score": false}}
with a "reasoning" field."""
quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
    continuous=False,
)
def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Pass if ALL required terms are present. Fail if any are missing."""
    text = outputs.get("response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    if not must_mention:
        return {"key": "coverage", "score": True}
    all_present = all(t.lower() in text for t in must_mention)
    return {"key": "coverage", "score": all_present}
def response_length(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Fail if the response is too short to be useful or excessively long."""
    response = outputs.get("response", "")
    word_count = len(response.split())
    return {"key": "response_length", "score": 10 <= word_count <= 500}
def target(inputs: dict) -> dict:
    result = qa_agent.invoke({
        "messages": [HumanMessage(content=inputs["question"])]
    })
    return {
        "response": result["messages"][-1].content,
        "messages": result["messages"],
    }
results = evaluate(
    target,
    data="qa-agent-evals-v1",
    evaluators=[quality_judge, coverage, response_length],
    experiment_prefix="qa-agent-quick-check",
    max_concurrency=4,
)

What to watch:

Correctness pass rate. This is your north star. If it drops, the agent is giving wrong answers. Every other metric is secondary.
Tool usage pass rate. If this drops, the agent stopped using the knowledge base — probably a prompt regression that caused it to answer from parametric memory.
No-hallucination pass rate. If this drops, the agent is making up answers when it should be admitting ignorance. This is the most dangerous regression and the one most likely to slip through manual review.
Failing examples across runs. Track which specific examples fail consistently. These are your hardest cases — either improve the agent to handle them or accept them as known limitations and document them.

Evals

This section is the article, but here's the condensed eval suite for quick reference — the minimum viable pipeline you should have before shipping any agent.


┌─────────────────────────────────────────────────────────┐
│                    OFFLINE EVALS                         │
│                                                         │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│  │ Dataset  │───►│ Target   │───►│ Evaluators       │   │
│  │ (examples│    │ Function │    │ - Correctness    │   │
│  │  + refs) │    │ (agent)  │    │ - Completeness   │   │
│  └──────────┘    └──────────┘    │ - Trajectory     │   │
│                                  │ - LLM-as-Judge   │   │
│                                  └────────┬─────────┘   │
│                                           │             │
│                                  ┌────────▼─────────┐   │
│                                  │ Experiment        │   │
│                                  │ (versioned scores)│  │
│                                  └──────────────────┘   │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                   ONLINE MONITORING                      │
│                                                         │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│  │ Prod     │───►│ Traces   │───►│ Online Evals     │   │
│  │ Traffic  │    │ (tagged) │    │ (sampling)       │   │
│  └──────────┘    └──────────┘    └──────────────────┘   │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                 REGRESSION DETECTION                     │
│                                                         │
│  Experiment v1 scores  ◄──── compare ────►  v2 scores   │
│                                                         │
│  Δ correctness: -0.15  ← REGRESSION DETECTED            │
└─────────────────────────────────────────────────────────┘

When to Use This

Build a full eval pipeline when:

You're shipping prompt changes more than once a month
More than one person works on the agent
The agent handles queries where wrong answers have consequences (policy, pricing, compliance)
You need to compare model versions (Claude vs GPT, Sonnet vs Haiku)
You're running A/B tests on agent behavior

Start with just deterministic evals when:

The agent is in early development and schemas are changing weekly
You have fewer than 5 test cases
The output format is structured (JSON extraction) and correctness is binary

Skip evals when:

The application is a prototype that won't see real users
You're the only user and you'll notice regressions immediately

The Bottom Line

Evals are not a nice-to-have. They're the difference between "I think the prompt is better" and "I know the prompt is better, and here are the pass rates." Dataset, evaluators, experiment, comparison. Four components. ~20 lines per evaluator. The payoff is catching every regression before your users do. Pass/fail is the right default. Continuous scores feel more sophisticated, but they create ambiguity — is 0.72 good? Is a drop from 0.81 to 0.76 a regression or noise? Pass/fail kills the question. Green or red. When you need more nuance, add more evaluators with sharper criteria instead of adding decimal places to existing ones. Start with three: one deterministic keyword check, one LLM-as-judge for correctness, one for tool usage. Run them on every PR that touches agent code. Add trajectory evaluation when your agent has more than two tools. Add production monitoring when you have traffic. And update the dataset — the dataset that stops growing is the one that stops catching bugs.

Technical Resources

Eval Pipelines GitHub Repo

LangSmith Evaluation (datasets, evaluators, experiments)

LangGraph (stateful agents, tool calling, orchestration)

LangSmith Tracing & Observability

Debugging Your RAG Application: A LangChain, Python, and OpenAI Tutorial

Austin Vance — Wed, 15 Apr 2026 12:42:32 +0000

Let's explore a real-world example of debugging a RAG-type application. I recently undertook this process while updating our company knowledge base -- a resource for potential clients and employees to learn about us.

Tech Stack:

I work with Python and the LangChain framework, specifically using LangChain Expression Language (LCEL) to build chains. You can find the LangChain LCEL documentation here.

This approach services as a good alternative to LangChain's debugging tool, LangSmith.


# Load memory
def get_session_history(session_id: str) -> ConversationBufferMemory:
    if session_id not in store:
        store[session_id] = ConversationBufferMemory(
            return_messages=True, output_key="answer", input_key="question"
        )
    return store[session_id]

def _get_loaded_memory(x):
    return get_session_history(x["session_id"]).load_memory_variables({"question": x["question"]})

def load_memory_chain():
    return RunnablePassthrough.assign(
        chat_history=RunnableLambda(_get_loaded_memory) | itemgetter("history"),
    )

# Create Question
def create_question_chain():
    return {
        "standalone_question": {
                                   "question": itemgetter("question"),
                                   "chat_history": lambda x: get_buffer_string(x["chat_history"]),
                               }
                               | CONDENSE_QUESTION_PROMPT
                               | llm
                               | StrOutputParser(),
        "role": itemgetter("role"),
    }

# Retrieve Documents
def retrieve_documents_chain(vector_store):
    retriever = vector_store.as_retriever()
    return {
        "role": itemgetter("role"),
        "docs": itemgetter("standalone_question") | retriever,
        "question": lambda x: x["standalone_question"],
    }

# Answer
def create_answer_chain():
    final_inputs = {
        "role": itemgetter("role"),
        "context": lambda x: combine_documents(x["docs"], DEFAULT_DOCUMENT_PROMPT),
        "question": itemgetter("question"),
    }
    return {
        "answer": final_inputs | ANSWER_PROMPT | llm,
        "docs": itemgetter("docs"),
    }

# Final Chain looks like this
chain = load_memory_chain() | create_question_chain() | retrieve_documents_chain() | create_answer_chain()

While debugging, I prefer using a cheaper model like gpt-3.5-turbo for its cost-effectiveness. The less advanced models are more than adequate for basic testing. For final testing and deployment to production, you might consider upgrading to gpt-4-turbo or a similar advanced model.

I also favor Jupyter notebooks for much of my debugging. This way, I can include the notebook in a .gitignore file, reducing cleanup from debugging shenanigans in my main code. I can also run very specific pieces of my code without plumbing overhead.

Initial Observations

I noticed that basic queries received correct answers, but any follow-up question would lack the appropriate context, indicating that conversational memory was no longer functioning effectively.

Here's what I observed:


Question: What are the Focused Labs core values?
> AI: The core values of Focused Labs are Love Your Craft, Listen First, and Learn Why ✅
> Sources: ...

Question: Tell me more about the first one.
> AI: Based on the given context, the first one is about the importance of the "Red" step in Test Driven Development (TDD). ❌
> Sources: ...

However, I expected responses more in line with explanations like "Love your craft is when you are passionate about what you do."

For more context, this issue with conversational memory arose while I was implementing a new feature: allowing end users to customize responses based on their role. So, for example, a developer could receive a highly technical answer while a marketing manager would see more high-level details.

Debugging Steps

1. Ensure Role Feature Integrity

To avoid impacting the newly implemented role feature, I made it overly obvious and active in every response during this debugging session by temporarily updating my system prompt.


SYSTEM_PROMPT = """Answer the question from the perspective of a {role}."""

DEBUGGING_SYSTEM_PROMPT = """Answer the question in a {role} accent."""

Here's how the AI responded, clearly adhering to my updated prompt:


Question: What are the Focused Labs core values?
Role: pirate
> AI: Arr, the core values of Focused Labs be Love Your Craft, Listen First, and Learn Why, matey! ✅
> Sources: ...

Question: Tell me more about the first one.
> AI: Arr, the first one be talkin' about the importance of reachin' the "Red" stage in Test Driven Development... ✅
> Sources: ...

2. Creating a Visual Representation

I created a diagram of the app to visualize the process flow.

I began at the end of my flow and worked backward to identify issues. I first checked whether my LLM was answering questions based on the provided context. Upon inspecting the sources, I realized that the given context was a blog on TDD.


> Sources: [{'URL': '<https://focusedlabs.io/blog/tdd-first-step-think>'}, ...]

Thus, I ruled out the answer component as the source of the bug.

3. Tracing the Bug's Origin

Next, I examined the logic for retrieving documents. I added a 'standalone question' key to every input and output chain to log runtime values, which revealed that questions were being incorrectly rephrased.

Adding these keys to the chains allows us to log the values seen by the components at runtime. Using breakpoints will only show the code when it's instantiated and not populated with real-time values.


# Code Snippet with added keys
def retrieve_documents_chain(vector_store):
    retriever = vector_store.as_retriever()
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
    }

def create_answer_chain():
    final_inputs = {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added 
    }
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
    }

Question: What are the Focused Labs core values?
> standalone_question: What are the core values of Focused Labs? ✅

Question: Tell me more about the first one.
> standalone_question: What can you tell me about the first one? ❌

I expected the standalone_question to be more specific, like "What can you tell me about the core value of Love your Craft?"

4. Identifying the Exact Source

I focused on the chat_history variable, suspecting an issue with how the chat history was being recognized.


def retrieve_documents_chain(vector_store):
    retriever = vector_store.as_retriever()
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
                "chat_history": itemgetter("chat_history") # Added
    }

def create_answer_chain():
    final_inputs = {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added 
                "chat_history": itemgetter("chat_history") # Added
    }
    return {
          .
                .
                .
        "standalone_question": itemgetter("standalone_question") # Added
                "chat_history": itemgetter("chat_history") # Added
    }

Question: What are the Focused Labs core values?

Question: Tell me more about the first one.
> chat_history: [] ❌

Found the issue! Since the chat_history was blank, it wasn't being loaded as I had assumed.

5. Implementing the Solution

I resolved the issue by checking my conversation memory store. As a dict, the conversation memory store was sensitive to the type of saved messages. I saved the messages with a str converted version of session_id. But, I invoked with an Optional[UUID] version. So, while the conversation memory store itself was set up correctly, I needed to update how I invoked my chain.


result = 
chain.invoke({"question": question, "session_id": session_id, "role": role})

Therefore, I updated the session_id type to str.


result = 
chain.invoke({"question": question, "session_id": str(session_id), "role": role})

6. Confirming the Fix

I confirmed that the conversation memory now functioned correctly.


Question: What are the Focused Labs core values?

Question: Tell me more about the first one.
> chat_history: ['What are the Focused Labs core values?'] ✅
> standalone_question: Can you provide more information about the first core value: Love Your Craft? ✅
> AI: This value means that we are passionate about being the best at what we do, paying attention to every detail... ✅
> Sources: [{'URL': '<https://www.notion.so/Who-are-we-c42efb179fa64f6bb7866deb363fb7ef>'}, ...] ✅

7. Final Cleanup and Future-Proofing

I reverted back from the temporary pirate accent debug feature used for easy identification of the role feature.

I decided to maintain detailed logging within the system for future debugging efforts.

Key Takeaways

Debugging AI Systems: A mix of traditional and AI-specific debugging techniques is essential.
Opting for Cost-Effective Models: Use more affordable models to reduce costs during repeated queries.
Importance of Transparency: Clear visibility into each step and component of your RAG accelerates debugging.
Type Consistency: Paying attention to small details, like variable types, can significantly impact functionality.

Thanks for reading!

Stay tuned for more insights into the world of software engineering and AI. Have questions or insights? Feel free to share them in the comments below!