Forem: Focused

Agent Failures Should Open Tickets | Focused Labs

Austin Vance — Thu, 21 May 2026 09:29:57 +0000

Agent traces should create work.

An AI agent workflow can fail twice. So, if it does fail twice, it should create a ticket with an owner and linked evidence and have that be something that the team can check for regression down the line. Instead, tracing an AI’s processes and reviewing its outputs can be pretty and searchable but still be essentially worthless as something that can be used for anything other than replaying the series of mistakes over and over. It’s what I’ve begun to call “replayable regret,” expensive, and painful to behold.

LangChain this week identified a critical gap in tooling: traces of AI agent work can be traced and reviewed, but the error identification and corresponding merged fix are still manual and slow. Harrison Chase called this out this week too and noted that LangChain is building out an “issue bench”, already using it internally, but still early for this class of tooling.

That phrase matters because the unit of work changes. The trace stops being the artifact everyone stares at after the failure. The recurring failure becomes the artifact the team improves against.

Traces should not die in Slack

The common failure loop is quite dumb.

Step 1 for handling the common failure loop: an agent fails in a workflow, someone opens up the trace from that work item, and the team can see the failure in the trace steps. The tool call timed out. The planner picked a weird branch. Retrieval pulled stale context. The evaluator fired. A user left negative feedback. All of that fails when a trace link is added to Slack with the word “interesting” on top of it. In a word, that’s vibes, not work.

Traces are good! I recently wrote about why traces from agent workflows should cross the MCP boundary. And making traces visible is the first half of work here. Traces are good because their visibility says this happened. But that trace, or set of traces, must also say this failure family is owned, fixed, covered, and blocked from coming back.

That second half is the AI agent workflow people keep skipping.

In the recent Engine thread, the team at LangChain identified the right inputs into the AI agent workflow: tool call failures and timeouts, online eval failures, trace anomalies, negative feedback, and unusual behavior. Today those inputs are treated as interesting patterns to watch as dashboard widgets. Instead, they should be treated as signals for a queue of work, where each is eligible to become a named issue with severity, linked traces, suspected boundary, and release condition.

The trace is evidence. The issue loop turns it into engineering work.

The good version is pretty straightforward and mechanical: a trace anomaly turns into an issue, new traces get clustered with it, and it gets an owner. Said owner then makes a change which in turn adds a new evaluator or updates an existing one. The change is then released through a particular release gate, which in turn runs that new evaluator. And if it introduces any regressions, said issue reopens.

No ceremony. Just a loop.

The ticket needs a shape

An agent issue is not a Jira card with “LLM flaky” in the title. That card should be illegal (morally, at least). The issue needs the same hard edges as a production defect.

An agent issue has to have the same characteristics of a production defect issue:

Failure name: “refund flow calls payment API before policy check”
Workflow: refund, plan upgrade, incident triage, research synthesis
Severity: customer impact, data risk, financial risk, operational drag
Evidence: linked traces, failed eval runs, user feedback, tool responses
Boundary: prompt, tool contract, context source, model route, permission, downstream API
Owner: team or service owner, not “AI”
Fix status: proposed, merged, reverted, blocked
Regression coverage: benchmark eval, coverage eval, release gate
Reopen rule: the exact signal that opens it up again and puts it back in the queue

Agent failures are hard to test because they manifest differently based on the input, the branch under test, and the tools the agent touched before it failed. Without an issue name, every failure trace becomes a new issue to debug rather than another data point in the failure family that production tests are supposed to remove.

If LangSmith Engine emits issue.created and issue.trace.added events, then stable event IDs can handle dedupe, severity can travel with the event, and the shared request ID can group deliveries from the same upstream action. That’s all that’s required for this. No need for a religion. Use the existing webhook shape to get failures into queues, boards, and CI jobs.

The boring webhook handler should do four things:

Dedupe on event ID.
Group related deliveries by request ID.
Attach trace evidence to the existing issue when the cluster already exists.
Trigger the right owner workflow for the right reasons, meaning severity and recurrence justify the cost of that workflow.

This is a small piece of work. It is also how agent quality work avoids getting lost on Tuesday.

Benchmarks are pointing at the same problem

Long-horizon agent work fails in similar ways to engineering work. Rather than one incorrect result, the failure is a series of small errors that creep up over time, leaving a final result that is less than useful.

RoadmapBench exists to evaluate long-horizon software development tasks: 115 tasks spread across 17 repositories and 5 languages. The median task modified 3,700 lines of code in 51 files. For tasks at that size, the best model resolved 39.1% of them. The useful analysis is where the generated plan went wrong, which files inside the task became riskier, and which requirements got orphaned.

The CLI project pipeline for LongCLI-Bench uses the same kind of scoring to compare tool performance on long-horizon programming tasks: fail-to-pass, pass-to-pass, and step-level progress. It reports pass rates below 20% for state-of-the-art agents. In terms of stalls, there is a big difference between a late red X for failing to hit the ultimate task goal and an early red X that points to a tool loop, the wrong files, or a pass-to-fail regression.

Phoenix-bench: Locating the oracle for file-level actions on hardware tasks added only 1.4% to resolution. A single round of feedback from testbench logs increased the resolved rate from 42% to 45%. It turns out that pointing to the right general area for a human to improve long-horizon programming tasks is of limited value. Providing actionable feedback that improves the task under consideration is valuable.

This is the issue-loop argument dressed up in benchmark clothing. Better testing of AI agents requires more than a simple test suite. It requires a workflow that can expose issues, allow them to be fixed, and verify the fix inside the same workflow.

The eval suite should grow from resolved issues

Closed issues should feed the test suite.

LangSmith describes evaluators as workspace-level resources that can be attached to tracing projects and data sets in the same workspace. They can be suggested by Engine for detected issues where custom evaluators could be developed and then added as trace evidence for the closed clusters that caused the issue in the first place.

Brace Sproul’s distinction between benchmark evals and coverage evals maps onto this. One set of evaluators for fast benchmarks on known workflows. A second, more exhaustive set of evaluators for longer paths, product commitments, and stranger trajectories. Trying to use one suite for both ends turns evals into the tax nobody wants to pay.

Resolved issues should feed the right suite, not one giant eval blob.

Severity-0 resolved issues, like refund errors in critical workflows, should be evaluated with the fast benchmarks. A rare edge case in a long multi-hop research workflow is probably better served by the broader coverage suite, high cost and long run time included. Severity-0 policy violations may belong in both suites.

However it gets cut, this is work. Every workflow change can introduce failure modes the system has not seen before. The test that proves a fix worked is different from the test that guards the same problem against a later regression. And then there is the matter of the gate.

The queue is where agency gets real

The harder discipline is developing agents so they improve. That is much harder to demonstrate than the capabilities an agent can apply to tasks and workflows.

A harder thing to demo is building AI agency into an agent. Developing AI agency was always about that discipline. It shows up in a particular way: when something fails, the team can explain what happened next.

A good issue queue for a development team debugging a failure answers these questions. The team cannot get all this information from a single trace.

Is this failure new or recurring?
Which workflow owns it?
Which traces point at the same root cause?
Did the fix land?
Which evaluator covers it now?
Which release gate blocks a regression?
Who gets paged if the issue comes back?

Again, this is normal software development, complicated by workflows that fail through complex, probabilistic, variable paths. Same defect, different costumes.

The LangChain survey of production AI agents found that 57.3% of respondents already have agents in production. The number one production blocker cited by respondents was quality, at 32%. This sits next to 89% observability adoption for production agents, far ahead of offline evaluation at 52.4% and online evaluation at 37.3%. There is already a sea of visibility for production agents. The work to convert that visibility into closed quality issues is still barely underway.

Honeycomb’s new investigation features for agent observability start to address the same problem, with Agent Timeline built to reconstruct complex multi-agent, multi-trace workflows. But reconnecting that path to specific owned work, and making sure the work is covered, is still the large gap.

That is where the issue queue comes in.

Own the loop

The AI agent workflow I want is not fancy.

Signal failure -> create issue -> add evidence -> assign owner -> propose fix -> add evaluator -> run release gate -> reopen regression.

This workflow looks less interesting week to week than announcing a new AI model. But to the buyer with an agent touching refunds, support tickets, infrastructure changes, or account data, this is the kind of work that matters week to week. Last week’s failure needs to become this week’s guardrail.

Agent failures should open tickets.

The ticket is where the trace becomes work. The work is where the system gets better.

Agent Traces Need to Cross the MCP Boundary | Focused Labs

Austin Vance — Tue, 19 May 2026 21:23:21 +0000

Observability for AI agents running through MCP has a new failure point: the MCP tool call.

Good. The broad version of this conversation has already been beaten to death. Agents need traces. Agents need evals. Agents need feedback loops. Fine. The sharper production question is what happens when the agent leaves the planner and crosses into a tool server owned by another team, another vendor, another runtime, or another cloud account.

That boundary is where the trace disappears.

Honeycomb is running O11yCon in San Francisco this week. Christine Yen's line in the announcement gets at the issue: agents are writing code, agents are triaging incidents, agents are running production through orchestration, and engineering has little visibility into what the agents did, let alone whether they added value. The visibility gap for these agents is along the path between the model's decision, the tool server, and the downstream services affected by the action.

The production shape is distributed tracing with a model in the loop.

A planner says "tool failed." An MCP server just sees an unrelated tools/call. A database sees a single query. A payment API sees a single request. The observability backend sees all these individual pieces and, operationally, has no idea what to do with them. Nobody can say whether the model chose the wrong tool, the planner's MCP client lost context somewhere along the line, the server failed to accept the call, or the downstream service simply timed out.

Logs within a given service are comfortable to view because the local nature of the stream makes them easy to interpret. However, as soon as an incident affects multiple services or tools, that comfortable stream of logs disappears.

MCP made tool integration portable. It did not magically make tool behavior observable. Focused has been pushing this shape for a while. In Developing AI Agency, the point was that useful agents need real engineering systems around them. In Streaming agent state with LangGraph, the point was that intermediate state matters while long-running work is happening. MCP adds a protocol boundary to that same production story. If the trace cannot cross it, the agent becomes opaque at the exact moment it starts doing useful work.

MCP gave us the carrier

MCP made tool integration for production tool calls easier. Making the behavior of those calls observable is a different job.

This brings us to a simple and useful place: SEP-414 reserves the W3C trace keys for W3C trace context propagation through MCP. So the MCP tools/call request can include trace context as part of params._meta, next to the tool name and arguments.

MCP typically wants _meta keys that start with a DNS-prefixed name. SEP-414 makes an exception for the three W3C trace keys so existing OpenTelemetry propagation can work without creating twelve slightly different names for the same thing. traceparent stays traceparent, tracestate stays tracestate, and baggage stays baggage.

Tiny standardization, huge operational consequence.

A universal set of properties for W3C trace context is a small thing to request. Without SEP-414, every agent stack invents its own set of properties in params: io.modelcontextprotocol.traceparent, otel_trace_parent, correlation IDs encoded in a vendor envelope, plus the special shape required by a proprietary monitoring stack. The resulting observability swamp would be indistinguishable from what exists today with services and their HTTP traces.

First, the agent runtime starts a new span or continues an existing one. Then the MCP client for that runtime injects W3C trace context into params._meta for the call. When the MCP server processes the call, it extracts the W3C trace context from params._meta. Then the server creates a new server span. Tool code invoked by that server, including API calls to databases, queues, workflow engines, and other services, runs under the same trace context.

The tool boundary is where agent observability either survives or dies.

HTTP spans will not save the agent loop

A tempting shortcut is to assume the transport already has tracing. The MCP server runs over HTTP. The ingress span exists. The collector sees requests. Done.

Nope.

That is why OpenTelemetry's MCP semantic conventions matter: HTTP spans only contain information about transport. Streamable MCP transports can contain more than one request, and one MCP operation can spread across retries and transports. The transport context and MCP context are related, but different.

A streamable HTTP request can sit under multiple MCP messages. A retry can create multiple transport-level attempts for one logical operation. Stdio has no HTTP request to hang a trace on at all. If instrumentation stops at the transport layer, the team is just looking at plumbing. The production question lives one layer up: what MCP method was called, what tool was called, what session was involved, what error type was returned, and which downstream spans received the trace context.

A trace is useful when it follows the boundary. In the simplest case, a single trace starts with a span created by the agent runtime. The span name should be boring and low-cardinality, with names like tools/call get_weather, tools/call query_customer, or tools/call create_ticket. The attributes carry the information that matters in production: mcp.method.name, gen_ai.tool.name, mcp.session.id, mcp.protocol.version, network.transport, and error.type. OpenTelemetry warns against adding high-cardinality resource URIs to span names by default. That creates backend cardinality problems for no benefit.

The same thing is true for baggage. Baggage is useful for correlation. It is also an attractive nuisance. A tenant hint here, a route class there, an evaluation cohort for a particular set of runs. Fine. But prompts, secrets, user emails, access tokens, and customer data do not belong in baggage because trace context is supposed to cross service boundaries.

Google's Cloud Trace documentation treats tracing through remote MCP request metadata as an implementation detail. A remote server can accept traceparent in headers or _meta. Once that tracing information is accepted and the trace is sampled, the server emits spans for the requested operation, including failures caused by the agent or by the tool, and latency caused by the client, network, or server processing.

Sampling policy becomes relevant for observability of the agent's tool work. If the agent's tool work is not sampled, the tool's work cannot be reconstructed later by whoever wired up the chat UI.

Fragmented truth still loses the incident

Separate traces can be valid. A vendor-operated MCP server may want a clean service boundary. A client team may not own the server. Langfuse's docs make that distinction directly: Langfuse's MCP tracing docs. But default separation is awful for incident management when the agent itself is causing a user-visible problem.

The agent chooses a tool. The MCP server executes the request. The database locks. The tool returns a timeout. The planner retries with slightly different arguments. The user waits. Each system can tell the truth from inside its own box. The operator still has to stitch together causality by timestamps, request IDs, Slack screenshots, and vibes (the official fourth pillar, apparently).

Without propagation, every system tells the truth in isolation.

In production flow, agent traces should form a chain that represents both the decision process for a request and the execution process carried out by services. The tool spans from an agent trace should link to the corresponding service spans. Having the agent's processing stages with nothing from subsequent services is model theater. Service spans without the corresponding tool decision are classic APM with no agent-specific information.

Honeycomb has been going down a similar route. Their Innovation Week writeup describes agent workflows that branch, retry, call tools, hand off, and trigger services. They frame Agent Timeline. The resulting view places the agent's work inside the incident loop and shows the causal chain behind a prompt log.

The implementation surface is small

Here is a concise specification for adding distributed tracing to an agent-enabled workflow:

inject traceparent into MCP params._meta
extract it on the MCP server
name spans by MCP method plus stable tool or prompt name
attach MCP and GenAI attributes with low cardinality
propagate trace context to following API and database calls
keep sensitive data out of baggage
send the result to a backend that can show agent and service work together

The ecosystem around the MCP contract already does a decent amount of the heavy lifting. Grafana's MCP server docs include attributes such as gen_ai.tool.name, mcp.method.name, and mcp.session.id, with W3C trace context propagation from _meta. MCP Toolbox telemetry docs cover attributes for MCP method, transport, protocol, toolset, tool name, and error type. LangSmith accepts OpenTelemetry ingestion, which means MCP spans do not have to sit in an observability island away from LangChain or LangGraph applications.

In practice, agent systems run across different runtimes, including planners, graphs, model gateways, tool registries, MCP clients and servers, legacy APIs, databases, queues, approval steps, and eval jobs. Evidence of proper orchestration cannot scatter across architecture components and still be reviewable by team members from AI, platform, service, and business functions. We discussed the tradeoffs in Multi-Agent Orchestration in LangGraph. For trace propagation, the same reasoning applies. Architecture can be decomposed into modular components. Evidence for correct runtime behavior cannot.

A decent review checklist is simple:

Can an operator start from a failed agent run and find the corresponding MCP tools/call span for the tool that failed?
Can they see the exact tool name without exploding cardinality?
Can they jump from the client span to the server span?
Can they see the downstream API, database, or queue work under the same trace?
Can they distinguish model/tool selection failure from tool/server failure?
Can they see error.type, latency, tokens, and quality signals near the same workflow?
Can they prove no secrets or PII are leaking through baggage or span attributes?

Call it what it is: a pull request.

The owner is the team that owns the boundary

The trick keeping observability in agent systems stuck is assigning MCP observability to the AI team, or to the MCP tools team, or to the database platform team, while claiming the boundary is too hard for any one team to own.

There are four parties involved here: the vendor of the tool, the platform team, the AI team, and the service team. The vendor exposes spans. The platform team runs a collector to gather those spans. The AI team creates a planner span that is passed as context to tools. The service team instruments downstream API and database calls made by tools within an agent run. Someone has to own the boundary between those groups.

Own the boundary.

For an internal MCP server, trace propagation belongs in the server template for all calls. It should not be left to individual tools. For vendor-provided MCP servers, test the contract by sending traceparent in params._meta and verifying that the backend receives the linked span. Test trace propagation from the agent runtime for every tool call after context injection, without needing to chase separate dashboards. Baggage should have a clear policy before developers discover it as a convenient place to add sensitive information.

AI agent observability will continue to sound mysterious when production monitoring means staring at transcripts of model dialogs. A transcript is one artifact. It will never show the intent behind a command, the tools used to execute it, the side effects, the latency, the errors, or the downstream work required by systems that had to deal with the output of those tools.

MCP made tools portable. SEP-414 and the OpenTelemetry MCP conventions make the tool boundary traceable. The work is wonderfully unglamorous: pass the context, name the spans, control the attributes to keep cardinality low, protect baggage from sensitive information, and then follow the tool calls as the trace crosses the same boundary as the agent.

Follow the trace, follow the agent.

AI Agent Orchestration Needs Receipts | Focused Labs

Austin Vance — Sun, 17 May 2026 21:13:49 +0000

Orchestrating AI agents breaks in the boring place of all: between issuing a tool call and the tool call having its intended side effect.

As tool calls transition from being client tools executed by application code to server tools executed by models, there is a point in the system where the language and the abstraction used to describe the tool use breaks down. A tool call becomes a runtime transaction. The work done by a tool affects databases, makes payments, sends emails, creates tickets, etc. A retry storm, or even a simple retry, now has significant production consequences.

Agent tools need receipts.

Tool Calls Are Side Effects With Better Marketing

Anthropic's tool-use docs split server tools from client tools. A client tool is executed by application code, and then the application sends tool_result back to the model. This is where language ends and production begins. Databases get mutated. Payments get made. Emails get sent. Tickets get updated. Credentials get used.

I see this boundary get described as a function call. Better: side-effect boundary. These systems do not have a durable receipt right now.

What proves the side effect in an agent runtime? The request IDs from external vendors, the changed rows in the business system, and the receipt the runtime saved before the model moved on. It takes human eyes reading through three different systems (and writing glue code along the way) to answer questions like "Did this exact tool intent already cause this exact side effect?" if the runtime cannot track the side effects caused by tool calls inside the model loop.

The Old Backend Pattern Still Applies

Normal API work has already figured this out. For example, Stripe supports idempotent requests for POST, so a caller can retry after a network failure without charging the customer twice. It tracks the original parameters for a given idempotency key, so if the key is reused with different parameters, it will not be treated as the same operation.

AWS Lambda Powertools describes idempotency records with INPROGRESS and COMPLETE states, payload hashes, stored responses and an expiration for the record. This is a tiny state machine around a side effect. That's all that's required for an agent runtime to safely handle model-intent-to-change-the-world calls.

The transactional outbox pattern: write the business state and the outbound message in one database transaction, then deliver from the outbox. AWS writes about the duplicate-message problem for this style of delivery and recommends idempotent consumers that track processed message IDs.

The deterministic backend, for example a Java or Python service, calls a service endpoint with fixed intent semantics. Booking a hotel room is boring in exactly the right way. An agent tool call is produced by a model loop that can re-plan, retry, branch, summarize state, and call the same tool again. The runtime has to record the intent before the side effect is produced.

What the Ledger Has to Know

Tool Ledger. Side-Effect Journal. Orchestration Transaction Table. The name is unimportant. It is a table with a specific shape.

The side-effect ledger is the boundary between model intent and production side effects.

A side-effecting tool call needs a record before execution:


create table agent_tool_ledger (
  id uuid primary key,
  run_id text not null,
  step_id text not null,
  tool_name text not null,
  input_hash text not null,
  operation_key text not null,
  status text not null check (status in (
    'planned',
    'in_progress',
    'succeeded',
    'failed',
    'compensating',
    'compensated'
  )),
  receipt jsonb,
  compensation jsonb,
  error jsonb,
  run_trace_id text,
  owner_service text not null,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now(),
  unique (tool_name, operation_key)
);

That unique constraint is the point.

The record would hold: tool name, normalized input hash, run ID, graph step, owner service, run trace ID, status, receipt, and compensation metadata. On conflict, the application checks the stored input_hash against the new input_hash. Same key with different input is a bug. The receipt is the external fact: Stripe charge ID, Zendesk ticket ID, GitHub comment URL, invoice number, database primary key, email provider message ID.

No receipt, no production claim.

Retry Safety Has to Be Designed Before the Retry

A retry policy is essentially a duplicate side-effect generator wearing a reliability costume.

Retries become safe only after the runtime has a durable place to check intent and receipts.

Temporal's Activity documentation recommends idempotent Activities because they can be retried. A non-idempotent Activity can corrupt application state even when the distributed system is functioning correctly. The runtime's retry policy does not make the agent reliable by itself.

This is where agent systems get uncomfortable. Because we've instrumented our system to retry on transport failure, we can easily believe that we're retrying on transport failure, when in reality we're just retrying on a model of the world that observes a timeout and decides to go down a different path. So, for example, after refunding a customer the model may decide to create a support note, and then the model may decide to refund the customer again in a summary step, losing the receipt from the first attempt. The model may ask a human for confirmation in the meantime and then resume with stale tool context. The model may even run a background subagent that decides to go down a different path in order to arrive at the same conclusion.

This intent cannot be raw JSON. Models produce irrelevant differences. Field order changes. Natural-language notes shift. A good operation key comes from the business operation. The model's token stream is too noisy. refund:{tenant_id}:{payment_id}:{reason_code} beats a hash of the entire prompt. comment:{repo}:{pull_request}:{review_run_id} beats a blob of generated markdown.

That ownership boundary corresponds to the ownership of the credentials for the tool. In agent systems, the authentication of the agent to the external system should start with the workload identity. In AI Agent Authentication Starts With Workload Identity, we discussed the reasons why the secrets should not be passed around like party favors. This same principle applies here. The runtime should not make up the side-effect semantics for a tool that is not owned by the runtime.

Observability Without the Receipt Is Theater

But traces do not, by default, create a business-level uniqueness boundary.

Joining traces to ledger entries changes what agent observability can do. The trace explains the path after the incident. The ledger table can drive behavior during the incident: suppress the duplicate, resume from a receipt, trigger compensation, alert the owning team, or block the next step until a human approves the ambiguous side effect.

That is the difference between a dashboard and a control surface. The trace is evidence. The ledger is state.

Evaluations also get a lot better. In place of "the model called the refund tool", the useful check is one planned refund, one succeeded ledger entry, one receipt, zero duplicate external effects after a simulated timeout. In Everybody Tests, we recognized that people are already testing with the feedback loops they have today. The transcript is too thin to capture all the detail.

The Tool Interface Should Expose the Contract

The contract for a side-effecting tool should be defined near the definition of the tool itself. That contract should describe the operational facts that the runtime can enforce for that tool. A side-effecting tool contract should answer:

Is the tool read-only or mutating?
Who owns the tool?
Which fields form the operation key?
Which external receipt proves success?
What status means the side effect is safe to retry?
What compensation path exists when the effect is wrong?
How long does the ledger entry live?

This is where MCP and other tool packaging efforts need to "grow up" to support packaging of tools for agents to use in production. Such interfaces are not just "packaging" and must be agent-operable - typed, permissioned, inspectable, retryable, and owned by a service. This is the real product, and it is a far cry from a mere interface for the agent to discover and call a tool.

A tool registry that simply says a tool exists is table stakes. A registry that says a write tool mutates customer billing, requires workload identity, lists the operation-key fields, emits a specific external receipt, and pages the service owner on ambiguous completion starts to look like production infrastructure.

Boring. Also useful.

The Runtime Should Refuse Unsafe Writes

Ledger policies for mutating tools run the show.

Read-only search tools remain lightweight, (retrieval, ranking, summarization, classification). Write tools charge cards or email customers. Write tools have their own set of problems but follow a different set of rules. For write tools the runtime should require a ledger policy before registration. The tool owner supplies the operation-key builder, receipt parser, retry rules, and compensation metadata. The runtime supplies the reservation, status transitions, trace joining, and audit events. The rest of the orchestration layer checks the side-effect ledger before running the tool and after it fails. The eval harness tests the duplicate paths for the tool. The on-call team can see stuck in_progress rows before the customers do.

LangGraph Agent Error Handling in Production. Here, handling errors in tools called by an agent is more than simply handling exceptions that occur when the tool is called. The side effects that occur before the error is surfaced, especially around a timeout, are the real problem the error handling has to address. The ledger is where the system goes looking for evidence.

That last point matters. Agents can keep going after an error has occurred. But in production, continuing can be reckless.

Own the Receipt

The gold rush version of AI agent orchestration wants better planners, bigger context windows, and more tools. Fine. Those help.

The production version needs a boring table that answers whether a tool call already did the thing.

That table won't demo well. Nobody cheers for a simple unique index on (tool_name, operation_key). But that's exactly what this table is. And it will save a team from having to refund, email, provision, delete and apologize (for the mysterious model) twice.

The model can be probabilistic. The side-effect boundary cannot.

Own the receipt.

Agentic AI Implementation Runs Through Change Control | Focused Labs

Austin Vance — Sun, 17 May 2026 21:13:16 +0000

There’s been a big mis-selling in Agentic AI implementation. People compare its implementation to software enablement. But this breaks when the agent can change a workflow.

The agent approves a refund, opens an incident, updates a customer record, begins onboarding for a new customer, or escalates a support ticket. At that point a training calendar and a Slack message are not enough for a rollout plan.

It needs a change record.

Enterprise AI adoption has a naming problem. Work ‘adoption’ gets viewed through the same lens as software ‘usage’. Thus work is framed in terms of seats, office hours, examples of how to properly format a prompt, and wait for it to kick in. But then the work actually gets executed out through an agent that in turn changes a workflow.

The system has entered the process.

Microsoft's 2026 Work Trend Index frames this shift as an operating-model problem. WorkLab analysis finds that employees may be ready for AI, while the systems around work are not. Agent approvals, open incidents, and changed customer records create a different implementation roadmap.

That changes the implementation roadmap.

The Rollout Surface Changed

Agents behave differently from a chat tool. An agent is released through a system.

ServiceNow announced Action Fabric at Knowledge 2026, explicitly opening its governed system of action to agents. The MCP Server gives agents access to workflows, playbooks, approvals, catalog requests, and business rules. All of which run through identity verification, granted permissions, and audit trails.

Within an enterprise the enterprise agent problem manifests itself when an agent has moved from the edge of a process, creating a summary of work done, to inside the process, making a move.

The first key question that comes to the surface for the enterprise is no longer "who should have access to this tool" and rather "what change is this tool going to drive for the business, and who is going to own that change (ie: the teams that run the production systems, compliance to regulations, promises to customers, incident response, and the overall economics of the workflows that this will insert into)".

The reality of the enterprise is well captured in a preview for LangChain's Interrupt 2026: the initial excitement to have agents proving work in production will quickly give way to questions about the team, tooling and infrastructure required to support agents that are no longer ‘proof-of-concept’ work (LangChain Interrupt 2026 preview LangChain Interrupt 2026 preview). My experience with clients has been the same: there is initial excitement with the first useful agent, overlap of work with the second and finally ownership problems with the third.

Fine. That is the good version.

The bad version of this is quiet. A team enables an agent with a service account, an admin token, a dashboard that nobody looks at. It looks good during the demo, and then a change in a source system happens (e.g. a field name changes), a policy document drifts, an approval queue gets renamed, a customer edge case gets found out, and the agent keeps moving. Nobody owns the change because nobody treated the agent as a change.

The rollout path gets safer when every promotion carries evidence, scope, and a rollback owner.

The Change Record Is the Agent Spec

Atlassian describes IT change management as planning, reviewing, approving, and deploying changes to services with as little disruption as possible. Boring. Also the right object.

Agentic AI needs the same boring object.

A change record should specify which human role loses or gains work, which systems the agent can interact with, which actions require approval, which actions are forbidden, which metrics define harm, which traces prove behavior, and which owner can roll back changes made by the agent when something goes wrong.

Rather than going straight to a typical roadmap of discovery, pilot, platform choice, training, and rollout, I would put a change-control spine through each step of that typical roadmap.

By discovering the workflows instead of thinking of all the cool things an AI can do, we can categorize “Summarize account notes” and “renew an enterprise contract” for example into different risk classes. For example, pilot work should run in a sandbox that is production-like in terms of data and failure handling. Limited rollout of an agent should in the first place constrain the authority of the agent before it’s given to more people. And production should have a clear owner, and the agent and all its traces should be kept for a defined amount of time, after which they can be evaluated for performance, and in case of an incident there should be a clear path to resolve it.

This keeps the agent’s actual permissions from being discovered during an incident review.

By embedding service ownership into an organization’s way of working, these implementation dangers can be mitigated by establishing contracts between teams, a sandboxed deployment, and an appropriate rollout sequence. The AI team can be left to own the things they know best, i.e. the evaluation harness, the evals, model routing, and deployment mechanics. The business process owner must own the workflow semantics. Security, operations, and the relevant parts of legal or compliance must own the permission envelope, production response, and the consequences of non-compliance (respectively).

Shared ownership is annoying. So is production.

This is why I keep harping on service ownership for agent work. LangGraph for enterprise agent development made the runtime version of this point. Production agents have operational contracts. A clever graph is not enough. It can fall apart after the first model swap, policy change, or integration outage.

The change record is the handoff object between business process, agent runtime, security, and operations.

The Metrics Already Exist

No need for another exotic agent scorecard. The software delivery world already has the basic bones. DORA's software delivery metrics track change lead time, deployment frequency, failed deployment recovery time, change fail rate, and deployment rework rate.

Change lead time: time from proposing agent behavior to approving production behavior. Deployment frequency: rate of safe promoting of an agent to production, such as adding an agent to a tool registry, policy pack, an organization’s memory schema, retrieval index, or a workflow. Failed deployment recovery time: time to reverse an action of an agent, such as reverting a prompt or policy that was added to production, removing a permission that was granted to an agent, or switching back to a previous workflow. Change fail rate: percentage of changes to agents that require intervention.

This would all be nice and clean if an agent’s behavior failed in a binary way, like an exception being thrown. But it does not. It produces a technically correct answer that just happens to be wrong in the context of the workflow. Which is why the failure is behavioral, not binary, and is invisible to a deployment platform that only knows how to scream when a process fails to start.

So the metric needs evidence.

In the end, the production agent rollout should collect all traces of decisions (tool calls, approval steps etc), rejected actions (e.g. because of insufficient privileges), user corrected mistakes as well as any failures of the eval routine. Business outcomes should also be added to that list of the things changed for a release story and then the team has the evidence for the change board that they’re approving of “stuff” with a slightly nicer UI.

This is where Everybody Tests comes in. Testing cannot be relegated to downstream QA when an agent can affect a live workflow. Product, engineering, operations, security, and enterprise systems teams should be able to run the test. Ideally, they should understand it, too. The eval suite tests behavioral regressions. Traces reveal runtime drift. Approval logs expose authority escalation. Business metrics surface harm the model never sees.

All of them are part of the change.

The Roadmap Is a Promotion Ladder

Start with read-only assistance. The agent assists with summarization, search, templates, classification, and process explanation. That finds workflow fit and failure modes without giving the system authority to act.

Next, the team gradually grants more permission inside well-defined boundaries. Completing low-dollar refunds, updating internal tickets, sending non-regulated customer messages, changing low-risk account fields, deploying to test environments. The goal is to prove bounded authority before scope expands.

This promotion path pays for itself by preventing a business process from being secretly screwed by an AI that nobody can explain.

Make each step on the promotion ladder concrete. Human-in-the-loop needs a named reviewer, a review surface, override power, correction capture, and a rule for when the agent stops asking. Same for guardrails, observability, and governance. Each word should collapse to an owner, system, threshold, and audit trail.

McKinsey's 2026 AI trust survey is useful here because it separates adoption from maturity. Strategy, governance, and controls for agentic AI remain the weak spots. Security and risk concerns remain the main barrier to scaling. Which tracks.

Boring. Beautiful.

Own the Change

So long as an organization treats an enterprise AI agent like another tool intended to spread to more people in the organization with the same amount of enthusiasm, then the AI agent’s implementation will fail shortly after the first collisions with the organization’s permission models, its customers’ reporting structures, its compliance requirements, its process exceptions and its sheer number of customers.

I have no particular interest in helping to recreate the CAB theater for Enterprise Agents. Meetings with 8 approvers (or more!) for a password reset workflow that they cannot even understand is a huge waste of time and effort. Yes, review is reasonable in regulated paths, but that should be the exception, not the rule. And it should be as trivial and technical as possible, ideally close to where the work is actually being done. (In this case a simple approval in the workflow UI).

Put the agent change record next to the PR, the eval report, the trace sample, the permission diff, and the rollback plan. Have the workflow owner sign the semantics; security sign the authority; engineering sign the runtime; and operations sign the incident path.

Then ship.

That is what an AI implementation roadmap needs now: a promotion path for systems that can act.

Production always gets weird.

Agent Benchmark Scores Are Measuring the Harness, Not the Model | Focused Labs

Austin Vance — Sun, 17 May 2026 21:13:13 +0000

The difference between the leading agentic coding models is much smaller than the difference between two distinct configurations of a single model on the same benchmark. Anthropic just quantified it: a six-percentage-point gap on Terminal-Bench 2.0 between the most- and least-resourced setups, p < 0.01. Same model. Same task set. Same harness. The only variable was the resource budget given to the pod.

This is larger than the spread between most frontier models on the public leaderboard.

The number the enterprise picked as "the best agent model" is mostly the amount of CPU and RAM that the eval team assigned to the pod for the test. Welcome to production.

The benchmark is not what the benchmark claims to measure

Static evals score a model's output directly. Agentic coding evals score a model in a runtime, and the runtime itself decides whether a container gets OOM-killed for a transient memory spike, whether a pip install command finishes, whether a test subprocess ever returns a result. Two agents at different resource budgets will be taking different tests.

Anthropic ran Terminal-Bench 2.0 across six resource configurations, from strict enforcement of the per-task specs all the way to completely uncapped. They observed 5.8% of tasks failing on pod errors unrelated to model capacity at strict enforcement, compared to 0.5% at uncapped. Success scores at 1x through 3x were largely within noise (p=0.40), since the agent was going to fail those tasks anyway. However, past 3x, success scores climbed faster than infra errors declined. The extra headroom gave the agent room to attempt new approaches that only work when given more generous allocations, such as installing several large packages at once, running memory-hungry test suites, or spawning subprocesses that take extra time to complete.

The benchmark shifted. Previously it was measuring how capable the model was. Now it is measuring how much budget the harness gives the agent to brute-force the answer.

This is not a bug in Terminal-Bench. It is the nature of agentic evaluation: the runtime is not a passive container, it is an active part of the problem-solving process.

When the benchmark does not include the exact hardware and resource configuration, it ships a number that can't be compared to anyone else's number. Nobody is measuring the same thing.

The model is mostly plumbing

Harrison Chase has been making a variant of this argument for about a year. The agent is not the model. The agent is the harness, memory, tools, prompts, retries, state machines, guardrails, and context windows, with a model call buried somewhere in there.

The Anthropic data is the experimental confirmation of the harness sitting at the heart of the agent. Flip the pod resource limits and the "same" agent is a different agent inhabiting a wildly different reality. Flip the sandbox provider and the same leaderboard score means a completely different thing. The vast majority of the decisions that go into building an agent are about tuning the harness.

Anna Bernad posted a Twitter thread last week after looking at 36 production agent harnesses. Her take is far sharper than mine.

"Every harness I studied that actually ships does the same underlying move, and guess, it's not separation. It's making the context describe a different room."

If the context reads as "teammate shipped work, I'm the reviewer, pipeline wants green," the agent soft-approves with a minor note. Not because the model is bad. The agent is trying to fit the response to the context, and soft approval is the only way to complete the pattern.

The harness is the room. The model is the tenant.

What this does to enterprise procurement

Agent performance based on a benchmark consistently deviates from expectations once a client engages with our service. The model selected for the agent's function is sound. The "harness" through which the model is commanded to operate is what impedes the application. The runtime may not give the tools sufficient compute to act effectively. The retry mechanism built to improve throughput actually masks critical errors until it is far too late. The context window is being consumed by boilerplate system prompts the procurement team didn't know existed.

The enterprise then concludes "AI doesn't work for us" and abandons the effort. The model vendor is blamed. Nobody audits the scaffold.

Vendor benchmark claims aren't automatically disbelieved, but those claims become purely marketing when translated into an "eval score" meant for buyers to use in evaluating vendors. If the eval score is only reproducible on the vendor's Kubernetes cluster with their sandboxing solution and their machine resources, it's safe to say the score has no procurement value.

The LangSmith Signal report this week puts billions of agent runs behind the month's trends. Anthropic grew 73% in users, gaining 39% of share. Gemini rose after the release of Gemini 3. OpenAI remained the largest at around 80% of volume but didn't move up or down. Those are usage numbers, not capability numbers. People are moving around based on what actually works in their harness, not based on what a leaderboard says.

How to read a benchmark

Three questions, in order.

The first question is what the harness actually was. If the eval team doesn't publish the scaffold, retry policy, context budget, tool set, and resource configuration tradeoffs, the number is a picture of one run on their box and not comparable to anything.

Second: what is the infra error rate? Anthropic reported 5.8% of Terminal-Bench 2.0 tasks failing on pod errors at strict enforcement, a 5x margin above the spread between most frontier models. An eval that doesn't separate "model failed" from "container got killed" introduces a lot of noise in the headline number.

Third: does my production environment resemble the eval environment? If the eval runs uncapped on a data-center GPU cluster, the score is going to have almost no predictive value for me, since my agent runs in a sandboxed environment such as a Lambda function with a 512MB memory cap. An agent can win the competition by brute-forcing the space of scikit-learn installs and then fail silently at ship time because it consumes too much memory in the production environment. A lean, efficient agent that loses the benchmark will ship just fine.

What to do instead

Build the harness first. Run the model last.

The analysis has to translate to production. Production tools. Production retry budget (or lack thereof). Production memory store. Production prompt scaffolding. Production runtime limits. Wire it up with observability that traces trajectories through the system, not individual LLM calls. Then swap different models in and see what changes.


# Shape of an internal model bake-off in 2026.
# LangChain 1.x, LangGraph 1.1.9, LangSmith.

from langchain.agents import create_agent
from langsmith import Client, traceable
from langsmith.evaluation import evaluate

CANDIDATES = [
    "anthropic:claude-opus-4-7",
    "openai:gpt-5.1-pro",
    "google:gemini-3-pro",
]

def build_agent(model: str):
    # Same tools, same prompt, same retry budget, same memory store.
    # The ONLY variable is the model string.
    return create_agent(
        model=model,
        tools=PRODUCTION_TOOLS,
        prompt=PRODUCTION_SYSTEM_PROMPT,
        middleware=[
            PIIMiddleware(config=PROD_PII_CONFIG),
            HumanInTheLoopMiddleware(escalation_policy=PROD_POLICY),
        ],
        context_schema=ProductionContext,
    )

client = Client()
dataset = client.read_dataset(dataset_name="production-trajectories-q2")

for model_id in CANDIDATES:
    agent = build_agent(model_id)
    evaluate(
        lambda inputs: agent.invoke(inputs),
        data=dataset,
        evaluators=[
            trajectory_match,       # compares actual tool-call path to reference
            tool_call_precision,    # did the agent use the right tool at the right time
            final_output_rubric,    # LLM-as-judge on the end state
        ],
        experiment_prefix=f"harness-bakeoff-{model_id}",
        max_concurrency=8,
    )

All tests run using the same harness, the same tools, one variable at a time. The goal is to select the model that actually works within the production stack, not the one that earned points on a public leaderboard running on a Kubernetes cluster someone else had tuned.

This is where the engineering work is. This is also why the agent harness is where the engineering work lives now, and why a lot of clients call us. The model picker is not the problem. The harness design is the problem. The eval infrastructure is the problem. The trajectory observability is the problem.

The harder truth

The methods for finding genuinely good agents tended to favor simplicity and efficiency. The reason is that we were looking for agents that could write efficient code quickly. In contrast, agents that had plenty of resources available tended to do better when there were plenty of resources available. Both types of agents are useful to test for, and both correspond to realistic scenarios. Neither of them can fairly be collapsed into a single number on a leaderboard.

Many of the agents we deploy to enterprises run on some sort of strict budget for resources such as memory and CPU. Beyond these general limits, there are often specific restrictions on things like subprocess runtime and the number of times an API can be called within a window, largely because of cost. The model that wins with unlimited resources is a different model than the one that wins under strict limits.

Pick the model that performs in the harness. Own the harness. Measure the trajectory. The benchmark is not the product.

The harness is the product.

AI Agent Authentication Starts With Workload Identity | Focused Labs

Austin Vance — Wed, 13 May 2026 14:55:56 +0000

AI agent authentication starts when the system can answer which actor is allowed to make a tool call.

The model can propose the action. The runtime has to attach authority to it.

Most teams start with the fastest answer: an API key in an environment variable. The agent reaches Salesforce, GitHub, Jira, Snowflake, Stripe, whatever system makes the first useful proof feel real, and everyone moves on.

That proof matters. It shows the agent can reach the systems where work actually happens. It also hides the first product decision: who is acting when the tool call leaves the runtime?

The agent gets memory. The agent runs in the background. The agent forks into subagents. The agent retries failed operations. The agent calls tools after the user has walked away. The agent lands in an enterprise workflow where the work has value, the logs have value, and breaking something has a consequence.

A shared API key starts as configuration. Then it quietly becomes the identity of the agent.

An ugly place to stumble into by accident.

The secret becomes the actor

Early security models for agents tend toward good vibes with a bearer token. The prompt gives instructions. The tool schema lists calls. Hard-coded secrets in the runtime decide what actually gets done based on the input, the agent, and whatever authority those secrets carry.

The secret wins.

The agent has all of those powers if the same key can read every customer record, submit refunds, update tickets, and write to production data. Carefulness in the prompt is theater at that point. The tool description can say those powers apply only when appropriate. The audit log will still show one credential able to perform a pile of different tasks.

There is already a category for this outside agents: OWASP's Non-Human Identities Top 10. Production applications identify themselves as non-human identities. Agents are adding themselves to that growing list of stranger workloads, running differently than normal services, but still requiring access to systems and data.

The important step for me is naming the agent as a workload, because the architecture gets less magical and more useful.

Workloads have identities. Workloads can request scoped credentials for those identities. A workload can be denied a credential. A workload can rotate credentials. A workload can leave an audit trail that survives the model, the prompt, and the v2 or v3 abstraction barrier the team is currently working around.

Baseline authentication for production AI agents.

The runtime should issue tool-specific credentials instead of letting the agent carry a shared key everywhere.

Workload identity is the boring answer

This part is old. Good.

Kubernetes already considers service accounts to be identities of processes running in Pods, and the current docs describe short-lived, automatically rotating ServiceAccount tokens issued through the TokenRequest API. SPIFFE generalizes that into workload identity documents, including short-lived X.509 and JWT SVIDs that a workload can use to authenticate itself to other workloads.

Cloud platforms are heading in the same general direction. AWS STS can issue temporary security credentials after a workload has identified itself using OpenID Connect. Google Cloud Workload Identity Federation allows external workloads to access Google Cloud resources without service account keys. Azure managed identity docs describe workload identities as machine and non-human identities associated with compute resources.

The industry knows how to keep long-lived secrets out of the hot path. It just keeps giving agents interfaces that make the old mistake easy.

A developer writes a tool wrapper. The tool wrapper needs credentials. The fastest way to configure it is to add an API key to an environment variable and add a TODO to remove it later. The TODO gets pushed to production because now the agent answers support tickets, reconciles invoices, or looks at CI.

I've worked with teams who reviewed the model, tuned prompts, drew diagrams for tool selection, created a few secrets in deploy config, and crossed their fingers that the tool descriptions would shore it all up.

They are not enough.

Delegation is the missing primitive

In many applications, the agent should rarely hold the credential it uses to act.

Put an identity assertion in the flow. This agent. This tenant. This user context if present. This policy version. This tool request. This approval state. That assertion is exchanged for a credential only when the action needs one.

OAuth was designed to support exactly this shape. RFC 8693 defines token exchange, describing how one temporary credential can be exchanged for another temporary credential intended for a different context. In the agent case, the model proposes an action, the runtime checks policy, the broker issues a credential for that action and tool context, the call happens, and the credential dies.

It does not expire after a quarter. It does not expire after someone remembers to rotate it. It expires because the system puts expiration in the path.

That changes the damage pattern. A compromised tool wrapper no longer implies broad access to every downstream system. A prompt injection has to cross approval, run, tenant, and policy boundaries. A subagent that escapes its execution boundary cannot reuse credentials after the run, approval, or tenant context has expired.

The agent is still useful. It just has to query through a production boundary that understands production concerns.

This is why integrated agents are valuable and dangerous at the same time. The valuable integrated agents do not live in a chatbot tab. They integrate with real systems. Once an agent is tied to real systems, authentication becomes product architecture rather than cleanup work hidden in deployment.

The runtime owns the identity boundary

A model provider should not own this boundary. A prompt should not own this boundary. A tool schema should not own this boundary.

The runtime owns it because the runtime follows the whole path.

It connects agent definitions to threads or runs, tenants, and identity information, including the user who initiated the work, whether the work is backgrounded, whether a human approved a risky step, which tool is being called, and which downstream credential is being requested. It can attach those facts to an identity assertion and make a policy decision before any assertion leaves the process.

That policy decision can be boring and explicit:

The refund tool can request a payment credential for the current tenant.
A GitHub tool can request a write credential after CI has produced an eval pass.
The Snowflake tool can request a read credential for one warehouse, one role, and one time window.
A subagent can run with a delegated identity, but only with fewer capabilities than the parent run.

The list is not impressive, which is why it is powerful.

This is also where multi-agent orchestration gets serious. A supervisor handing work to a subagent creates a delegation relationship along with the task description. The child process needs enough authority to perform the work at hand and no more. The audit log must reflect that chain of trust cleanly or troubleshooting becomes an exercise in futility.

The worst setup is a swarm of agents all sharing the same service account. Simple enough to get going. Terrible when it comes time to debug an incident. Every action has been performed by the same principal, authenticated with the same key, and observed through the same useless blur.

The incident has no useful actor. Just a shared key with a long memory and no accountability.

Short-lived delegated credentials make the agent run, policy decision, tool call, and audit trail line up.

Audit follows identity

Agent observability without identity is half a story.

A trace for the agent step called refund_customer can include latency, tool arguments, model output, retries, all visualized in a convenient span tree. Useful. Then someone asks who had authority to issue that refund, and the trace turns into archaeological excavation.

The right trace shows the tool call connected to a principal. Not just a service account. A principal with an agent ID, run ID, tenant, user context, policy decision, credential scope, and expiration time.

This is what allows a team to answer questions after the tool call has done real work.

Who granted access? What user context did it use? What broker generated the credential? What version of policy allowed it? What downstream resource accepted it? What subagent inherited it? Can that credential be used for something else?

Those questions determine whether there is a real postmortem or just hand waving about the agent doing something weird.

The same principle applies to testing. In Everybody Tests, I argued that every team already tests whether they admit it or not. Agent identity needs that same honesty. If a runtime can create delegated credentials, tests should verify that the boundary holds. A refund agent should fail against the wrong tenant. A code agent should fail when eval gates are red. A research agent should fail when it asks for write access to a system it only reads.

Not a single npx this and that in the whole codebase. Test it in CI.

Shared keys hide product decisions

The fastest credential story hides the decisions that matter most.

A shared key hides tenancy. It hides user context. It hides the identity of the agent performing an action. It hides which subagent inherited authority. It hides whether approval was granted. It hides whether the action matched the original request. It hides rotation until rotation becomes an outage.

OWASP's secrets management guidance recommends dynamic secrets where possible to reduce credential reuse and limit the damage when credentials leak. Agent systems need the same pressure, with the additional constraint that the credential must represent the run instead of only the application.

A normal backend service is expected to behave predictably and follow a reliable lifecycle. It accepts requests, implements endpoints, and changes through controlled deployments. An agent runtime for integration automation can select different tools per request, execute work in subagents, retry steps, and continue running after initial user interaction has completed.

So identity has to be more exact.

The credential loaned to the system should assert what it is currently allowed to do. The operating policy should be visible enough to understand the motivation behind the action. The audit trail must persist long enough for a human to traverse the events as they happened.

A boundary-based platform does not need a full rewrite. Start with one boundary.

Put an identity broker between the agent runtime and the first high-risk tool. Give the agent runtime a workload identity. Have the broker exchange that identity for a tool credential. Associate the decision with tenant, run, and operation. Record the policy decision in the trace. Add a CI test that proves the wrong tenant fails. Expire the credential quickly. Make the failure visible when the broker returns no.

Then move the next tool behind the boundary.

The production line

AI agent authentication is the control plane for non-human actors who do work across systems.

Ownership matters here. Security cannot retroactively add this after the agent and its resources have shipped. Platform cannot stash it in a vault path. Product cannot mark it as a checkbox in consent. Identity, delegation, expiration, and audit have to be inherent in the runtime of the agent and how it executes.

The agent should actually be able to act. That is, after all, why we are doing AI agency in the first place. That agency should have a workload identity.

Production systems have already worked out parts of the problem. Kubernetes, SPIFFE, OAuth token exchange, cloud workload federation, managed identities, dynamic secrets. They exist because static secrets rot and shared principal accounts make bad worse.

It is a mistake to grant agents an exemption because the interface is conversational.

The model can decide on the next step. The runtime decides whether that step gets a credential.

Agentic AI Architecture Needs Model Routing

Austin Vance — Fri, 08 May 2026 01:57:35 +0000

Agentic AI architecture is stuck on model loyalty.

The same graph. The same provider. One giant model doing every job because one graph is easier to defend than a routing policy.

I get why people want to pick one model: it makes demos and evaluation and procurement easier, and sometimes debugging only slightly worse. The agent call becomes always the same, the trace becomes always the same, and the team can blame one provider instead of four.

Fine. But production agents do not do one kind of work.

Classify intent. Search. Summarize. Write code. Choose a tool. Check if a tool's result smells wrong. Write a customer-facing answer when something failed. Decide whether approval is required. Wait for something to happen. Retry something that failed. Recover from something gone wrong.

Production agents run a pile of distinct workloads.

Harrison Chase notes that LLMs are getting expensive, and open source models matter for that reason. LangChain is pushing the same direction from a product perspective, noting that Fleet agents no longer have to be constrained by a single model and can instead use multi-model support.

Those are the same production reality arriving through two doors.

The agent architecture must determine which model should perform which work.

The Same Model Everywhere Is an Architecture Smell

This is surprising. Many current agent stacks treat model selection as just another config parameter of the environment, equivalent to tradeoff parameters or batch sizes. Set MODEL=claude-whatever or MODEL=gpt-whatever and deploy the agent.

That's fine for a chatbot, but lazy for an agent.

Agents introduce variance internally. What looks simple to a user becomes retrieval, planning, transformation, checking, execution, generation and scheduling inside the system. Some of these steps need to be deep, some fast, some cheap. Some need a model that is good at generating code, others an open-weight model because the data cannot legally leave the boundary, or because it is simply too expensive to move around the company.

Using the same frontier model across the board is comforting. It also conceals the waste.

Instead of one glaring failure, I get slow, expensive, bureaucratic agent production. A team looks at the dashboard. Cost rises, latency rises, and people say the model is too expensive or the prompts are too long. The architecture is linear and all steps go to one place.

What gets under my skin is the compute monolith. Everywhere else we have learned to separate compute classes properly (queues are not databases, lambdas are not batch workers, CDNs are not origin servers). Then some clever agent comes along and suddenly every cognitive function has to go through the biggest model in the account.

Come on.

Routing Has to Do More Than Fallbacks

Model routing usually enters the conversation through reliability. If OpenAI is down, try Anthropic. If a deployment is overloaded, try another one. If a provider rate-limits, retry somewhere else.

This is important. LiteLLM's router docs explain load balancing, cooldowns, fallbacks, timeouts, retries, and Redis-based production rate limiting. OpenRouter's provider routing docs explain provider ordering, fallbacks, performance, price, and data policy constraints. Boring infrastructure at its best.

But routing cannot stop at uptime.

In a production agent workflow, the router should understand why a task exists. It should see the agent step, the tool context, the risk, latency budget, data boundary and previous run quality. Then it can pick the appropriate model class for the work at hand.

The router belongs in production architecture, where policy can be tested.

This is where things get more interesting for agentic AI architecture, compared to just building an LLM app. The router turns the agent’s internal structure into an execution policy.

A planner step can go to a reasoning model. A normalization step can go to a fast model. A code-editing subagent can go to a model tuned for code. A bulk summarization step can go to an open-weight model. A regulated data step can stay inside the boundary. A customer-facing final answer can take the slower path because that is where quality matters (since it impacts the customer).

The pattern is already familiar, which is the point. It has the same shape as multi-agent orchestration in LangGraph, but I like it better down at this level. The graph determines what work exists, and the router determines which model class should process that work.

The Router Needs Typed Work

Prompt-based routing is where it all goes wrong.

A team adds "Use the cheaper model when the task is simple." The agent is amiable, but ignores the team's intent at exactly the wrong time. The AI guesses or routes based on whatever words match the current prompt. The result is a vibe with a model attached.

The router needs typed work.

My ideal is for the agent to report task metadata before the model call occurs: task kind, expected output shape, sensitivity of input data, allowed tools, user-facing risk, latency/cost budgets, required capability, and retry posture. I do not need a full taxonomy to start. Most teams can begin with something tiny: classify, retrieve, reason, write, code, act. The key is moving model choice from prose to runtime.

This is a lesson already learned elsewhere in agent architecture. In Developing AI Agency, explicit mechanisms for planning, tools, memory, and verification beat one giant prompt pretending to be architecture. Model selection is another version of this.

The router can start dumb and be a simple lookup table driven by task type. It can be configured to dispatch to the code model for code tasks, the fast model for low-risk summaries, the local model for sensitive data, and the quality model for final text written for specific customers. First, ship that. Verify that it works. Then gradually become less dumb and add more nuance to the router.

The first mistake is expecting the team to find the single best router before shipping anything. The second mistake is letting the model design the router policy inside the same prompt it is supposed to execute.

Observability Makes Routing Honest

A router that does not publish telemetry data becomes an additional place where opinions get hidden.

An engineer's affection for a particular design, the score of a benchmark, and the features listed on a vendor's web page are all useful, but ultimately insufficient. The only relevant test is whether the routing rule improves the production agent's performance on the tasks it actually faces.

This means we need to consider cost, latency, error rate, retry rate, approval rate, human correction rate and eval score when deciding the routing for a request. So these statistics need to attach to the routing decision itself, not just to the trace.

LangSmith's platform language is already pointing in this direction. It treats traces as the record of an agent’s actions and reasoning, and says teams should monitor cost, latency, errors, and qualitative online evals. Fleet's product page puts model choice next to admin controls, observability, approvals, MCP connections, and export via APIs. This is the signal.

Model selection has moved from dropdown aesthetics into operational control. It affects the performance of a wide array of business processes.

Once routing is visible, the discussion shifts. The team can stop arguing over which model is best and start figuring out which route failed: fast model for tool argument generation, reasoning model for eval lift, open-weight model for internal summarization, code model for patch generation.

Those are engineering questions.

The answers need to inform the router policy, or else the agent keeps making yesterday's decisions with today's realities.

Open-Weight Models Are Part of the Architecture

The open-model conversation is often deeply ideological. People tend to think in terms of closed models versus open models, frontier quality versus control, benchmarks, and vibes.

Production is less dramatic.

Open-weight models give teams another execution path. They are useful when the task is bounded, when the data boundary matters, when throughput matters, when the cost curve gets ugly, or when the model only needs to be good enough for an internal step the user never sees.

A frontier connection does not mean every call should route through that location. That misconception is common. Routing makes the difference.

A team can still use a frontier model architecture for the high-risk reasoning step. And yes, the final answer can still go through a strong hosted model. But the retrieval cleanup, first-pass summarization, metadata extraction, and internal critique may not automatically deserve the same spend.

There is no best model for this problem. The more useful question is: Which model owns this step under these constraints?

Interface portability matters for the same reason. LangChain says Deep Agents ships with ACP so the same harness can run across multiple interfaces. The Deep Agents CLI docs show a coding agent with provider credentials, model switching, tools, memory, skills, MCP tools, and LangSmith tracing. The interface can change. The harness can change. The routing policy has to be portable across both.

Model choice that lives in a UI dropdown is prone to drift. Model choice that lives in the agent runtime can be tested, traced, reviewed and rolled back.

Own the Decision Boundary

The old agent stack revolved around a model call. The next one revolves around a decision boundary.

That boundary decides which work deserves which model, which provider, which data path, how many retries to attempt, what approval loop to operate in, and which evaluation loop to use. Less glamorous than a chart, to be sure, but more relevant to production workflows. Most production architecture is less glamorous than the thing that sells the demo.

The teams that get this right won’t talk about having one “agent model”. They’ll talk about routes: Fast route. Deep route. Code route. Local route. Human-review route. And for each route, they’ll know when to use it, how much it costs, how often it fails, and whether the next release made it better.

This is where integrated agents become useful. The agent owns execution decisions instead of wrapping a model call in a little workflow theater.

The code that matters controls the router, the telemetry and the eval loop.

The model will keep changing. The decision boundary should belong to the team shipping the agent.

Stop Eager-Loading MCP Tools Into the Context Window

Austin Vance — Tue, 05 May 2026 20:31:01 +0000

MCP servers should not eagerly load every tool schema into an agent's context window. Lazy-load tools by intent, then govern and audit execution.

Austin Vance, CEO of Focused

I think the problem with the current state of MCP is way deeper than just resizing the context window.

The protocol itself is decent, tool discovery and schema negotiation works well and the JSON-RPC architecture all feel very solid and well engineered. However, the default behavior of populating the agent's context at session start with every tool definition from every connected server makes running production agents virtually impossible.

One developer measured 67,300 tokens consumed before typing a single question. Seven MCP servers. Tool schemas alone ate up a third of the available context. Another measured 81,986 tokens.

The Eager-Loading Tax

When an agent starts a session with MCP servers connected, it downloads the full library of all tools, every session. And never filters out just the tools needed for the job at hand.

My browser automation server is loading 21 tool definitions. A GitHub server loads 27. My web search server bundles 8 providers behind 20 tools. I've not sent a single message yet and I'm already consuming significant context.

The numbers from a study of 856 tools across 103 MCP servers make this worse than it sounds. Fully augmented MCP tool descriptions add 67% more execution steps for a 5.85 percentage point accuracy gain. The tool definitions don't just eat context. They also slow agents down at actually learning to use the tools.

We wrote about evaluation pipelines for production agents. One of the failure modes of context pollution from tool definitions that I never see anyone mention is when the agent becomes less effective over time. It doesn't necessarily die or crash or throw an error. The amount of real conversation history that can be displayed in the working window gets pushed out by the tool schemas.

Even with child agents the context budget gets severely curtailed. Each child agent inherits the MCP configuration. That's new context I guess, but the immediate loss of tens of thousands of tokens to render tool schemas for subagents that may not even use them is completely antithetical to the point of using subagents in the first place: focused context. We covered the architecture patterns for multi-agent orchestration in LangGraph, but even great orchestration can't fix a context budget that's already half spent before the first tool call.

The waste is architectural: eager loading spends the context budget before the agent starts working.

Cloudflare Just Admitted This Is Broken

Cloudflare launched Agents Week on April 12, and buried in their enterprise MCP reference architecture is an admission that the tool-definition model doesn't scale.

Their solution is called Code Mode. It condenses all of the individual MCP tools down into two meta-tools: portal_codemode_search and portal_codemode_execute. Rather than loading every tool definition into context, the agent writes JavaScript to search for and invoke tools on demand.

This means that 4 internal MCP servers exposing 52 tools would normally consume 9,400 tokens just for definitions. Code Mode drops that to 600 tokens. A 94% reduction. For Cloudflare's own API, which would consume over 2 million tokens as a traditional MCP server (twice the largest context window available right now), the reduction hits 99.9%.

That last number deserves to sit for a second. Cloudflare, one of the companies most aggressively adopting MCP across their entire enterprise, had to build a system that essentially replaces MCP's tool discovery mechanism because the original approach would literally overflow the context window. With one server.

The MCP spec team acknowledged context overload as the most frequent community concern in their tool filtering proposal. Quality decreases rapidly after around 10 tools, which far exceeds what most production setups connect.

Lazy-Loading Is the Fix

Not just a theoretical issue. I'm seeing lazy-loading work in multiple production environments, each implementing it slightly differently.

Cloudflare's Code Mode turns the agent into its own tool browser. Give it a search function, give it an execute function, and let it figure out which tools matter for the job at hand. The context cost for exploring MCP servers stays the same regardless of how many servers are connected.

There's also the Skills pattern. Instead of representing all of the tool schemas in detail upfront, agents encode the knowledge needed for a given task in lightweight skill files (typically 200 to 1,500 tokens each) that can be loaded as needed based on intent matching. A skill for browser automation might cost around 2,000 tokens to activate, as opposed to 13,600 tokens to load the full MCP server at startup. GitHub operations drop from 18,000 tokens to maybe 500 or so. Web search goes from 14,100 down to 550.

That's not marginal. That's an order of magnitude.

Arcade's MCP Gateway in LangSmith Fleet takes a third approach by centralizing 7,500+ tools and optimizing the tool descriptions for language models. These tools are not simply API wrappers. They are mapped to actions that agents can perform, with descriptions written specifically for how language models select and call upon them.

Harrison Chase wrote about this from the other side of the spectrum. His continual learning framework identifies three realms where agents improve: model weights, harness code, and context. The context layer is "the most common and most exciting area right now." However, optimizing for context only works if there is room in the context budget to do so. An agent can't learn from its interactions if the space for learning is already completely filled by tool schemas it loaded at boot time.

Lazy-loading turns tool discovery into a governed routing path instead of a context-window tax.

What This Looks Like in Practice

What I particularly like about the current LangChain infrastructure is that the eager version of these agents registers all tools when the agent is built:


from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient

MCP_SERVERS = {
    "github": {"transport": "http", "url": "http://localhost:3001/mcp"},
    "browser": {"transport": "http", "url": "http://localhost:3002/mcp"},
    "search": {"transport": "http", "url": "http://localhost:3003/mcp"},
    "database": {"transport": "http", "url": "http://localhost:3004/mcp"},
}

async def build_eager_agent():
    client = MultiServerMCPClient(MCP_SERVERS)
    tools = await client.get_tools()  # all tools, all servers, every session
    return create_agent("claude-sonnet-4-6", tools=tools)

The lazy approach is not a magic discovery tool that mutates the running agent's tool set. The boring version is a router: decide which MCP servers matter for this task, load only those tools, then build the agent for that run.


from langchain.agents import create_agent
from langchain_mcp_adapters.client import MultiServerMCPClient

TOOL_REGISTRY = {
    "github": {
        "transport": "http",
        "url": "http://localhost:3001/mcp",
        "triggers": ["pr", "issue", "repo", "commit", "branch"],
    },
    "browser": {
        "transport": "http",
        "url": "http://localhost:3002/mcp",
        "triggers": ["browse", "click", "navigate", "screenshot", "page"],
    },
    "search": {
        "transport": "http",
        "url": "http://localhost:3003/mcp",
        "triggers": ["search", "find", "look up", "query"],
    },
    "database": {
        "transport": "http",
        "url": "http://localhost:3004/mcp",
        "triggers": ["sql", "query", "table", "database", "records"],
    },
}

def select_servers(task_description: str) -> dict[str, dict]:
    selected = {}
    task = task_description.lower()

    for name, config in TOOL_REGISTRY.items():
        if any(trigger in task for trigger in config["triggers"]):
            selected[name] = {
                "transport": config["transport"],
                "url": config["url"],
            }

    return selected

async def run_with_lazy_tools(task_description: str):
    selected_servers = select_servers(task_description)
    if not selected_servers:
        available = ", ".join(TOOL_REGISTRY)
        raise ValueError(f"No matching MCP servers. Available: {available}")

    client = MultiServerMCPClient(selected_servers)
    tools = await client.get_tools()  # only tools from the routed servers
    agent = create_agent("claude-sonnet-4-6", tools=tools)

    return await agent.ainvoke(
        {"messages": [{"role": "user", "content": task_description}]}
    )

The first version of the feature I had written had a terrible context profile because it stored definitions for every tool on every server. The next version routed first, then loaded only the relevant components as needed. The gain in a production system with 5 to 10 MCP servers is in the tens of thousands of fewer tokens processed every session.

Holding all of that tool schema in context is expensive. But more importantly, every token of tool schema that sits in context is a token that could be spent on reasoning, conversation history, or user-specific memory. We wrote about why persistent agent memory is critical for production agents. Memory is useless if there isn't room for it.

Shadow MCP Is the Enterprise Problem Nobody Expected

Cloudflare's reference architecture introduces another concept worth paying attention to: Shadow MCP detection. They scan for unauthorized MCP server connections across the organization, monitoring hostnames, URI paths, and even DLP-based body inspection for JSON-RPC method calls like tools/call and initialize.

MCP has its own shadow IT problem. Developers will sometimes set up their own MCP server, integrate that into their existing agents, and security will never even be aware. This code can execute locally on developer machines, reach out to internal APIs, and bypass security controls. No audit trail, no credential governance, no DLP.

Cloudflare's answer is a monorepo governance model: centralized MCP team, AI governance approval, templates that inherit default-deny write controls and audit logging out of the box. New governed MCP servers deploy in minutes because the governance is baked into the platform, not bolted on after the fact.

I see this pattern constantly with clients. The MCP gold rush has teams spinning up servers faster than security can evaluate them. We wrote about why agent-operable interfaces are the product. The same principle applies to the tools agents use. If an employee can't access a system without approval, the agent shouldn't be able to either.

The Fix Is Architecture, Not Bigger Windows

"Context windows keep getting bigger." They do. And the waste doesn't get smaller.

A million-token window doesn't help if 67,000 tokens of tool schemas still get loaded that the agent won't ever use. The underlying issue is architectural: eager-loading is the wrong pattern for tool discovery in production agents.

Lazy-load tools based on task intent. Gate discovery behind a search mechanism. Keep tool definitions out of the context until the agent actually needs them.

Honeycomb published a set of principles for the AI era that apply here: cost is a system attribute, not an afterthought, and pre-production testing doesn't prepare for the load that comes from real systems in a real environment. Tool context overhead is exactly the kind of emergent cost that only shows up in production, when real agents connect to real MCP servers and the token bills start making people uncomfortable.

The protocol isn't the problem. The eager-loading default is the problem. Own the architecture decision. Lazy-load.

MCP Is Packaging. Agent-Operable Interfaces Are the Product | Focused Labs

Austin Vance — Mon, 04 May 2026 14:25:47 +0000

MCP packages tools, but the real product is the narrow, typed, auditable interface an agent can actually operate.

Austin Vance, CEO of Focused

MCP is not the hard part.

The hard part is designing a system that an agent can use, as opposed to guessing, wandering, or mangling it. The protocol is the distribution rather than the architecture

This is kind of important. Every enterprise AI conversation I’ve had will, at some point, boil down to this: we have a model, we have a workflow, and we have a tangle of internal tools designed for humans to interact with them through a web interface at human speeds. Then the question becomes “should we make an MCP server to handle all of this?”

Fine. But for what?

The Model Context Protocol makes it easy for applications to expose tools and model context. That’s useful and I'm not opposing MCP. I am opposing the use of this protocol to justify exposure of a useless shortcut as being also useful.

Harrison Chase broke down the lock-in problem well: switching model providers is easy, switching harnesses is less so, and model providers want to lock teams in through the harness. The harness is where the agent learns about the actions in an application, the state, the model’s memory, what can be retried, what needs approval, and what telemetry gets written down.

But then there is the interface below the harness, which gets little recognition.

A bad interface can turn an excellent harness into a nightmarish pain. A good interface can make any harness only fair at worst.

I see why “just build an MCP server” isn’t the entire answer. An MCP server can send a messy action. It can wrap up a sharp action. But deciding which action exists in the first place is up to the team. And it's a design / experience problem not engineering.

Teams build integrations for internal agents by wrapping around existing APIs, often structured to hide awkward frontend decisions, like why the API returned an object with an object with an object inside of it. An endpoint might have a side effect of updating state because it’s an admin screen. Exceptions include human-readable error messages, implicit permissions, opaque pagination parameters, no support for dry running, and no idempotency keys. The most lacking verb in this system is “after policy rules apply, approve this one invoice,” and that ends up on an agent with the verb updateInvoice. Stricter prompts don’t work.

Welcome to production.

After reading yet another question about whether a given subsystem has an MCP server, I paused for an instant to ask myself whether I missed something here. We shouldn't be asking "is an MCP server," instead we should ask if the system in question has handles for the agent that just got invited in.

A handle is a small, typed, boring action, describing what it intends to do with some data. It describes what the data contains, what the operation needs from it, and what it will look like afterward. It fails in a way that the caller can understand. Handle-based operations are easy to test without a full model. Finally, handles leave traces of their prior actions.

Do the new examples reinforce the point? Google’s MCP Toolbox for Databases might sound utterly bland because “database plus MCP” is a magical phrase. But in this case, the interesting new aspect is that databases require controlled, auditable work that can be inspected by the software agent. MathWorks has released an official MATLAB MCP server, which is interesting because the interface to MATLAB’s mature technical environment is vastly more appropriate than a chat window. Browserbase and LangChain are demonstrating Deep Agents with search, fetch, and browser subagents. Again, a cheap, light subagent performs quick retrieval, followed by a heavier browser-based operation if necessary.

I don’t mean that every single thing suddenly becomes an MCP server. I mean that more of the important tools in a business can become something controlled through an agent instead of through a browser tab or terminal command.

There is a difference.

An MCP server is just one package boundary among several, each with its own strengths and weaknesses. An agent-operable interface is a product decision, choosing specific verbs, inputs, outputs, reversible operations, and mandatory human pause actions. A protocol can then move that interface around, but it cannot make the interface good.

MCP moves an interface around. It does not make the verbs worth trusting.

This is the same anti-pattern we saw with APIs. Companies would publish a REST API to tremendous fanfare, convinced that integration problems were now solved. In practice, the nouns and mutations provided by the API would prove inadequate for anything beyond the simplest cases. Docs would sometimes contradict behavior. And while most of the workflow might be automatable, the remaining chunk still required a human being logged into the admin console.

The gap costs more as agents move further into it, since they typically stop short of explicitly stating the ambiguities at the boundary, and instead select tools, insert missing fields, retry operations, and give misleading summaries of the results as if they were progress. Agents do not intend to fail in workflows. Instead, they are given an irregular surface to work on for which they have no clear mandate and for which they must pretend to be competent.

A useful way to think about this is Developing AI Agency. The word “agency” comes with unfortunate connotations of personality, so I try to think about it in terms of the required affordances for any agent: a goal, some tools to pursue it with, memory, feedback, and permission to act. When the tool layer is too vague, the AI ends up with fake agency. It can talk about work and even generate a lot of thoughtful-sounding design language, but it can’t actually do the work.

The current gold rush of building MCPs obfuscates this problem because when people say “server” they think of code and physical hardware. Code and hardware are tangible. There is a repo, a README, and a demo of someone, usually Claude or Cursor, opening up the tool and something happening.

That demo is not the test.

Test whether the interface still behaves when the request is boring, partial, duplicated, late, unauthorized, or wrong. Test whether a reviewer can always reconstruct what happened to an object after the agent touched the handle of the thing. Test whether the action can be replayed in staging without accidentally sending the email to customers. Everybody Tests, even when the thing under test is an agent holding a tool handle.

A useful agent-operable interface has a few properties.

The verbs are narrow. A verb for “create refund request” instead of “update order.” A verb for “draft response” instead of “send message.” A verb for “propose schema migration” instead of “run SQL.” Narrow verbs help by letting the operation name strongly suggest the operation’s intent.

All inputs are provided in a form that the domain expects, not just pure JSON schema for the sake of it. Real domain constraints are used where possible, to reflect the kind of validation that matters in the application. This means providing an account ID that actually exists in the system, a payment amount that has a meaningful currency, and a date and time with timezone rules that have real-world meaning to the user. And when using enums, the validated output should contain meaningful strings, not just values used in the demo.

Outputs should be machine-readable and human-readable at the same time. The agent expects certain fields to be populated. A human reviewer wants to read a simple statement of what changed, what didn’t change, and what still needs work.

There’s a dry-run path. A dry run is the cheapest safety mechanism available, and almost nobody shipping generated code tries it first. A dry run turns “can the agent do this?” into “can the agent explain the diff before doing this?” That is where human judgment is better.

Interfaces are idempotent to the degree possible. Networks fail, agents retry, and tool calls time out while the downstream system was actually working. If creating an invocation of create_refund_request also creates a second refund, or a second ticket, or a second production deploy, then the interface is not yet ready for an agent.

Every interface has contract tests that don’t involve a model. This matters. If every single correctness check has to run an LLM, we have built a slot machine and only looked at the CI badge. The tool’s schema, how it validates, what a dry run looks like, how permissions fail, and what audit records are generated should all be tested by normal software tests. Save the model evals for when there’s a model involved.

The interface leaves evidence. Not vibes, though it could strive for better ones. Tangible records of who acted, through which agent, under which policy, against which object, with what proposed change, and with what final result. Here I’m talking about connecting observability to governance without inverting into another dashboard cult.

A useful handle is a contract the agent cannot creatively reinterpret.

The Google Cloud conversation with Harrison Chase framed harness engineering as the path from demo to production. I think that is right, and I think the next practical step is interface engineering. The harness made sense once it had an interface for composing sane things.

This is why abstractions on top of LangChain are useful too. Start with a basic agent primitive, then a graph, and finally a Deep Agent that can even use browser subagents and human interruption. Every level of abstraction still ultimately bottoms out at a tool call, which either corresponds to a clean domain operation or a tangled mess of code that happens to work on the backend.

In practice, Multi-Agent Orchestration in LangGraph is only half the story. The other half is whether the interface lets the worker do anything worth trusting.

It’s getting said out loud in the community now: “Stop building MCP servers. Build CLIs that agents can use”. I don’t care what the end result is, as long as it’s a CLI, OpenAPI endpoint, MCP tool, database management procedure, internal command bus, or whatever boring thing is observable, testable, and readable by others.

Interesting new projects are emerging around this idea too. agent-install treats agent capabilities as installable surfaces across coding agents. loadam turns OpenAPI specs into tests, MCP output, and drift reports. freeCodeCamp’s LangGraph, MCP, and A2A guide also illustrates the progress from single-agent demos to more structured systems with protocols between them.

Good. Just make the distinction between what the protocol diagram shows and what the system can actually do.

The work is deciding what actions the agent can take within Salesforce, Jira, GitHub, Postgres, SAP, Stripe, and the lingering internal admin app that is totally going to get replaced tomorrow. Deleting broad verbs is the new favorite hobby. Adding dry runs is straightforward. Making failures typed is tedious. Writing tests for contracts before a single model sees the tool is boring.

Boring is the point.

Stop Eager-Loading MCP Tools Into the Context Window. A giant pile of tools is not capability. It is usually confusion with a larger token bill. Agents need fewer, sharper handles to their tools, and tool catalogs should feel more like a well-designed command line than a junk drawer with JSON schemas bolted on.

Agent-operable interfaces should be treated as part of product architecture, not just sweeping up integration bits and pieces that product teams don’t want anymore. Enterprise teams should own the verbs the same way they own the database schema. Version them. Deprecate them. Test them and document the failure modes. Have review for dangerous actions. Make the interface boring enough that the agent has no creative wiggle room around the important bits.

MCP will help distribute interfaces. Harnesses will help compose them. Models will get better at calling them.

Companies will not win by having the most MCP-capable servers. They will win by having the cleanest handles in their systems.

Your Customer Service Bot Is Slow Because It's Single-Threaded

Austin Vance — Thu, 23 Apr 2026 19:16:24 +0000

Consider a typical enterprise support agent. A customer asks a complex compliance question and the agent dutifully queries the knowledge base, then searches the web, then checks policy docs. Sequential. Three LLM calls back to back. That's ~12 seconds of wall time.

Users start abandoning chat around 8.

Fan out those three research calls in parallel, same calls, same models, same prompts, and wall time drops to ~6.5 seconds.

This post covers the parallel sub-agent pattern using LangGraph and LangSmith. I'll show the code, but more importantly, I'll show you the failure modes because the pattern is simple and the bugs are not.

The Latency Math

You have an agent that needs to hit three sources, internal KB, web search, and policy documents. Each LLM call takes 2–4 seconds. Sequentially:

Step	Latency
Classify query	~1s
Research KB	~3s
Research Web	~3.5s
Research Policy	~2.5s
Synthesize	~2s
Total	~12s

In parallel, the three research steps overlap:

Step	Latency
Classify query	~1s
Research (all three, parallel)	~3.5s
Synthesize	~2s
Total	~6.5s

A 45% reduction from a structural change, not a prompt improvement. Every additional sub-agent you add sequentially costs another 2–4 seconds. In parallel, it's free, until you hit the slowest branch.

The Parallel Agents Architecture

We're building a research assistant that fans out to three parallel sub-agents, aggregates results, and synthesizes a response:

                     ┌→ [Research: KB]     ─┐
[Classify Query] ────┼→ [Research: Web]    ─┼→ [Synthesize] → END
                     └→ [Research: Policy] ─┘

LangGraph executes parallel branches in a superstep, all three branches run concurrently, state updates are transactional. The fan-in edge waits for all branches before proceeding.

On the Send API: LangGraph has a Send API for dynamic map-reduce where branch count is unknown at build time. Don't reach for it here. Send is designed for running the same node N times with different inputs. For a fixed set of specialist agents, static edges or conditional routing are simpler, preserve graph structure, and keep every branch visible at compile time via graph.get_graph().draw_mermaid(). In practice, you'll rarely need Send. Start with static fan-out, graduate to conditional, reach for Send as a last resort.

State: The One Thing You'll Get Wrong

The Annotated[list, operator.add] reducer tells LangGraph to concatenate results from parallel branches instead of overwriting them. Without it, parallel branches race to write the results field. The last branch to finish wins, and you silently lose the other two. This is one of the most common bugs in parallel agent systems. The synthesizer produces suspiciously narrow responses, coverage evals fail intermittently, and you spend two days blaming the prompt before realizing you're only getting one source's data.

The Code

State, a sub-agent factory, and three agent instances. The @traceable decorator ensures each agent appears as a distinct span in LangSmith — this will be the single most important debugging decision you make.

import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research_results: Annotated[list[dict], operator.add]
    final_response: str


def make_agent(name: str, focus: str):
    """Factory that builds a traceable research sub-agent."""

    @traceable(name=name, run_type="chain")
    def node(state: State) -> dict:
        response = llm.invoke([
            SystemMessage(content=f"You are the {name} agent. Focus on {focus}. "
                                  "Return a concise summary. Cite your source type."),
            HumanMessage(content=f"Research query: {state['question']}"),
        ])
        return {"research_results": [{"source": name, "content": response.content}]}

    return node


kb_agent = make_agent("knowledge_base", "internal knowledge base searches.")
web_agent = make_agent("web_search", "recent news and industry trends.")
policy_agent = make_agent("policy", "compliance, legal, and regulatory frameworks.")

The synthesizer merges sub-agent outputs into one customer-facing response. The key constraint, worth knowing before you ship, is that policy information takes precedence. Without this, the synthesizer will cheerfully soften restrictions to sound more helpful.

@traceable(name="Synthesizer", run_type="chain")
def synthesize(state: State) -> dict:
    context = "\n\n".join(
        f"[{r['source']}]: {r['content']}" for r in state["research_results"]
    )
    response = llm.invoke([
        SystemMessage(
            content="Synthesize the following research into a clear, actionable "
                    "response. When policy information conflicts with or constrains "
                    "other responses, the policy statement takes precedence. "
                    "Never soften or omit policy restrictions."
        ),
        HumanMessage(
            content=f"Customer question: {state['question']}\n\n"
                    f"Research findings:\n{context}"
        ),
    ])
    return {"final_response": response.content}

Graph Assembly

Fifteen lines of wiring. RetryPolicy on every research node so a provider 429 doesn't kill the entire pipeline, successful branches are checkpointed and won't re-execute.

from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy

builder = StateGraph(State)

builder.add_node("kb", kb_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("web", web_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("policy", policy_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "kb")
builder.add_edge(START, "web")
builder.add_edge(START, "policy")
builder.add_edge(["kb", "web", "policy"], "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()

Conditional Routing: The Upgrade

Sometimes hitting every source is wasteful. A simple "what's our refund policy?" doesn't need web search. Conditional fan-out lets you route based on the question using structured output, no regex parsing, no brittle string matching:

from collections.abc import Sequence

from pydantic import BaseModel, Field


class RoutingPlan(BaseModel):
    agents: list[str] = Field(
        description="Agents to activate: kb, web, policy"
    )

structured_llm = llm.with_structured_output(RoutingPlan)


def classify_and_route(state: State) -> Sequence[str]:
    plan = structured_llm.invoke([
        SystemMessage(content="Decide which research agents to invoke. "
                              "Available: kb, web, policy. When in doubt, include the agent."),
        HumanMessage(content=state["question"]),
    ])
    return plan.agents or ["kb"]

The tradeoff is real. Conditional routing saves latency on simple queries but your routing logic becomes a new failure point. And with conditional fan-out, use individual edges from each node to synthesize not the list-style fan-in or LangGraph waits forever for branches that were never dispatched.

Production Failures in Concurrent Execution

These are the failure modes that surface once parallel agents hit real traffic.

State Clobbering. Synthesizer references only one source. Intermittent. Cause: missing operator.add reducer. Parallel branches overwrite instead of appending. There's no warning, the graph runs fine, it just loses data.****
Synthesizer Contradicted the Policy Agent. Say a customer asks about returning an opened product. The policy agent correctly stated the 30-day unopened-only return policy. The KB agent mentioned "hassle-free returns." The synthesizer merged these into: "You can return the product within 30 days, hassle-free" omitting the unopened requirement. LangSmith traces showed the policy agent's output was correct; the synthesizer span revealed where the information was lost. Fix: the policy-takes-precedence constraint in the synthesizer prompt.
Hung Branch Blocking Fan-In. Response times spike from ~6s to 30s+. The fan-in waits for ALL branches. Your p50 is fine, your p99 is determined by the slowest branch on its worst day. Fix: async timeouts per branch, return partial results ({"source": "web_search", "content": "Timed out"}) rather than blocking the pipeline.****
Orchestrator Under-Dispatched. A significant fraction of multi-domain queries will be only partially routed. Over-dispatching (an agent returning empty results) is cheap. Under-dispatching is a customer getting an incomplete answer. Fix: explicit multi-domain examples in the routing prompt and a "when in doubt, include the agent" instruction.

Observability

Parallel agents are hard to debug without tracing. @traceable on every sub-agent gives you per-branch spans in LangSmith. Tag production traces with metadata for filtering:

from langsmith import tracing_context

with tracing_context(
    metadata={"customer_tier": "enterprise", "channel": "chat"},
    tags=["production", "v2"],
):
    result = graph.invoke({"question": "How does GDPR affect our data pipeline?"})

The first thing to check when latency spikes: is one branch consistently slower? LangSmith makes that a 10-second investigation instead of an hour of log-grepping.

Evals

Shipping without evals is negligence. Three evaluators catch the most common regressions: deterministic coverage, structural fan-out validation, and LLM-as-judge for overall quality.

from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="research-agent-evals",
    description="Parallel research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is our refund policy for enterprise clients?"},
        {"question": "How does GDPR affect our data pipeline architecture?"},
        {"question": "What competitors launched AI features last quarter?"},
    ],
    outputs=[
        {"must_mention": ["refund", "enterprise", "policy"]},
        {"must_mention": ["GDPR", "data", "compliance"]},
        {"must_mention": ["competitor", "AI", "feature"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
Customer query: {inputs[question]}
AI response: {outputs[final_response]}

Rate 0.0-1.0 on completeness, accuracy, and tone.
Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the synthesizer actually address the question?"""
    text = outputs.get("final_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def source_diversity(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Is the fan-out actually working, or did it silently degrade?"""
    results = outputs.get("research_results", [])
    sources = {r["source"] for r in results if isinstance(r, dict)}
    return {"key": "source_diversity", "score": min(len(sources) / 2.0, 1.0)}


def target(inputs: dict) -> dict:
    return graph.invoke({"question": inputs["question"]})


results = evaluate(
    target,
    data="research-agent-evals",
    evaluators=[quality_judge, coverage, source_diversity],
    experiment_prefix="parallel-research-v1",
    max_concurrency=4,
)

source_diversity is the only automated check that your parallel architecture is actually parallel. Without it, state clobbering can ship to production and sit there for weeks. Run this eval on every PR that touches agent code.

When to Use This

Use parallel sub-agents when:

Queries regularly span 2+ domains in a single message
You need per-domain traceability for debugging and compliance
Sub-agents have different tool sets or retrieval sources
You're iterating on prompts and need isolated regression testing

Skip it when:

Queries are single-domain (a FAQ bot doesn't need orchestration)
Latency budget is extremely tight (routing adds one LLM call)
You have fewer than 3 distinct knowledge domains

The Bottom Line

Parallel sub-agents aren't architecturally complex it's a fan-out, a fan-in, and a reducer. The code is about 15 lines of graph wiring. The production hardening is everything else.

Start with static fan-out. Add conditional routing when you have data showing which sources matter for which queries. Write the source_diversity eval before you write the second prompt. And put operator.add on your list fields you'll thank me later.

Technical References

Originally published at https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded.

Your AI Just Emailed a Customer Without Permission

Austin Vance — Thu, 23 Apr 2026 19:16:21 +0000

In a customer complaint handler for a fintech company you have drafted responses, checked tone, and verified responses to match company policy. Automated from end to end. Then, the agent sends a $4,200 refund approval to a customer who'd asked about a fee schedule. The LLM hallucinates the complaint, writes up a professional apology with a specific dollar amount, and fires it off before anyone on the team even knows.

Better prompts won’t help because the problem isn't what the model says, it's that nothing stops it from saying it.

To fix this you need an approval gate. Somewhere in the agent’s graph where execution... stops. State gets written to disk and a human looks at the draft. Only after they say "yeah, send it" does anything go out the door. LangGraph has a built-in primitive for this called interrupt.

Let's walk through the full pattern here. The code is straightforward but state management can trip you up.

The cost argument (if you need one)

If you're already sold on why AI shouldn't email customers unsupervised, skip this, but if you need to convince your PM, here's some napkin math:

Metric	Without Gate	With Gate
Messages sent/day	~500	~500
Error rate (wrong tone/info)	~3%	~0.1%
Bad messages/day	15	0.5
Avg cost per bad message	$200	$200
Daily risk	$3,000	$100

What we’re building

A customer complaint response pipeline. Complaint comes in, AI drafts a response, a human approves or edits, system sends the final version.

[Intake] → [Draft Response] → [INTERRUPT: Human Review] → [Send Response] → END

The interrupt is where execution pauses. All the graph state (draft, original complaint, metadata, etc) gets checkpointed. It could be hours or days before someone reviews it and when they do, the graph will pick up right where it stopped.

Even in serverless environments interrupt is resilient. The Python process can crash. Server can restart. You resume with the same thread_id and LangGraph reloads everything from the checkpointer.

The state schema

Whatever the reviewer needs to see has to be in state before the interrupt fires.

from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    complaint: str
    customer_id: str
    draft_response: str
    review_decision: str
    reviewer_notes: str
    final_response: str

The nodes

Let’s build three nodes, draft, review, send. All with @traceable because six months from now when someone asks "who approved sending that email to the VP of procurement at our biggest account," you want a trace showing what the AI wrote vs. what a person changed.

@traceable(name="draft_response", run_type="chain")
def draft_response(state: State) -> dict:
    response = llm.invoke([
        SystemMessage(
            content="You are a customer service agent. Draft a professional, "
                    "empathetic response to the following complaint. Be specific "
                    "about next steps. Do NOT promise refunds or credits unless "
                    "the complaint clearly warrants one. Keep it under 150 words."
        ),
        HumanMessage(
            content=f"Customer ID: {state['customer_id']}\n\n"
                    f"Complaint: {state['complaint']}"
        ),
    ])
    return {"draft_response": response.content}

The review node is where interrupt() does its work.

from langgraph.types import interrupt

@traceable(name="human_review", run_type="chain")
def human_review(state: State) -> dict:
    decision = interrupt({
        "draft": state["draft_response"],
        "customer_id": state["customer_id"],
        "complaint": state["complaint"],
        "instructions": "Review the draft. Respond with a JSON object: "
                        '{"action": "approve" | "edit" | "reject", '
                        '"edited_response": "...", "notes": "..."}'
    })
    return {
        "review_decision": decision["action"],
        "reviewer_notes": decision.get("notes", ""),
        "final_response": decision.get("edited_response", state["draft_response"])
            if decision["action"] != "reject" else "",
    }

The dict you pass to interrupt() is the payload. It shows up in the __interrupt__ field of the graph's return value, which is what your UI or Slack bot reads to build the review screen. When someone calls Command(resume={"action": "approve"}), that dict becomes what interrupt() returns. The function resumes from the line right after the interrupt() call. It looks like a normal function call but there's a checkpoint boundary hiding inside it.

Send node. Don't send if it was rejected:

@traceable(name="send_response", run_type="chain")
def send_response(state: State) -> dict:
    if state["review_decision"] == "reject":
        return {"final_response": "[REJECTED] " + state["reviewer_notes"]}
    return {"final_response": state["final_response"]}

Wiring it up

The checkpointer makes interrupts durable. You can use InMemorySaver for dev, PostgresSaver for prod and if you forget the checkpointer and interrupt() throws a RuntimeError.

from langgraph.checkpoint.memory import InMemorySaver
from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("draft", draft_response)
builder.add_node("review", human_review)
builder.add_node("send", send_response)

builder.add_edge(START, "draft")
builder.add_edge("draft", "review")
builder.add_edge("review", "send")
builder.add_edge("send", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)

The full interrupt/resume cycle

Two invoke calls. First one runs until the interrupt and stops, the second one picks up where it left off.

from langgraph.types import Command

config = {"configurable": {"thread_id": "complaint-1234"}}

# Phase 1: Run until the interrupt
result = graph.invoke(
    {
        "complaint": "I was charged twice for my subscription last month. "
                     "Order #A-9912. I want a refund immediately.",
        "customer_id": "cust_8837",
    },
    config=config,
)

# The graph paused. Extract the interrupt payload.
interrupt_data = result["__interrupt__"][0].value
print(f"Draft for review: {interrupt_data['draft']}")
print(f"Customer: {interrupt_data['customer_id']}")

# Phase 2: Human reviews and approves (could be minutes or days later)
final_result = graph.invoke(
    Command(resume={
        "action": "edit",
        "edited_response": "We've identified the duplicate charge on Order #A-9912. "
                           "A refund of $29.99 has been initiated and will appear "
                           "in 3-5 business days. We apologize for the inconvenience.",
        "notes": "Verified duplicate charge in billing system. Approved refund.",
    }),
    config=config,  # Same thread_id — this is how LangGraph finds the checkpoint
)

print(f"Final response: {final_result['final_response']}")

That thread_id in the config matters more than anything else here. It's the key into the checkpointer. Without a thread_id you can't resume. We treat these as primary keys and map it to something stable in your system: ticket ID, conversation ID, etc.

Adding risk-based routing

The basic version sends everything through human review. Start there, but eventually reviewers get tired of approving "thanks for contacting us, we're looking into it" all day, and you'll want to auto-approve the low-risk stuff.

from pydantic import BaseModel, Field


class RiskAssessment(BaseModel):
    risk_level: str = Field(description="low, medium, or high")
    reason: str = Field(description="Why this risk level was assigned")


risk_llm = llm.with_structured_output(RiskAssessment)


@traceable(name="assess_risk", run_type="chain")
def assess_risk(state: State) -> dict:
    assessment = risk_llm.invoke([
        SystemMessage(
            content="Assess the risk level of this customer service response. "
                    "high = involves money, legal, account changes, or could "
                    "be interpreted as a binding commitment. "
                    "medium = emotional topic, could escalate. "
                    "low = simple acknowledgment, FAQ, status update."
        ),
        HumanMessage(
            content=f"Complaint: {state['complaint']}\n\n"
                    f"Draft response: {state['draft_response']}"
        ),
    ])
    return {"review_decision": assessment.risk_level}


def route_by_risk(state: State) -> str:
    if state["review_decision"] == "low":
        return "send"
    return "review"


builder_v2 = StateGraph(State)

builder_v2.add_node("draft", draft_response)
builder_v2.add_node("assess", assess_risk)
builder_v2.add_node("review", human_review)
builder_v2.add_node("send", send_response)

builder_v2.add_edge(START, "draft")
builder_v2.add_edge("draft", "assess")
builder_v2.add_conditional_edges("assess", route_by_risk, {"send": "send", "review": "review"})
builder_v2.add_edge("review", "send")
builder_v2.add_edge("send", END)

graph_v2 = builder_v2.compile(checkpointer=InMemorySaver())

Fair warning: you've now introduced a second LLM call as a gate, and that gate can be wrong in both directions. Under-classify risk and messages go out without review. Over-classify and reviewers are right back to rubber-stamping everything. Run the classifier in logging-only mode for a couple weeks first (route everything through review, but record what the classifier would have done and use long term memory to tune the classifier). Then start skipping reviews on low-risk messages after you trust the data.

The bugs

The demo works great... but...

Lost thread_id

Someone approves a draft in Slack. The integration pulls out the approval decision but constructs a new thread_id instead of looking up the one stored with the interrupt payload. Now Command(resume=...) creates a fresh graph where the input is an approval decision, not the complaint.

This happens a lot. Store the thread_id alongside the interrupt payload when you surface it to reviewers. Put it in a database. Put it in the Slack message metadata, Do not lose it.

Stale state

Reviewer opens the draft at 11:30. Goes to lunch. Comes back at 1pm and hits approve. In the meantime, the customer sent two more messages and someone on the support team already replied manually. The approved draft is now responding to a conversation that moved on.

LangGraph has no idea. It resumes from the checkpoint, which is frozen in time. Fix this by putting a created_at timestamp in the interrupt payload and checking it against the customer record's last_updated_at on resume. If anything changed, re-draft.

Double resume

Shared review queue. Two reviewers see the same pending draft. Both click approve. Depending on the checkpointer implementation, the second resume is either a no-op or an error, but by then the send logic already fired on the first one. Maybe that's fine. Maybe you just sent duplicate emails.

Build in idempotency to check if the thread already has a review_decision before doing anything with the resume.

Interrupt reordering

Two interrupt() calls in one node (say, one for policy review and one for tone). LangGraph matches resume values to interrupts by position, not by name. There are no names. Refactor and swap the order, the policy answer goes to the tone check and vice versa.

Don't put multiple interrupts in one node, instead use separate nodes.

Tracing across the gap

Interrupt-based workflows leave a gap in the LangSmith timeline where the human review happened. The draft trace ends, then hours later the resume trace starts, and nothing connects them unless you're deliberate about it.

from langsmith import tracing_context

ticket_id = "TICKET-4821"
config = {"configurable": {"thread_id": ticket_id}}

# Phase 1: Draft
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "draft"},
    tags=["production", "complaint-handler", "phase-1"],
):
    result = graph.invoke(
        {
            "complaint": "Your app crashed and I lost 3 hours of work.",
            "customer_id": "cust_2291",
        },
        config=config,
    )

# ... time passes, human reviews ...

# Phase 2: Resume
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "resume", "reviewer": "jane@company.com"},
    tags=["production", "complaint-handler", "phase-2"],
):
    final = graph.invoke(
        Command(resume={"action": "approve", "notes": "Looks good."}),
        config=config,
    )

Put the ticket ID in the metadata for both phases. Now you can filter in LangSmith and see the full lifecycle of a single complaint even though draft and resume were separate invocations. The reviewer field in phase 2 is your audit trail.

Evals

You need to know if drafts are any good before a human ever sees them.

Dataset setup and evaluators live in evals.py in the companion repo:

from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

from complaint_handler import graph

ls_client = Client()

DATASET_NAME = "complaint-handler-evals"

if not ls_client.has_dataset(dataset_name=DATASET_NAME):
    dataset = ls_client.create_dataset(
        dataset_name=DATASET_NAME,
        description="Human-in-the-loop complaint handler evaluation dataset",
    )
    ls_client.create_examples(
        dataset_id=dataset.id,
        inputs=[
            {
                "complaint": "Charged twice for order #A-1234. Want a refund.",
                "customer_id": "cust_001",
            },
            {
                "complaint": "App crashes every time I open the settings page.",
                "customer_id": "cust_002",
            },
            {
                "complaint": "Your CEO's tweet was offensive. Cancelling my account.",
                "customer_id": "cust_003",
            },
        ],
        outputs=[
            {
                "must_mention": ["refund", "order", "A-1234"],
                "risk": "high",
            },
            {
                "must_mention": ["crash", "settings", "investigating"],
                "risk": "medium",
            },
            {
                "must_mention": ["feedback", "understand", "account"],
                "risk": "high",
            },
        ],
    )

Three evaluators. LLM judge for draft quality, keyword coverage, and a check for unauthorized promises:

DRAFT_QUALITY_PROMPT = """\
Customer complaint: {inputs}
AI draft response: {outputs}

Rate 0.0-1.0 on empathy, accuracy, and professionalism.
Deduct points if the draft promises specific remedies (refunds, credits)
without explicit authorization.
Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

draft_judge = create_llm_as_judge(
    prompt=DRAFT_QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="draft_quality",
    continuous=True,
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the draft actually address the complaint specifics?"""
    text = outputs.get("draft_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def no_unauthorized_promises(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the draft promise refunds or credits without authorization?"""
    text = outputs.get("draft_response", "").lower()
    dangerous_phrases = ["refund has been", "credit has been", "we will refund",
                         "we will credit", "compensation of"]
    violations = sum(1 for p in dangerous_phrases if p in text)
    return {"key": "no_unauthorized_promises", "score": 1.0 if violations == 0 else 0.0}


def target(inputs: dict) -> dict:
    """Run the graph until the interrupt (draft phase only)."""
    config = {"configurable": {"thread_id": f"eval-{inputs['customer_id']}"}}
    result = graph.invoke(inputs, config=config)
    return {"draft_response": result.get("draft_response", "")}

no_unauthorized_promises catches the failure mode from the top of this post. If the draft says "a refund has been initiated" when nobody authorized a refund, it scores zero. Run this eval every time you change the system prompt.

if name == "main":

    results = evaluate(

        target,

        data=DATASET_NAME,

        evaluators=[draft_judge, coverage, no_unauthorized_promises],

        experiment_prefix="complaint-handler-v1",

        max_concurrency=4,

    )

    print("\nEvaluation complete. Check LangSmith for results.")

When to Human In The Loop

If AI is writing things that go to customers, you need a gate. Processing refunds, updating account records, anything you can't undo with a quick "sorry about that" email. Regulated industries need the gate plus an audit trail of who approved what.

You don't need this for internal stuff. Summarizing meeting notes, running analysis for a dashboard, generating reports that a human reads.

TL;DR

The two function calls: interrupt() and Command(resume=...). Pause execution, persist state, resume later.

Most of the work is everything around those two calls. Thread IDs getting lost, the world changing during the review gap, two reviewers approving the same draft, traces that need to connect across a timeline gap of hours or days.

Start by routing every response through review. Reviewers will complain. Good. Measure which categories they rubber-stamp, run your evals, and only then start auto-approving the boring stuff.

Technical References

Originally published at https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission.

Streaming Agent State with LangGraph

Austin Vance — Thu, 23 Apr 2026 19:15:26 +0000

Your research agent takes 9 seconds to answer a question. It fans out to three sources, synthesizes results, returns a polished answer. The user sees a blank screen for all nine of those seconds. By second 5 they've refreshed the page, doubled your API costs, and still seen nothing.

Streaming fixes this. Show the user what the agent is doing while it's doing it: "Searching knowledge base...", "Found 3 results...", "Synthesizing..." and then stream the final answer token by token. Same 9 seconds, but the user sees progress from millisecond 200.

The Perception Math

Identical work, different user experience:

Pattern

Wall time

Time to first byte

Perceived wait

invoke() (no streaming)

Broken

stream(stream_mode="updates")

~200ms

Working

stream(stream_mode=["updates", "custom", "messages"])

~200ms

Can see what it’s doing

What we're Building

A multi-step research agent that streams three types of events to the UI: node-level progress updates, custom status messages from inside nodes, and token-by-token LLM output for the final synthesis.

                          ┌─ stream: "Searching KB..."
[Intake] → [Research KB]  ┤
                          └─ stream: {results: 3}
                                    ↓
                          ┌─ stream: "Analyzing results..."
         → [Synthesize]  ┤
                          └─ stream: tokens... t-o-k-e-n-b-y-t-o-k-e-n
                                    ↓
                                     → END

Three stream modes run simultaneously: updates for graph state changes, custom for application-specific progress events, and messages for LLM token streaming.

The Five Modes

LangGraph exposes five stream modes. You'll use three in practice:

Mode

What it streams

When to use

values

Full state after each superstep

Debugging, state inspection

updates

State delta from each node

Production UIs — lightweight, shows which node ran

messages

LLM tokens + metadata

Chat UIs — token-by-token output

custom

Arbitrary data from get_stream_writer()

Progress bars, status messages, structured events

debug

Everything — internal execution details

Development only

In production, use ["updates", "custom", "messages"]. values sends the entire state on every step. debug is for development.

The Code

State and two nodes: a research step that emits custom progress events, and a synthesizer that streams its LLM response token by token.

from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.config import get_stream_writer
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research: str
    answer: str

The research node uses get_stream_writer() to push status updates to the client. These show up in the custom stream mode:

@traceable(name="research", run_type="chain")
def research(state: State) -> dict:
    writer = get_stream_writer()

    writer({"step": "research", "status": "starting", "message": "Searching knowledge base..."})

    response = llm.invoke([
        SystemMessage(
            content="You are a research assistant. Search for relevant information "
                    "about the user's question. Return a concise summary of findings."
        ),
        HumanMessage(content=state["question"]),
    ])

    writer({"step": "research", "status": "complete", "message": "Research complete."})

    return {"research": response.content}

The synthesizer uses the LLM normally. LangGraph automatically streams its tokens when messages mode is active:

@traceable(name="synthesize", run_type="chain")
def synthesize(state: State) -> dict:
    writer = get_stream_writer()
    writer({"step": "synthesize", "status": "starting", "message": "Synthesizing answer..."})

    response = llm.invoke([
        SystemMessage(
            content="Synthesize the research into a clear, actionable answer. "
                    "Be concise but thorough."
        ),
        HumanMessage(
            content=f"Question: {state['question']}\n\nResearch:\n{state['research']}"
        ),
    ])

    writer({"step": "synthesize", "status": "complete", "message": "Done."})
    return {"answer": response.content}

Graph Assembly

from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("research", research)
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "research")
builder.add_edge("research", "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()

Multi-mode Streaming

A single .stream() call can emit node updates, custom progress events, and LLM tokens simultaneously:

for mode, chunk in graph.stream(
    {"question": "What are the key differences between REST and GraphQL for mobile APIs?"},
    stream_mode=["updates", "custom", "messages"],
):
    if mode == "updates":
        # Node completed — chunk is the state delta
        node_name = list(chunk.keys())[0]
        print(f"[node] {node_name} completed")

    elif mode == "custom":
        # Custom progress event from get_stream_writer()
        print(f"[status] {chunk.get('message', chunk)}")

    elif mode == "messages":
        # LLM token — chunk is a tuple of (message_chunk, metadata)
        message_chunk, metadata = chunk
        if hasattr(message_chunk, "content") and message_chunk.content:
            print(message_chunk.content, end="", flush=True)

Note that the output shape changes with multi-mode. Single mode (stream_mode="updates") yields chunks directly. Multi-mode (stream_mode=["updates", "custom"]) yields (mode, chunk) tuples. Code that works with single mode breaks with multi-mode because the unpacking is different.

Async streaming

For production APIs, use astream with async for:

import asyncio

from langsmith import traceable


@traceable(name="stream_research", run_type="chain")
async def stream_research(question: str):
    chunks = []
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                chunks.append(message_chunk.content)
                yield {"type": "token", "content": message_chunk.content}
        elif mode == "custom":
            yield {"type": "status", "content": chunk}
        elif mode == "updates":
            yield {"type": "node_update", "content": chunk}


async def main():
    async for event in stream_research("How do vector databases work?"):
        if event["type"] == "token":
            print(event["content"], end="", flush=True)
        else:
            print(f"\n[{event['type']}] {event['content']}")

asyncio.run(main())

FastAPI + SSE

The standard production pattern is a FastAPI endpoint that converts graph streams to SSE. SSE is one-directional (server to client), works over HTTP/1.1, and auto-reconnects:

import json

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langsmith import traceable

app = FastAPI()


@traceable(name="sse_research_stream", run_type="chain")
async def generate_sse(question: str):
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                data = json.dumps({"type": "token", "content": message_chunk.content})
                yield f"data: {data}\n\n"
        elif mode == "custom":
            data = json.dumps({"type": "status", "content": chunk})
            yield f"data: {data}\n\n"
        elif mode == "updates":
            node_name = list(chunk.keys())[0] if chunk else "unknown"
            data = json.dumps({"type": "node_complete", "node": node_name})
            yield f"data: {data}\n\n"

    yield "data: [DONE]\n\n"


@app.post("/research/stream")
async def stream_endpoint(payload: dict):
    return StreamingResponse(
        generate_sse(payload["question"]),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )

Set X-Accel-Buffering: no in the response headers and proxy_buffering off in your nginx config. Without these, nginx buffers the entire response before sending it to the client and your streaming pipeline becomes a regular HTTP response.

The Bugs

These break under load.

Reverse proxy buffering

You deploy behind nginx or a cloud load balancer. SSE events arrive at the client in one big batch after the stream completes. Cause: proxy buffering is on by default. Set the X-Accel-Buffering header, disable proxy_buffering in nginx, and check your cloud provider's load balancer settings.

Message chunk ordering

With messages mode, you receive AIMessageChunk objects. The content field is usually a string, except when the model returns tool calls where it's a list of content blocks. Concatenating .content naively produces garbled output. Check isinstance(message_chunk.content, str) before concatenating and handle tool-call chunks separately.

Backpressure on slow clients

Your agent streams tokens faster than the client can consume them (mobile on 3G, overloaded browser tab). The server-side buffer grows until memory pressure kills the process. Use bounded async queues or configure your ASGI server's per-connection send buffer limits.

Mixed single/multi mode unpacking

Developer switches from stream_mode="updates" to stream_mode=["updates", "custom"] and doesn't update the unpacking code. The for chunk in graph.stream(...) now yields (mode, chunk) tuples, but the code tries to use the tuple as a dict. No error, just wrong data flowing through. Always use multi-mode from the start, even if you only need one mode today.

Observability

Stream-based workflows produce many small events. Tag your traces so you can measure stream performance in LangSmith:

from langsmith import tracing_context

with tracing_context(
    metadata={
        "stream_mode": "multi",
        "client_type": "web",
        "session_id": "sess_12345",
    },
    tags=["production", "streaming", "v1"],
):
    for mode, chunk in graph.stream(
        {"question": "Explain vector similarity search"},
        stream_mode=["updates", "custom", "messages"],
    ):
        pass  # process chunks

The LangSmith trace shows per-node timings. Use this to find nodes that are slow to emit their first token (high time-to-first-byte) vs. nodes that produce tokens slowly (low throughput).

Evals

Streaming doesn't change what the agent produces, it changes how the output is delivered. Evals verify that streamed output matches what invoke() would return, and that custom events are emitted correctly.

from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="streaming-agent-evals",
    description="Streaming research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What are the tradeoffs between REST and GraphQL?"},
        {"question": "How do vector databases enable semantic search?"},
        {"question": "What is retrieval-augmented generation?"},
    ],
    outputs=[
        {"must_mention": ["REST", "GraphQL", "tradeoff"]},
        {"must_mention": ["vector", "embedding", "similarity"]},
        {"must_mention": ["retrieval", "generation", "context"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
User question: {inputs[question]}
Agent response: {outputs[answer]}

Rate 0.0-1.0 on completeness, accuracy, and clarity.
Return ONLY: {{"score": <float>, "reasoning": "<explanation>"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Did the response address the key topics?"""
    text = outputs.get("answer", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def stream_completeness(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Does streaming produce the same output as invoke?"""
    streamed = outputs.get("answer", "")
    invoked_result = graph.invoke({"question": inputs["question"]})
    invoked = invoked_result.get("answer", "")
    # Exact match is too strict — LLM outputs vary. Check key content overlap.
    streamed_words = set(streamed.lower().split())
    invoked_words = set(invoked.lower().split())
    if not invoked_words:
        return {"key": "stream_completeness", "score": 1.0}
    overlap = len(streamed_words & invoked_words) / len(invoked_words)
    return {"key": "stream_completeness", "score": min(overlap, 1.0)}


def custom_events_emitted(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Were custom status events emitted during streaming?"""
    events = outputs.get("custom_events", [])
    expected_steps = {"research", "synthesize"}
    seen_steps = {e.get("step") for e in events if isinstance(e, dict)}
    coverage_score = len(seen_steps & expected_steps) / len(expected_steps)
    return {"key": "custom_events", "score": coverage_score}


def target(inputs: dict) -> dict:
    custom_events = []
    answer_chunks = []
    for mode, chunk in graph.stream(
        {"question": inputs["question"]},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "custom":
            custom_events.append(chunk)
        elif mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                answer_chunks.append(message_chunk.content)
        elif mode == "updates":
            if "synthesize" in chunk:
                pass  # answer is captured via message chunks

    return {
        "answer": "".join(answer_chunks) if answer_chunks else "",
        "custom_events": custom_events,
    }


results = evaluate(
    target,
    data="streaming-agent-evals",
    evaluators=[quality_judge, coverage, stream_completeness, custom_events_emitted],
    experiment_prefix="streaming-agent-v1",
    max_concurrency=4,
)

stream_completeness verifies that the streaming path produces equivalent output to invoke(). This catches bugs where stream chunking drops content, like an SSE serializer silently truncating chunks that exceed a size limit.

When to Stream

Use streaming for any user-facing agent interaction over 2 seconds, multi-step agents where progress indicators reduce perceived latency, and chat interfaces where token-by-token display is expected.

Skip it for background jobs with no user waiting, when latency is already under a second, and when the output is structured data rather than natural language.

TL;DR

Three modes in production: updates for node transitions, custom for progress events via get_stream_writer(), and messages for token streaming. Combine them with stream_mode=["updates", "custom", "messages"].

Deploy behind FastAPI + SSE with X-Accel-Buffering: no. Watch for reverse proxy buffering, backpressure on slow clients, and the single-to-multi mode unpacking change.

Technical References:

Originally published at https://focused.io/lab/streaming-agent-state-with-langgraph.