<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Austin Vance</title>
    <description>The latest articles on Forem by Austin Vance (@austinbv).</description>
    <link>https://forem.com/austinbv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F305023%2Fc978f899-9fa5-4b1e-9ad9-e3f60313fd65.jpeg</url>
      <title>Forem: Austin Vance</title>
      <link>https://forem.com/austinbv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/austinbv"/>
    <language>en</language>
    <item>
      <title>AI Agent Authentication Starts With Workload Identity | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 13 May 2026 14:55:56 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/ai-agent-authentication-starts-with-workload-identity-focused-labs-418</link>
      <guid>https://forem.com/focused_dot_io/ai-agent-authentication-starts-with-workload-identity-focused-labs-418</guid>
      <description>&lt;p&gt;AI agent authentication starts when the system can answer which actor is allowed to make a tool call.&lt;/p&gt;

&lt;p&gt;The model can propose the action. The runtime has to attach authority to it.&lt;/p&gt;

&lt;p&gt;Most teams start with the fastest answer: an API key in an environment variable. The agent reaches Salesforce, GitHub, Jira, Snowflake, Stripe, whatever system makes the first useful proof feel real, and everyone moves on.&lt;/p&gt;

&lt;p&gt;That proof matters. It shows the agent can reach the systems where work actually happens. It also hides the first product decision: who is acting when the tool call leaves the runtime?&lt;/p&gt;

&lt;p&gt;The agent gets memory. The agent runs in the background. The agent forks into subagents. The agent retries failed operations. The agent calls tools after the user has walked away. The agent lands in an enterprise workflow where the work has value, the logs have value, and breaking something has a consequence.&lt;/p&gt;

&lt;p&gt;A shared API key starts as configuration. Then it quietly becomes the identity of the agent.&lt;/p&gt;

&lt;p&gt;An ugly place to stumble into by accident.&lt;/p&gt;

&lt;h2&gt;
  
  
  The secret becomes the actor
&lt;/h2&gt;

&lt;p&gt;Early security models for agents tend toward good vibes with a bearer token. The prompt gives instructions. The tool schema lists calls. Hard-coded secrets in the runtime decide what actually gets done based on the input, the agent, and whatever authority those secrets carry.&lt;/p&gt;

&lt;p&gt;The secret wins.&lt;/p&gt;

&lt;p&gt;The agent has all of those powers if the same key can read every customer record, submit refunds, update tickets, and write to production data. Carefulness in the prompt is theater at that point. The tool description can say those powers apply only when appropriate. The audit log will still show one credential able to perform a pile of different tasks.&lt;/p&gt;

&lt;p&gt;There is already a category for this outside agents: &lt;a href="https://owasp.org/www-project-non-human-identities-top-10/" rel="noopener noreferrer"&gt;OWASP's Non-Human Identities Top 10&lt;/a&gt;. Production applications identify themselves as non-human identities. Agents are adding themselves to that growing list of stranger workloads, running differently than normal services, but still requiring access to systems and data.&lt;/p&gt;

&lt;p&gt;The important step for me is naming the agent as a workload, because the architecture gets less magical and more useful.&lt;/p&gt;

&lt;p&gt;Workloads have identities. Workloads can request scoped credentials for those identities. A workload can be denied a credential. A workload can rotate credentials. A workload can leave an audit trail that survives the model, the prompt, and the v2 or v3 abstraction barrier the team is currently working around.&lt;/p&gt;

&lt;p&gt;Baseline authentication for production AI agents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ke2gp8x404se04fz457.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ke2gp8x404se04fz457.png" alt="A runtime identity boundary showing an agent requesting scoped credentials from an identity broker before calling external systems." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The runtime should issue tool-specific credentials instead of letting the agent carry a shared key everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workload identity is the boring answer
&lt;/h2&gt;

&lt;p&gt;This part is old. Good.&lt;/p&gt;

&lt;p&gt;Kubernetes already considers service accounts to be identities of processes running in Pods, and the current docs describe &lt;a href="https://kubernetes.io/docs/concepts/security/service-accounts/" rel="noopener noreferrer"&gt;short-lived, automatically rotating ServiceAccount tokens&lt;/a&gt; issued through the TokenRequest API. SPIFFE generalizes that into workload identity documents, including &lt;a href="https://spiffe.io/docs/latest/spiffe-about/spiffe-concepts/" rel="noopener noreferrer"&gt;short-lived X.509 and JWT SVIDs&lt;/a&gt; that a workload can use to authenticate itself to other workloads.&lt;/p&gt;

&lt;p&gt;Cloud platforms are heading in the same general direction. AWS STS can &lt;a href="https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRoleWithWebIdentity.html" rel="noopener noreferrer"&gt;issue temporary security credentials&lt;/a&gt; after a workload has identified itself using OpenID Connect. Google Cloud Workload Identity Federation allows external workloads to &lt;a href="https://cloud.google.com/iam/docs/workload-identity-federation" rel="noopener noreferrer"&gt;access Google Cloud resources without service account keys&lt;/a&gt;. Azure managed identity docs describe workload identities as &lt;a href="https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview" rel="noopener noreferrer"&gt;machine and non-human identities&lt;/a&gt; associated with compute resources.&lt;/p&gt;

&lt;p&gt;The industry knows how to keep long-lived secrets out of the hot path. It just keeps giving agents interfaces that make the old mistake easy.&lt;/p&gt;

&lt;p&gt;A developer writes a tool wrapper. The tool wrapper needs credentials. The fastest way to configure it is to add an API key to an environment variable and add a TODO to remove it later. The TODO gets pushed to production because now the agent answers support tickets, reconciles invoices, or looks at CI.&lt;/p&gt;

&lt;p&gt;I've worked with teams who reviewed the model, tuned prompts, drew diagrams for tool selection, created a few secrets in deploy config, and crossed their fingers that the tool descriptions would shore it all up.&lt;/p&gt;

&lt;p&gt;They are not enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delegation is the missing primitive
&lt;/h2&gt;

&lt;p&gt;In many applications, the agent should rarely hold the credential it uses to act.&lt;/p&gt;

&lt;p&gt;Put an identity assertion in the flow. This agent. This tenant. This user context if present. This policy version. This tool request. This approval state. That assertion is exchanged for a credential only when the action needs one.&lt;/p&gt;

&lt;p&gt;OAuth was designed to support exactly this shape. &lt;a href="https://www.rfc-editor.org/rfc/rfc8693" rel="noopener noreferrer"&gt;RFC 8693 defines token exchange&lt;/a&gt;, describing how one temporary credential can be exchanged for another temporary credential intended for a different context. In the agent case, the model proposes an action, the runtime checks policy, the broker issues a credential for that action and tool context, the call happens, and the credential dies.&lt;/p&gt;

&lt;p&gt;It does not expire after a quarter. It does not expire after someone remembers to rotate it. It expires because the system puts expiration in the path.&lt;/p&gt;

&lt;p&gt;That changes the damage pattern. A compromised tool wrapper no longer implies broad access to every downstream system. A prompt injection has to cross approval, run, tenant, and policy boundaries. A subagent that escapes its execution boundary cannot reuse credentials after the run, approval, or tenant context has expired.&lt;/p&gt;

&lt;p&gt;The agent is still useful. It just has to query through a production boundary that understands production concerns.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://focused.io/lab/2026-year-of-the-integrated-agent" rel="noopener noreferrer"&gt;integrated agents&lt;/a&gt; are valuable and dangerous at the same time. The valuable integrated agents do not live in a chatbot tab. They integrate with real systems. Once an agent is tied to real systems, authentication becomes product architecture rather than cleanup work hidden in deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runtime owns the identity boundary
&lt;/h2&gt;

&lt;p&gt;A model provider should not own this boundary. A prompt should not own this boundary. A tool schema should not own this boundary.&lt;/p&gt;

&lt;p&gt;The runtime owns it because the runtime follows the whole path.&lt;/p&gt;

&lt;p&gt;It connects agent definitions to threads or runs, tenants, and identity information, including the user who initiated the work, whether the work is backgrounded, whether a human approved a risky step, which tool is being called, and which downstream credential is being requested. It can attach those facts to an identity assertion and make a policy decision before any assertion leaves the process.&lt;/p&gt;

&lt;p&gt;That policy decision can be boring and explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The refund tool can request a payment credential for the current tenant.&lt;/li&gt;
&lt;li&gt;A GitHub tool can request a write credential after CI has produced an eval pass.&lt;/li&gt;
&lt;li&gt;The Snowflake tool can request a read credential for one warehouse, one role, and one time window.&lt;/li&gt;
&lt;li&gt;A subagent can run with a delegated identity, but only with fewer capabilities than the parent run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The list is not impressive, which is why it is powerful.&lt;/p&gt;

&lt;p&gt;This is also where &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;multi-agent orchestration&lt;/a&gt; gets serious. A supervisor handing work to a subagent creates a delegation relationship along with the task description. The child process needs enough authority to perform the work at hand and no more. The audit log must reflect that chain of trust cleanly or troubleshooting becomes an exercise in futility.&lt;/p&gt;

&lt;p&gt;The worst setup is a swarm of agents all sharing the same service account. Simple enough to get going. Terrible when it comes time to debug an incident. Every action has been performed by the same principal, authenticated with the same key, and observed through the same useless blur.&lt;/p&gt;

&lt;p&gt;The incident has no useful actor. Just a shared key with a long memory and no accountability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qaovc23fundj7hk9akt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3qaovc23fundj7hk9akt.png" alt="A token lifecycle showing an agent run creating an identity assertion, exchanging it for a scoped token, calling a tool, writing audit evidence, and expiring the credential." width="800" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Short-lived delegated credentials make the agent run, policy decision, tool call, and audit trail line up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Audit follows identity
&lt;/h2&gt;

&lt;p&gt;Agent observability without identity is half a story.&lt;/p&gt;

&lt;p&gt;A trace for the agent step called &lt;code&gt;refund_customer&lt;/code&gt; can include latency, tool arguments, model output, retries, all visualized in a convenient span tree. Useful. Then someone asks who had authority to issue that refund, and the trace turns into archaeological excavation.&lt;/p&gt;

&lt;p&gt;The right trace shows the tool call connected to a principal. Not just a service account. A principal with an agent ID, run ID, tenant, user context, policy decision, credential scope, and expiration time.&lt;/p&gt;

&lt;p&gt;This is what allows a team to answer questions after the tool call has done real work.&lt;/p&gt;

&lt;p&gt;Who granted access? What user context did it use? What broker generated the credential? What version of policy allowed it? What downstream resource accepted it? What subagent inherited it? Can that credential be used for something else?&lt;/p&gt;

&lt;p&gt;Those questions determine whether there is a real postmortem or just hand waving about the agent doing something weird.&lt;/p&gt;

&lt;p&gt;The same principle applies to testing. In &lt;a href="https://focused.io/lab/everybody-tests" rel="noopener noreferrer"&gt;Everybody Tests&lt;/a&gt;, I argued that every team already tests whether they admit it or not. Agent identity needs that same honesty. If a runtime can create delegated credentials, tests should verify that the boundary holds. A refund agent should fail against the wrong tenant. A code agent should fail when eval gates are red. A research agent should fail when it asks for write access to a system it only reads.&lt;/p&gt;

&lt;p&gt;Not a single &lt;code&gt;npx this and that&lt;/code&gt; in the whole codebase. Test it in CI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared keys hide product decisions
&lt;/h2&gt;

&lt;p&gt;The fastest credential story hides the decisions that matter most.&lt;/p&gt;

&lt;p&gt;A shared key hides tenancy. It hides user context. It hides the identity of the agent performing an action. It hides which subagent inherited authority. It hides whether approval was granted. It hides whether the action matched the original request. It hides rotation until rotation becomes an outage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html" rel="noopener noreferrer"&gt;OWASP's secrets management guidance recommends dynamic secrets where possible&lt;/a&gt; to reduce credential reuse and limit the damage when credentials leak. Agent systems need the same pressure, with the additional constraint that the credential must represent the run instead of only the application.&lt;/p&gt;

&lt;p&gt;A normal backend service is expected to behave predictably and follow a reliable lifecycle. It accepts requests, implements endpoints, and changes through controlled deployments. An agent runtime for integration automation can select different tools per request, execute work in subagents, retry steps, and continue running after initial user interaction has completed.&lt;/p&gt;

&lt;p&gt;So identity has to be more exact.&lt;/p&gt;

&lt;p&gt;The credential loaned to the system should assert what it is currently allowed to do. The operating policy should be visible enough to understand the motivation behind the action. The audit trail must persist long enough for a human to traverse the events as they happened.&lt;/p&gt;

&lt;p&gt;A boundary-based platform does not need a full rewrite. Start with one boundary.&lt;/p&gt;

&lt;p&gt;Put an identity broker between the agent runtime and the first high-risk tool. Give the agent runtime a workload identity. Have the broker exchange that identity for a tool credential. Associate the decision with tenant, run, and operation. Record the policy decision in the trace. Add a CI test that proves the wrong tenant fails. Expire the credential quickly. Make the failure visible when the broker returns no.&lt;/p&gt;

&lt;p&gt;Then move the next tool behind the boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The production line
&lt;/h2&gt;

&lt;p&gt;AI agent authentication is the control plane for non-human actors who do work across systems.&lt;/p&gt;

&lt;p&gt;Ownership matters here. Security cannot retroactively add this after the agent and its resources have shipped. Platform cannot stash it in a vault path. Product cannot mark it as a checkbox in consent. Identity, delegation, expiration, and audit have to be inherent in the runtime of the agent and how it executes.&lt;/p&gt;

&lt;p&gt;The agent should actually be able to act. That is, after all, why we are doing &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;AI agency&lt;/a&gt; in the first place. That agency should have a workload identity.&lt;/p&gt;

&lt;p&gt;Production systems have already worked out parts of the problem. Kubernetes, SPIFFE, OAuth token exchange, cloud workload federation, managed identities, dynamic secrets. They exist because static secrets rot and shared principal accounts make bad worse.&lt;/p&gt;

&lt;p&gt;It is a mistake to grant agents an exemption because the interface is conversational.&lt;/p&gt;

&lt;p&gt;The model can decide on the next step. The runtime decides whether that step gets a credential.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Agentic AI Architecture Needs Model Routing</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Fri, 08 May 2026 01:57:35 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/agentic-ai-architecture-needs-model-routing-1e1k</link>
      <guid>https://forem.com/focused_dot_io/agentic-ai-architecture-needs-model-routing-1e1k</guid>
      <description>&lt;p&gt;Agentic AI architecture is stuck on model loyalty.&lt;/p&gt;

&lt;p&gt;The same graph. The same provider. One giant model doing every job because one graph is easier to defend than a routing policy.&lt;/p&gt;

&lt;p&gt;I get why people want to pick one model: it makes demos and evaluation and procurement easier, and sometimes debugging only slightly worse. The agent call becomes always the same, the trace becomes always the same, and the team can blame one provider instead of four.&lt;/p&gt;

&lt;p&gt;Fine. But production agents do not do one kind of work.&lt;/p&gt;

&lt;p&gt;Classify intent. Search. Summarize. Write code. Choose a tool. Check if a tool's result smells wrong. Write a customer-facing answer when something failed. Decide whether approval is required. Wait for something to happen. Retry something that failed. Recover from something gone wrong.&lt;/p&gt;

&lt;p&gt;Production agents run a pile of distinct workloads.&lt;/p&gt;

&lt;p&gt;Harrison Chase notes that &lt;a href="https://x.com/hwchase17/status/2051745855812882576" rel="noopener noreferrer"&gt;LLMs are getting expensive, and open source models matter for that reason&lt;/a&gt;. LangChain is pushing the same direction from a product perspective, noting that &lt;a href="https://x.com/LangChain/status/2051367244060598312" rel="noopener noreferrer"&gt;Fleet agents no longer have to be constrained by a single model and can instead use multi-model support&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Those are the same production reality arriving through two doors.&lt;/p&gt;

&lt;p&gt;The agent architecture must determine which model should perform which work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Same Model Everywhere Is an Architecture Smell
&lt;/h2&gt;

&lt;p&gt;This is surprising. Many current agent stacks treat model selection as just another config parameter of the environment, equivalent to tradeoff parameters or batch sizes. Set &lt;code&gt;MODEL=claude-whatever&lt;/code&gt; or &lt;code&gt;MODEL=gpt-whatever&lt;/code&gt; and deploy the agent.&lt;/p&gt;

&lt;p&gt;That's fine for a chatbot, but lazy for an agent.&lt;/p&gt;

&lt;p&gt;Agents introduce variance internally. What looks simple to a user becomes retrieval, planning, transformation, checking, execution, generation and scheduling inside the system. Some of these steps need to be deep, some fast, some cheap. Some need a model that is good at generating code, others an open-weight model because the data cannot legally leave the boundary, or because it is simply too expensive to move around the company.&lt;/p&gt;

&lt;p&gt;Using the same frontier model across the board is comforting. It also conceals the waste.&lt;/p&gt;

&lt;p&gt;Instead of one glaring failure, I get slow, expensive, bureaucratic agent production. A team looks at the dashboard. Cost rises, latency rises, and people say the model is too expensive or the prompts are too long. The architecture is linear and all steps go to one place.&lt;/p&gt;

&lt;p&gt;What gets under my skin is the compute monolith. Everywhere else we have learned to separate compute classes properly (queues are not databases, lambdas are not batch workers, CDNs are not origin servers). Then some clever agent comes along and suddenly every cognitive function has to go through the biggest model in the account.&lt;/p&gt;

&lt;p&gt;Come on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Routing Has to Do More Than Fallbacks
&lt;/h2&gt;

&lt;p&gt;Model routing usually enters the conversation through reliability. If OpenAI is down, try Anthropic. If a deployment is overloaded, try another one. If a provider rate-limits, retry somewhere else.&lt;/p&gt;

&lt;p&gt;This is important. &lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;LiteLLM's router docs&lt;/a&gt; explain load balancing, cooldowns, fallbacks, timeouts, retries, and Redis-based production rate limiting. &lt;a href="https://openrouter.ai/docs/guides/routing/provider-selection" rel="noopener noreferrer"&gt;OpenRouter's provider routing docs&lt;/a&gt; explain provider ordering, fallbacks, performance, price, and data policy constraints. Boring infrastructure at its best.&lt;/p&gt;

&lt;p&gt;But routing cannot stop at uptime.&lt;/p&gt;

&lt;p&gt;In a production agent workflow, the router should understand why a task exists. It should see the agent step, the tool context, the risk, latency budget, data boundary and previous run quality. Then it can pick the appropriate model class for the work at hand.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wvfq4fqt38sx7vkh4p8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9wvfq4fqt38sx7vkh4p8.png" alt="Architecture diagram showing an agent graph sending a typed task into a model router with a router policy that chooses among fast, reasoning, code, and open-weight models, with telemetry and evaluation feedback returning to the policy." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The router belongs in production architecture, where policy can be tested.&lt;/p&gt;

&lt;p&gt;This is where things get more interesting for agentic AI architecture, compared to just building an LLM app. The router turns the agent’s internal structure into an execution policy.&lt;/p&gt;

&lt;p&gt;A planner step can go to a reasoning model. A normalization step can go to a fast model. A code-editing subagent can go to a model tuned for code. A bulk summarization step can go to an open-weight model. A regulated data step can stay inside the boundary. A customer-facing final answer can take the slower path because that is where quality matters (since it impacts the customer).&lt;/p&gt;

&lt;p&gt;The pattern is already familiar, which is the point. It has the same shape as &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;multi-agent orchestration in LangGraph&lt;/a&gt;, but I like it better down at this level. The graph determines what work exists, and the router determines which model class should process that work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Router Needs Typed Work
&lt;/h2&gt;

&lt;p&gt;Prompt-based routing is where it all goes wrong.&lt;/p&gt;

&lt;p&gt;A team adds "Use the cheaper model when the task is simple." The agent is amiable, but ignores the team's intent at exactly the wrong time. The AI guesses or routes based on whatever words match the current prompt. The result is a vibe with a model attached.&lt;/p&gt;

&lt;p&gt;The router needs typed work.&lt;/p&gt;

&lt;p&gt;My ideal is for the agent to report task metadata &lt;em&gt;before&lt;/em&gt; the model call occurs: task kind, expected output shape, sensitivity of input data, allowed tools, user-facing risk, latency/cost budgets, required capability, and retry posture. I do not need a full taxonomy to start. Most teams can begin with something tiny: &lt;code&gt;classify&lt;/code&gt;, &lt;code&gt;retrieve&lt;/code&gt;, &lt;code&gt;reason&lt;/code&gt;, &lt;code&gt;write&lt;/code&gt;, &lt;code&gt;code&lt;/code&gt;, &lt;code&gt;act&lt;/code&gt;. The key is moving model choice from prose to runtime.&lt;/p&gt;

&lt;p&gt;This is a lesson already learned elsewhere in agent architecture. In &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;Developing AI Agency&lt;/a&gt;, explicit mechanisms for planning, tools, memory, and verification beat one giant prompt pretending to be architecture. Model selection is another version of this.&lt;/p&gt;

&lt;p&gt;The router can start dumb and be a simple lookup table driven by task type. It can be configured to dispatch to the code model for code tasks, the fast model for low-risk summaries, the local model for sensitive data, and the quality model for final text written for specific customers. First, ship that. Verify that it works. Then gradually become less dumb and add more nuance to the router.&lt;/p&gt;

&lt;p&gt;The first mistake is expecting the team to find the single best router before shipping anything. The second mistake is letting the model design the router policy inside the same prompt it is supposed to execute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Makes Routing Honest
&lt;/h2&gt;

&lt;p&gt;A router that does not publish telemetry data becomes an additional place where opinions get hidden.&lt;/p&gt;

&lt;p&gt;An engineer's affection for a particular design, the score of a benchmark, and the features listed on a vendor's web page are all useful, but ultimately insufficient. The only relevant test is whether the routing rule improves the production agent's performance on the tasks it actually faces.&lt;/p&gt;

&lt;p&gt;This means we need to consider cost, latency, error rate, retry rate, approval rate, human correction rate and eval score when deciding the routing for a request. So these statistics need to attach to the routing decision itself, not just to the trace.&lt;/p&gt;

&lt;p&gt;LangSmith's platform language is already pointing in this direction. It treats traces as the record of an agent’s actions and reasoning, and says teams should monitor &lt;a href="https://www.langchain.com/langsmith-platform" rel="noopener noreferrer"&gt;cost, latency, errors, and qualitative online evals&lt;/a&gt;. Fleet's product page puts &lt;a href="https://www.langchain.com/langsmith/fleet" rel="noopener noreferrer"&gt;model choice next to admin controls, observability, approvals, MCP connections, and export via APIs&lt;/a&gt;. This is the signal.&lt;/p&gt;

&lt;p&gt;Model selection has moved from dropdown aesthetics into operational control. It affects the performance of a wide array of business processes.&lt;/p&gt;

&lt;p&gt;Once routing is visible, the discussion shifts. The team can stop arguing over which model is best and start figuring out which route failed: fast model for tool argument generation, reasoning model for eval lift, open-weight model for internal summarization, code model for patch generation.&lt;/p&gt;

&lt;p&gt;Those are engineering questions.&lt;/p&gt;

&lt;p&gt;The answers need to inform the router policy, or else the agent keeps making yesterday's decisions with today's realities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open-Weight Models Are Part of the Architecture
&lt;/h2&gt;

&lt;p&gt;The open-model conversation is often deeply ideological. People tend to think in terms of closed models versus open models, frontier quality versus control, benchmarks, and vibes.&lt;/p&gt;

&lt;p&gt;Production is less dramatic.&lt;/p&gt;

&lt;p&gt;Open-weight models give teams another execution path. They are useful when the task is bounded, when the data boundary matters, when throughput matters, when the cost curve gets ugly, or when the model only needs to be good enough for an internal step the user never sees.&lt;/p&gt;

&lt;p&gt;A frontier connection does not mean every call should route through that location. That misconception is common. Routing makes the difference.&lt;/p&gt;

&lt;p&gt;A team can still use a frontier model architecture for the high-risk reasoning step. And yes, the final answer can still go through a strong hosted model. But the retrieval cleanup, first-pass summarization, metadata extraction, and internal critique may not automatically deserve the same spend.&lt;/p&gt;

&lt;p&gt;There is no best model for this problem. The more useful question is: Which model owns this step under these constraints?&lt;/p&gt;

&lt;p&gt;Interface portability matters for the same reason. LangChain says &lt;a href="https://x.com/LangChain/status/2051715028567437359" rel="noopener noreferrer"&gt;Deep Agents ships with ACP so the same harness can run across multiple interfaces&lt;/a&gt;. The &lt;a href="https://docs.langchain.com/oss/python/deepagents/cli/overview" rel="noopener noreferrer"&gt;Deep Agents CLI docs&lt;/a&gt; show a coding agent with provider credentials, model switching, tools, memory, skills, MCP tools, and LangSmith tracing. The interface can change. The harness can change. The routing policy has to be portable across both.&lt;/p&gt;

&lt;p&gt;Model choice that lives in a UI dropdown is prone to drift. Model choice that lives in the agent runtime can be tested, traced, reviewed and rolled back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Own the Decision Boundary
&lt;/h2&gt;

&lt;p&gt;The old agent stack revolved around a model call. The next one revolves around a decision boundary.&lt;/p&gt;

&lt;p&gt;That boundary decides which work deserves which model, which provider, which data path, how many retries to attempt, what approval loop to operate in, and which evaluation loop to use. Less glamorous than a chart, to be sure, but more relevant to production workflows. Most production architecture is less glamorous than the thing that sells the demo.&lt;/p&gt;

&lt;p&gt;The teams that get this right won’t talk about having one “agent model”. They’ll talk about routes: Fast route. Deep route. Code route. Local route. Human-review route. And for each route, they’ll know when to use it, how much it costs, how often it fails, and whether the next release made it better.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://focused.io/lab/2026-year-of-the-integrated-agent" rel="noopener noreferrer"&gt;integrated agents&lt;/a&gt; become useful. The agent owns execution decisions instead of wrapping a model call in a little workflow theater.&lt;/p&gt;

&lt;p&gt;The code that matters controls the router, the telemetry and the eval loop.&lt;/p&gt;

&lt;p&gt;The model will keep changing. The decision boundary should belong to the team shipping the agent.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Stop Eager-Loading MCP Tools Into the Context Window</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 05 May 2026 20:31:01 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/stop-eager-loading-mcp-tools-into-the-context-window-3mjl</link>
      <guid>https://forem.com/focused_dot_io/stop-eager-loading-mcp-tools-into-the-context-window-3mjl</guid>
      <description>&lt;p&gt;&lt;em&gt;MCP servers should not eagerly load every tool schema into an agent's context window. Lazy-load tools by intent, then govern and audit execution.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Austin Vance, CEO of&lt;/em&gt;&lt;a href="https://focused.io" rel="noopener noreferrer"&gt; &lt;em&gt;Focused&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I think the problem with the current state of MCP is way deeper than just resizing the context window.&lt;/p&gt;

&lt;p&gt;The protocol itself is decent, tool discovery and schema negotiation works well and the JSON-RPC architecture all feel very solid and well engineered. However, the default behavior of populating the agent's context at session start with every tool definition from every connected server makes running production agents virtually impossible.&lt;/p&gt;

&lt;p&gt;One developer &lt;a href="https://joshowens.dev/mcps-are-dead/" rel="noopener noreferrer"&gt;measured 67,300 tokens consumed&lt;/a&gt; before typing a single question. Seven MCP servers. Tool schemas alone ate up a third of the available context. Another measured 81,986 tokens. &lt;/p&gt;

&lt;h2&gt;
  
  
  The Eager-Loading Tax
&lt;/h2&gt;

&lt;p&gt;When an agent starts a session with MCP servers connected, it downloads the full library of all tools, every session. And never filters out just the tools needed for the job at hand.&lt;/p&gt;

&lt;p&gt;My browser automation server is loading 21 tool definitions. A GitHub server loads 27. My web search server bundles 8 providers behind 20 tools. I've not sent a single message yet and I'm already consuming significant context.&lt;/p&gt;

&lt;p&gt;The numbers from &lt;a href="https://arxiv.org/abs/2602.14878" rel="noopener noreferrer"&gt;a study of 856 tools across 103 MCP servers&lt;/a&gt; make this worse than it sounds. Fully augmented MCP tool descriptions add 67% more execution steps for a 5.85 percentage point accuracy gain. The tool definitions don't just eat context. They also slow agents down at actually learning to use the tools.&lt;/p&gt;

&lt;p&gt;We wrote about &lt;a href="https://focused.io/lab/evaluation-pipelines-for-langgraph-agents" rel="noopener noreferrer"&gt;evaluation pipelines for production agents&lt;/a&gt;. One of the failure modes of context pollution from tool definitions that I never see anyone mention is when the agent becomes less effective over time. It doesn't necessarily die or crash or throw an error. The amount of real conversation history that can be displayed in the working window gets pushed out by the tool schemas.&lt;/p&gt;

&lt;p&gt;Even with child agents the context budget gets severely curtailed. Each child agent inherits the MCP configuration. That's new context I guess, but the immediate loss of tens of thousands of tokens to render tool schemas for subagents that may not even use them is completely antithetical to the point of using subagents in the first place: focused context. We covered the architecture patterns for &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;multi-agent orchestration in LangGraph&lt;/a&gt;, but even great orchestration can't fix a context budget that's already half spent before the first tool call.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flsght88sk9728j25u1gi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flsght88sk9728j25u1gi.png" alt="Split comparison of eager MCP tool loading versus lazy tool discovery preserving the context window." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The waste is architectural: eager loading spends the context budget before the agent starts working.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloudflare Just Admitted This Is Broken
&lt;/h2&gt;

&lt;p&gt;Cloudflare launched &lt;a href="https://blog.cloudflare.com/welcome-to-agents-week/" rel="noopener noreferrer"&gt;Agents Week&lt;/a&gt; on April 12, and buried in their enterprise MCP reference architecture is an admission that the tool-definition model doesn't scale.&lt;/p&gt;

&lt;p&gt;Their solution is called &lt;a href="https://blog.cloudflare.com/enterprise-mcp/" rel="noopener noreferrer"&gt;Code Mode&lt;/a&gt;. It condenses all of the individual MCP tools down into two meta-tools: &lt;code&gt;portal_codemode_search&lt;/code&gt; and &lt;code&gt;portal_codemode_execute&lt;/code&gt;. Rather than loading every tool definition into context, the agent writes JavaScript to search for and invoke tools on demand.&lt;/p&gt;

&lt;p&gt;This means that 4 internal MCP servers exposing 52 tools would normally consume 9,400 tokens just for definitions. Code Mode drops that to 600 tokens. A 94% reduction. For Cloudflare's own API, which would consume over 2 million tokens as a traditional MCP server (twice the largest context window available right now), the reduction hits 99.9%.&lt;/p&gt;

&lt;p&gt;That last number deserves to sit for a second. Cloudflare, one of the companies most aggressively adopting MCP across their entire enterprise, had to build a system that essentially replaces MCP's tool discovery mechanism because the original approach would literally overflow the context window. With one server.&lt;/p&gt;

&lt;p&gt;The MCP spec team &lt;a href="https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1300" rel="noopener noreferrer"&gt;acknowledged context overload as the most frequent community concern&lt;/a&gt; in their tool filtering proposal. Quality decreases rapidly after around 10 tools, which far exceeds what most production setups connect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lazy-Loading Is the Fix
&lt;/h2&gt;

&lt;p&gt;Not just a theoretical issue. I'm seeing lazy-loading work in multiple production environments, each implementing it slightly differently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare's Code Mode&lt;/strong&gt; turns the agent into its own tool browser. Give it a search function, give it an execute function, and let it figure out which tools matter for the job at hand. The context cost for exploring MCP servers stays the same regardless of how many servers are connected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There's also the Skills pattern.&lt;/strong&gt; Instead of representing all of the tool schemas in detail upfront, agents encode the knowledge needed for a given task in lightweight skill files (typically 200 to 1,500 tokens each) that can be loaded as needed based on intent matching. A skill for browser automation might cost around 2,000 tokens to activate, as opposed to 13,600 tokens to load the full MCP server at startup. GitHub operations drop from 18,000 tokens to maybe 500 or so. Web search goes from 14,100 down to 550.&lt;/p&gt;

&lt;p&gt;That's not marginal. That's an order of magnitude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Arcade's MCP Gateway&lt;/strong&gt; in &lt;a href="https://blog.langchain.com/arcade-dev-tools-now-in-langsmith-fleet/" rel="noopener noreferrer"&gt;LangSmith Fleet&lt;/a&gt; takes a third approach by centralizing 7,500+ tools and optimizing the tool descriptions for language models. These tools are not simply API wrappers. They are mapped to actions that agents can perform, with descriptions written specifically for how language models select and call upon them.&lt;/p&gt;

&lt;p&gt;Harrison Chase wrote about this from the other side of the spectrum. His &lt;a href="https://blog.langchain.com/continual-learning-for-ai-agents/" rel="noopener noreferrer"&gt;continual learning framework&lt;/a&gt; identifies three realms where agents improve: model weights, harness code, and context. The context layer is "the most common and most exciting area right now." However, optimizing for context only works if there is room in the context budget to do so. An agent can't learn from its interactions if the space for learning is already completely filled by tool schemas it loaded at boot time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59hrnl33bnrdh6d5p6l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F59hrnl33bnrdh6d5p6l3.png" alt="Flow diagram showing task intent routing through tool discovery, policy approval, needed tool schemas, agent execution, and audit logging." width="800" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lazy-loading turns tool discovery into a governed routing path instead of a context-window tax.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;What I particularly like about the current LangChain infrastructure is that the eager version of these agents registers all tools when the agent is built:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_mcp_adapters.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MultiServerMCPClient&lt;/span&gt;

&lt;span class="n"&gt;MCP_SERVERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3001/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3002/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3003/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3004/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_eager_agent&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MultiServerMCPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MCP_SERVERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# all tools, all servers, every session
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The lazy approach is not a magic discovery tool that mutates the running agent's tool set. The boring version is a router: decide which MCP servers matter for this task, load only those tools, then build the agent for that run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_mcp_adapters.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MultiServerMCPClient&lt;/span&gt;

&lt;span class="n"&gt;TOOL_REGISTRY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3001/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;commit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;branch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3002/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;browse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3003/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;find&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;look up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:3004/mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;select_servers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;selected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;TOOL_REGISTRY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;trigger&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;trigger&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="n"&gt;selected&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transport&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;selected&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_with_lazy_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;selected_servers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;select_servers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;selected_servers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOOL_REGISTRY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No matching MCP servers. Available: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MultiServerMCPClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selected_servers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# only tools from the routed servers
&lt;/span&gt;    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;}]}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first version of the feature I had written had a terrible context profile because it stored definitions for every tool on every server. The next version routed first, then loaded only the relevant components as needed. The gain in a production system with 5 to 10 MCP servers is in the tens of thousands of fewer tokens processed every session.&lt;/p&gt;

&lt;p&gt;Holding all of that tool schema in context is expensive. But more importantly, every token of tool schema that sits in context is a token that could be spent on reasoning, conversation history, or user-specific memory. We wrote about why &lt;a href="https://focused.io/lab/persistent-agent-memory-in-langgraph" rel="noopener noreferrer"&gt;persistent agent memory&lt;/a&gt; is critical for production agents. Memory is useless if there isn't room for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shadow MCP Is the Enterprise Problem Nobody Expected
&lt;/h2&gt;

&lt;p&gt;Cloudflare's reference architecture introduces another concept worth paying attention to: &lt;a href="https://blog.cloudflare.com/enterprise-mcp/" rel="noopener noreferrer"&gt;Shadow MCP detection&lt;/a&gt;. They scan for unauthorized MCP server connections across the organization, monitoring hostnames, URI paths, and even DLP-based body inspection for JSON-RPC method calls like &lt;code&gt;tools/call&lt;/code&gt; and &lt;code&gt;initialize&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;MCP has its own shadow IT problem. Developers will sometimes set up their own MCP server, integrate that into their existing agents, and security will never even be aware. This code can execute locally on developer machines, reach out to internal APIs, and bypass security controls. No audit trail, no credential governance, no DLP.&lt;/p&gt;

&lt;p&gt;Cloudflare's answer is a monorepo governance model: centralized MCP team, AI governance approval, templates that inherit default-deny write controls and audit logging out of the box. New governed MCP servers deploy in minutes because the governance is baked into the platform, not bolted on after the fact.&lt;/p&gt;

&lt;p&gt;I see this pattern constantly with clients. The MCP gold rush has teams spinning up servers faster than security can evaluate them. We wrote about why &lt;a href="https://focused.io/lab/mcp-is-packaging-agent-operable-interfaces-are-the-product" rel="noopener noreferrer"&gt;agent-operable interfaces are the product&lt;/a&gt;. The same principle applies to the tools agents use. If an employee can't access a system without approval, the agent shouldn't be able to either.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix Is Architecture, Not Bigger Windows
&lt;/h2&gt;

&lt;p&gt;"Context windows keep getting bigger." They do. And the waste doesn't get smaller.&lt;/p&gt;

&lt;p&gt;A million-token window doesn't help if 67,000 tokens of tool schemas still get loaded that the agent won't ever use. The underlying issue is architectural: eager-loading is the wrong pattern for tool discovery in production agents.&lt;/p&gt;

&lt;p&gt;Lazy-load tools based on task intent. Gate discovery behind a search mechanism. Keep tool definitions out of the context until the agent actually needs them.&lt;/p&gt;

&lt;p&gt;Honeycomb published &lt;a href="https://www.honeycomb.io/blog/icymi-is-this-code-worth-running-heres-how-know" rel="noopener noreferrer"&gt;a set of principles for the AI era&lt;/a&gt; that apply here: cost is a system attribute, not an afterthought, and pre-production testing doesn't prepare for the load that comes from real systems in a real environment. Tool context overhead is exactly the kind of emergent cost that only shows up in production, when real agents connect to real MCP servers and the token bills start making people uncomfortable.&lt;/p&gt;

&lt;p&gt;The protocol isn't the problem. The eager-loading default is the problem. Own the architecture decision. Lazy-load.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>MCP Is Packaging. Agent-Operable Interfaces Are the Product | Focused Labs</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Mon, 04 May 2026 14:25:47 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/mcp-is-packaging-agent-operable-interfaces-are-the-product-focused-labs-49gp</link>
      <guid>https://forem.com/focused_dot_io/mcp-is-packaging-agent-operable-interfaces-are-the-product-focused-labs-49gp</guid>
      <description>&lt;p&gt;&lt;em&gt;MCP packages tools, but the real product is the narrow, typed, auditable interface an agent can actually operate.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Austin Vance, CEO of&lt;/em&gt;&lt;a href="https://focused.io" rel="noopener noreferrer"&gt; &lt;em&gt;Focused&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP is not the hard part.&lt;/p&gt;

&lt;p&gt;The hard part is designing a system that an agent can use, as opposed to guessing, wandering, or mangling it. The protocol is the distribution rather than the architecture&lt;/p&gt;

&lt;p&gt;This is kind of important. Every enterprise AI conversation I’ve had will, at some point, boil down to this: we have a model, we have a workflow, and we have a tangle of internal tools designed for humans to interact with them through a web interface at human speeds. Then the question becomes “should we make an MCP server to handle all of this?”&lt;/p&gt;

&lt;p&gt;Fine. But for what?&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://modelcontextprotocol.io/introduction" rel="noopener noreferrer"&gt;Model Context Protocol makes it easy for applications to expose tools and model context&lt;/a&gt;. That’s useful and I'm not opposing MCP. I am opposing the use of this protocol to justify exposure of a useless shortcut as being also useful.&lt;/p&gt;

&lt;p&gt;Harrison Chase broke down the lock-in problem well: &lt;a href="https://x.com/hwchase17/status/2050470473310572849" rel="noopener noreferrer"&gt;switching model providers is easy, switching harnesses is less so, and model providers want to lock teams in through the harness&lt;/a&gt;. The harness is where the agent learns about the actions in an application, the state, the model’s memory, what can be retried, what needs approval, and what telemetry gets written down.&lt;/p&gt;

&lt;p&gt;But then there is the interface below the harness, which gets little recognition.&lt;/p&gt;

&lt;p&gt;A bad interface can turn an excellent harness into a nightmarish pain. A good interface can make any harness only fair at worst.&lt;/p&gt;

&lt;p&gt;I see why “just build an MCP server” isn’t the entire answer. An MCP server can send a messy action. It can wrap up a sharp action. But deciding which action exists in the first place is up to the team. And it's a design / experience problem not engineering.&lt;/p&gt;

&lt;p&gt;Teams build integrations for internal agents by wrapping around existing APIs, often structured to hide awkward frontend decisions, like why the API returned an object with an object with an object inside of it. An endpoint might have a side effect of updating state because it’s an admin screen. Exceptions include human-readable error messages, implicit permissions, opaque pagination parameters, no support for dry running, and no idempotency keys. The most lacking verb in this system is “after policy rules apply, approve this one invoice,” and that ends up on an agent with the verb &lt;code&gt;updateInvoice&lt;/code&gt;. Stricter prompts don’t work.&lt;/p&gt;

&lt;p&gt;Welcome to production.&lt;/p&gt;

&lt;p&gt;After reading yet another question about whether a given subsystem has an MCP server, I paused for an instant to ask myself whether I missed something here. We shouldn't be asking "is an MCP server," instead we should ask if the system in question has handles for the agent that just got invited in.&lt;/p&gt;

&lt;p&gt;A handle is a small, typed, boring action, describing what it intends to do with some data. It describes what the data contains, what the operation needs from it, and what it will look like afterward. It fails in a way that the caller can understand. Handle-based operations are easy to test without a full model. Finally, handles leave traces of their prior actions.&lt;/p&gt;

&lt;p&gt;Do the new examples reinforce the point? Google’s &lt;a href="https://github.com/googleapis/mcp-toolbox" rel="noopener noreferrer"&gt;MCP Toolbox for Databases&lt;/a&gt; might sound utterly bland because “database plus MCP” is a magical phrase. But in this case, the interesting new aspect is that databases require controlled, auditable work that can be inspected by the software agent. MathWorks has released an official &lt;a href="https://github.com/matlab/matlab-mcp-core-server" rel="noopener noreferrer"&gt;MATLAB MCP server&lt;/a&gt;, which is interesting because the interface to MATLAB’s mature technical environment is vastly more appropriate than a chat window. Browserbase and LangChain are demonstrating Deep Agents with &lt;a href="https://docs.langchain.com/oss/python/integrations/providers/browserbase" rel="noopener noreferrer"&gt;search, fetch, and browser subagents&lt;/a&gt;. Again, a cheap, light subagent performs quick retrieval, followed by a heavier browser-based operation if necessary.&lt;/p&gt;

&lt;p&gt;I don’t mean that every single thing suddenly becomes an MCP server. I mean that more of the important tools in a business can become something controlled through an agent instead of through a browser tab or terminal command.&lt;/p&gt;

&lt;p&gt;There is a difference.&lt;/p&gt;

&lt;p&gt;An MCP server is just one package boundary among several, each with its own strengths and weaknesses. An agent-operable interface is a product decision, choosing specific verbs, inputs, outputs, reversible operations, and mandatory human pause actions. A protocol can then move that interface around, but it cannot make the interface good.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4pryn0qq9ln3xc56py8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh4pryn0qq9ln3xc56py8.png" alt="Side-by-side architecture diagram comparing a thin MCP wrapper around a messy API with an agent-operable interface that has narrow verbs, dry runs, typed errors, and audit records." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MCP moves an interface around. It does not make the verbs worth trusting.&lt;/p&gt;

&lt;p&gt;This is the same anti-pattern we saw with APIs. Companies would publish a REST API to tremendous fanfare, convinced that integration problems were now solved. In practice, the nouns and mutations provided by the API would prove inadequate for anything beyond the simplest cases. Docs would sometimes contradict behavior. And while most of the workflow might be automatable, the remaining chunk still required a human being logged into the admin console.&lt;/p&gt;

&lt;p&gt;The gap costs more as agents move further into it, since they typically stop short of explicitly stating the ambiguities at the boundary, and instead select tools, insert missing fields, retry operations, and give misleading summaries of the results as if they were progress. Agents do not intend to fail in workflows. Instead, they are given an irregular surface to work on for which they have no clear mandate and for which they must pretend to be competent.&lt;/p&gt;

&lt;p&gt;A useful way to think about this is &lt;a href="https://focused.io/lab/developing-ai-agency" rel="noopener noreferrer"&gt;Developing AI Agency&lt;/a&gt;. The word “agency” comes with unfortunate connotations of personality, so I try to think about it in terms of the required affordances for any agent: a goal, some tools to pursue it with, memory, feedback, and permission to act. When the tool layer is too vague, the AI ends up with fake agency. It can talk about work and even generate a lot of thoughtful-sounding design language, but it can’t actually do the work.&lt;/p&gt;

&lt;p&gt;The current gold rush of building MCPs obfuscates this problem because when people say “server” they think of code and physical hardware. Code and hardware are tangible. There is a repo, a README, and a demo of someone, usually Claude or Cursor, opening up the tool and something happening.&lt;/p&gt;

&lt;p&gt;That demo is not the test.&lt;/p&gt;

&lt;p&gt;Test whether the interface still behaves when the request is boring, partial, duplicated, late, unauthorized, or wrong. Test whether a reviewer can always reconstruct what happened to an object after the agent touched the handle of the thing. Test whether the action can be replayed in staging without accidentally sending the email to customers. &lt;a href="https://focused.io/lab/everybody-tests" rel="noopener noreferrer"&gt;Everybody Tests&lt;/a&gt;, even when the thing under test is an agent holding a tool handle.&lt;/p&gt;

&lt;p&gt;A useful agent-operable interface has a few properties.&lt;/p&gt;

&lt;p&gt;The verbs are narrow. A verb for “create refund request” instead of “update order.” A verb for “draft response” instead of “send message.” A verb for “propose schema migration” instead of “run SQL.” Narrow verbs help by letting the operation name strongly suggest the operation’s intent.&lt;/p&gt;

&lt;p&gt;All inputs are provided in a form that the domain expects, not just pure JSON schema for the sake of it. Real domain constraints are used where possible, to reflect the kind of validation that matters in the application. This means providing an account ID that actually exists in the system, a payment amount that has a meaningful currency, and a date and time with timezone rules that have real-world meaning to the user. And when using enums, the validated output should contain meaningful strings, not just values used in the demo.&lt;/p&gt;

&lt;p&gt;Outputs should be machine-readable and human-readable at the same time. The agent expects certain fields to be populated. A human reviewer wants to read a simple statement of what changed, what didn’t change, and what still needs work.&lt;/p&gt;

&lt;p&gt;There’s a dry-run path. A dry run is the cheapest safety mechanism available, and almost nobody shipping generated code tries it first. A dry run turns “can the agent do this?” into “can the agent explain the diff before doing this?” That is where human judgment is better.&lt;/p&gt;

&lt;p&gt;Interfaces are idempotent to the degree possible. Networks fail, agents retry, and tool calls time out while the downstream system was actually working. If creating an invocation of &lt;code&gt;create_refund_request&lt;/code&gt; also creates a second refund, or a second ticket, or a second production deploy, then the interface is not yet ready for an agent.&lt;/p&gt;

&lt;p&gt;Every interface has contract tests that don’t involve a model. This matters. If every single correctness check has to run an LLM, we have built a slot machine and only looked at the CI badge. The tool’s schema, how it validates, what a dry run looks like, how permissions fail, and what audit records are generated should all be tested by normal software tests. Save the model evals for when there’s a model involved.&lt;/p&gt;

&lt;p&gt;The interface leaves evidence. Not vibes, though it could strive for better ones. Tangible records of who acted, through which agent, under which policy, against which object, with what proposed change, and with what final result. Here I’m talking about connecting observability to governance without inverting into another dashboard cult.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jgx17fgudljd1qbdpuh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jgx17fgudljd1qbdpuh.png" alt="Matrix listing the properties of an agent-operable handle: narrow verb, typed input, dry run, idempotency, typed failure, audit record, and human pause." width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A useful handle is a contract the agent cannot creatively reinterpret.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://x.com/GoogleCloudTech/status/2050334450697863535" rel="noopener noreferrer"&gt;Google Cloud conversation with Harrison Chase framed harness engineering as the path from demo to production&lt;/a&gt;. I think that is right, and I think the next practical step is interface engineering. The harness made sense once it had an interface for composing sane things.&lt;/p&gt;

&lt;p&gt;This is why abstractions on top of LangChain are useful too. Start with a basic agent primitive, then a graph, and finally a Deep Agent that can even use browser subagents and human interruption. Every level of abstraction still ultimately bottoms out at a tool call, which either corresponds to a clean domain operation or a tangled mess of code that happens to work on the backend.&lt;/p&gt;

&lt;p&gt;In practice, &lt;a href="https://focused.io/lab/multi-agent-orchestration-in-langgraph-supervisor-vs-swarm-tradeoffs-and-architecture" rel="noopener noreferrer"&gt;Multi-Agent Orchestration in LangGraph&lt;/a&gt; is only half the story. The other half is whether the interface lets the worker do anything worth trusting.&lt;/p&gt;

&lt;p&gt;It’s getting said out loud in the community now: &lt;a href="https://x.com/i/status/2050545264927093004" rel="noopener noreferrer"&gt;“Stop building MCP servers. Build CLIs that agents can use”&lt;/a&gt;. I don’t care what the end result is, as long as it’s a CLI, OpenAPI endpoint, MCP tool, database management procedure, internal command bus, or whatever boring thing is observable, testable, and readable by others.&lt;/p&gt;

&lt;p&gt;Interesting new projects are emerging around this idea too. &lt;a href="https://github.com/millionco/agent-install" rel="noopener noreferrer"&gt;agent-install&lt;/a&gt; treats agent capabilities as installable surfaces across coding agents. &lt;a href="https://github.com/DesmondSanctity/loadam" rel="noopener noreferrer"&gt;loadam&lt;/a&gt; turns OpenAPI specs into tests, MCP output, and drift reports. &lt;a href="https://www.freecodecamp.org/news/how-to-build-a-multi-agent-ai-system-with-langgraph-mcp-and-a2a-full-book/" rel="noopener noreferrer"&gt;freeCodeCamp’s LangGraph, MCP, and A2A guide&lt;/a&gt; also illustrates the progress from single-agent demos to more structured systems with protocols between them.&lt;/p&gt;

&lt;p&gt;Good. Just make the distinction between what the protocol diagram shows and what the system can actually do.&lt;/p&gt;

&lt;p&gt;The work is deciding what actions the agent can take within Salesforce, Jira, GitHub, Postgres, SAP, Stripe, and the lingering internal admin app that is totally going to get replaced tomorrow. Deleting broad verbs is the new favorite hobby. Adding dry runs is straightforward. Making failures typed is tedious. Writing tests for contracts before a single model sees the tool is boring.&lt;/p&gt;

&lt;p&gt;Boring is the point.&lt;/p&gt;

&lt;p&gt;Stop Eager-Loading MCP Tools Into the Context Window. A giant pile of tools is not capability. It is usually confusion with a larger token bill. Agents need fewer, sharper handles to their tools, and tool catalogs should feel more like a well-designed command line than a junk drawer with JSON schemas bolted on.&lt;/p&gt;

&lt;p&gt;Agent-operable interfaces should be treated as part of product architecture, not just sweeping up integration bits and pieces that product teams don’t want anymore. Enterprise teams should own the verbs the same way they own the database schema. Version them. Deprecate them. Test them and document the failure modes. Have review for dangerous actions. Make the interface boring enough that the agent has no creative wiggle room around the important bits.&lt;/p&gt;

&lt;p&gt;MCP will help distribute interfaces. Harnesses will help compose them. Models will get better at calling them.&lt;/p&gt;

&lt;p&gt;Companies will not win by having the most MCP-capable servers. They will win by having the cleanest handles in their systems.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your Customer Service Bot Is Slow Because It's Single-Threaded</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:16:24 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/your-customer-service-bot-is-slow-because-its-single-threaded-1gnb</link>
      <guid>https://forem.com/focused_dot_io/your-customer-service-bot-is-slow-because-its-single-threaded-1gnb</guid>
      <description>&lt;p&gt;Consider a typical enterprise support agent. A customer asks a complex compliance question and the agent dutifully queries the knowledge base, then searches the web, then checks policy docs. Sequential. Three LLM calls back to back. &lt;em&gt;That's ~12 seconds of wall time.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Users start abandoning chat around 8.&lt;/p&gt;

&lt;p&gt;Fan out those three research calls in parallel, same calls, same models, same prompts, and &lt;em&gt;wall time drops to ~6.5 seconds.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This post covers the parallel sub-agent pattern using LangGraph and LangSmith. I'll show the code, but more importantly, I'll show you the failure modes because the pattern is simple and the bugs are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Latency Math
&lt;/h2&gt;

&lt;p&gt;You have an agent that needs to hit three sources, internal KB, web search, and policy documents. Each LLM call takes 2–4 seconds. Sequentially:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classify query&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research KB&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research Web&lt;/td&gt;
&lt;td&gt;~3.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research Policy&lt;/td&gt;
&lt;td&gt;~2.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthesize&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~12s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In parallel, the three research steps overlap:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classify query&lt;/td&gt;
&lt;td&gt;~1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Research (all three, parallel)&lt;/td&gt;
&lt;td&gt;~3.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synthesize&lt;/td&gt;
&lt;td&gt;~2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~6.5s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 45% reduction from a structural change, not a prompt improvement. Every additional sub-agent you add sequentially costs another 2–4 seconds. In parallel, it's free, until you hit the slowest branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Parallel Agents Architecture
&lt;/h2&gt;

&lt;p&gt;We're building a research assistant that fans out to three parallel sub-agents, aggregates results, and synthesizes a response:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                     ┌→ [Research: KB]     ─┐
[Classify Query] ────┼→ [Research: Web]    ─┼→ [Synthesize] → END
                     └→ [Research: Policy] ─┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;LangGraph executes parallel branches in a superstep, all three branches run concurrently, state updates are transactional. The fan-in edge waits for all branches before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the Send API:&lt;/strong&gt; LangGraph has a &lt;code&gt;Send&lt;/code&gt; API for dynamic map-reduce where branch count is unknown at build time. Don't reach for it here. &lt;code&gt;Send&lt;/code&gt; is designed for running the same node N times with different inputs. For a fixed set of specialist agents, static edges or conditional routing are simpler, preserve graph structure, and keep every branch visible at compile time via &lt;code&gt;graph.get_graph().draw_mermaid()&lt;/code&gt;. In practice, you'll rarely need &lt;code&gt;Send&lt;/code&gt;. Start with static fan-out, graduate to conditional, reach for &lt;code&gt;Send&lt;/code&gt; as a last resort.&lt;/p&gt;

&lt;h2&gt;
  
  
  State: The One Thing You'll Get Wrong
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;Annotated[list, operator.add]&lt;/code&gt; reducer tells LangGraph to &lt;strong&gt;concatenate&lt;/strong&gt; results from parallel branches instead of overwriting them. Without it, parallel branches race to write the results field. The last branch to finish wins, and you silently lose the other two. This is one of the most common bugs in parallel agent systems. The synthesizer produces suspiciously narrow responses, coverage evals fail intermittently, and you spend two days blaming the prompt before realizing you're only getting one source's data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;State, a sub-agent factory, and three agent instances. The &lt;code&gt;@traceable&lt;/code&gt; decorator ensures each agent appears as a distinct span in LangSmith — this will be the single most important debugging decision you make.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import operator
from typing import Annotated, TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research_results: Annotated[list[dict], operator.add]
    final_response: str


def make_agent(name: str, focus: str):
    """Factory that builds a traceable research sub-agent."""

    @traceable(name=name, run_type="chain")
    def node(state: State) -&amp;gt; dict:
        response = llm.invoke([
            SystemMessage(content=f"You are the {name} agent. Focus on {focus}. "
                                  "Return a concise summary. Cite your source type."),
            HumanMessage(content=f"Research query: {state['question']}"),
        ])
        return {"research_results": [{"source": name, "content": response.content}]}

    return node


kb_agent = make_agent("knowledge_base", "internal knowledge base searches.")
web_agent = make_agent("web_search", "recent news and industry trends.")
policy_agent = make_agent("policy", "compliance, legal, and regulatory frameworks.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The synthesizer merges sub-agent outputs into one customer-facing response. The key constraint, worth knowing before you ship, is that policy information takes precedence. Without this, the synthesizer will cheerfully soften restrictions to sound more helpful.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="Synthesizer", run_type="chain")
def synthesize(state: State) -&amp;gt; dict:
    context = "\n\n".join(
        f"[{r['source']}]: {r['content']}" for r in state["research_results"]
    )
    response = llm.invoke([
        SystemMessage(
            content="Synthesize the following research into a clear, actionable "
                    "response. When policy information conflicts with or constrains "
                    "other responses, the policy statement takes precedence. "
                    "Never soften or omit policy restrictions."
        ),
        HumanMessage(
            content=f"Customer question: {state['question']}\n\n"
                    f"Research findings:\n{context}"
        ),
    ])
    return {"final_response": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;

&lt;p&gt;Fifteen lines of wiring. &lt;code&gt;RetryPolicy&lt;/code&gt; on every research node so a provider 429 doesn't kill the entire pipeline, successful branches are checkpointed and won't re-execute.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.graph import StateGraph, START, END
from langgraph.types import RetryPolicy

builder = StateGraph(State)

builder.add_node("kb", kb_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("web", web_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("policy", policy_agent, retry=RetryPolicy(max_attempts=3))
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "kb")
builder.add_edge(START, "web")
builder.add_edge(START, "policy")
builder.add_edge(["kb", "web", "policy"], "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Conditional Routing: The Upgrade
&lt;/h2&gt;

&lt;p&gt;Sometimes hitting every source is wasteful. A simple "what's our refund policy?" doesn't need web search. Conditional fan-out lets you route based on the question using structured output, no regex parsing, no brittle string matching:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from collections.abc import Sequence

from pydantic import BaseModel, Field


class RoutingPlan(BaseModel):
    agents: list[str] = Field(
        description="Agents to activate: kb, web, policy"
    )

structured_llm = llm.with_structured_output(RoutingPlan)


def classify_and_route(state: State) -&amp;gt; Sequence[str]:
    plan = structured_llm.invoke([
        SystemMessage(content="Decide which research agents to invoke. "
                              "Available: kb, web, policy. When in doubt, include the agent."),
        HumanMessage(content=state["question"]),
    ])
    return plan.agents or ["kb"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The tradeoff is real. Conditional routing saves latency on simple queries but your routing logic becomes a new failure point. And with conditional fan-out, use individual edges from each node to &lt;code&gt;synthesize&lt;/code&gt; not the list-style fan-in or LangGraph waits forever for branches that were never dispatched.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Failures in Concurrent Execution
&lt;/h2&gt;

&lt;p&gt;These are the failure modes that surface once parallel agents hit real traffic.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State Clobbering.&lt;/strong&gt; Synthesizer references only one source. Intermittent. Cause: missing &lt;code&gt;operator.add&lt;/code&gt; reducer. Parallel branches overwrite instead of appending. There's no warning, the graph runs fine, it just loses data.****&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesizer Contradicted the Policy Agent.&lt;/strong&gt; Say a customer asks about returning an opened product. The policy agent correctly stated the 30-day &lt;em&gt;unopened-only&lt;/em&gt; return policy. The KB agent mentioned "hassle-free returns." The synthesizer merged these into: "You can return the product within 30 days, hassle-free" omitting the unopened requirement. LangSmith traces showed the policy agent's output was correct; the synthesizer span revealed where the information was lost. Fix: the policy-takes-precedence constraint in the synthesizer prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hung Branch Blocking Fan-In.&lt;/strong&gt; Response times spike from ~6s to 30s+. The fan-in waits for ALL branches. Your p50 is fine, your p99 is determined by the slowest branch on its worst day. Fix: async timeouts per branch, return partial results (&lt;code&gt;{"source": "web_search", "content": "Timed out"}&lt;/code&gt;) rather than blocking the pipeline.****&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrator Under-Dispatched&lt;/strong&gt;. A significant fraction of multi-domain queries will be only partially routed. Over-dispatching (an agent returning empty results) is cheap. Under-dispatching is a customer getting an incomplete answer. Fix: explicit multi-domain examples in the routing prompt and a &lt;code&gt;"when in doubt, include the agent"&lt;/code&gt; instruction.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Parallel agents are hard to debug without tracing. &lt;code&gt;@traceable&lt;/code&gt; on every sub-agent gives you per-branch spans in LangSmith. Tag production traces with metadata for filtering:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

with tracing_context(
    metadata={"customer_tier": "enterprise", "channel": "chat"},
    tags=["production", "v2"],
):
    result = graph.invoke({"question": "How does GDPR affect our data pipeline?"})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The first thing to check when latency spikes: is one branch consistently slower? LangSmith makes that a 10-second investigation instead of an hour of log-grepping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Shipping without evals is negligence. Three evaluators catch the most common regressions: deterministic coverage, structural fan-out validation, and LLM-as-judge for overall quality.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="research-agent-evals",
    description="Parallel research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What is our refund policy for enterprise clients?"},
        {"question": "How does GDPR affect our data pipeline architecture?"},
        {"question": "What competitors launched AI features last quarter?"},
    ],
    outputs=[
        {"must_mention": ["refund", "enterprise", "policy"]},
        {"must_mention": ["GDPR", "data", "compliance"]},
        {"must_mention": ["competitor", "AI", "feature"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
Customer query: {inputs[question]}
AI response: {outputs[final_response]}

Rate 0.0-1.0 on completeness, accuracy, and tone.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the synthesizer actually address the question?"""
    text = outputs.get("final_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def source_diversity(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Is the fan-out actually working, or did it silently degrade?"""
    results = outputs.get("research_results", [])
    sources = {r["source"] for r in results if isinstance(r, dict)}
    return {"key": "source_diversity", "score": min(len(sources) / 2.0, 1.0)}


def target(inputs: dict) -&amp;gt; dict:
    return graph.invoke({"question": inputs["question"]})


results = evaluate(
    target,
    data="research-agent-evals",
    evaluators=[quality_judge, coverage, source_diversity],
    experiment_prefix="parallel-research-v1",
    max_concurrency=4,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;source_diversity&lt;/code&gt; is the only automated check that your parallel architecture is actually parallel. Without it, state clobbering can ship to production and sit there for weeks. Run this eval on every PR that touches agent code.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use parallel sub-agents when:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries regularly span 2+ domains in a single message&lt;/li&gt;
&lt;li&gt;You need per-domain traceability for debugging and compliance&lt;/li&gt;
&lt;li&gt;Sub-agents have different tool sets or retrieval sources&lt;/li&gt;
&lt;li&gt;You're iterating on prompts and need isolated regression testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip it when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queries are single-domain (a FAQ bot doesn't need orchestration)&lt;/li&gt;
&lt;li&gt;Latency budget is extremely tight (routing adds one LLM call)&lt;/li&gt;
&lt;li&gt;You have fewer than 3 distinct knowledge domains&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Parallel sub-agents aren't architecturally complex it's a fan-out, a fan-in, and a reducer. The code is about 15 lines of graph wiring. The production hardening is everything else.&lt;/p&gt;

&lt;p&gt;Start with static fan-out. Add conditional routing when you have data showing which sources matter for which queries. Write the &lt;code&gt;source_diversity&lt;/code&gt; eval before you write the second prompt. And put &lt;code&gt;operator.add&lt;/code&gt; on your list fields you'll thank me later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/01-parallel-sub-agents/" rel="noopener noreferrer"&gt;Parrallel Agents Github Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/quickstart" rel="noopener noreferrer"&gt;LangGraph Quickstart (State, Reducers, Graph Construction)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/observability" rel="noopener noreferrer"&gt;LangSmith Observbaility &amp;amp; Tracing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/evaluation" rel="noopener noreferrer"&gt;LangSmith Evaluation Framework&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded" rel="noopener noreferrer"&gt;https://focused.io/lab/your-customer-service-bot-is-slow-because-its-single-threaded&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your AI Just Emailed a Customer Without Permission</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:16:21 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/your-ai-just-emailed-a-customer-without-permission-38k4</link>
      <guid>https://forem.com/focused_dot_io/your-ai-just-emailed-a-customer-without-permission-38k4</guid>
      <description>&lt;p&gt;In a customer complaint handler for a fintech company you have drafted responses, checked tone, and verified responses to match company policy. Automated from end to end. Then, the agent sends a $4,200 refund approval to a customer who'd asked about a fee schedule. The LLM hallucinates the complaint, writes up a professional apology with a specific dollar amount, and fires it off before anyone on the team even knows.&lt;/p&gt;

&lt;p&gt;Better prompts won’t help because the problem isn't what the model says, it's that nothing stops it from saying it.&lt;/p&gt;

&lt;p&gt;To fix this you need an approval gate. Somewhere in the agent’s graph where execution... stops. State gets written to disk and a human looks at the draft. Only after they say "yeah, send it" does anything go out the door. LangGraph has a built-in primitive for this called &lt;code&gt;interrupt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's walk through the full pattern here. The code is straightforward but state management can trip you up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost argument (if you need one)
&lt;/h2&gt;

&lt;p&gt;If you're already sold on why AI shouldn't email customers unsupervised, skip this, but if you need to convince your PM, here's some napkin math:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Without Gate&lt;/th&gt;
&lt;th&gt;With Gate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Messages sent/day&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rate (wrong tone/info)&lt;/td&gt;
&lt;td&gt;~3%&lt;/td&gt;
&lt;td&gt;~0.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bad messages/day&lt;/td&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Avg cost per bad message&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;td&gt;$200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily risk&lt;/td&gt;
&lt;td&gt;$3,000&lt;/td&gt;
&lt;td&gt;$100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What we’re building&lt;/p&gt;

&lt;p&gt;A customer complaint response pipeline. Complaint comes in, AI drafts a response, a human approves or edits, system sends the final version.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Intake] → [Draft Response] → [INTERRUPT: Human Review] → [Send Response] → END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;interrupt&lt;/code&gt; is where execution pauses. All the graph state (draft, original complaint, metadata, etc) gets checkpointed. It could be hours or days before someone reviews it and when they do, the graph will pick up right where it stopped.&lt;/p&gt;

&lt;p&gt;Even in serverless environments &lt;code&gt;interrupt&lt;/code&gt; is resilient. The Python process can crash. Server can restart. You resume with the same &lt;code&gt;thread_id&lt;/code&gt; and LangGraph reloads everything from the checkpointer. &lt;/p&gt;

&lt;h2&gt;
  
  
  The state schema
&lt;/h2&gt;

&lt;p&gt;Whatever the reviewer needs to see has to be in state before the interrupt fires.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    complaint: str
    customer_id: str
    draft_response: str
    review_decision: str
    reviewer_notes: str
    final_response: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The nodes
&lt;/h2&gt;

&lt;p&gt;Let’s build three nodes, draft, review, send. All with &lt;code&gt;@traceable&lt;/code&gt; because six months from now when someone asks "who approved sending that email to the VP of procurement at our biggest account," you want a trace showing what the AI wrote vs. what a person changed.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="draft_response", run_type="chain")
def draft_response(state: State) -&amp;gt; dict:
    response = llm.invoke([
        SystemMessage(
            content="You are a customer service agent. Draft a professional, "
                    "empathetic response to the following complaint. Be specific "
                    "about next steps. Do NOT promise refunds or credits unless "
                    "the complaint clearly warrants one. Keep it under 150 words."
        ),
        HumanMessage(
            content=f"Customer ID: {state['customer_id']}\n\n"
                    f"Complaint: {state['complaint']}"
        ),
    ])
    return {"draft_response": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The review node is where &lt;code&gt;interrupt()&lt;/code&gt; does its work.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.types import interrupt

@traceable(name="human_review", run_type="chain")
def human_review(state: State) -&amp;gt; dict:
    decision = interrupt({
        "draft": state["draft_response"],
        "customer_id": state["customer_id"],
        "complaint": state["complaint"],
        "instructions": "Review the draft. Respond with a JSON object: "
                        '{"action": "approve" | "edit" | "reject", '
                        '"edited_response": "...", "notes": "..."}'
    })
    return {
        "review_decision": decision["action"],
        "reviewer_notes": decision.get("notes", ""),
        "final_response": decision.get("edited_response", state["draft_response"])
            if decision["action"] != "reject" else "",
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The dict you pass to &lt;code&gt;interrupt()&lt;/code&gt; is the payload. It shows up in the &lt;code&gt;__interrupt__&lt;/code&gt; field of the graph's return value, which is what your UI or Slack bot reads to build the review screen. When someone calls &lt;code&gt;Command(resume={"action": "approve"})&lt;/code&gt;, that dict becomes what &lt;code&gt;interrupt()&lt;/code&gt; returns. The function resumes from the line right after the &lt;code&gt;interrupt()&lt;/code&gt; call. It looks like a normal function call but there's a checkpoint boundary hiding inside it.&lt;/p&gt;

&lt;p&gt;Send node. Don't send if it was rejected:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="send_response", run_type="chain")
def send_response(state: State) -&amp;gt; dict:
    if state["review_decision"] == "reject":
        return {"final_response": "[REJECTED] " + state["reviewer_notes"]}
    return {"final_response": state["final_response"]}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Wiring it up
&lt;/h2&gt;

&lt;p&gt;The checkpointer makes interrupts durable. You can use &lt;code&gt;InMemorySaver&lt;/code&gt; for dev, &lt;code&gt;PostgresSaver&lt;/code&gt; for prod and if you forget the checkpointer and &lt;code&gt;interrupt()&lt;/code&gt; throws a &lt;code&gt;RuntimeError&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.checkpoint.memory import InMemorySaver
from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("draft", draft_response)
builder.add_node("review", human_review)
builder.add_node("send", send_response)

builder.add_edge(START, "draft")
builder.add_edge("draft", "review")
builder.add_edge("review", "send")
builder.add_edge("send", END)

checkpointer = InMemorySaver()
graph = builder.compile(checkpointer=checkpointer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  The full interrupt/resume cycle
&lt;/h2&gt;

&lt;p&gt;Two &lt;code&gt;invoke&lt;/code&gt; calls. First one runs until the interrupt and stops, the second one picks up where it left off.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.types import Command

config = {"configurable": {"thread_id": "complaint-1234"}}

# Phase 1: Run until the interrupt
result = graph.invoke(
    {
        "complaint": "I was charged twice for my subscription last month. "
                     "Order #A-9912. I want a refund immediately.",
        "customer_id": "cust_8837",
    },
    config=config,
)

# The graph paused. Extract the interrupt payload.
interrupt_data = result["__interrupt__"][0].value
print(f"Draft for review: {interrupt_data['draft']}")
print(f"Customer: {interrupt_data['customer_id']}")

# Phase 2: Human reviews and approves (could be minutes or days later)
final_result = graph.invoke(
    Command(resume={
        "action": "edit",
        "edited_response": "We've identified the duplicate charge on Order #A-9912. "
                           "A refund of $29.99 has been initiated and will appear "
                           "in 3-5 business days. We apologize for the inconvenience.",
        "notes": "Verified duplicate charge in billing system. Approved refund.",
    }),
    config=config,  # Same thread_id — this is how LangGraph finds the checkpoint
)

print(f"Final response: {final_result['final_response']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That &lt;code&gt;thread_id&lt;/code&gt; in the config matters more than anything else here. It's the key into the checkpointer. Without a &lt;code&gt;thread_id&lt;/code&gt; you can't resume. We treat these as primary keys and map it to something stable in your system: ticket ID, conversation ID, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding risk-based routing
&lt;/h2&gt;

&lt;p&gt;The basic version sends everything through human review. Start there, but eventually reviewers get tired of approving "thanks for contacting us, we're looking into it" all day, and you'll want to auto-approve the low-risk stuff.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pydantic import BaseModel, Field


class RiskAssessment(BaseModel):
    risk_level: str = Field(description="low, medium, or high")
    reason: str = Field(description="Why this risk level was assigned")


risk_llm = llm.with_structured_output(RiskAssessment)


@traceable(name="assess_risk", run_type="chain")
def assess_risk(state: State) -&amp;gt; dict:
    assessment = risk_llm.invoke([
        SystemMessage(
            content="Assess the risk level of this customer service response. "
                    "high = involves money, legal, account changes, or could "
                    "be interpreted as a binding commitment. "
                    "medium = emotional topic, could escalate. "
                    "low = simple acknowledgment, FAQ, status update."
        ),
        HumanMessage(
            content=f"Complaint: {state['complaint']}\n\n"
                    f"Draft response: {state['draft_response']}"
        ),
    ])
    return {"review_decision": assessment.risk_level}


def route_by_risk(state: State) -&amp;gt; str:
    if state["review_decision"] == "low":
        return "send"
    return "review"


builder_v2 = StateGraph(State)

builder_v2.add_node("draft", draft_response)
builder_v2.add_node("assess", assess_risk)
builder_v2.add_node("review", human_review)
builder_v2.add_node("send", send_response)

builder_v2.add_edge(START, "draft")
builder_v2.add_edge("draft", "assess")
builder_v2.add_conditional_edges("assess", route_by_risk, {"send": "send", "review": "review"})
builder_v2.add_edge("review", "send")
builder_v2.add_edge("send", END)

graph_v2 = builder_v2.compile(checkpointer=InMemorySaver())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Fair warning: you've now introduced a second LLM call as a gate, and that gate can be wrong in both directions. Under-classify risk and messages go out without review. Over-classify and reviewers are right back to rubber-stamping everything. Run the classifier in logging-only mode for a couple weeks first (route everything through review, but record what the classifier would have done and use long term memory to tune the classifier). Then start skipping reviews on low-risk messages after you trust the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bugs
&lt;/h2&gt;

&lt;p&gt;The demo works great... but...&lt;/p&gt;

&lt;h3&gt;
  
  
  Lost thread_id
&lt;/h3&gt;

&lt;p&gt;Someone approves a draft in Slack. The integration pulls out the approval decision but constructs a &lt;em&gt;new&lt;/em&gt; thread_id instead of looking up the one stored with the interrupt payload. Now &lt;code&gt;Command(resume=...)&lt;/code&gt; creates a fresh graph where the input is an approval decision, not the complaint. &lt;/p&gt;

&lt;p&gt;This happens a lot. Store the thread_id alongside the interrupt payload when you surface it to reviewers. Put it in a database. Put it in the Slack message metadata, Do not lose it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stale state
&lt;/h3&gt;

&lt;p&gt;Reviewer opens the draft at 11:30. Goes to lunch. Comes back at 1pm and hits approve. In the meantime, the customer sent two more messages and someone on the support team already replied manually. The approved draft is now responding to a conversation that moved on.&lt;/p&gt;

&lt;p&gt;LangGraph has no idea. It resumes from the checkpoint, which is frozen in time. Fix this by putting a &lt;code&gt;created_at&lt;/code&gt; timestamp in the interrupt payload and checking it against the customer record's &lt;code&gt;last_updated_at&lt;/code&gt; on resume. If anything changed, re-draft.&lt;/p&gt;

&lt;h3&gt;
  
  
  Double resume
&lt;/h3&gt;

&lt;p&gt;Shared review queue. Two reviewers see the same pending draft. Both click approve. Depending on the checkpointer implementation, the second resume is either a no-op or an error, but by then the send logic already fired on the first one. Maybe that's fine. Maybe you just sent duplicate emails.&lt;/p&gt;

&lt;p&gt;Build in idempotency to check if the thread already has a &lt;code&gt;review_decision&lt;/code&gt; before doing anything with the resume.&lt;/p&gt;

&lt;h3&gt;
  
  
  Interrupt reordering
&lt;/h3&gt;

&lt;p&gt;Two &lt;code&gt;interrupt()&lt;/code&gt; calls in one node (say, one for policy review and one for tone). LangGraph matches resume values to interrupts by position, not by name. There are no names. Refactor and swap the order, the policy answer goes to the tone check and vice versa.&lt;/p&gt;

&lt;p&gt;Don't put multiple interrupts in one node, instead use separate nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing across the gap
&lt;/h2&gt;

&lt;p&gt;Interrupt-based workflows leave a gap in the LangSmith timeline where the human review happened. The draft trace ends, then hours later the resume trace starts, and nothing connects them unless you're deliberate about it.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

ticket_id = "TICKET-4821"
config = {"configurable": {"thread_id": ticket_id}}

# Phase 1: Draft
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "draft"},
    tags=["production", "complaint-handler", "phase-1"],
):
    result = graph.invoke(
        {
            "complaint": "Your app crashed and I lost 3 hours of work.",
            "customer_id": "cust_2291",
        },
        config=config,
    )

# ... time passes, human reviews ...

# Phase 2: Resume
with tracing_context(
    metadata={"ticket_id": ticket_id, "phase": "resume", "reviewer": "jane@company.com"},
    tags=["production", "complaint-handler", "phase-2"],
):
    final = graph.invoke(
        Command(resume={"action": "approve", "notes": "Looks good."}),
        config=config,
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Put the ticket ID in the metadata for both phases. Now you can filter in LangSmith and see the full lifecycle of a single complaint even though draft and resume were separate invocations. The &lt;code&gt;reviewer&lt;/code&gt; field in phase 2 is your audit trail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;You need to know if drafts are any good before a human ever sees them.&lt;/p&gt;

&lt;p&gt;Dataset setup and evaluators live in &lt;code&gt;evals.py&lt;/code&gt; in the companion repo:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client, evaluate
from openevals.llm import create_llm_as_judge

from complaint_handler import graph

ls_client = Client()

DATASET_NAME = "complaint-handler-evals"

if not ls_client.has_dataset(dataset_name=DATASET_NAME):
    dataset = ls_client.create_dataset(
        dataset_name=DATASET_NAME,
        description="Human-in-the-loop complaint handler evaluation dataset",
    )
    ls_client.create_examples(
        dataset_id=dataset.id,
        inputs=[
            {
                "complaint": "Charged twice for order #A-1234. Want a refund.",
                "customer_id": "cust_001",
            },
            {
                "complaint": "App crashes every time I open the settings page.",
                "customer_id": "cust_002",
            },
            {
                "complaint": "Your CEO's tweet was offensive. Cancelling my account.",
                "customer_id": "cust_003",
            },
        ],
        outputs=[
            {
                "must_mention": ["refund", "order", "A-1234"],
                "risk": "high",
            },
            {
                "must_mention": ["crash", "settings", "investigating"],
                "risk": "medium",
            },
            {
                "must_mention": ["feedback", "understand", "account"],
                "risk": "high",
            },
        ],
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three evaluators. LLM judge for draft quality, keyword coverage, and a check for unauthorized promises:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DRAFT_QUALITY_PROMPT = """\
Customer complaint: {inputs}
AI draft response: {outputs}

Rate 0.0-1.0 on empathy, accuracy, and professionalism.
Deduct points if the draft promises specific remedies (refunds, credits)
without explicit authorization.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

draft_judge = create_llm_as_judge(
    prompt=DRAFT_QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="draft_quality",
    continuous=True,
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the draft actually address the complaint specifics?"""
    text = outputs.get("draft_response", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def no_unauthorized_promises(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the draft promise refunds or credits without authorization?"""
    text = outputs.get("draft_response", "").lower()
    dangerous_phrases = ["refund has been", "credit has been", "we will refund",
                         "we will credit", "compensation of"]
    violations = sum(1 for p in dangerous_phrases if p in text)
    return {"key": "no_unauthorized_promises", "score": 1.0 if violations == 0 else 0.0}


def target(inputs: dict) -&amp;gt; dict:
    """Run the graph until the interrupt (draft phase only)."""
    config = {"configurable": {"thread_id": f"eval-{inputs['customer_id']}"}}
    result = graph.invoke(inputs, config=config)
    return {"draft_response": result.get("draft_response", "")}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;no_unauthorized_promises&lt;/code&gt; catches the failure mode from the top of this post. If the draft says "a refund has been initiated" when nobody authorized a refund, it scores zero. Run this eval every time you change the system prompt.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if &lt;strong&gt;name&lt;/strong&gt; == "&lt;strong&gt;main&lt;/strong&gt;":&lt;br&gt;
    results = evaluate(&lt;br&gt;
        target,&lt;br&gt;
        data=DATASET_NAME,&lt;br&gt;
        evaluators=[draft_judge, coverage, no_unauthorized_promises],&lt;br&gt;
        experiment_prefix="complaint-handler-v1",&lt;br&gt;
        max_concurrency=4,&lt;br&gt;
    )&lt;br&gt;
    print("\nEvaluation complete. Check LangSmith for results.")&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  When to Human In The Loop&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;If AI is writing things that go to customers, you need a gate. Processing refunds, updating account records, anything you can't undo with a quick "sorry about that" email. Regulated industries need the gate plus an audit trail of who approved what.&lt;/p&gt;

&lt;p&gt;You don't need this for internal stuff. Summarizing meeting notes, running analysis for a dashboard, generating reports that a human reads. &lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;The two function calls: &lt;code&gt;interrupt()&lt;/code&gt; and &lt;code&gt;Command(resume=...)&lt;/code&gt;. Pause execution, persist state, resume later.&lt;/p&gt;

&lt;p&gt;Most of the work is everything around those two calls. Thread IDs getting lost, the world changing during the review gap, two reviewers approving the same draft, traces that need to connect across a timeline gap of hours or days.&lt;/p&gt;

&lt;p&gt;Start by routing every response through review. Reviewers will complain. Good. Measure which categories they rubber-stamp, run your evals, and only then start auto-approving the boring stuff.  &lt;/p&gt;

&lt;p&gt;Technical References&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/02-human-in-the-loop/tree/9e328bdd3770541a764134efa7f87d53de2dad6b" rel="noopener noreferrer"&gt;Human in the Loop Github Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/interrupts" rel="noopener noreferrer"&gt;Interrupts (Human-in-the-loop / pause &amp;amp; resume)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/persistence" rel="noopener noreferrer"&gt;Persistence (Thread IDs &amp;amp; Checkpointers)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/overview" rel="noopener noreferrer"&gt;LangGraph Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/evaluation-quickstart?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;LangSmith Eval Quickstarter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission" rel="noopener noreferrer"&gt;https://focused.io/lab/your-ai-just-emailed-a-customer-without-permission&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Streaming Agent State with LangGraph</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:15:26 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/streaming-agent-state-with-langgraph-10kg</link>
      <guid>https://forem.com/focused_dot_io/streaming-agent-state-with-langgraph-10kg</guid>
      <description>&lt;p&gt;Your research agent takes 9 seconds to answer a question. It fans out to three sources, synthesizes results, returns a polished answer. The user sees a blank screen for all nine of those seconds. By second 5 they've refreshed the page, doubled your API costs, and still seen nothing.&lt;/p&gt;

&lt;p&gt;Streaming fixes this. Show the user what the agent is doing while it's doing it: "Searching knowledge base...", "Found 3 results...", "Synthesizing..." and then stream the final answer token by token. Same 9 seconds, but the user sees progress from millisecond 200.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Perception Math
&lt;/h2&gt;

&lt;p&gt;Identical work, different user experience:&lt;/p&gt;

&lt;p&gt;Pattern&lt;/p&gt;

&lt;p&gt;Wall time&lt;/p&gt;

&lt;p&gt;Time to first byte&lt;/p&gt;

&lt;p&gt;Perceived wait&lt;/p&gt;

&lt;p&gt;&lt;code&gt;invoke() (no streaming)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;Broken&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stream(stream_mode="updates")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;~200ms&lt;/p&gt;

&lt;p&gt;Working&lt;/p&gt;

&lt;p&gt;&lt;code&gt;stream(stream_mode=["updates", "custom", "messages"])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;9s&lt;/p&gt;

&lt;p&gt;~200ms&lt;/p&gt;

&lt;p&gt;Can see what it’s doing&lt;/p&gt;

&lt;h2&gt;
  
  
  What we're Building
&lt;/h2&gt;

&lt;p&gt;A multi-step research agent that streams three types of events to the UI: node-level progress updates, custom status messages from inside nodes, and token-by-token LLM output for the final synthesis.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          ┌─ stream: "Searching KB..."
[Intake] → [Research KB]  ┤
                          └─ stream: {results: 3}
                                    ↓
                          ┌─ stream: "Analyzing results..."
         → [Synthesize]  ┤
                          └─ stream: tokens... t-o-k-e-n-b-y-t-o-k-e-n
                                    ↓
                                     → END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three stream modes run simultaneously: &lt;code&gt;updates&lt;/code&gt; for graph state changes, &lt;code&gt;custom&lt;/code&gt; for application-specific progress events, and &lt;code&gt;messages&lt;/code&gt; for LLM token streaming.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Modes
&lt;/h2&gt;

&lt;p&gt;LangGraph exposes five stream modes. You'll use three in practice:&lt;/p&gt;

&lt;p&gt;Mode&lt;/p&gt;

&lt;p&gt;What it streams&lt;/p&gt;

&lt;p&gt;When to use&lt;/p&gt;

&lt;p&gt;&lt;code&gt;values&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Full state after each superstep&lt;/p&gt;

&lt;p&gt;Debugging, state inspection&lt;/p&gt;

&lt;p&gt;&lt;code&gt;updates&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;State delta from each node&lt;/p&gt;

&lt;p&gt;Production UIs — lightweight, shows which node ran&lt;/p&gt;

&lt;p&gt;&lt;code&gt;messages&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;LLM tokens + metadata&lt;/p&gt;

&lt;p&gt;Chat UIs — token-by-token output&lt;/p&gt;

&lt;p&gt;&lt;code&gt;custom&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Arbitrary data from &lt;code&gt;get_stream_writer()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Progress bars, status messages, structured events&lt;/p&gt;

&lt;p&gt;&lt;code&gt;debug&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Everything — internal execution details&lt;/p&gt;

&lt;p&gt;Development only&lt;/p&gt;

&lt;p&gt;In production, use &lt;code&gt;["updates", "custom", "messages"]&lt;/code&gt;. &lt;code&gt;values&lt;/code&gt; sends the entire state on every step. &lt;code&gt;debug&lt;/code&gt; is for development.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;State and two nodes: a research step that emits custom progress events, and a synthesizer that streams its LLM response token by token.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from typing import TypedDict

from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage, SystemMessage
from langgraph.config import get_stream_writer
from langsmith import traceable

llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)


class State(TypedDict):
    question: str
    research: str
    answer: str
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The research node uses &lt;code&gt;get_stream_writer()&lt;/code&gt; to push status updates to the client. These show up in the &lt;code&gt;custom&lt;/code&gt; stream mode:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="research", run_type="chain")
def research(state: State) -&amp;gt; dict:
    writer = get_stream_writer()

    writer({"step": "research", "status": "starting", "message": "Searching knowledge base..."})

    response = llm.invoke([
        SystemMessage(
            content="You are a research assistant. Search for relevant information "
                    "about the user's question. Return a concise summary of findings."
        ),
        HumanMessage(content=state["question"]),
    ])

    writer({"step": "research", "status": "complete", "message": "Research complete."})

    return {"research": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The synthesizer uses the LLM normally. LangGraph automatically streams its tokens when &lt;code&gt;messages&lt;/code&gt; mode is active:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;@traceable(name="synthesize", run_type="chain")
def synthesize(state: State) -&amp;gt; dict:
    writer = get_stream_writer()
    writer({"step": "synthesize", "status": "starting", "message": "Synthesizing answer..."})

    response = llm.invoke([
        SystemMessage(
            content="Synthesize the research into a clear, actionable answer. "
                    "Be concise but thorough."
        ),
        HumanMessage(
            content=f"Question: {state['question']}\n\nResearch:\n{state['research']}"
        ),
    ])

    writer({"step": "synthesize", "status": "complete", "message": "Done."})
    return {"answer": response.content}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langgraph.graph import StateGraph, START, END

builder = StateGraph(State)

builder.add_node("research", research)
builder.add_node("synthesize", synthesize)

builder.add_edge(START, "research")
builder.add_edge("research", "synthesize")
builder.add_edge("synthesize", END)

graph = builder.compile()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Multi-mode Streaming
&lt;/h2&gt;

&lt;p&gt;A single &lt;code&gt;.stream()&lt;/code&gt; call can emit node updates, custom progress events, and LLM tokens simultaneously:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for mode, chunk in graph.stream(
    {"question": "What are the key differences between REST and GraphQL for mobile APIs?"},
    stream_mode=["updates", "custom", "messages"],
):
    if mode == "updates":
        # Node completed — chunk is the state delta
        node_name = list(chunk.keys())[0]
        print(f"[node] {node_name} completed")

    elif mode == "custom":
        # Custom progress event from get_stream_writer()
        print(f"[status] {chunk.get('message', chunk)}")

    elif mode == "messages":
        # LLM token — chunk is a tuple of (message_chunk, metadata)
        message_chunk, metadata = chunk
        if hasattr(message_chunk, "content") and message_chunk.content:
            print(message_chunk.content, end="", flush=True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Note that the output shape changes with multi-mode. Single mode (&lt;code&gt;stream_mode="updates"&lt;/code&gt;) yields chunks directly. Multi-mode (&lt;code&gt;stream_mode=["updates", "custom"]&lt;/code&gt;) yields &lt;code&gt;(mode, chunk)&lt;/code&gt; tuples. Code that works with single mode breaks with multi-mode because the unpacking is different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Async streaming
&lt;/h2&gt;

&lt;p&gt;For production APIs, use &lt;code&gt;astream&lt;/code&gt; with &lt;code&gt;async for&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import asyncio

from langsmith import traceable


@traceable(name="stream_research", run_type="chain")
async def stream_research(question: str):
    chunks = []
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                chunks.append(message_chunk.content)
                yield {"type": "token", "content": message_chunk.content}
        elif mode == "custom":
            yield {"type": "status", "content": chunk}
        elif mode == "updates":
            yield {"type": "node_update", "content": chunk}


async def main():
    async for event in stream_research("How do vector databases work?"):
        if event["type"] == "token":
            print(event["content"], end="", flush=True)
        else:
            print(f"\n[{event['type']}] {event['content']}")

asyncio.run(main())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  FastAPI + SSE
&lt;/h2&gt;

&lt;p&gt;The standard production pattern is a FastAPI endpoint that converts graph streams to SSE. SSE is one-directional (server to client), works over HTTP/1.1, and auto-reconnects:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langsmith import traceable

app = FastAPI()


@traceable(name="sse_research_stream", run_type="chain")
async def generate_sse(question: str):
    async for mode, chunk in graph.astream(
        {"question": question},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                data = json.dumps({"type": "token", "content": message_chunk.content})
                yield f"data: {data}\n\n"
        elif mode == "custom":
            data = json.dumps({"type": "status", "content": chunk})
            yield f"data: {data}\n\n"
        elif mode == "updates":
            node_name = list(chunk.keys())[0] if chunk else "unknown"
            data = json.dumps({"type": "node_complete", "node": node_name})
            yield f"data: {data}\n\n"

    yield "data: [DONE]\n\n"


@app.post("/research/stream")
async def stream_endpoint(payload: dict):
    return StreamingResponse(
        generate_sse(payload["question"]),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no",
        },
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Set &lt;code&gt;X-Accel-Buffering: no&lt;/code&gt; in the response headers and &lt;code&gt;proxy_buffering off&lt;/code&gt; in your nginx config. Without these, nginx buffers the entire response before sending it to the client and your streaming pipeline becomes a regular HTTP response.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bugs
&lt;/h2&gt;

&lt;p&gt;These break under load.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reverse proxy buffering
&lt;/h3&gt;

&lt;p&gt;You deploy behind nginx or a cloud load balancer. SSE events arrive at the client in one big batch after the stream completes. Cause: proxy buffering is on by default. Set the &lt;code&gt;X-Accel-Buffering&lt;/code&gt; header, disable &lt;code&gt;proxy_buffering&lt;/code&gt; in nginx, and check your cloud provider's load balancer settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Message chunk ordering
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;messages&lt;/code&gt; mode, you receive &lt;code&gt;AIMessageChunk&lt;/code&gt; objects. The &lt;code&gt;content&lt;/code&gt; field is usually a string, except when the model returns tool calls where it's a list of content blocks. Concatenating &lt;code&gt;.content&lt;/code&gt; naively produces garbled output. Check &lt;code&gt;isinstance(message_chunk.content, str)&lt;/code&gt; before concatenating and handle tool-call chunks separately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Backpressure on slow clients
&lt;/h3&gt;

&lt;p&gt;Your agent streams tokens faster than the client can consume them (mobile on 3G, overloaded browser tab). The server-side buffer grows until memory pressure kills the process. Use bounded async queues or configure your ASGI server's per-connection send buffer limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mixed single/multi mode unpacking
&lt;/h3&gt;

&lt;p&gt;Developer switches from &lt;code&gt;stream_mode="updates"&lt;/code&gt; to &lt;code&gt;stream_mode=["updates", "custom"]&lt;/code&gt; and doesn't update the unpacking code. The &lt;code&gt;for chunk in graph.stream(...)&lt;/code&gt; now yields &lt;code&gt;(mode, chunk)&lt;/code&gt; tuples, but the code tries to use the tuple as a dict. No error, just wrong data flowing through. Always use multi-mode from the start, even if you only need one mode today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Stream-based workflows produce many small events. Tag your traces so you can measure stream performance in &lt;a href="https://www.langchain.com/langsmith/observability" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import tracing_context

with tracing_context(
    metadata={
        "stream_mode": "multi",
        "client_type": "web",
        "session_id": "sess_12345",
    },
    tags=["production", "streaming", "v1"],
):
    for mode, chunk in graph.stream(
        {"question": "Explain vector similarity search"},
        stream_mode=["updates", "custom", "messages"],
    ):
        pass  # process chunks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The LangSmith trace shows per-node timings. Use this to find nodes that are slow to emit their first token (high time-to-first-byte) vs. nodes that produce tokens slowly (low throughput).&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Streaming doesn't change what the agent produces, it changes how the output is delivered. Evals verify that streamed output matches what &lt;code&gt;invoke()&lt;/code&gt; would return, and that custom events are emitted correctly.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langsmith import Client

ls_client = Client()

dataset = ls_client.create_dataset(
    dataset_name="streaming-agent-evals",
    description="Streaming research agent evaluation dataset",
)

ls_client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"question": "What are the tradeoffs between REST and GraphQL?"},
        {"question": "How do vector databases enable semantic search?"},
        {"question": "What is retrieval-augmented generation?"},
    ],
    outputs=[
        {"must_mention": ["REST", "GraphQL", "tradeoff"]},
        {"must_mention": ["vector", "embedding", "similarity"]},
        {"must_mention": ["retrieval", "generation", "context"]},
    ],
)


from langsmith import evaluate
from openevals.llm import create_llm_as_judge

QUALITY_PROMPT = """\
User question: {inputs[question]}
Agent response: {outputs[answer]}

Rate 0.0-1.0 on completeness, accuracy, and clarity.
Return ONLY: {{"score": &amp;lt;float&amp;gt;, "reasoning": "&amp;lt;explanation&amp;gt;"}}"""

quality_judge = create_llm_as_judge(
    prompt=QUALITY_PROMPT,
    model="anthropic:claude-sonnet-4-5-20250929",
    feedback_key="quality",
)


def coverage(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Did the response address the key topics?"""
    text = outputs.get("answer", "").lower()
    must_mention = reference_outputs.get("must_mention", [])
    hits = sum(1 for t in must_mention if t.lower() in text)
    return {"key": "coverage", "score": hits / len(must_mention) if must_mention else 1.0}


def stream_completeness(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Does streaming produce the same output as invoke?"""
    streamed = outputs.get("answer", "")
    invoked_result = graph.invoke({"question": inputs["question"]})
    invoked = invoked_result.get("answer", "")
    # Exact match is too strict — LLM outputs vary. Check key content overlap.
    streamed_words = set(streamed.lower().split())
    invoked_words = set(invoked.lower().split())
    if not invoked_words:
        return {"key": "stream_completeness", "score": 1.0}
    overlap = len(streamed_words &amp;amp; invoked_words) / len(invoked_words)
    return {"key": "stream_completeness", "score": min(overlap, 1.0)}


def custom_events_emitted(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:
    """Were custom status events emitted during streaming?"""
    events = outputs.get("custom_events", [])
    expected_steps = {"research", "synthesize"}
    seen_steps = {e.get("step") for e in events if isinstance(e, dict)}
    coverage_score = len(seen_steps &amp;amp; expected_steps) / len(expected_steps)
    return {"key": "custom_events", "score": coverage_score}


def target(inputs: dict) -&amp;gt; dict:
    custom_events = []
    answer_chunks = []
    for mode, chunk in graph.stream(
        {"question": inputs["question"]},
        stream_mode=["updates", "custom", "messages"],
    ):
        if mode == "custom":
            custom_events.append(chunk)
        elif mode == "messages":
            message_chunk, metadata = chunk
            if hasattr(message_chunk, "content") and message_chunk.content:
                answer_chunks.append(message_chunk.content)
        elif mode == "updates":
            if "synthesize" in chunk:
                pass  # answer is captured via message chunks

    return {
        "answer": "".join(answer_chunks) if answer_chunks else "",
        "custom_events": custom_events,
    }


results = evaluate(
    target,
    data="streaming-agent-evals",
    evaluators=[quality_judge, coverage, stream_completeness, custom_events_emitted],
    experiment_prefix="streaming-agent-v1",
    max_concurrency=4,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;stream_completeness&lt;/code&gt; verifies that the streaming path produces equivalent output to &lt;code&gt;invoke()&lt;/code&gt;. This catches bugs where stream chunking drops content, like an SSE serializer silently truncating chunks that exceed a size limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Stream
&lt;/h2&gt;

&lt;p&gt;Use streaming for any user-facing agent interaction over 2 seconds, multi-step agents where progress indicators reduce perceived latency, and chat interfaces where token-by-token display is expected.&lt;/p&gt;

&lt;p&gt;Skip it for background jobs with no user waiting, when latency is already under a second, and when the output is structured data rather than natural language.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Three modes in production: &lt;code&gt;updates&lt;/code&gt; for node transitions, &lt;code&gt;custom&lt;/code&gt; for progress events via &lt;code&gt;get_stream_writer()&lt;/code&gt;, and &lt;code&gt;messages&lt;/code&gt; for token streaming. Combine them with &lt;code&gt;stream_mode=["updates", "custom", "messages"]&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Deploy behind FastAPI + SSE with &lt;code&gt;X-Accel-Buffering: no&lt;/code&gt;. Watch for reverse proxy buffering, backpressure on slow clients, and the single-to-multi mode unpacking change.  &lt;/p&gt;

&lt;p&gt;Technical References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/focused-dot-io/03-streaming-agents" rel="noopener noreferrer"&gt;Streaming Agent State GitHub repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langgraph/streaming" rel="noopener noreferrer"&gt;LangGraph Streaming (Python)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/oss/python/langchain/streaming/overview" rel="noopener noreferrer"&gt;LangChain Streaming Overview&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.langchain.com/langsmith/add-metadata-tags" rel="noopener noreferrer"&gt;LangSmith Tracing Metadata &amp;amp; Tags&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/streaming-agent-state-with-langgraph" rel="noopener noreferrer"&gt;https://focused.io/lab/streaming-agent-state-with-langgraph&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Driving Value with LangSmith Insights</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 23 Apr 2026 19:15:24 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/driving-value-with-langsmith-insights-5bp</link>
      <guid>https://forem.com/focused_dot_io/driving-value-with-langsmith-insights-5bp</guid>
      <description>&lt;p&gt;Imagine you have a deployed agentic system in production. Everything is going well, users are interacting with the product, and there are no critical issues going on. But what comes next? How can we monitor our system to understand what needs to be improved, fixed or built next? &lt;/p&gt;

&lt;p&gt;The first requirement is to have great observability. LangSmith is a great tool for this.&lt;/p&gt;

&lt;p&gt;We can use it to monitor all of our production runs, detect errors and understand how the model behaves across different interactions.&lt;/p&gt;

&lt;p&gt;In October 2025, October LangChain released a new feature: &lt;a href="https://www.blog.langchain.com/insights-agent-multiturn-evals-langsmith/" rel="noopener noreferrer"&gt;&lt;strong&gt;Insights Agent&lt;/strong&gt;&lt;/a&gt;. This feature allows an agent to analyze your LangSmith traces and surface usage patterns, common behaviors, and recurring error modes automatically. Instead of manually digging through logs, you can let an agent do the analysis for you. If you want to read more about it, here's a &lt;a href="https://docs.langchain.com/langsmith/insights?ref=blog.langchain.com&amp;amp;ajs_aid=d4bdd020-281f-4f7e-86e5-726ef5abdfe6&amp;amp;ajs_uid=141a090a-530d-4b83-96d1-d2b713439671&amp;amp;_gl=1*a05q53*_gcl_au*MTgzNDAzODM2OS4xNzY2MjUyNTM4*_ga*MTAyNTcyODQ0NS4xNzYzMzg1Mjg0*_ga_47WX3HKKY2*czE3NjYyNTI1MzgkbzkyJGcwJHQxNzY2MjUyNTM4JGo2MCRsMCRoMA.." rel="noopener noreferrer"&gt;link to the docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to run the Insights Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We are going to go through a simple demo of how to use this exciting new tool with a simple chatbot graph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Plus or Enterprise LangSmith plan&lt;/li&gt;
&lt;li&gt;A tracing project with a good amount of traces to analyze&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first thing we need to do is go to our LangSmith project. Once there, we are going to see multiple tabs on the top of the screen. Click on the one that says “Insights”.&lt;/p&gt;

&lt;p&gt;If this is our first time running Insights, we are going to see an empty page and a “Create Insight” button. We can go ahead and click it.&lt;/p&gt;

&lt;p&gt;Now, we are presented with two alternatives for how to run the Insights Agent: auto or manual. For the sake of simplicity, let’s start with the “auto” mode.  &lt;/p&gt;

&lt;p&gt;We need to answer the following questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;“What does the agent in this tracing project do?”&lt;/em&gt;&lt;/li&gt;
&lt;li&gt; &lt;em&gt;“What would you like to learn about this agent?”&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt; &lt;em&gt;“How are traces in this tracing project structured? Are there specific input/output keys to pay attention to?”&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This information will be used in our agent prompt, and will help tailor the output to our needs.&lt;/p&gt;

&lt;p&gt;We can also choose if we want to use OpenAI or Anthropic as our provider. As a note, you will need an API key for either provider.&lt;/p&gt;

&lt;p&gt;After we click on “Run Job”, we are going to see a message saying the agent has started running in the background and that we will have our results in a few minutes. If we navigate to the Insights tab we are going to see the agent run in progress as well as the results that start to come out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to understand and use the results&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For this example, we are going to be using a chatbot that answers questions about restaurants and helps with making reservations.&lt;/p&gt;

&lt;p&gt;The first part of the output is a summary of the findings. This is going to be the answer to the question we were asked earlier about what we wanted to learn about this agent. In this case, we wanted to understand what customers were asking the chatbot in order to identify user patterns.&lt;/p&gt;

&lt;p&gt;We can see that in this example, 57% of the questions being asked to our chatbot are about feature discovery, 29% are about operating hours, and only 14% are about making reservations.&lt;/p&gt;

&lt;p&gt;This kind of result is interesting because it helps us understand what customers actually need. Maybe we initially assumed that most questions would be about making reservations, but this data doesn’t support that. &lt;strong&gt;LangSmith Insights is critical because it grounds our product decisions in real user behavior, helping us invest engineering effort where it delivers the most value.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If we click on the “Hide Findings” button, we can do a deep dive into the traces, broken down by category.&lt;/p&gt;

&lt;p&gt;If we click on any of the categories we can see all runs within that category and navigate to the trace we are interested in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using evaluation + Insights to get the highest impact on value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once we are comfortable with the categories of our generated insights, we can build evaluation datasets that mirror those categories. This way, we can understand how well our agent is answering questions across categories.&lt;/p&gt;

&lt;p&gt;Why do insights change this process? Imagine we run our evaluations and we discover the agent is only answering 40% of questions around reservations correctly. But insights reveal that reservation questions are actually the least common user queries. That context lowers the overall criticality of the issue and helps us prioritize fixes more intelligently.&lt;/p&gt;

&lt;p&gt;Insights add context to the analysis, but they don’t override business requirements. This is only an example: Depending on the use case, a low-frequency category like reservations may still demand zero errors if the business impact is high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final thoughts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We have gone through a simple example to illustrate the power of this tool. But as we’ve seen, we can ask the agent virtually any question we want. For example, we could ask, &lt;em&gt;“What types of questions is my agent hallucinating on or answering incorrectly?”&lt;/em&gt; and the agent will find all traces that match that criteria. This is extremely flexible and powerful.&lt;/p&gt;

&lt;p&gt;LangSmith is still king when it comes to building and observing production grade AI applications, and this kind of feature is the reason why I encourage you to try it out and continue to create amazing applications with it!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://focused.io/lab/driving-value-with-langsmith-insights" rel="noopener noreferrer"&gt;https://focused.io/lab/driving-value-with-langsmith-insights&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>observability</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Most Teams Don't Have a Data Flywheel</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 22 Apr 2026 18:56:27 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/most-teams-dont-have-a-data-flywheel-33o4</link>
      <guid>https://forem.com/focused_dot_io/most-teams-dont-have-a-data-flywheel-33o4</guid>
      <description>&lt;p&gt;&lt;em&gt;LangChain shows how the loop works. Here's why it stalls in production and what it actually takes to make it compound.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Austin Vance, CEO of&lt;a href="https://focused.io" rel="noopener noreferrer"&gt;Focused&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;LangChain has been pushing a clear idea: production data should make your agents better.&lt;/p&gt;

&lt;p&gt;The loop looks like this: production traces capture real behavior, those traces become datasets, evaluators score performance, feedback improves those evaluators, and improvements get deployed back into the system. Over time, the system compounds.&lt;/p&gt;

&lt;p&gt;That is the data flywheel.&lt;/p&gt;

&lt;p&gt;And it is directionally right.&lt;/p&gt;

&lt;p&gt;But most teams building agents today are not seeing that compounding effect. The loop exists on paper. In practice, it stalls.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Data Flywheel Actually Is
&lt;/h2&gt;

&lt;p&gt;In the LangChain ecosystem, especially with LangSmith, the flywheel connects three things: observability, evaluation, and iteration.&lt;/p&gt;

&lt;p&gt;Production traces become the source of truth. Failures are turned into datasets. Datasets become regression tests. Evaluators score performance at scale. Feedback improves those evaluators over time.&lt;/p&gt;

&lt;p&gt;The goal is simple: every production interaction should become an improvement signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where It Breaks
&lt;/h2&gt;

&lt;p&gt;The issue is not the idea. The issue is that most teams never fully implement the system required to make it work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Traces are collected, but nothing happens.&lt;/strong&gt; Teams instrument their agents. They capture inputs, outputs, tool calls, and intermediate steps. And then it stops there. The missing step is turning traces into something actionable — structured datasets, labeled failures, repeatable test cases. Without that, you are not building a flywheel. You are just logging behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. There is no real evaluation layer.&lt;/strong&gt; This is where most teams stall. They review outputs manually. They rely on intuition. They make changes based on what "looks better." There is no automated evaluation, no regression testing, no baseline performance. So when something changes, there is no way to know if it improved or regressed. If you cannot measure it, the loop does not spin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Evaluators are not trusted.&lt;/strong&gt; Even when teams introduce evaluation, it often breaks down. LLM-as-a-judge systems can scale evaluation, but only if they are clearly defined, calibrated against human feedback, and continuously refined. Without that, evaluator output becomes noisy. And noisy signals lead to random changes. If you do not trust your evaluation layer, you cannot rely on your flywheel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The loop never actually closes.&lt;/strong&gt; Even when failures are identified, prompts get tweaked ad hoc, changes are not versioned, and fixes are not tested against past failures. So nothing compounds. A real loop looks like this: a failure is captured, the failure becomes a dataset, the dataset is evaluated, a change is applied, and the change is tested against that dataset. If you skip any step, the loop breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. There is no real production pressure.&lt;/strong&gt; This is the quiet failure that kills most flywheels. If your agent is not embedded in a real system, you do not get meaningful traffic, you do not see real edge cases, and you do not generate useful data. Internal demos do not create real signals. Without real usage, the flywheel has nothing to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Real Data Flywheel Looks Like
&lt;/h2&gt;

&lt;p&gt;At a system level, this is not a concept. It is a pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instrumentation.&lt;/strong&gt; Every step of the agent is observable — inputs, decisions, state transitions, outputs. Using structured systems like LangGraph makes this consistent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dataset creation.&lt;/strong&gt; Production traces are turned into labeled examples, categorized failures, and reusable datasets. This is where the loop actually begins.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation.&lt;/strong&gt; You define what "good" looks like and measure it — correctness, tool selection, completion quality. Evaluations run continuously, not just during development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Calibration.&lt;/strong&gt; Evaluators improve over time. Human feedback corrects them, agreement is measured, alignment increases. This step is critical and often skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration and deployment.&lt;/strong&gt; Changes are applied intentionally — to prompts, graph structure, and tool logic. Then tested against historical failures before being deployed. Only validated improvements ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shift Most Teams Need to Make
&lt;/h2&gt;

&lt;p&gt;The data flywheel is often described like a product feature. That is the problem.&lt;/p&gt;

&lt;p&gt;It is not something you turn on. It is an engineering system that connects observability, evaluation, feedback, and deployment into a continuous loop. Without that system, you do not have a flywheel. You have logs and intuition.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Most teams do not have a data flywheel. They have a growing pile of traces and a sense that things might be improving.&lt;/p&gt;

&lt;p&gt;The teams that actually get better over time treat this differently. They build the system that makes improvement inevitable.&lt;/p&gt;

&lt;p&gt;If your agent only records what happened, it will stall. If your system learns from what happened, it compounds.&lt;/p&gt;

&lt;p&gt;That is the difference.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>langchain</category>
      <category>programming</category>
    </item>
    <item>
      <title>LangGraph Error Handling Patterns for Production AI Agents</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Tue, 21 Apr 2026 18:53:58 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/langgraph-error-handling-patterns-for-production-ai-agents-33p7</link>
      <guid>https://forem.com/focused_dot_io/langgraph-error-handling-patterns-for-production-ai-agents-33p7</guid>
      <description>&lt;p&gt;You have a document processing pipeline. It ingests contracts, extracts key clauses, validates them against policy, and generates a summary. Monday morning it processes 200 documents without a hiccup. Tuesday at 2 AM, Anthropic’s API returns a 429, the extraction node throws, and &lt;strong&gt;the entire pipeline stops.&lt;/strong&gt; Not just the one document — the whole batch. Your on-call engineer spends 45 minutes figuring out it was a transient rate limit that would have resolved itself with a 2-second backoff.&lt;/p&gt;

&lt;p&gt;The fix isn’t “add a try/except.” The fix is classifying errors by who can fix them and routing each class to the right handler. LangGraph gives you the primitives — &lt;code&gt;RetryPolicy&lt;/code&gt;, &lt;code&gt;Command&lt;/code&gt;, &lt;code&gt;interrupt()&lt;/code&gt;, and &lt;code&gt;ToolNode&lt;/code&gt; error handling — but the framework won’t decide your error strategy for you. That’s on you.&lt;/p&gt;

&lt;p&gt;This post shows the four error classes, the LangGraph primitives for each, and the production failures that surface when you get the classification wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Production Error Handling Classification Matrix
&lt;/h2&gt;

&lt;p&gt;Not all errors are equal. The single most important decision in your error-handling strategy is: &lt;strong&gt;who fixes this?&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error Class&lt;/th&gt;
&lt;th&gt;Who Fixes It&lt;/th&gt;
&lt;th&gt;LangGraph Primitive&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transient&lt;/td&gt;
&lt;td&gt;System (automatic)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RetryPolicy&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;API 429, network timeout, DNS blip&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM-Recoverable&lt;/td&gt;
&lt;td&gt;The LLM&lt;/td&gt;
&lt;td&gt;Error in state + loop back&lt;/td&gt;
&lt;td&gt;Tool returned bad JSON, wrong tool chosen&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User-Fixable&lt;/td&gt;
&lt;td&gt;The human&lt;/td&gt;
&lt;td&gt;&lt;code&gt;interrupt()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Missing required field, ambiguous input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unexpected&lt;/td&gt;
&lt;td&gt;The developer&lt;/td&gt;
&lt;td&gt;Let it bubble up&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;TypeError&lt;/code&gt;, schema mismatch, logic bug&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Getting this wrong costs you. Retrying a user-fixable error wastes 3 attempts and 6 seconds before failing anyway. Interrupting for a transient error pages a human to click “retry” on something that would have fixed itself. Swallowing an unexpected error hides a real bug behind a generic fallback.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;We’re building a document processing pipeline that extracts clauses from contracts, validates them, and generates summaries. Each node has a different error profile:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;import operator&lt;br&gt;
from typing import Annotated, TypedDict&lt;/p&gt;

&lt;p&gt;from langchain_anthropic import ChatAnthropic&lt;br&gt;
from langchain_core.messages import AnyMessage&lt;br&gt;
from langsmith import traceable&lt;/p&gt;

&lt;p&gt;llm = ChatAnthropic(model="claude-sonnet-4-5-20250929", temperature=0)&lt;/p&gt;

&lt;p&gt;class PipelineState(TypedDict):&lt;br&gt;
    document: str&lt;br&gt;
    messages: Annotated[list[AnyMessage], operator.add]&lt;br&gt;
    extracted_clauses: list[dict]&lt;br&gt;
    validation_errors: list[str]&lt;br&gt;
    retry_count: int&lt;br&gt;
    final_summary: str&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The extraction node calls tools and hits APIs — it gets transient errors and tool failures. The validation node needs complete data — it surfaces user-fixable gaps. The summarizer is the least error-prone but still needs retry protection.&lt;/p&gt;
&lt;h2&gt;
  
  
  State: Track Errors Explicitly
&lt;/h2&gt;

&lt;p&gt;The key insight: &lt;strong&gt;errors are data, not just exceptions.&lt;/strong&gt; Store them in state so the LLM can see what went wrong and adjust its approach.&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;python&lt;/p&gt;

&lt;p&gt;from langgraph.types import RetryPolicy&lt;/p&gt;
&lt;h1&gt;
  
  
  Aggressive retry for flaky external APIs
&lt;/h1&gt;

&lt;p&gt;api_retry = RetryPolicy(&lt;br&gt;
    max_attempts=5,&lt;br&gt;
    initial_interval=1.0,&lt;br&gt;
    backoff_factor=2.0,&lt;br&gt;
    max_interval=10.0,&lt;br&gt;
    jitter=True,&lt;br&gt;
)&lt;/p&gt;
&lt;h1&gt;
  
  
  Conservative retry for LLM calls (they're expensive)
&lt;/h1&gt;

&lt;p&gt;llm_retry = RetryPolicy(&lt;br&gt;
    max_attempts=3,&lt;br&gt;
    initial_interval=0.5,&lt;br&gt;
    backoff_factor=2.0,&lt;br&gt;
    max_interval=5.0,&lt;br&gt;
    jitter=True,&lt;br&gt;
)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;
&lt;h2&gt;
  
  
  Pattern 1: Transient Errors with RetryPolicy
&lt;/h2&gt;

&lt;p&gt;API rate limits, network blips, DNS hiccups. These fix themselves. Don’t write code for them — configure them.&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;python&lt;/p&gt;

&lt;p&gt;from httpx import HTTPStatusError&lt;/p&gt;

&lt;p&gt;def should_retry(error: Exception) -&amp;gt; bool:&lt;br&gt;
    if isinstance(error, HTTPStatusError):&lt;br&gt;
        return error.response.status_code in (429, 502, 503)&lt;br&gt;
    return False&lt;/p&gt;

&lt;p&gt;selective_retry = RetryPolicy(&lt;br&gt;
    max_attempts=5,&lt;br&gt;
    initial_interval=1.0,&lt;br&gt;
    backoff_factor=2.0,&lt;br&gt;
    retry_on=should_retry,&lt;br&gt;
)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;RetryPolicy&lt;/code&gt; parameters worth knowing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_attempts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Total attempts including the first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;initial_interval&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.5&lt;/td&gt;
&lt;td&gt;Seconds before first retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;backoff_factor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2.0&lt;/td&gt;
&lt;td&gt;Multiplier per retry (exponential backoff)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_interval&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;128.0&lt;/td&gt;
&lt;td&gt;Cap on wait time between retries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;jitter&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;True&lt;/td&gt;
&lt;td&gt;Randomize wait to avoid thundering herd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retry_on&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;(default exceptions)&lt;/td&gt;
&lt;td&gt;Exception types or callable to filter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;retry_on&lt;/code&gt; parameter is where most people get it wrong. The default retries on common network/transient exceptions. If you need to retry on a custom exception type:&lt;/p&gt;

&lt;p&gt;``&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToolNode&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_clause&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract a specific clause from contract text.

    Args:
        text: The contract text to search.
        clause_type: One of &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;liability&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;indemnification&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;valid_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;liability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;indemnification&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;valid_types&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Invalid clause_type &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Must be one of: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;valid_types&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clause_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; clause from document.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_compliance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regulation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check if a clause complies with a specific regulation.

    Args:
        clause: The clause text to check.
        regulation: The regulation identifier (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;GDPR-Art17&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SOX-302&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;clause&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Empty clause text provided. Extract the clause first.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compliant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regulation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;regulation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No issues found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;extract_clause&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;check_compliance&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# handle_tool_errors=True: catch exceptions, return error as ToolMessage
&lt;/span&gt;&lt;span class="n"&gt;tool_node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_tool_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The superstep transaction rule:&lt;/strong&gt; LangGraph executes parallel branches in supersteps. If any branch in a superstep raises an exception, &lt;strong&gt;none of the state updates from that superstep apply.&lt;/strong&gt; Successful branches are checkpointed and won’t re-execute on retry, but the state snapshot rolls back to before the superstep started. This means a flaky API in one branch can block state updates from an unrelated branch that succeeded. &lt;code&gt;RetryPolicy&lt;/code&gt; per node keeps one bad branch from poisoning the whole superstep.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pattern 2: LLM-Recoverable Errors with ToolNode
&lt;/h2&gt;

&lt;p&gt;Tool calls fail. The LLM picks the wrong tool, passes bad arguments, or the tool returns something unparseable. The fix isn’t retrying the exact same call — it’s letting the LLM see what went wrong and try a different approach.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ToolNode&lt;/code&gt; from &lt;code&gt;langgraph.prebuilt&lt;/code&gt; has a &lt;code&gt;handle_tool_errors&lt;/code&gt; parameter that catches tool exceptions and returns the error message as a &lt;code&gt;ToolMessage&lt;/code&gt;. The LLM sees the error and can adjust:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_tool_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool failed with: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review your arguments and try again. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check the tool&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s docstring for valid parameter values.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tool_node_custom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_tool_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;format_tool_error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also pass a custom error handler for more control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;

&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agent_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a contract analysis agent. Use the provided tools to &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract and validate clauses. If a tool returns an error, read &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the error message carefully and adjust your arguments. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Available clause types: termination, liability, indemnification, payment.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent node calls the LLM, which may invoke tools. When a tool fails, &lt;code&gt;handle_tool_errors=True&lt;/code&gt; catches the exception and sends the error back to the LLM as a &lt;code&gt;ToolMessage&lt;/code&gt;. The LLM sees the error and tries again — usually with corrected arguments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;interrupt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Command&lt;/span&gt;

&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate_document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;clauses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="n"&gt;errors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No clauses extracted from document.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;required_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;found_types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
    &lt;span class="n"&gt;missing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;required_types&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;found_types&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing required clause types: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;missing&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;low_confidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;low_confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;types&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;low_confidence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low confidence extractions for: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;human_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;interrupt&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document validation failed. Please review and provide corrections.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][:&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;human_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;corrected_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clauses&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation_errors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pattern 3: User-Fixable Errors with interrupt()
&lt;/h2&gt;

&lt;p&gt;Some errors can’t be fixed by the system or the LLM. The document is missing a signature date. The clause references an undefined term. The input is ambiguous. These need a human.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;interrupt()&lt;/code&gt; pauses graph execution, saves state to the checkpointer, and returns a payload to the caller. When the human provides input, you resume with &lt;code&gt;Command(resume=...)&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;from langgraph.checkpoint.memory import InMemorySaver&lt;/p&gt;

&lt;p&gt;checkpointer = InMemorySaver()&lt;/p&gt;
&lt;h1&gt;
  
  
  First invocation — runs until interrupt
&lt;/h1&gt;

&lt;p&gt;config = {"configurable": {"thread_id": "contract-review-42"}}&lt;br&gt;
result = graph.invoke(&lt;br&gt;
    {"document": "...", "messages": [], "extracted_clauses": [], "validation_errors": [], "retry_count": 0, "final_summary": ""},&lt;br&gt;
    config,&lt;br&gt;
)&lt;/p&gt;
&lt;h1&gt;
  
  
  Check for interrupt
&lt;/h1&gt;

&lt;p&gt;if "&lt;strong&gt;interrupt&lt;/strong&gt;" in result:&lt;br&gt;
    print("Human input needed:", result["&lt;strong&gt;interrupt&lt;/strong&gt;"])&lt;/p&gt;
&lt;h1&gt;
  
  
  Resume with corrections
&lt;/h1&gt;

&lt;p&gt;corrected = Command(resume={&lt;br&gt;
    "corrected_clauses": [&lt;br&gt;
        {"clause_type": "termination", "text": "Either party may terminate...", "confidence": 0.95},&lt;br&gt;
        {"clause_type": "payment", "text": "Payment due within 30 days...", "confidence": 0.98},&lt;br&gt;
    ]&lt;br&gt;
})&lt;br&gt;
final_result = graph.invoke(corrected, config)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Resuming after an interrupt:&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;python&lt;/p&gt;

&lt;p&gt;from langsmith import tracing_context&lt;/p&gt;

&lt;p&gt;@traceable(name="process_document", run_type="chain")&lt;br&gt;
def process_document(document: str, thread_id: str) -&amp;gt; dict:&lt;br&gt;
    config = {"configurable": {"thread_id": thread_id}}&lt;br&gt;
    with tracing_context(&lt;br&gt;
        metadata={"document_length": len(document), "thread_id": thread_id},&lt;br&gt;
        tags=["production", "document-pipeline"],&lt;br&gt;
    ):&lt;br&gt;
        return graph.invoke(&lt;br&gt;
            {&lt;br&gt;
                "document": document,&lt;br&gt;
                "messages": [HumanMessage(content=f"Process this contract:\n\n{document}")],&lt;br&gt;
                "extracted_clauses": [],&lt;br&gt;
                "validation_errors": [],&lt;br&gt;
                "retry_count": 0,&lt;br&gt;
                "final_summary": "",&lt;br&gt;
            },&lt;br&gt;
            config,&lt;br&gt;
        )&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical detail:&lt;/strong&gt; &lt;code&gt;interrupt()&lt;/code&gt; requires a checkpointer. Without one, the state is lost and you can’t resume. Use &lt;code&gt;InMemorySaver&lt;/code&gt; for development and a durable checkpointer (Postgres, SQLite) for production. Forgetting the checkpointer is a silent failure — the graph runs fine until you actually need to resume, and then it has no idea where it left off.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pattern 4: Building Fault-Tolerant Agents — Let Unexpected Errors Bubble
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;TypeError&lt;/code&gt;, &lt;code&gt;KeyError&lt;/code&gt;, schema mismatches, logic bugs. Don’t catch these. Don’t retry them. Don’t interrupt for them. &lt;strong&gt;Let them crash.&lt;/strong&gt; A retry just wastes time on an error that will never self-resolve. A human interrupt pages someone to look at a bug that should be in your issue tracker.&lt;/p&gt;

&lt;p&gt;The only thing to do with unexpected errors is make them observable. Wrap your graph invocation and log context:&lt;/p&gt;

&lt;p&gt;``&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;summarize_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;clauses_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;- &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the following contract clauses into a concise executive summary. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Flag any compliance concerns.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Contract clauses:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;clauses_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;When it crashes, you get a full trace in LangSmith with the document content, the exact node that failed, and every intermediate state. That’s a 5-minute investigation, not a 2-hour log-grepping session.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Summarizer
&lt;/h2&gt;

&lt;p&gt;The final node synthesizes everything. It’s simple but gets &lt;code&gt;RetryPolicy&lt;/code&gt; protection because it calls the LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.checkpoint.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemorySaver&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;last_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;last_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;post_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clause_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PipelineState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Agent node: LLM retry (expensive, conservative)
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm_retry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Tool node: API retry (cheap, aggressive) + error handling for tool failures
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;api_retry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Post-tool processing
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;post_tool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Validation: no retry (errors here are user-fixable, not transient)
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validate_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Summarizer: LLM retry
&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;summarize_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm_retry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post_tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;checkpointer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemorySaver&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Graph Assembly
&lt;/h2&gt;

&lt;p&gt;Here’s where the error classification meets the graph structure. Each node gets the retry strategy that matches its error profile:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;from langsmith import tracing_context&lt;/p&gt;

&lt;p&gt;with tracing_context(&lt;br&gt;
    metadata={&lt;br&gt;
        "pipeline_version": "v3",&lt;br&gt;
        "error_strategy": "classified",&lt;br&gt;
        "document_type": "contract",&lt;br&gt;
    },&lt;br&gt;
    tags=["production", "error-handling-v3"],&lt;br&gt;
):&lt;br&gt;
    result = process_document(&lt;br&gt;
        document="Sample contract text...",&lt;br&gt;
        thread_id="contract-42",&lt;br&gt;
    )&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Notice: the validation node has &lt;strong&gt;no retry policy.&lt;/strong&gt; Retrying a missing-clause error 3 times won’t make the clause appear. That’s a user-fixable problem that needs &lt;code&gt;interrupt()&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production Failures
&lt;/h2&gt;

&lt;p&gt;These are the error-handling mistakes that make it past code review and into production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Retrying User-Fixable Errors.&lt;/strong&gt; The pipeline retries an extraction 3 times, burning 8 seconds and 3 LLM calls, before finally failing with the same “missing payment clause” error. The document genuinely doesn’t have a payment clause. No amount of retrying will create one. Fix: classify the error before choosing the handler. If the document is missing required content, &lt;code&gt;interrupt()&lt;/code&gt; immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Swallowing Unexpected Errors.&lt;/strong&gt; A developer wraps the entire graph invocation in &lt;code&gt;try/except Exception: return {"error": "Something went wrong."}&lt;/code&gt;. Now every &lt;code&gt;TypeError&lt;/code&gt;, every &lt;code&gt;KeyError&lt;/code&gt;, every schema mismatch disappears into a generic error message. The LangSmith trace shows the node completed “successfully” — because from the graph’s perspective, it did. It returned a value. The bug lives in production for weeks until someone notices the output quality degraded. Fix: only catch the specific exception types you know how to handle. Let everything else crash loudly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Superstep Transaction Surprise.&lt;/strong&gt; You have two parallel branches: clause extraction and metadata extraction. The metadata branch succeeds, but the clause branch hits a rate limit and throws. You expect the metadata to be saved — it succeeded, after all. But superstep transactions mean &lt;strong&gt;neither update applies.&lt;/strong&gt; The entire superstep rolls back. Your metadata extraction re-runs on retry (if you have &lt;code&gt;RetryPolicy&lt;/code&gt;) or is lost entirely (if you don’t). Fix: put &lt;code&gt;RetryPolicy&lt;/code&gt; on every node that can fail transiently. LangGraph checkpoints successful nodes within a superstep so they don’t re-execute, but the state update is still atomic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Interrupt Without Checkpointer.&lt;/strong&gt; You add &lt;code&gt;interrupt()&lt;/code&gt; to your validation node and test it locally. Works great. Deploy to production without a persistent checkpointer (or with &lt;code&gt;InMemorySaver&lt;/code&gt; behind a load balancer). The interrupt pauses the graph, the user provides corrections, and... the graph starts from scratch because the in-memory state was on a different server instance. Fix: use a durable checkpointer (&lt;code&gt;PostgresSaver&lt;/code&gt;, &lt;code&gt;SqliteSaver&lt;/code&gt;) in production. &lt;code&gt;InMemorySaver&lt;/code&gt; is for tests only.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Error Recovery Loop Explosion.&lt;/strong&gt; The LLM fails to call a tool correctly, the error goes back to the LLM, the LLM tries again with slightly different wrong arguments, the error goes back again. After 15 loops and $2 in API costs, you hit the recursion limit. Fix: add a &lt;code&gt;retry_count&lt;/code&gt; to state. After 3 LLM-recovery attempts, escalate to &lt;code&gt;interrupt()&lt;/code&gt; or fail with a clear error message.&lt;/p&gt;
&lt;h2&gt;
  
  
  Observability
&lt;/h2&gt;

&lt;p&gt;Error handling without observability is guesswork. Here’s how to make every error path visible in LangSmith:&lt;/p&gt;

&lt;p&gt;``&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openevals.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_llm_as_judge&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error-handling-evals&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document processing pipeline error handling evaluation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This contract between Party A and Party B includes: Termination: Either party may terminate with 30 days notice. Payment: Net 30 terms apply.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This agreement covers liability limitations and indemnification clauses only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;should_succeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;should_succeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;missing_clauses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;termination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;should_succeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;empty_document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;@traceable&lt;/code&gt; decorator on every node means you can see in LangSmith:&lt;/p&gt;

&lt;p&gt;Filter by the &lt;code&gt;error-handling-v3&lt;/code&gt; tag to compare error rates across pipeline versions. If v3 has fewer interrupts but more retries, your error classification improved — transient errors are being handled automatically instead of paging humans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evals
&lt;/h2&gt;

&lt;p&gt;Test error recovery paths the same way you test happy paths. Three evaluators: one for successful processing, one for error classification accuracy, and one LLM-as-judge for output quality under failure conditions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
&lt;/code&gt;`python&lt;/p&gt;

&lt;p&gt;QUALITY_PROMPT = """\&lt;br&gt;
Document: {inputs[document]}&lt;br&gt;
Pipeline output: {outputs[final_summary]}&lt;/p&gt;

&lt;p&gt;Rate 0.0-1.0 on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Completeness: Did the summary cover all extracted clauses?&lt;/li&gt;
&lt;li&gt;Accuracy: Are the clause descriptions faithful to the source?&lt;/li&gt;
&lt;li&gt;Error handling: If the document was incomplete, did the pipeline flag it appropriately?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Return ONLY: {{"score": , "reasoning": ""}}"""&lt;/p&gt;

&lt;p&gt;quality_judge = create_llm_as_judge(&lt;br&gt;
    prompt=QUALITY_PROMPT,&lt;br&gt;
    model="anthropic:claude-sonnet-4-5-20250929",&lt;br&gt;
    feedback_key="quality",&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;def error_classification(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:&lt;br&gt;
    """Did the pipeline correctly classify errors?"""&lt;br&gt;
    should_succeed = reference_outputs.get("should_succeed", True)&lt;br&gt;
    has_summary = bool(outputs.get("final_summary"))&lt;br&gt;
    has_errors = bool(outputs.get("validation_errors"))&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if should_succeed:
    score = 1.0 if has_summary and not has_errors else 0.0
else:
    score = 1.0 if has_errors or not has_summary else 0.0
return {"key": "error_classification", "score": score}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;def recovery_efficiency(inputs: dict, outputs: dict, reference_outputs: dict) -&amp;gt; dict:&lt;br&gt;
    """How many retries did it take? Lower is better."""&lt;br&gt;
    retry_count = outputs.get("retry_count", 0)&lt;br&gt;
    if retry_count == 0:&lt;br&gt;
        score = 1.0&lt;br&gt;
    elif retry_count &amp;lt;= 2:&lt;br&gt;
        score = 0.7&lt;br&gt;
    else:&lt;br&gt;
        score = 0.3&lt;br&gt;
    return {"key": "recovery_efficiency", "score": score}&lt;/p&gt;

&lt;p&gt;def target(inputs: dict) -&amp;gt; dict:&lt;br&gt;
    config = {"configurable": {"thread_id": inputs["thread_id"]}}&lt;br&gt;
    try:&lt;br&gt;
        return graph.invoke(&lt;br&gt;
            {&lt;br&gt;
                "document": inputs["document"],&lt;br&gt;
                "messages": [HumanMessage(content=f"Process this contract:\n\n{inputs['document']}")],&lt;br&gt;
                "extracted_clauses": [],&lt;br&gt;
                "validation_errors": [],&lt;br&gt;
                "retry_count": 0,&lt;br&gt;
                "final_summary": "",&lt;br&gt;
            },&lt;br&gt;
            config,&lt;br&gt;
        )&lt;br&gt;
    except Exception as e:&lt;br&gt;
        return {"final_summary": "", "validation_errors": [str(e)], "retry_count": 0}&lt;/p&gt;

&lt;p&gt;results = evaluate(&lt;br&gt;
    target,&lt;br&gt;
    data="error-handling-evals",&lt;br&gt;
    evaluators=[quality_judge, error_classification, recovery_efficiency],&lt;br&gt;
    experiment_prefix="error-handling-v1",&lt;br&gt;
    max_concurrency=2,&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;[Ingest] → [Extract Clauses] → [Validate] → [Summarize] → END&lt;br&gt;
              ↑        |&lt;br&gt;
              |        ↓&lt;br&gt;
              ← (tool error: retry with context)&lt;br&gt;
                       |&lt;br&gt;
                  (missing info: interrupt for human)&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;code&gt;  &lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;recovery_efficiency&lt;/code&gt; catches the error-loop explosion problem. If your average retry count creeps above 2, your error classification is wrong — you’re retrying things that should interrupt or bubble up.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use This
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use classified error handling when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your pipeline calls external APIs that can return transient errors&lt;/li&gt;
&lt;li&gt;Documents have variable quality and may be missing required fields&lt;/li&gt;
&lt;li&gt;You need human-in-the-loop for ambiguous or incomplete inputs&lt;/li&gt;
&lt;li&gt;You're running batch processing where one failure shouldn't kill the batch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip it when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your pipeline is a single LLM call with no tools&lt;/li&gt;
&lt;li&gt;Every error is the same type (all transient, all user-fixable)&lt;/li&gt;
&lt;li&gt;You're prototyping and don't need production resilience yet&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;The error classification matrix is the whole strategy: transient errors get &lt;code&gt;RetryPolicy&lt;/code&gt;, LLM-recoverable errors get stored in state and looped back, user-fixable errors get &lt;code&gt;interrupt()&lt;/code&gt;, and unexpected errors crash loudly. Four patterns, four primitives, zero catch-all try/excepts.&lt;/p&gt;

&lt;p&gt;The mistake everyone makes is treating errors as a single category. You either retry everything (wasting time and money) or catch everything (hiding bugs). The classification forces you to ask “who fixes this?” for every failure mode, and that question is worth more than any amount of retry logic.&lt;/p&gt;

&lt;p&gt;Put &lt;code&gt;RetryPolicy&lt;/code&gt; on every node that touches a network. Put &lt;code&gt;handle_tool_errors=True&lt;/code&gt; on your &lt;code&gt;ToolNode&lt;/code&gt;. Put &lt;code&gt;interrupt()&lt;/code&gt; on validation failures. Let everything else crash. Ship the &lt;code&gt;recovery_efficiency&lt;/code&gt; eval before you ship the pipeline.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical References:  &lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/focused-dot-io/08-error-handling-production/tree/c7fd49401a5c9adf03cd0f90ac08117820013f1e#article" rel="noopener noreferrer"&gt;LangGraph Agent Error Handling in Production GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://concepts" rel="noopener noreferrer"&gt;LangGraph Retry Policy (Handling Retries)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://langchain-ai.github.io/langgraph/how-tos/tool-calling/" rel="noopener noreferrer"&gt;LangGraph Tool Calling and ToolNode&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://langchain-ai.github.io/langgraph/how-tos/human-in-the-loop/" rel="noopener noreferrer"&gt;LangGraph Human-in-the-Loop (interrupt)&lt;/a&gt;&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>Evaluation Pipelines for LangGraph Agents</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Thu, 16 Apr 2026 00:43:37 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/evaluation-pipelines-for-langgraph-agents-2aoi</link>
      <guid>https://forem.com/focused_dot_io/evaluation-pipelines-for-langgraph-agents-2aoi</guid>
      <description>&lt;p&gt;You changed a system prompt. It looks better on the three examples you tried. You ship it. Tuesday morning, support tickets spike. The agent is now hallucinating policy details on a class of queries you didn’t test. You revert, but 400 users already got bad answers.&lt;/p&gt;

&lt;p&gt;This is not a testing problem. You have unit tests.&lt;/p&gt;

&lt;p&gt;This is an evaluation problem.&lt;/p&gt;

&lt;p&gt;Traditional tests check “does the code run.” Evals check “is the output good.” For LLM applications, you need a clear verdict: pass or fail. Not a 0.73. Not “mostly correct.” The agent either got the answer right or it didn’t.&lt;/p&gt;

&lt;p&gt;Binary evaluators give you that clarity. More importantly, they give your CI pipeline a gate that actually means something.&lt;/p&gt;

&lt;p&gt;The cost of not having evals is not “we might ship a bad prompt.” It is that you have no idea if any prompt is good.&lt;/p&gt;

&lt;p&gt;LangSmith gives you the pieces: datasets with versioned examples, custom evaluators (deterministic and LLM-as-judge), trajectory evaluation for agent behavior, experiment comparison across runs, and production trace monitoring.&lt;/p&gt;

&lt;p&gt;This post builds the whole pipeline, from dataset to CI regression detection.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Eval Tax
&lt;/h3&gt;

&lt;p&gt;Every team resists evals because they seem expensive. Here's the actual math:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Activity&lt;/th&gt;
&lt;th&gt;Time Cost&lt;/th&gt;
&lt;th&gt;Without Evals&lt;/th&gt;
&lt;th&gt;With Evals&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt change&lt;/td&gt;
&lt;td&gt;~30 min&lt;/td&gt;
&lt;td&gt;Ship and pray&lt;/td&gt;
&lt;td&gt;Run eval suite, check pass rate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regression discovery&lt;/td&gt;
&lt;td&gt;Hours–days&lt;/td&gt;
&lt;td&gt;User reports, support tickets&lt;/td&gt;
&lt;td&gt;Caught in CI, before merge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Root cause analysis&lt;/td&gt;
&lt;td&gt;1–4 hours&lt;/td&gt;
&lt;td&gt;Manual trace inspection&lt;/td&gt;
&lt;td&gt;Failed evals pinpoint exactly which capability regressed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollback decision&lt;/td&gt;
&lt;td&gt;Stressful&lt;/td&gt;
&lt;td&gt;"Is this really worse?"&lt;/td&gt;
&lt;td&gt;Pass rate dropped from 95% to 71%, clear signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total cost per change&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Unpredictable&lt;/td&gt;
&lt;td&gt;~15 min eval run&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The eval suite described below costs ~$0.50 per run (LLM-as-judge calls) and takes 2–3 minutes. The alternative is discovering regressions from users. &lt;/p&gt;

&lt;h2&gt;
  
  
  ** &lt;em&gt;The Architecture&lt;/em&gt;**
&lt;/h2&gt;

&lt;p&gt;We're building an evaluation pipeline for a Q&amp;amp;A agent. The pipeline covers offline evals (before deploy), online monitoring (after deploy), and regression detection (across deploys).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnyMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RetryPolicy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;traceable&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;AnyMessage&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Search the internal knowledge base for relevant information.

    Args:
        query: The search query to find relevant documents.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;knowledge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;refund policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Full refund within 30 days of purchase for unopened items. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Opened items eligible for exchange only within 14 days. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Digital products are non-refundable after download.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Standard shipping: 5-7 business days, free over $50. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Express shipping: 2-3 business days, $12.99. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;International shipping: 10-15 business days, $24.99.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warranty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All electronics carry a 1-year manufacturer warranty. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extended warranty available for $49.99 (adds 2 years). &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Warranty does not cover accidental damage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hours&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer support available Monday-Friday 9am-6pm EST. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chat support available 24/7. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Phone support: 1-800-555-0123.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;knowledge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Knowledge Base [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No relevant results found. Try rephrasing your query.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;llm_with_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are a customer support agent. Answer questions using the knowledge base tool.
Be concise and accurate. If the knowledge base doesn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have the answer, say so —
do not make up information. Always cite the source when using knowledge base results.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa_agent_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm_with_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

&lt;span class="n"&gt;tool_node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle_tool_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RetryPolicy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;qa_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Agent Under Test
&lt;/h2&gt;

&lt;p&gt;A Q&amp;amp;A agent that answers questions using a knowledge base. Simple enough to evaluate clearly, complex enough to have real failure modes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Q&amp;amp;A agent evaluation dataset covering core support topics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dataset_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is your refund policy?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How long does express shipping take?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Does the warranty cover water damage?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are your support hours?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Can I return a digital download?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Do you sell gift cards?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the refund window for opened electronics?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Full refund within 30 days for unopened items. Opened items eligible for exchange only within 14 days. Digital products are non-refundable after download.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unopened&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;14 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Express shipping takes 2-3 business days and costs $12.99.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2-3 business days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;12.99&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The warranty does not cover accidental damage, including water damage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;does not cover&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accidental damage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer support is available Monday-Friday 9am-6pm EST. Chat support is available 24/7.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monday-Friday&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9am-6pm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24/7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Digital products are non-refundable after download.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;digital&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non-refundable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have information about gift cards in the knowledge base.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expects_no_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Opened items are eligible for exchange only within 14 days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exchange&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;14 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Create a Dataset
&lt;/h2&gt;

&lt;p&gt;The dataset is the foundation. Bad examples produce misleading eval scores. Each example has &lt;code&gt;inputs&lt;/code&gt; (what goes to the agent) and &lt;code&gt;outputs&lt;/code&gt; (the ground truth to evaluate against).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;traceable&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_keyword_coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;keyword_coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if the response mentions ALL required keywords. Fail if any are missing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;must_mention&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword_coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keyword_coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_tool_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tool_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if the agent called all expected tools. Fail if any are missing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="n"&gt;expected_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="n"&gt;actual_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
            &lt;span class="n"&gt;actual_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;expected_tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;expected_tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;issubset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual_tools&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_no_hallucination_on_missing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;no_hallucination_on_missing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;When the KB has no answer, pass if the agent admits it. Fail if it fabricates.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expects_no_answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;hedging_phrases&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have information&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no information&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not available&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cannot find&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no relevant results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t have&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not in the knowledge base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m not sure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;hedged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;phrase&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;phrase&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hedging_phrases&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_hallucination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hedged&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Seven examples is a starting point, not a finish line.&lt;/strong&gt; In production, you need 50-100 examples covering happy paths, edge cases, and adversarial inputs. But starting with seven well-chosen examples that cover your core failure modes is better than starting with zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Build LangSmith Evaluation Evaluators
&lt;/h2&gt;

&lt;p&gt;Three layers of evaluation: deterministic checks (fast, cheap, reliable), LLM-as-judge (flexible, handles nuance), and trajectory evaluation (validates agent behavior, not just output).&lt;/p&gt;

&lt;h3&gt;
  
  
  Deterministic Evaluators
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openevals.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_llm_as_judge&lt;/span&gt;
&lt;span class="n"&gt;CORRECTNESS_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are evaluating a customer support agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response.
Customer question: {inputs[question]}
Agent response: {outputs[response]}
Expected answer: {reference_outputs[expected_answer]}
Determine whether the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response is correct.
A response is CORRECT if it:
- Contains the key factual claims from the expected answer
- Does not contradict the expected answer
- Does not fabricate information beyond what the knowledge base provides
A response is INCORRECT if it:
- Misses critical factual information from the expected answer
- States anything that contradicts the expected answer
- Invents details not present in the knowledge base
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field explaining your verdict.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;correctness_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CORRECTNESS_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feedback_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correctness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;TONE_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are evaluating the tone and professionalism of a customer support agent.
Customer question: {inputs[question]}
Agent response: {outputs[response]}
Determine whether the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s tone is ACCEPTABLE or UNACCEPTABLE.
ACCEPTABLE tone: professional, helpful, concise, empathetic, and action-oriented.
UNACCEPTABLE tone: condescending, rude, excessively verbose, robotic, dismissive,
or inappropriately casual for a support context.
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field explaining your verdict.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;tone_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TONE_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feedback_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why pass/fail instead of continuous scores?&lt;/strong&gt; You don't ship code when 73% of your unit tests "mostly pass." When keyword_coverage fails, you know exactly what happened: the agent missed a required term. A score of 0.75 tells you something is partially wrong, but you still have to go figure out what. And binary evaluators don't suffer from judge variance — the same input produces the same verdict every time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentevals.trajectory.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_trajectory_llm_as_judge&lt;/span&gt;
&lt;span class="n"&gt;TRAJECTORY_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;You are evaluating whether an AI agent took a reasonable path to answer a question.
The agent has access to a knowledge base search tool.
Evaluate the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s trajectory (sequence of actions and messages):
{outputs}
A trajectory PASSES if:
- The agent called the appropriate tool(s) for the question
- The agent did not make unnecessary or redundant tool calls
- The agent used tool results to formulate its response
- The agent did not ignore relevant tool results
A trajectory FAILS if:
- The agent skipped tool calls and answered from its own knowledge
- The agent made excessive redundant calls (more than 2 calls for a simple question)
- The agent ignored tool results and fabricated an answer
- The agent called completely irrelevant tools
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field explaining your verdict.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;trajectory_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_trajectory_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TRAJECTORY_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_trajectory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;trajectory_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if the agent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s tool-calling trajectory was reasonable. Fail otherwise.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trajectory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;trajectory_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trajectory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Binary judges are more reliable than continuous ones.&lt;/strong&gt; Ask a model to rate something 0.0-1.0 and you'll get different scores on every run. Ask it "correct or incorrect?" and you'll get the same answer 95%+ of the time. The judge isn't deciding how correct, it's deciding whether the response meets a bar. Easier task, more consistent results, fewer false signals in CI.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Trajectory Evaluator&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Trajectory evaluation checks not just &lt;strong&gt;what&lt;/strong&gt; the agent said, but &lt;strong&gt;how&lt;/strong&gt; it got there. Did it call the right tools? Did it call them in a reasonable order? Did it over-call or under-call?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa_eval_target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evaluators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;correctness_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tone_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;keyword_coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;no_hallucination_on_missing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trajectory_eval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;experiment_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Run the Evaluation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Wire the target function and evaluators into evaluate(). The target function takes dataset inputs, runs the agent, and returns a dict with the keys your evaluators expect.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compare_experiments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compare_experiments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;baseline_prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;candidate_prefix&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;regression_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Compare two experiment runs and flag regressions in pass rates.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;experiments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_projects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;project_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;reference_dataset_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;experiments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_prefix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_prefix&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;baseline&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Could not find both experiments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;baseline_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_test_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;baseline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;candidate_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ls_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_test_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compute_pass_rates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;feedback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
                &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;baseline_rates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_pass_rates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;candidate_rates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_pass_rates&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;regressions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;regression_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;regressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metric&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;candidate_pass_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;regressions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;regressions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;regressions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_rates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;baseline_rates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;candidate_rates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;candidate_rates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates an experiment in LangSmith, a versioned snapshot of your agent's performance. The result is a pass rate per evaluator: "correctness: 6/7 passed, tool_usage: 7/7 passed, keyword_coverage: 5/7 passed." Every future eval run with a different experiment_prefix becomes a comparable data point.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: AI Agent Testing with Regression Detection&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The power of experiments is comparison. When you change a prompt, model, or tool, run the same eval suite and compare pass rates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tracing_context&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_user_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Production entry point with trace tagging.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;tracing_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;channel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-02-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, you run this in CI. A prompt change creates a new experiment, the comparison script checks for regressions, and the PR is blocked if any evaluator's pass rate drops more than 10%. This is the single most valuable thing you can build with LangSmith — the rest is instrumentation. The regression threshold is 10%, not 5%, because pass/fail metrics move in discrete jumps. On a 7-example dataset, one additional failure drops your pass rate by ~14%. On a 50-example dataset, you can tighten the threshold to 5%. Scale the threshold to your dataset size.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Production Monitoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Offline evals catch regressions before deploy. Online monitoring catches drift after deploy — the slow degradation that happens when user behavior shifts, knowledge bases get stale, or upstream APIs change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@traceable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_tool_call_efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tool_call_efficiency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fail if the agent made more than 3 tool calls for a single question.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="n"&gt;tool_call_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_efficiency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_call_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tag every production trace with the agent version and prompt version. When you deploy a new version, you can filter traces by version and compare pass rates across versions — with real user traffic, not synthetic examples. The monitoring loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tag all production traces with version metadata&lt;/li&gt;
&lt;li&gt;Configure LangSmith online evaluators to sample 10-20% of traces&lt;/li&gt;
&lt;li&gt;Dashboard alerts on pass rate drops by version&lt;/li&gt;
&lt;li&gt;When a drop is detected, pull the failing traces, add them to your offline dataset, and fix&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Production Failures&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;These are the eval-specific failure modes that surface once you're running evals in CI and production. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Flaky LLM-as-Judge Verdicts.&lt;/strong&gt; The same input/output pair passes on one run and fails on the next. The judge model is non-deterministic, and your eval is measuring judge variance, not agent quality. Fix: set temperature=0 on the judge model, make your pass/fail criteria as specific as possible (list exactly what constitutes a pass), and run each evaluation 3 times with a majority-vote. If the same example flips verdict more than 10% of the time, your criteria need to be sharper. Binary verdicts are already far more stable than continuous scores, but ambiguous criteria still cause flakiness. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Eval Gaming.&lt;/strong&gt; You optimize the prompt to pass the eval dataset. The pass rate goes up. User satisfaction doesn't. Your dataset is too narrow, the agent learned your test distribution, not the actual problem. Fix: rotate examples into and out of the eval set quarterly. Pull 10% of examples from production traces each month. Never let the eval set become stale. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Judge Model Disagreement.&lt;/strong&gt; You switch the judge from Claude to GPT and pass rates shift by 20%. The evaluator is measuring model preference, not quality. Fix: calibrate your judge against human ratings. Run 50 examples through both the judge and a human annotator. If they disagree on more than 10% of verdicts, your judge criteria need work. openevals provides pre-calibrated prompts as a starting point. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Dataset Drift.&lt;/strong&gt; Your eval dataset was created six months ago. The product has changed, new policies, new features, different user behavior. The evals are passing but they're testing scenarios that no longer matter, while ignoring scenarios that do. Fix: timestamp your examples. Review the dataset monthly. Add production failure cases as they occur. Delete examples for deprecated features. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Trajectory Eval False Positives.&lt;/strong&gt; The trajectory judge says the agent's path was "reasonable" even when the agent called the wrong tool first and then self-corrected. Self-correction is fine in production but expensive, it adds latency and cost. Fix: add a separate tool_call_count evaluator that fails trajectories with more than N tool calls. Combine the trajectory pass/fail with an efficiency gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tracing_context&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;tracing_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eval_run&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trigger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ci&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;evaluation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ci&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;evaluators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;correctness_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tone_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;keyword_coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tool_usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;no_hallucination_on_missing&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;trajectory_eval&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tool_call_efficiency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;experiment_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Observability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every evaluator is @traceable, which means your evals themselves are traced in LangSmith. This matters more than you think. When an evaluator produces a surprising verdict, you can inspect exactly what it saw and why it ruled that way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;evaluate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openevals.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_llm_as_judge&lt;/span&gt;
&lt;span class="n"&gt;ls_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;QUALITY_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;&lt;span class="s"&gt;Question: {inputs[question]}
Response: {outputs[response]}
Expected: {reference_outputs[expected_answer]}
Does the response correctly and completely answer the question
based on the expected answer?
PASS if the response contains all key facts from the expected answer
and does not contradict it.
FAIL if the response misses critical information, contradicts the
expected answer, or fabricates details.
Return ONLY: {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: true}} or {{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: false}}
with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; field.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;quality_judge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_llm_as_judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;QUALITY_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic:claude-sonnet-4-5-20250929&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;feedback_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Pass if ALL required terms are present. Fail if any are missing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;must_mention&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;must_mention&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;all_present&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;must_mention&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;all_present&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;response_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fail if the response is too short to be useful or excessively long.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;word_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_length&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;word_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;target&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qa_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-evals-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;evaluators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;quality_judge&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;coverage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_length&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;experiment_prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qa-agent-quick-check&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What to watch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Correctness pass rate.&lt;/strong&gt; This is your north star. If it drops, the agent is giving wrong answers. Every other metric is secondary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool usage pass rate.&lt;/strong&gt; If this drops, the agent stopped using the knowledge base — probably a prompt regression that caused it to answer from parametric memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-hallucination pass rate.&lt;/strong&gt; If this drops, the agent is making up answers when it should be admitting ignorance. This is the most dangerous regression and the one most likely to slip through manual review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failing examples across runs.&lt;/strong&gt; Track which specific examples fail consistently. These are your hardest cases — either improve the agent to handle them or accept them as known limitations and document them.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Evals&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This section is the article, but here's the condensed eval suite for quick reference — the minimum viable pipeline you should have before shipping any agent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="err"&gt;┌─────────────────────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                    &lt;span class="n"&gt;OFFLINE&lt;/span&gt; &lt;span class="n"&gt;EVALS&lt;/span&gt;                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────────────┐&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Dataset&lt;/span&gt;  &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Target&lt;/span&gt;   &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Evaluators&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;examples&lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Function&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Correctness&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;refs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Completeness&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;Trajectory&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Judge&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;└────────┬─────────┘&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                           &lt;span class="err"&gt;│&lt;/span&gt;             &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;┌────────▼─────────┐&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Experiment&lt;/span&gt;        &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;versioned&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                  &lt;span class="err"&gt;└──────────────────┘&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└─────────────────────────────────────────────────────────┘&lt;/span&gt;

&lt;span class="err"&gt;┌─────────────────────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                   &lt;span class="n"&gt;ONLINE&lt;/span&gt; &lt;span class="n"&gt;MONITORING&lt;/span&gt;                      &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────┐&lt;/span&gt;    &lt;span class="err"&gt;┌──────────────────┐&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Prod&lt;/span&gt;     &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Traces&lt;/span&gt;   &lt;span class="err"&gt;│───►│&lt;/span&gt; &lt;span class="n"&gt;Online&lt;/span&gt; &lt;span class="n"&gt;Evals&lt;/span&gt;     &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="n"&gt;Traffic&lt;/span&gt;  &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tagged&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;│&lt;/span&gt;    &lt;span class="err"&gt;│&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sampling&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;└──────────┘&lt;/span&gt;    &lt;span class="err"&gt;└──────────────────┘&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└─────────────────────────────────────────────────────────┘&lt;/span&gt;

&lt;span class="err"&gt;┌─────────────────────────────────────────────────────────┐&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                 &lt;span class="n"&gt;REGRESSION&lt;/span&gt; &lt;span class="n"&gt;DETECTION&lt;/span&gt;                     &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="n"&gt;Experiment&lt;/span&gt; &lt;span class="n"&gt;v1&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;  &lt;span class="err"&gt;◄────&lt;/span&gt; &lt;span class="n"&gt;compare&lt;/span&gt; &lt;span class="err"&gt;────►&lt;/span&gt;  &lt;span class="n"&gt;v2&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;   &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;                                                         &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="n"&gt;Δ&lt;/span&gt; &lt;span class="n"&gt;correctness&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.15&lt;/span&gt;  &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="n"&gt;REGRESSION&lt;/span&gt; &lt;span class="n"&gt;DETECTED&lt;/span&gt;            &lt;span class="err"&gt;│&lt;/span&gt;
&lt;span class="err"&gt;└─────────────────────────────────────────────────────────┘&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;When to Use This&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build a full eval pipeline when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're shipping prompt changes more than once a month&lt;/li&gt;
&lt;li&gt;More than one person works on the agent&lt;/li&gt;
&lt;li&gt;The agent handles queries where wrong answers have consequences (policy, pricing, compliance)&lt;/li&gt;
&lt;li&gt;You need to compare model versions (Claude vs GPT, Sonnet vs Haiku)&lt;/li&gt;
&lt;li&gt;You're running A/B tests on agent behavior &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Start with just deterministic evals when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The agent is in early development and schemas are changing weekly&lt;/li&gt;
&lt;li&gt;You have fewer than 5 test cases&lt;/li&gt;
&lt;li&gt;The output format is structured (JSON extraction) and correctness is binary &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skip evals when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The application is a prototype that won't see real users&lt;/li&gt;
&lt;li&gt;You're the only user and you'll notice regressions immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Bottom Line&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Evals are not a nice-to-have. They're the difference between "I think the prompt is better" and "I know the prompt is better, and here are the pass rates." Dataset, evaluators, experiment, comparison. Four components. ~20 lines per evaluator. The payoff is catching every regression before your users do. Pass/fail is the right default. Continuous scores feel more sophisticated, but they create ambiguity — is 0.72 good? Is a drop from 0.81 to 0.76 a regression or noise? Pass/fail kills the question. Green or red. When you need more nuance, add more evaluators with sharper criteria instead of adding decimal places to existing ones. Start with three: one deterministic keyword check, one LLM-as-judge for correctness, one for tool usage. Run them on every PR that touches agent code. Add trajectory evaluation when your agent has more than two tools. Add production monitoring when you have traffic. And update the dataset — the dataset that stops growing is the one that stops catching bugs.&lt;/p&gt;

&lt;p&gt;Technical Resources&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/focused-dot-io/07-evaluation-testing/tree/746f37847075391b6a638be55ce8f2507c55f231" rel="noopener noreferrer"&gt;Eval Pipelines GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.smith.langchain.com/evaluation" rel="noopener noreferrer"&gt;LangSmith Evaluation (datasets, evaluators, experiments)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.langchain.com/langgraph" rel="noopener noreferrer"&gt;LangGraph (stateful agents, tool calling, orchestration)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.smith.langchain.com/observability" rel="noopener noreferrer"&gt;LangSmith Tracing &amp;amp; Observability&lt;/a&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>programming</category>
      <category>ai</category>
      <category>langchain</category>
    </item>
    <item>
      <title>Debugging Your RAG Application: A LangChain, Python, and OpenAI Tutorial</title>
      <dc:creator>Austin Vance</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:42:32 +0000</pubDate>
      <link>https://forem.com/focused_dot_io/debugging-your-rag-application-a-langchain-python-and-openai-tutorial-4gke</link>
      <guid>https://forem.com/focused_dot_io/debugging-your-rag-application-a-langchain-python-and-openai-tutorial-4gke</guid>
      <description>&lt;p&gt;Let's explore a real-world example of debugging a RAG-type application. I recently undertook this process while updating our company knowledge base -- a resource for potential clients and employees to learn about us.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack:
&lt;/h2&gt;

&lt;p&gt;I work with Python and the LangChain framework, specifically using LangChain Expression Language (LCEL) to build chains. You can find the LangChain LCEL documentation &lt;a href="https://python.langchain.com/docs/how_to/#langchain-expression-language-lcel" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
This approach services as a good alternative to LangChain's debugging tool, &lt;a href="https://www.langchain.com/langsmith" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Load memory
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_session_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ConversationBufferMemory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;return_messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_loaded_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;get_session_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;load_memory_variables&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_memory_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;RunnablePassthrough&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RunnableLambda&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_get_loaded_memory&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create Question
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_question_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;get_buffer_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                               &lt;span class="p"&gt;}&lt;/span&gt;
                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;CONDENSE_QUESTION_PROMPT&lt;/span&gt;
                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;
                               &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nc"&gt;StrOutputParser&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Retrieve Documents
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Answer
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;combine_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;DEFAULT_DOCUMENT_PROMPT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;ANSWER_PROMPT&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Final Chain looks like this
&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_memory_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;create_question_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While debugging, I prefer using a cheaper model like gpt-3.5-turbo for its cost-effectiveness. The less advanced models are more than adequate for basic testing. For final testing and deployment to production, you might consider upgrading to gpt-4-turbo or a similar advanced model.&lt;br&gt;&lt;br&gt;
I also favor Jupyter notebooks for much of my debugging. This way, I can include the notebook in a .gitignore file, reducing cleanup from debugging shenanigans in my main code. I can also run very specific pieces of my code without plumbing overhead.  &lt;/p&gt;
&lt;h2&gt;
  
  
  Initial Observations
&lt;/h2&gt;

&lt;p&gt;I noticed that basic queries received correct answers, but any follow-up question would lack the appropriate context, indicating that conversational memory was no longer functioning effectively.&lt;br&gt;&lt;br&gt;
Here's what I observed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;Love&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;Craft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Listen&lt;/span&gt; &lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;Learn&lt;/span&gt; &lt;span class="n"&gt;Why&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Based&lt;/span&gt; &lt;span class="n"&gt;on&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;given&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;importance&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Red&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;Driven&lt;/span&gt; &lt;span class="nc"&gt;Development &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TDD&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt; &lt;span class="err"&gt;❌&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, I expected responses more in line with explanations like "Love your craft is when you are passionate about what you do."&lt;br&gt;&lt;br&gt;
For more context, this issue with conversational memory arose while I was implementing a new feature: allowing end users to customize responses based on their role. So, for example, a developer could receive a highly technical answer while a marketing manager would see more high-level details.  &lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Debugging Steps&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. &lt;strong&gt;Ensure Role Feature Integrity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To avoid impacting the newly implemented role feature, I made it overly obvious and active in every response during this debugging session by temporarily updating my system prompt.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question from the perspective of a {role}.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;DEBUGGING_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Answer the question in a {role} accent.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's how the AI responded, clearly adhering to my updated prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;
&lt;span class="n"&gt;Role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pirate&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;Love&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;Craft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Listen&lt;/span&gt; &lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;Learn&lt;/span&gt; &lt;span class="n"&gt;Why&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;matey&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Arr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;talkin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; about the importance of reachin&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Red&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="n"&gt;stage&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="n"&gt;Driven&lt;/span&gt; &lt;span class="n"&gt;Development&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Creating a Visual Representation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I created a diagram of the app to visualize the process flow.&lt;br&gt;&lt;br&gt;
I began at the end of my flow and worked backward to identify issues. I first checked whether my LLM was answering questions based on the provided context. Upon inspecting the sources, I realized that the given context was a blog on TDD.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://focusedlabs.io/blog/tdd-first-step-think&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Thus, I ruled out the answer component as the source of the bug.  &lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Tracing the Bug's Origin&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Next, I examined the logic for retrieving documents. I added a 'standalone question' key to every input and output chain to log runtime values, which revealed that questions were being incorrectly rephrased.&lt;br&gt;&lt;br&gt;
&lt;em&gt;Adding these keys to the chains allows us to log the values seen by the components at runtime. Using breakpoints will only show the code when it's instantiated and not populated with real-time values.&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Code Snippet with added keys
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added 
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;standalone_question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;standalone_question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;can&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="err"&gt;❌&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I expected the &lt;code&gt;standalone_question&lt;/code&gt; to be more specific, like "What can you tell me about the core value of Love your Craft?"  &lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Identifying the Exact Source&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I focused on the &lt;code&gt;chat_history&lt;/code&gt; variable, suspecting an issue with how the chat history was being recognized.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_documents_chain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_answer_chain&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;final_inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added 
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
                &lt;span class="p"&gt;.&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;standalone_question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;itemgetter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Added
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="err"&gt;❌&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Found the issue! Since the &lt;code&gt;chat_history&lt;/code&gt; was blank, it wasn't being loaded as I had assumed.  &lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Implementing the Solution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I resolved the issue by checking my conversation memory store. As a &lt;code&gt;dict&lt;/code&gt;, the conversation memory store was sensitive to the type of saved messages. I saved the messages with a &lt;code&gt;str&lt;/code&gt; converted version of &lt;code&gt;session_id&lt;/code&gt;. But, I invoked with an &lt;code&gt;Optional[UUID]&lt;/code&gt; version. So, while the conversation memory store itself was set up correctly, I needed to update how I invoked my chain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Therefore, I updated the &lt;code&gt;session_id&lt;/code&gt; type to &lt;code&gt;str&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
&lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. &lt;strong&gt;Confirming the Fix&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I confirmed that the conversation memory now functioned correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;What&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;Focused&lt;/span&gt; &lt;span class="n"&gt;Labs&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt;

&lt;span class="n"&gt;Question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Tell&lt;/span&gt; &lt;span class="n"&gt;me&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What are the Focused Labs core values?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;standalone_question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Can&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; &lt;span class="n"&gt;provide&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;information&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Love&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;Craft&lt;/span&gt;&lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;This&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="n"&gt;means&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt; &lt;span class="n"&gt;are&lt;/span&gt; &lt;span class="n"&gt;passionate&lt;/span&gt; &lt;span class="n"&gt;about&lt;/span&gt; &lt;span class="n"&gt;being&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="n"&gt;at&lt;/span&gt; &lt;span class="n"&gt;what&lt;/span&gt; &lt;span class="n"&gt;we&lt;/span&gt; &lt;span class="n"&gt;do&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;paying&lt;/span&gt; &lt;span class="n"&gt;attention&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;every&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://www.notion.so/Who-are-we-c42efb179fa64f6bb7866deb363fb7ef&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt; &lt;span class="err"&gt;✅&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7. &lt;strong&gt;Final Cleanup and Future-Proofing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I reverted back from the temporary pirate accent debug feature used for easy identification of the role feature.&lt;br&gt;&lt;br&gt;
I decided to maintain detailed logging within the system for future debugging efforts.  &lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Key Takeaways&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debugging AI Systems:&lt;/strong&gt; A mix of traditional and AI-specific debugging techniques is essential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opting for Cost-Effective Models:&lt;/strong&gt; Use more affordable models to reduce costs during repeated queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Importance of Transparency:&lt;/strong&gt; Clear visibility into each step and component of your RAG accelerates debugging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type Consistency:&lt;/strong&gt; Paying attention to small details, like variable types, can significantly impact functionality.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Thanks for reading!&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Stay tuned for more insights into the world of software engineering and AI. Have questions or insights? Feel free to share them in the comments below!&lt;/p&gt;

</description>
      <category>programming</category>
      <category>python</category>
      <category>langchain</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
