<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jangwook Kim</title>
    <description>The latest articles on Forem by Jangwook Kim (@jangwook_kim_e31e7291ad98).</description>
    <link>https://forem.com/jangwook_kim_e31e7291ad98</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1909290%2F60a8c15f-b2b5-4189-8578-78b8ab78900b.jpg</url>
      <title>Forem: Jangwook Kim</title>
      <link>https://forem.com/jangwook_kim_e31e7291ad98</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jangwook_kim_e31e7291ad98"/>
    <language>en</language>
    <item>
      <title>Archon v2: Open Source Coding Agent Harnesses</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Tue, 19 May 2026 00:12:07 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/archon-v2-open-source-coding-agent-harnesses-1n6f</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/archon-v2-open-source-coding-agent-harnesses-1n6f</guid>
      <description>&lt;p&gt;AI coding agents are becoming powerful enough to change real repositories, but the workflow around them is still often improvised. One run starts with planning, another jumps straight into edits, and a third forgets the validation command you expected. That is the gap Archon is trying to close.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://archon.diy/" rel="noopener noreferrer"&gt;Archon&lt;/a&gt; describes itself as a workflow engine for AI coding agents: you define development processes as YAML workflows, then run those workflows through a CLI, web UI, or integrations. The public GitHub repository calls it an open-source harness builder for deterministic and repeatable AI coding work. The useful framing is not "another coding assistant." It is a control layer around coding assistants.&lt;/p&gt;

&lt;p&gt;Effloow Lab ran a small local sandbox before writing this guide. The lab cloned the public repository, inspected the bundled workflow definitions, created a minimal &lt;code&gt;.archon/workflows/*.yaml&lt;/code&gt; file, and validated the dependency graph locally. The lab did not run model-backed AI nodes, Claude Code, Codex SDK, GitHub PR creation, or the web dashboard. Those limits matter because Archon's production value depends on the agent execution layer, not only the YAML shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Effloow Lab&lt;/strong&gt; — Local sandbox on macOS, Node v25.9.0, npm 11.12.1, Docker 29.2.0. Lab run notes: &lt;code&gt;data/lab-runs/archon-v2-ai-coding-agent-harness-builder-2026.md&lt;/code&gt;. The PoC validated workflow structure and dependency ordering only; no AI provider credentials were used.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Archon Is Trying to Fix
&lt;/h2&gt;

&lt;p&gt;The common coding-agent failure mode is not always code quality. It is process drift. A capable model can still skip a planning step, forget to run tests, rewrite too much code, or finish without a review pass. Humans compensate with long prompts: "first inspect the repo, then make a plan, then implement, then run tests, then summarize." That works until the prompt gets lost in a long context window or a different teammate writes a different instruction.&lt;/p&gt;

&lt;p&gt;Archon's answer is to move the process out of the prompt and into a workflow file. The &lt;a href="https://archon.diy/getting-started/concepts/" rel="noopener noreferrer"&gt;core concepts documentation&lt;/a&gt; defines a workflow as a YAML file containing a directed acyclic graph of nodes. Nodes can represent inline prompts, command files, bash scripts, loops, approval gates, or cancellation points. Dependencies are declared with &lt;code&gt;depends_on&lt;/code&gt;, so the sequence becomes explicit rather than implied by prose.&lt;/p&gt;

&lt;p&gt;That changes the operating model. The agent still supplies judgment inside AI-backed nodes, but the harness owns the skeleton: inspect, plan, implement, validate, review, request approval, create PR. For teams already using Claude Code, Codex, or other terminal agents, this is the difference between "ask the model to remember the process" and "make the model run inside the process."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Current Version Reality
&lt;/h2&gt;

&lt;p&gt;The backlog topic says "Archon v2," but the current public repository is more precise than that label. In the sandbox clone, &lt;code&gt;package.json&lt;/code&gt; reported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"archon"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0.3.12"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"module"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So this guide treats "v2" as the rewrite-era product direction, not as an exact package version. A GitHub migration issue from April 2026 says Archon was evolving from a Python-based MCP knowledge and task-management tool into a TypeScript workflow engine for AI coding agents, with the old Python code preserved on an archive branch. That matches the current repository shape: TypeScript, Bun scripts, &lt;code&gt;.archon&lt;/code&gt; workflow defaults, and documentation centered on YAML workflows.&lt;/p&gt;

&lt;p&gt;This distinction matters for readers. If you are looking for the older Archon OS-style RAG/task-management stack, you may land on older articles or mirrors. If you want the current coding-agent harness, focus on the docs at &lt;code&gt;archon.diy&lt;/code&gt; and the current &lt;code&gt;coleam00/Archon&lt;/code&gt; repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the YAML Harness Works
&lt;/h2&gt;

&lt;p&gt;A minimal workflow has a name, a description, and nodes. The &lt;a href="https://archon.diy/book/first-workflow/" rel="noopener noreferrer"&gt;first workflow guide&lt;/a&gt; shows the basic pattern: one node runs first, another depends on it, and Archon executes the graph in dependency order.&lt;/p&gt;

&lt;p&gt;Effloow Lab modeled this small workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;effloow-sandbox&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Minimal deterministic review workflow for an article-code sandbox&lt;/span&gt;
&lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inspect&lt;/span&gt;
    &lt;span class="na"&gt;bash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;printf&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'inspect&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'"&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plan&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;short&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;implementation&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;inspection&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;output."&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;inspect&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;validate&lt;/span&gt;
    &lt;span class="na"&gt;bash&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;printf&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'validate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'"&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;plan&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;review-copy&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;unsupported&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;claims."&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;review-risk&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;operational&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;risks."&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;validate&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;validation&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;review&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;findings."&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;review-copy&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;review-risk&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A local validator produced this execution layering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nodeCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"missingDependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"executionLayers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"inspect"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"plan"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"validate"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"review-copy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"review-risk"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"summarize"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interesting part is the fourth layer. &lt;code&gt;review-copy&lt;/code&gt; and &lt;code&gt;review-risk&lt;/code&gt; both depend on &lt;code&gt;validate&lt;/code&gt;, but neither depends on the other. That means the workflow has a natural parallel review stage before the final summary. This is exactly where harnesses start to matter: code review, security review, docs review, and regression review are different jobs, and a workflow file can represent them as separate nodes instead of one overloaded "please review this" prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Sandbox Confirmed
&lt;/h2&gt;

&lt;p&gt;The local run confirmed four concrete facts.&lt;/p&gt;

&lt;p&gt;First, the public repository could be cloned and inspected without credentials. The checkout used commit &lt;code&gt;45bc5e5&lt;/code&gt; at the time of the run.&lt;/p&gt;

&lt;p&gt;Second, the repository exposes a TypeScript/Bun toolchain. The root &lt;code&gt;package.json&lt;/code&gt; includes scripts such as &lt;code&gt;cli&lt;/code&gt;, &lt;code&gt;build&lt;/code&gt;, &lt;code&gt;test&lt;/code&gt;, &lt;code&gt;type-check&lt;/code&gt;, &lt;code&gt;lint&lt;/code&gt;, and &lt;code&gt;validate&lt;/code&gt;. Effloow Lab did not run those scripts because Bun was not installed on the host.&lt;/p&gt;

&lt;p&gt;Third, bundled defaults are real files, not just documentation examples. The clone contained 37 workflow YAML files and 36 default command files under &lt;code&gt;.archon&lt;/code&gt;. The visible workflow list included names such as &lt;code&gt;archon-idea-to-pr&lt;/code&gt;, &lt;code&gt;archon-plan-to-pr&lt;/code&gt;, &lt;code&gt;archon-smart-pr-review&lt;/code&gt;, &lt;code&gt;archon-comprehensive-pr-review&lt;/code&gt;, &lt;code&gt;archon-refactor-safely&lt;/code&gt;, and &lt;code&gt;archon-validate-pr&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Fourth, a small YAML workflow can be reasoned about with ordinary DAG validation. The local script found six nodes, no missing dependencies, and a five-layer execution plan. That does not prove Archon's runtime behavior, but it does prove the workflow model is inspectable and reviewable before an agent touches code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Sandbox Did Not Prove
&lt;/h2&gt;

&lt;p&gt;The local experiment intentionally stopped short of a full Archon trial.&lt;/p&gt;

&lt;p&gt;The documented Docker command started with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Unable to find image 'ghcr.io/coleam00/archon:latest' locally
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The image pull did not complete within the local run window, so Effloow Lab did not verify &lt;code&gt;archon workflow list&lt;/code&gt; through Docker. The lab also did not install Bun, configure Claude Code, set provider credentials, connect GitHub CLI, start the web dashboard, trigger a PR workflow, or run Slack/Telegram integrations.&lt;/p&gt;

&lt;p&gt;That boundary should shape adoption decisions. The sandbox supports a narrow claim: Archon's workflow concept is concrete, source-visible, and easy to inspect. It does not support a broad claim that Archon is production-ready in a specific team environment. Teams should run their own credentialed trial before putting it on a critical repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Archon Compares to Plain Agent Prompts
&lt;/h2&gt;

&lt;p&gt;Plain prompts are fast to write. They are also easy to mutate accidentally. A senior engineer might say "run tests before summarizing," while another says "summarize and then run tests if needed." Both can work, but neither creates a durable process artifact.&lt;/p&gt;

&lt;p&gt;Archon's workflow files are closer to CI configuration for agentic development. The &lt;a href="https://archon.diy/guides/authoring-workflows/" rel="noopener noreferrer"&gt;authoring guide&lt;/a&gt; emphasizes workflows, commands, artifacts, fresh context, and parallel execution. Commands communicate through files rather than hidden memory. Nodes can force a fresh context, which is useful when you want a review step to inspect artifacts instead of inheriting the implementer's assumptions.&lt;/p&gt;

&lt;p&gt;This is the strongest reason to care about Archon: it makes the human process reviewable. You can code-review a workflow file. You can ask whether the validation node is too weak, whether the approval gate is in the right place, or whether a security review should run before PR creation. That is harder when all process control lives in a giant natural-language prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Workflow Pattern
&lt;/h2&gt;

&lt;p&gt;A useful first Archon workflow should be boring. Do not start with an autonomous "idea to production PR" flow on a critical service. Start with a harness that standardizes a task you already do manually.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Inspect the relevant files.&lt;/li&gt;
&lt;li&gt;Write a short plan artifact.&lt;/li&gt;
&lt;li&gt;Ask for human approval.&lt;/li&gt;
&lt;li&gt;Implement one bounded change.&lt;/li&gt;
&lt;li&gt;Run the exact validation command.&lt;/li&gt;
&lt;li&gt;Run two independent review nodes.&lt;/li&gt;
&lt;li&gt;Summarize changed files, test output, and residual risk.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That pattern is also a good fit for content-backed engineering systems like Effloow. An article generator, for example, should not only draft prose. It should gather sources, create a lab note, check unsupported claims, verify frontmatter, update the backlog, and stop before publishing side effects. A workflow harness can encode those boundaries directly.&lt;/p&gt;

&lt;p&gt;Readers interested in related agent control patterns can compare this with Effloow's guides on &lt;a href="https://dev.to/articles/terminal-ai-coding-agents-compared-claude-code-gemini-cli-2026"&gt;terminal AI coding agents&lt;/a&gt; and &lt;a href="https://dev.to/articles/openai-agents-sdk-multi-agent-python-tutorial-2026"&gt;OpenAI Agents SDK multi-agent workflows&lt;/a&gt;. Archon sits one layer above the agent: it coordinates process, while the underlying assistant still performs the reasoning and code edits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Archon Looks Strong
&lt;/h2&gt;

&lt;p&gt;Archon is most compelling when the same engineering process must run repeatedly across issues, repositories, or teammates. The &lt;a href="https://archon.diy/reference/cli/" rel="noopener noreferrer"&gt;CLI reference&lt;/a&gt; documents workflow listing, workflow runs, JSON output, validation, logs, and merge detection behavior. The docs also describe project-local workflows in &lt;code&gt;.archon/workflows/&lt;/code&gt; and global workflows under &lt;code&gt;~/.archon/workflows/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That gives teams a path to standardize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bug-fix investigation and implementation.&lt;/li&gt;
&lt;li&gt;Plan-to-PR execution.&lt;/li&gt;
&lt;li&gt;Multi-review PR checks.&lt;/li&gt;
&lt;li&gt;Refactoring with validation gates.&lt;/li&gt;
&lt;li&gt;Documentation impact review.&lt;/li&gt;
&lt;li&gt;Human approval before irreversible steps.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key advantage is portability. A workflow committed to the repo can travel with the project. A global workflow can become a personal or team-wide operating pattern. Both are more durable than a chat transcript.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Teams Should Be Careful
&lt;/h2&gt;

&lt;p&gt;Harnesses can also create false confidence. A YAML file can force a validation command to run, but it cannot make that command comprehensive. A review node can ask for security issues, but it cannot guarantee that every issue is found. A human approval node can pause execution, but it cannot replace informed review.&lt;/p&gt;

&lt;p&gt;There is also tooling maturity risk. The current repo uses Bun, a web dashboard, provider integrations, and platform connectors. If your team standardizes on npm-only Node tooling, locked-down workstations, or restricted Docker access, the setup path may need extra work. The sandbox host did not have Bun installed, and the Docker image path was not verified within the local run window.&lt;/p&gt;

&lt;p&gt;Finally, avoid putting secrets or production credentials into workflow files. Treat Archon workflows like CI configuration: review them, keep secrets in approved secret stores, and put destructive operations behind explicit approval gates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Adoption Checklist
&lt;/h2&gt;

&lt;p&gt;Use this checklist before introducing Archon to a real repository:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose one low-risk workflow, such as docs review or test validation.&lt;/li&gt;
&lt;li&gt;Commit the workflow under &lt;code&gt;.archon/workflows/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Keep prompts short and task-specific.&lt;/li&gt;
&lt;li&gt;Put deterministic checks in &lt;code&gt;bash&lt;/code&gt; nodes where possible.&lt;/li&gt;
&lt;li&gt;Use artifacts for handoffs between nodes.&lt;/li&gt;
&lt;li&gt;Add human approval before PR creation, deployment, paid actions, or public posting.&lt;/li&gt;
&lt;li&gt;Run the workflow on a throwaway branch first.&lt;/li&gt;
&lt;li&gt;Compare the output against your normal manual process.&lt;/li&gt;
&lt;li&gt;Document what the agent is allowed to change.&lt;/li&gt;
&lt;li&gt;Keep a fallback path that does not require Archon.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the first workflow does not improve repeatability, do not add more workflows. The goal is not to make agent work look sophisticated. The goal is to make it observable, reviewable, and less dependent on the wording of one-off prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Is Archon a replacement for Claude Code or Codex?
&lt;/h3&gt;

&lt;p&gt;No. Archon is better understood as a harness around coding agents. The docs say it works with Claude Code SDK and Codex SDK, but the model-backed agent still performs the reasoning and code work. Archon provides workflow structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can Archon run deterministic steps without AI?
&lt;/h3&gt;

&lt;p&gt;Yes. The docs describe &lt;code&gt;bash&lt;/code&gt; nodes for shell scripts, and the sandbox workflow used &lt;code&gt;bash&lt;/code&gt; nodes for &lt;code&gt;inspect&lt;/code&gt; and &lt;code&gt;validate&lt;/code&gt;. Deterministic checks belong there whenever possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Did Effloow Lab run a full Archon workflow?
&lt;/h3&gt;

&lt;p&gt;No. Effloow Lab validated a local workflow DAG and inspected the repository defaults. It did not run model-backed nodes, provider credentials, PR creation, or the dashboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is the safest first use case?
&lt;/h3&gt;

&lt;p&gt;Start with validation and review, not autonomous implementation. A workflow that runs tests, checks docs impact, and summarizes risk is easier to trust than one that edits code and opens PRs immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;Archon is worth watching because it moves coding-agent process control out of fragile prompts and into source-visible workflow files. The sandbox confirmed that the current repository has real bundled workflow definitions and that a small YAML DAG can model deterministic validation plus parallel review. It did not prove end-to-end runtime reliability.&lt;/p&gt;

&lt;p&gt;For teams already experimenting with AI coding agents, Archon is most useful as a repeatability layer: encode the process, keep humans in the approval path, and let the agent operate inside a graph that engineers can inspect before it runs.&lt;/p&gt;

</description>
      <category>archon</category>
      <category>aiagents</category>
      <category>codingagents</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Claude Agent SDK Subagent Orchestration Tutorial — Parallel Multi-Agent Processing in Practice</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Mon, 18 May 2026 06:37:34 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/claude-agent-sdk-subagent-orchestration-tutorial-parallel-multi-agent-processing-in-practice-4ibo</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/claude-agent-sdk-subagent-orchestration-tutorial-parallel-multi-agent-processing-in-practice-4ibo</guid>
      <description>&lt;p&gt;After I published the &lt;a href="https://dev.to/en/blog/en/claude-agent-sdk-tool-use-complete-guide-2026"&gt;Tool Use guide&lt;/a&gt;, a comment came in fairly quickly: "I get single agents now, but how do I run a code reviewer, security scanner, and doc writer at the same time?" I was actually mid-experiment at that point myself.&lt;/p&gt;

&lt;p&gt;Installing &lt;code&gt;claude-agent-sdk 0.2.82&lt;/code&gt; directly, I found the answer. One &lt;code&gt;AgentDefinition&lt;/code&gt; dataclass and the &lt;code&gt;ClaudeAgentOptions.agents&lt;/code&gt; dict is all it takes. I created the objects and explored the type structure hands-on. No API key meant I couldn't run actual queries, but the code structure and type system were fully testable.&lt;/p&gt;

&lt;p&gt;This post is that exploration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Single Agents Hit a Wall
&lt;/h2&gt;

&lt;p&gt;The Tool Use loop is powerful. But there are three situations where it shows limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context contamination.&lt;/strong&gt; When a single agent handles code quality, security vulnerabilities, and test coverage in one PR review, the context window fills with intermediate results from all three tasks mixed together. The agent sees its earlier reasoning while forming later judgments — the fact that it spotted a code smell early can subtly shade the security analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No parallelism.&lt;/strong&gt; Code review takes 30 seconds, security scan 20 seconds, doc generation 25 seconds. Single agent: 75 seconds. Three concurrent agents: 30 seconds. There's no reason to run independent tasks serially.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Role bleed.&lt;/strong&gt; An agent that "thinks like a reviewer then thinks like a security expert" does both jobs worse than dedicated specialists. This is true for human teams too.&lt;/p&gt;

&lt;p&gt;The subagent pattern solves these three problems structurally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing claude-agent-sdk 0.2.82 — SDK Structure I Verified Directly
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;claude-agent-sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Version confirmed after install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Successfully installed claude-agent-sdk-0.2.82
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running &lt;code&gt;dir(claude_agent_sdk)&lt;/code&gt; in a temporary sandbox, the subagent-relevant classes that stood out:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;claude_agent_sdk&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;

&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AgentDefinition&lt;/span&gt;          &lt;span class="c1"&gt;# Subagent configuration dataclass
&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ClaudeAgentOptions&lt;/span&gt;       &lt;span class="c1"&gt;# Full options including agents dict
&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TaskBudget&lt;/span&gt;               &lt;span class="c1"&gt;# Token budget control
&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SubagentStartHookInput&lt;/span&gt;   &lt;span class="c1"&gt;# Hook for subagent start events
&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SubagentStopHookInput&lt;/span&gt;    &lt;span class="c1"&gt;# Hook for subagent stop events
&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;list_subagents&lt;/span&gt;           &lt;span class="c1"&gt;# List subagents in a session
&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_subagent_messages&lt;/span&gt;    &lt;span class="c1"&gt;# Retrieve a specific subagent's messages
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I read the &lt;code&gt;AgentDefinition&lt;/code&gt; source directly with &lt;code&gt;inspect.getsource()&lt;/code&gt;. This is the actual dataclass in 0.2.82:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentDefinition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;          &lt;span class="c1"&gt;# How the orchestrator identifies this agent
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;               &lt;span class="c1"&gt;# Subagent system prompt
&lt;/span&gt;    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;disallowedTools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# "sonnet", "opus", "haiku", "inherit", or full model ID
&lt;/span&gt;    &lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Literal&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;project&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;mcpServers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;initialPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;maxTurns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# Max loop count for this subagent
&lt;/span&gt;    &lt;span class="n"&gt;background&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;effort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;EffortLevel&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;permissionMode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PermissionMode&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing I noticed in the &lt;code&gt;tools&lt;/code&gt; field comment: "Deprecated: passing 'Skill' here is deprecated; use &lt;code&gt;skills&lt;/code&gt; instead." I hadn't seen that in the documentation. The separate &lt;code&gt;skills&lt;/code&gt; field is the right place now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining Subagents with AgentDefinition — A PR Review Pipeline
&lt;/h2&gt;

&lt;p&gt;Here's the actual code. A PR auto-review pipeline needs three roles:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;claude_agent_sdk&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;

&lt;span class="c1"&gt;# Define each role as a specialized subagent
&lt;/span&gt;&lt;span class="n"&gt;code_reviewer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AgentDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Python code quality and design review specialist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re a Python senior engineer with 10 years of experience. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review code quality, readability, and design patterns. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Provide concrete improvement suggestions in markdown format.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Grep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;maxTurns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;security_scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AgentDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Security vulnerability scanner — injection risks, exposed secrets, unsafe operations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re a security engineer. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find SQL injection risks, hardcoded secrets, unsafe eval/exec, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and permission issues. Report each with severity level.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Grep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;maxTurns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;doc_writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AgentDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Docstring and README writer — reads code and generates clear documentation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re a technical writer. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write Google Style docstrings for functions and classes, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and create usage examples for the README.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Edit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;haiku&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# haiku is sufficient for docs and cheaper
&lt;/span&gt;    &lt;span class="n"&gt;maxTurns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Orchestrator options
&lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re a PR review orchestrator. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Call code-reviewer, security-scanner, and doc-writer in parallel &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and compile a comprehensive review report from all results.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Agent tool is how subagents are invoked
&lt;/span&gt;    &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code-reviewer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;code_reviewer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security-scanner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;security_scanner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;doc-writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;doc_writer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;permission_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bypassPermissions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dict keys in &lt;code&gt;ClaudeAgentOptions.agents&lt;/code&gt; are the names the orchestrator uses when calling subagents. When the system prompt says "call code-reviewer," Claude invokes that agent via the &lt;code&gt;Agent&lt;/code&gt; tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel Execution Pattern — Running Three Agents Simultaneously
&lt;/h2&gt;

&lt;p&gt;The most important line in the SDK documentation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Multiple subagents can run concurrently. When Claude identifies independent subtasks, it spawns multiple agents simultaneously using multiple Task tool calls in a single message."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the orchestrator calls multiple &lt;code&gt;Agent&lt;/code&gt; tools in a single message, they run in parallel. You don't write &lt;code&gt;asyncio.gather()&lt;/code&gt; yourself. Tell the orchestrator to "call these agents in parallel" and the SDK handles it.&lt;/p&gt;

&lt;p&gt;Actual query flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;review_pr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pr_diff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review this PR diff:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pr_diff&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Run code-reviewer, security-scanner, and doc-writer simultaneously &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to analyze each domain in parallel, then compile a unified review report.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AssistantMessage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResultMessage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_cost_usd&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Duration: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;duration_ms&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each subagent's context window starts fresh. From the official docs:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"A subagent's context window starts fresh, and the only channel from parent to subagent is the Agent tool's prompt string."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The orchestrator only sees the final result, not the intermediate reasoning. That's what prevents context contamination.&lt;/p&gt;

&lt;h2&gt;
  
  
  Controlling Costs with TaskBudget
&lt;/h2&gt;

&lt;p&gt;Running three subagents concurrently doesn't just triple costs — it can amplify them unpredictably. Each agent might make redundant tool calls trying to do thorough work.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;TaskBudget&lt;/code&gt; is the API-level fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;# ... same as above ...
&lt;/span&gt;    &lt;span class="n"&gt;task_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TaskBudget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# 50K token budget
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actual class structure from &lt;code&gt;inspect.getsource(sdk.TaskBudget)&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TaskBudget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;API-side task budget in tokens.

    When set, the model is made aware of its remaining token budget so it can
    pace tool use and wrap up before the limit. Sent as
    output_config.task_budget with the task-budgets-2026-03-13 beta header.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;task-budgets-2026-03-13&lt;/code&gt; beta header is attached automatically. The agent becomes aware of its remaining budget and decides internally when to pace down and wrap up. Much cleaner than an external timeout that forces mid-task termination.&lt;/p&gt;

&lt;p&gt;Combine with &lt;code&gt;AgentDefinition.maxTurns&lt;/code&gt; for a two-tier safety net:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;security_scanner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AgentDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;maxTurns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Subagent level: max 6 tool calls
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;task_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TaskBudget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Global level: 100K token ceiling
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Subagent Hooks — Tracking Start and Stop Events
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;SubagentStartHookInput&lt;/code&gt; and &lt;code&gt;SubagentStopHookInput&lt;/code&gt; let you detect exactly when each subagent starts and finishes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;agent_timings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_agent_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SubagentStartHookInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;agent_timings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;▶ &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; started (id: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_agent_stop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SubagentStopHookInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_timings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;■ &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hook_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agent_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; done (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# hook_input.agent_transcript_path has the full subagent conversation log
&lt;/span&gt;
&lt;span class="n"&gt;opts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClaudeAgentOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;# ...
&lt;/span&gt;    &lt;span class="n"&gt;hooks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SubagentStart&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HookMatcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_agent_start&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SubagentStop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;HookMatcher&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hook_callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_agent_stop&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;agent_transcript_path&lt;/code&gt; from &lt;code&gt;SubagentStopHookInput&lt;/code&gt; is invaluable for production debugging. If a subagent produces unexpected output, that's where you look first.&lt;/p&gt;

&lt;p&gt;Also worth knowing: multiple hook matchers on the same event run &lt;strong&gt;concurrently&lt;/strong&gt;, not sequentially. The docs explicitly state this. Design each hook to be independent.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Subagents (and When Not To)
&lt;/h2&gt;

&lt;p&gt;I want to be direct here: subagents aren't always the right choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use them when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have 3+ independent tasks, each taking 10+ seconds&lt;/li&gt;
&lt;li&gt;Different tasks need different tool access (security scanner doesn't need Write)&lt;/li&gt;
&lt;li&gt;You've verified that context contamination actually hurts result quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;They're overkill when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have 2 tasks where task B depends on task A's output&lt;/li&gt;
&lt;li&gt;Total runtime would be under 5 seconds (spawn overhead exceeds benefit)&lt;/li&gt;
&lt;li&gt;It's a simple question-answer pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The same tradeoff comes up in the &lt;a href="https://dev.to/en/blog/en/a2a-mcp-hybrid-architecture-production-guide"&gt;A2A + MCP hybrid architecture post&lt;/a&gt;: multi-agent structure adds complexity. More failure points, harder debugging, less predictable costs. Don't add subagents to a problem that a single agent can handle.&lt;/p&gt;

&lt;p&gt;My personal threshold: "three or more independent tasks, each likely to consume 10K+ tokens with Opus." Below that, I stick with a single agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Couldn't Test
&lt;/h2&gt;

&lt;p&gt;No API key meant I couldn't capture actual execution logs showing the three agents running in parallel. The object construction and type validation worked, but "what does the console output look like when three agents actually spawn concurrently" — I can't show you that from this run.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;fork_session&lt;/code&gt; function also caught my attention but didn't fit in this post. &lt;code&gt;fork_session(session_id, up_to_message_id)&lt;/code&gt; lets you branch a session from a specific point. Useful when subagents want to try different strategies from the same base context without repeating earlier work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The core of subagent orchestration in &lt;code&gt;claude-agent-sdk 0.2.82&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;AgentDefinition&lt;/code&gt;&lt;/strong&gt;: Separate role, prompt, tools, and model per subagent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ClaudeAgentOptions.agents&lt;/code&gt;&lt;/strong&gt;: Register subagent names for the orchestrator to call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Agent&lt;/code&gt; tool + parallel prompt instruction&lt;/strong&gt;: Orchestrator spawns multiple subagents at once&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Add &lt;code&gt;TaskBudget&lt;/code&gt; and &lt;code&gt;SubagentStartHookInput&lt;/code&gt;/&lt;code&gt;SubagentStopHookInput&lt;/code&gt; for cost control and execution tracking.&lt;/p&gt;

&lt;p&gt;Start with a single agent. Move to subagents when your task fits "independent, parallelizable, three or more."&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/agent-sdk/subagents" rel="noopener noreferrer"&gt;Subagents in the SDK — Claude API official docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.anthropic.com/engineering/building-agents-with-the-claude-agent-sdk" rel="noopener noreferrer"&gt;Building agents with the Claude Agent SDK — Anthropic engineering blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;claude-agent-sdk==0.2.82&lt;/code&gt; PyPI package — direct install and source inspection (2026-05-18)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>claude</category>
      <category>anthropicsdk</category>
      <category>subagents</category>
      <category>multiagent</category>
    </item>
    <item>
      <title>Gemini 2.5 Flash Thinking API: What I Learned from Running Budget=0, 1024, and 8000</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sun, 17 May 2026 06:43:58 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/gemini-25-flash-thinking-api-what-i-learned-from-running-budget0-1024-and-8000-3l0j</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/gemini-25-flash-thinking-api-what-i-learned-from-running-budget0-1024-and-8000-3l0j</guid>
      <description>&lt;p&gt;I assumed that turning on Thinking would always make Gemini smarter. After running actual experiments, I found out that's only half true.&lt;/p&gt;

&lt;p&gt;I set &lt;code&gt;thinking_budget&lt;/code&gt; to 0, 1024, and 8000 across three prompt types — simple tasks, math reasoning, and code review — and measured response time, output tokens, and thinking tokens for each combination. The numbers told a more nuanced story than I expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Thinking API Actually Does
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;thinking_budget&lt;/code&gt; limits how many tokens the model can spend on "hidden reasoning" before it writes the response. Budget=0 disables thinking entirely. Budget=-1 lets the model decide how much to think. A positive integer sets the cap (maximum is 24576).&lt;/p&gt;

&lt;p&gt;There's an important catch: thinking tokens aren't returned in the response, but &lt;strong&gt;they're billed at the same rate as output tokens&lt;/strong&gt;. As covered in the &lt;a href="https://dev.to/en/blog/en/llm-api-pricing-comparison-2026-gpt5-claude-gemini-deepseek"&gt;LLM API pricing comparison&lt;/a&gt;, Gemini 2.5 Flash output tokens cost $0.0035/1K. Spending 1024 thinking tokens adds that cost on top.&lt;/p&gt;

&lt;p&gt;One practical note: the &lt;code&gt;google.generativeai&lt;/code&gt; package is now deprecated. You need the new &lt;code&gt;google-genai&lt;/code&gt; package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ❌ Deprecated (no longer receiving updates)
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.generativeai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;

&lt;span class="c1"&gt;# ✅ Current standard
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your prompt here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;thinking_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;include_thoughts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# expose thinking in response parts
&lt;/span&gt;        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Separate thinking from the actual answer
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thought&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Thinking] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Answer] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;include_thoughts=True&lt;/code&gt; surfaces the model's internal reasoning as separate response parts. Useful for debugging, though you'd keep it &lt;code&gt;False&lt;/code&gt; in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Ran the Experiment
&lt;/h2&gt;

&lt;p&gt;I created a fresh sandbox directory, installed only &lt;code&gt;google-genai&lt;/code&gt;, and applied Budget=0/1024/8000 to three prompt types.&lt;/p&gt;

&lt;p&gt;Measurements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Response time (seconds)&lt;/strong&gt;: wall clock time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output tokens&lt;/strong&gt;: actual answer tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thinking tokens&lt;/strong&gt;: tokens consumed in internal reasoning (&lt;code&gt;usage_metadata.thoughts_token_count&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prompts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Simple task&lt;/strong&gt;: "Explain in one sentence how to sort a list in Python"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Math reasoning&lt;/strong&gt;: Find all two-digit positive integers satisfying three conditions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code review&lt;/strong&gt;: Find bugs and improvements in a simple Python function&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Results: What the Numbers Show
&lt;/h2&gt;

&lt;p&gt;These are the actual measured values. No smoothing.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Budget=0&lt;/th&gt;
&lt;th&gt;Budget=1024&lt;/th&gt;
&lt;th&gt;Budget=8000&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple task&lt;/td&gt;
&lt;td&gt;1.4s / 54 out tok&lt;/td&gt;
&lt;td&gt;6.8s / 61 out tok / 751 think tok&lt;/td&gt;
&lt;td&gt;9.0s / 45 out tok / 1282 think tok&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math reasoning&lt;/td&gt;
&lt;td&gt;8.8s / 2143 out tok&lt;/td&gt;
&lt;td&gt;15.1s / 1915 out tok / 918 think tok&lt;/td&gt;
&lt;td&gt;26.2s / 1671 out tok / 4036 think tok&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code review&lt;/td&gt;
&lt;td&gt;6.7s / 1367 out tok&lt;/td&gt;
&lt;td&gt;13.1s / 1126 out tok / 734 think tok&lt;/td&gt;
&lt;td&gt;22.6s / 2055 out tok / 1824 think tok&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Simple task&lt;/strong&gt;: Budget=0 finished in 1.4 seconds. Budget=1024 took 6.8 seconds — nearly 5x slower — with no discernible quality improvement. Budget=8000 consumed 1282 thinking tokens and still produced a shorter answer (45 tokens). Complete overkill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Math reasoning&lt;/strong&gt;: This is where things got interesting. With Budget=0, the model produced 2143 output tokens. It was "thinking out loud" inside the answer, writing out every step of its reasoning. Budget=1024 used 918 thinking tokens internally and produced 1915 output tokens. The total token consumption was similar, but the response was more structured. Budget=8000 pushed thinking to 4036 tokens and the output dropped to 1671 — the model reasoned privately and wrote a more concise answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code review&lt;/strong&gt;: Budget=1024 actually cut output from 1367 to 1126 tokens. The answer was more focused. Budget=8000 expanded to 2055 tokens — a more thorough analysis but 3x slower. Which is better depends entirely on your use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Budget for Each Task Type
&lt;/h2&gt;

&lt;p&gt;Here's the practical framework I settled on from this experiment. Not a universal rule, but a solid starting point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget=0 (thinking disabled)&lt;/strong&gt; for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Classification, labeling, tagging&lt;/li&gt;
&lt;li&gt;Summarization, translation, format conversion&lt;/li&gt;
&lt;li&gt;Simple Q&amp;amp;A, factual lookups&lt;/li&gt;
&lt;li&gt;High-volume batch processing where cost matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The simple task responded in 1.4 seconds at Budget=0. Giving it 1024 budget means waiting 6.8 seconds and paying for 751 extra tokens. No benefit. Pure waste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget=1024–2048 (moderate thinking)&lt;/strong&gt; for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code review and bug finding where focused analysis matters&lt;/li&gt;
&lt;li&gt;Medium-complexity reasoning&lt;/li&gt;
&lt;li&gt;Multi-step judgment calls that are still latency-sensitive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'll be honest — the code review at Budget=1024 felt &lt;em&gt;better&lt;/em&gt; than Budget=0 even though the response was shorter. The unnecessary padding was gone. Just the key points.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget=4000–8000 (deep thinking)&lt;/strong&gt; for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex math, algorithm design&lt;/li&gt;
&lt;li&gt;Thorough architecture reviews&lt;/li&gt;
&lt;li&gt;Multi-step planning&lt;/li&gt;
&lt;li&gt;Tasks where accuracy matters far more than speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Budget=8000 on the math problem consumed 4036 thinking tokens in 26 seconds. That latency is unacceptable in any interactive context. I'd only use this for offline batch analysis or asynchronous background jobs.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/en/blog/en/gemini-25-flash-api-cost-optimization-guide"&gt;Gemini 2.5 Flash cost optimization guide&lt;/a&gt; covers this too: thinking tokens and output tokens are priced identically. Using Budget=8000 indiscriminately can multiply your costs by several times.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Code: Tracking Thinking Usage
&lt;/h2&gt;

&lt;p&gt;Here's the pattern I use to monitor thinking token consumption in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_with_thinking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate a response while tracking thinking token usage.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;thinking_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThinkingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;thinking_budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;include_thoughts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# False in production
&lt;/span&gt;        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;
    &lt;span class="n"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elapsed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_token_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates_token_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thoughts_token_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt_token_count&lt;/span&gt;
            &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates_token_count&lt;/span&gt;
            &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thoughts_token_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Usage
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_with_thinking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find potential memory leaks in this code: ...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;budget&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2048&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;latency_s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Thinking tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;thinking_tokens&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total billed tokens: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_tokens&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;usage_metadata.thoughts_token_count&lt;/code&gt; sometimes returns 0 — when budget=0 or the model decided it didn't need to think. Track this metric and you'll quickly learn how often thinking actually fires for your prompts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Thinking API Falls Short
&lt;/h2&gt;

&lt;p&gt;I want to be direct about the frustrating parts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic mode (Budget=-1) is unpredictable.&lt;/strong&gt; The model deciding its own budget sounds convenient, but it can fire thinking on simple tasks. In my simple task experiment, Budget=-1 took around the same time as Budget=1024. If you can't predict latency and cost, you can't budget for it in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;thinking_budget and thinking_level can't coexist.&lt;/strong&gt; Gemini 3.x uses &lt;code&gt;thinking_level&lt;/code&gt;; 2.5 uses &lt;code&gt;thinking_budget&lt;/code&gt;. Mix them in the same call and you get a 400 error. This is documented but the error message isn't obvious enough to catch immediately if you're migrating code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thinking tokens don't benefit from context caching.&lt;/strong&gt; Even if you use context caching to reduce the cost of a long system prompt, thinking tokens are billed fresh every time. As I covered in the &lt;a href="https://dev.to/en/blog/en/ai-agent-cost-reality"&gt;AI agent cost reality post&lt;/a&gt;, costs in agent loops can spiral faster than expected when thinking tokens accumulate.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Take
&lt;/h2&gt;

&lt;p&gt;Thinking API isn't overhyped. But "just turn it on" is also wrong.&lt;/p&gt;

&lt;p&gt;My position: &lt;strong&gt;use Budget=0 as the default, and explicitly activate Budget=1024–2048 only when the task genuinely needs multi-step reasoning.&lt;/strong&gt; Keep Budget=8000 for batch jobs or offline analysis where accuracy is paramount.&lt;/p&gt;

&lt;p&gt;Skip dynamic mode (Budget=-1) in production. Predictability beats convenience when you're billing actual users.&lt;/p&gt;

&lt;p&gt;The counterintuitive finding that stuck with me: for complex math, disabling thinking caused the model to "think out loud" across 2143 output tokens. Enabling Budget=1024 moved the reasoning internal and dropped output to 1915 tokens. The total cost difference was smaller than I expected. Whether you net save depends on the task.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Without running the experiments, I would have defaulted to "more thinking = better." The measurements said otherwise.&lt;/p&gt;

&lt;p&gt;Gemini 2.5 Flash Thinking API is a genuinely useful tool when applied to the right tasks. The paradoxical effect — where enabling thinking &lt;em&gt;reduces&lt;/em&gt; total tokens for complex reasoning — is real and worth knowing. But applying it blindly to simple tasks wastes money and time.&lt;/p&gt;

&lt;p&gt;Before setting &lt;code&gt;thinking_budget&lt;/code&gt;, ask one question first: &lt;strong&gt;does this task actually require deep reasoning?&lt;/strong&gt; Most of the time, the answer is no.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All code in this post is reproducible using the snippets provided. Written against google-genai package 0.8.x.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Claude Managed Agents' Dreaming, Outcomes, and Orchestration — How Agents Self-Improve While You Sleep</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Sat, 16 May 2026 06:41:32 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/claude-managed-agents-dreaming-outcomes-and-orchestration-how-agents-self-improve-while-you-2mij</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/claude-managed-agents-dreaming-outcomes-and-orchestration-how-agents-self-improve-while-you-2mij</guid>
      <description>&lt;p&gt;When Anthropic announced three new features at the Code with Claude conference in San Francisco on May 6, my first thought was: "What exactly is this agent learning while I'm away?"&lt;/p&gt;

&lt;p&gt;Dreaming. Outcomes. Multiagent Orchestration. The names have a marketing ring to them, but the underlying engineering decisions are concrete. And there's one thing people consistently misread about Dreaming in particular: when Anthropic says "the agent learns," they don't mean the model improves. The memory improves. That distinction matters more than it might seem.&lt;/p&gt;

&lt;p&gt;I couldn't test Dreaming directly — no API access, and it's still Research Preview. This analysis is based on official documentation, Anthropic's blog posts, conference materials, and early pilot reports. I won't claim to have run what I didn't run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code with Claude 2026 — No New Model, All Infrastructure
&lt;/h2&gt;

&lt;p&gt;The most telling thing about the May 6 SF keynote was the absence of a model announcement. Instead of competing on benchmark numbers, Anthropic focused on the execution layer for agents.&lt;/p&gt;

&lt;p&gt;What was announced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dreaming&lt;/strong&gt;: Automated agent memory refresh (Research Preview)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outcomes&lt;/strong&gt;: Success-criteria-based self-evaluation and iteration (Public Beta)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiagent Orchestration&lt;/strong&gt;: Lead-subagent parallel execution (Public Beta)&lt;/li&gt;
&lt;li&gt;Usage limits doubled across Pro, Max, Team, and Enterprise&lt;/li&gt;
&lt;li&gt;Peak-hour throttling removed for Pro and Max&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Security&lt;/strong&gt;: Code vulnerability scanner powered by Opus 4.7 (Enterprise)&lt;/li&gt;
&lt;li&gt;Remote Agents: Control your laptop from your phone&lt;/li&gt;
&lt;li&gt;SpaceX Project Colossus partnership (220,000+ GPUs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notion, Rakuten, Sentry, and Harvey are already running these features in production, according to Anthropic. The conference continues in London (May 19) and Tokyo (June 10).&lt;/p&gt;

&lt;p&gt;The pattern here is worth noting: Anthropic isn't trying to win the model benchmark race this month. They're building the plumbing that makes large-scale agent deployment tractable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dreaming — A Memory Consolidation System, Not Model Training
&lt;/h2&gt;

&lt;p&gt;Anthropic reaches for the hippocampus metaphor when explaining Dreaming — the way the human brain replays the day's events during sleep and decides what to keep. It's a reasonable analogy. Here's what the system actually does:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reviews up to 100 past agent sessions&lt;/li&gt;
&lt;li&gt;Extracts patterns: recurring mistakes, converged workflows, team preferences&lt;/li&gt;
&lt;li&gt;Removes duplicates and stale entries from the existing memory store, adds new ones&lt;/li&gt;
&lt;li&gt;Preserves original session transcripts untouched&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;This is not fine-tuning.&lt;/strong&gt; Anthropic is explicit: "Dreaming does not modify the underlying model weights." What changes is the memory store that the next agent session reads on startup. The model itself is unchanged.&lt;/p&gt;

&lt;p&gt;Harvey's pilot result gets cited constantly: task completion rates climbed roughly six times after Dreaming was enabled. The mechanism was specific — agents started remembering filetype workarounds and tool-specific behavior across sessions, which is exactly the kind of thing that breaks legal document workflows repeatedly.&lt;/p&gt;

&lt;p&gt;I think the six-times number is interesting but requires context. Harvey processes legal documents. Same document structures, same tools, repetitive review workflows — a domain where pattern extraction is tractable and where the same mistakes genuinely recur. Extrapolating "6x completion rate" to a general-purpose agent handling varied requests every session is not supported by this data.&lt;/p&gt;

&lt;p&gt;Dreaming is Research Preview. I can describe what the documentation says it does; I can't report what running it feels like.&lt;/p&gt;

&lt;h2&gt;
  
  
  Outcomes — Productizing the LLM-as-Judge Pattern
&lt;/h2&gt;

&lt;p&gt;Outcomes isn't a new idea. LLM-as-judge — using a separate model instance to evaluate agent output — is already standard practice in many agent pipelines. What Anthropic is doing is turning it into a managed primitive with a specific architecture.&lt;/p&gt;

&lt;p&gt;How Outcomes works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Developer writes a rubric defining success
   Example: "The contract clause must satisfy legal requirements A, B, and C"

2. Writer agent generates output

3. Grader runs in a separate context window, evaluates against rubric
   — Independent of the writer's reasoning process
   — Returns per-criterion pass/fail verdict

4. If anything fails → Grader sends specific feedback to writer

5. Writer revises and retries

6. All criteria pass → Return result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical design choice is that the grader runs in a &lt;strong&gt;completely separate context window&lt;/strong&gt;. It can't see the writer's reasoning; it only sees the output and the rubric. This is what distinguishes Outcomes from asking the same agent to "check your own work." Self-review in the same context window is biased — the agent is predisposed to justify what it already produced.&lt;/p&gt;

&lt;p&gt;Anthropic's internal benchmark numbers: 8.4% improvement in Word document quality, 10.1% for PowerPoint slides.&lt;/p&gt;

&lt;p&gt;In practice, rubric design becomes the core work. Too permissive, and Outcomes adds latency without benefit. Too strict, and you get an infinite retry loop. The API for grader configuration isn't something I could test at this tier, but the Outcomes cookbook example on the Claude platform shows the pattern clearly.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/en/blog/en/claude-managed-agents-production-deployment-guide"&gt;Managed Agents deployment walkthrough I wrote in April&lt;/a&gt; covered the $0.08/session baseline cost. Outcomes adds grader session cost on top — how much depends on rubric complexity and how many retry cycles each task needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multiagent Orchestration — Standardizing the Parallel Pattern
&lt;/h2&gt;

&lt;p&gt;Running multiple specialized agents in parallel for complex tasks isn't new either. &lt;a href="https://dev.to/en/blog/en/claude-code-agentic-workflow-patterns-5-types"&gt;Five agentic workflow patterns for Claude Code&lt;/a&gt; covered the architecture. What Orchestration adds is a managed version of that pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lead agent decomposes the task and delegates to specialists&lt;/li&gt;
&lt;li&gt;Up to 20 subagents run in parallel&lt;/li&gt;
&lt;li&gt;Each subagent has its &lt;strong&gt;own model, prompt, and tool configuration&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Shared filesystem for output coordination&lt;/li&gt;
&lt;li&gt;Full flow traceable in Claude Console
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    A["Complex Task"] --&amp;gt; B["Lead Agent\nDecompose &amp;amp; Delegate"]

    subgraph Parallel ["Parallel Execution (up to 20)"]
        B --&amp;gt; C["Subagent 1\nModel A + Tool X"]
        B --&amp;gt; D["Subagent 2\nModel B + Tool Y"]
        B --&amp;gt; E["Subagent N\nModel C + Tool Z"]
    end

    C --&amp;gt; F["Shared Filesystem"]
    D --&amp;gt; F
    E --&amp;gt; F
    F --&amp;gt; G["Lead Agent\nSynthesis &amp;amp; Final Output"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The per-subagent model configuration is a meaningful addition. A code generation subagent running Opus 4.7 alongside a fast validation subagent running Haiku 4.5 is cost-efficient without sacrificing output quality where it matters. This is the &lt;a href="https://dev.to/en/blog/en/ai-agent-cost-reality"&gt;heterogeneous agent fleet&lt;/a&gt; pattern made easier to implement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Self-Improvement Loop All Three Create Together
&lt;/h2&gt;

&lt;p&gt;Viewed individually, these three features look like separate product additions. Viewed together, they form a closed loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;graph TD
    subgraph Cycle ["Self-Improvement Cycle"]
        A["Agent Execution\n(Orchestration: parallel)"] --&amp;gt; B["Output Generation"]
        B --&amp;gt; C["Outcomes Evaluation\n(Grader: separate context)"]
        C --&amp;gt;|"Criteria failed"| D["Correction + Retry"]
        D --&amp;gt; B
        C --&amp;gt;|"All passed"| E["Final Output + Session Log"]
    end

    E --&amp;gt; F["Dreaming\n(Cross-session pattern extraction)"]
    F --&amp;gt; G["Memory Store Update"]
    G --&amp;gt; A
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Observe: Session data accumulates as the agent works.&lt;/p&gt;

&lt;p&gt;Evaluate: Outcomes' grader evaluates each task against success criteria. Failure reasons are recorded.&lt;/p&gt;

&lt;p&gt;Improve: Dreaming periodically reviews accumulated sessions and updates memory. The next session's agent starts with that enriched context.&lt;/p&gt;

&lt;p&gt;Over time, the agent doesn't acquire new skills — it accumulates operational knowledge about "what to watch out for in which situations." The model stays constant; the effective behavior improves.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/hindsight-mcp-agent-memory-learning"&gt;Hindsight MCP's approach to experience-based memory refresh&lt;/a&gt; covers similar territory from a different angle. Comparing both designs is useful for thinking through agent memory architecture choices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'm Skeptical
&lt;/h2&gt;

&lt;p&gt;Several things give me pause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, the Harvey 6x number.&lt;/strong&gt; Legal document processing is structured and repetitive. The same contracts, the same tools, the same review workflows. Pattern extraction works well here. Claiming "agents improve 6x with Dreaming" generalizes a domain-specific result in a way that sets unrealistic expectations for varied workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, memory poisoning risk.&lt;/strong&gt; If an agent consistently approaches tasks incorrectly, Dreaming could entrench those bad patterns. Anthropic offers a "review changes before they land" option, but how many teams will actually review every memory update in a high-volume production system? This needs better tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, auditability tension.&lt;/strong&gt; A system where the agent autonomously changes its own behavioral patterns is hard to audit. "Why did the agent make that decision six months ago?" requires memory store version history — and the tooling for that isn't clearly specified yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth, Research Preview status.&lt;/strong&gt; Dreaming is not Production. Unlike Outcomes and Orchestration (Public Beta), Dreaming hasn't been validated at production scale. The &lt;a href="https://dev.to/en/blog/en/ai-agent-cost-reality"&gt;agent cost reality&lt;/a&gt; analysis I did earlier applies here too: governance costs, monitoring costs, and debugging costs are real costs even when tokens are cheap.&lt;/p&gt;

&lt;p&gt;Fifth, Outcomes grader cost scales with retry depth. A rubric with five criteria and a task that fails on the first three passes could triple the session cost relative to a baseline run. There's no cost estimation tooling for this yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Use This Now
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with Outcomes.&lt;/strong&gt; If you're already running Managed Agents and output consistency is your main problem, rubric design is worth the investment. The separate grader context genuinely addresses the self-review bias problem. It's Public Beta and the most straightforward of the three to adopt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orchestration makes sense when tasks are too large or require too many specializations for a single agent.&lt;/strong&gt; Large report generation, simultaneous code review and documentation, multi-source data synthesis. Be careful about orchestration overhead — 20 subagents poorly configured can absorb the gains from parallelization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dreaming: proceed carefully.&lt;/strong&gt; Research Preview means production stability isn't guaranteed. The agents most likely to benefit are those handling repetitive, structured work over long periods. For varied-request agents, the improvement trajectory is less predictable.&lt;/p&gt;

&lt;p&gt;I find the three-feature combination genuinely interesting as a system design. Observe → Evaluate → Improve is a clean loop. But I'd push back on the "self-improving agent" framing that turns this into something magical. Memory can be wrong. Research Preview features aren't production-ready. And the Harvey pilot, while compelling, is a single domain under specific conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feasibility Assessment
&lt;/h2&gt;

&lt;p&gt;What I could verify directly: Anthropic SDK installation, basic Messages API connectivity. Dreaming, Outcomes, and Orchestration require Enterprise or Beta plan access — I didn't run them.&lt;/p&gt;

&lt;p&gt;Primary sources consulted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://claude.com/blog/new-in-claude-managed-agents" rel="noopener noreferrer"&gt;New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration&lt;/a&gt; — Anthropic official&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.claude.com/cookbook/managed-agents-cma-verify-with-outcome-grader" rel="noopener noreferrer"&gt;Outcomes implementation cookbook&lt;/a&gt; — Claude platform&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://claude.com/blog/code-w-claude-sf-2026-sf" rel="noopener noreferrer"&gt;Code w/ Claude SF 2026 summary&lt;/a&gt; — Anthropic blog&lt;/li&gt;
&lt;li&gt;&lt;a href="https://venturebeat.com/technology/anthropic-introduces-dreaming-a-system-that-lets-ai-agents-learn-from-their-own-mistakes" rel="noopener noreferrer"&gt;VentureBeat coverage of Dreaming&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Simon Willison's live blog from Code with Claude&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When more teams share production Dreaming results — especially outside legal tech — I'll update my read. For now: interesting architecture, still needs validation beyond a single pilot.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>aiagents</category>
      <category>anthropic</category>
      <category>managedagents</category>
    </item>
    <item>
      <title>Cloudflare Agents Week 2026 — When AI Agents Become Cloud Customers</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Fri, 15 May 2026 06:43:30 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/cloudflare-agents-week-2026-when-ai-agents-become-cloud-customers-47b2</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/cloudflare-agents-week-2026-when-ai-agents-become-cloud-customers-47b2</guid>
      <description>&lt;p&gt;This time last year, every AI agent infrastructure conversation started with Kubernetes + LangGraph. Cloudflare's April Agents Week presented a different picture. Agents don't just call APIs — they create Cloudflare accounts, register domains, and deploy code on their own. The phrase "agents as cloud customers" sounds like marketing fluff, but this time they actually built it.&lt;/p&gt;

&lt;p&gt;Here's my read on what matters, what doesn't, and where I'm skeptical.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agents Week Was
&lt;/h2&gt;

&lt;p&gt;Cloudflare declared April 2026 "agents week" and shipped announcements every day — 20+ new features and GA transitions by the end of it. The overall impression is a company-wide bet that agents will be the primary actors on the internet, and they rebuilt infrastructure accordingly across compute, storage, networking, and security.&lt;/p&gt;

&lt;p&gt;I'm focusing on the items that actually affect how you write and deploy agent code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Most Provocative Announcement — Agents That Create Their Own Accounts
&lt;/h2&gt;

&lt;p&gt;My honest reaction when I first read this: "is this real?" The mechanics: once a user accepts Cloudflare's terms of service once, agents can autonomously create a Cloudflare account, start a paid subscription, register a domain, get an API token, and deploy code. Stripe partnership handles payment tokenization; OAuth + OIDC authenticate the agent as a trusted actor.&lt;/p&gt;

&lt;p&gt;The implication is significant. Until now, agents worked within infrastructure that humans provisioned. Now agents can be the entity that provisions the infrastructure itself. If you're building a SaaS product, "agent handles new customer onboarding end-to-end" becomes a real architectural option.&lt;/p&gt;

&lt;p&gt;That said, I have two concerns I can't shake. First, an agent connected to live billing requires airtight cost controls. Cloudflare's new &lt;code&gt;task_budget&lt;/code&gt; concept seems designed for exactly this, but real-world examples of the two working together are scarce. Second, the legal accountability picture is murky. If an agent registers the wrong domain or incurs unexpected charges, who owns that? User consent to ToS exists, but the specific liability hasn't been tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Announcements Worth Your Attention
&lt;/h2&gt;

&lt;p&gt;Past the headline, here are the things I'd actually build with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sandboxes GA&lt;/strong&gt;: Nine months from beta (June 2025) to general availability. Each sandbox is an isolated Linux environment — real shell, real filesystem, background processes — that spins up on demand and, critically, picks up exactly where it left off after interruption. Sub-millisecond start times mean a code-generation agent can write, execute, observe output, and iterate in tight loops.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/ai-agent-framework-comparison-2026-langgraph-crewai-dapr-production"&gt;Compared to setting up a separate code execution environment alongside LangGraph or CrewAI&lt;/a&gt;, Sandboxes shifts the question from "how do I configure the execution environment" to "which infrastructure layer do I trust to manage it." Those are meaningfully different decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Artifacts&lt;/strong&gt;: Git-compatible versioned storage for agents. Create tens of millions of repos, fork from any remote, access with standard Git clients. Moved from private beta to public beta in early May. The practical use case: agents that produce code outputs now have a permanent home for those outputs, survives context resets, accessible from outside Cloudflare's stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic Workers&lt;/strong&gt;: Isolated runtime for AI-generated code. Millisecond spin-up, scales to millions of concurrent executions. Enables the generate-execute-observe loop agents need without managing container infrastructure. Still feels early but the concept is right.&lt;/p&gt;

&lt;h2&gt;
  
  
  I Actually Installed the SDK
&lt;/h2&gt;

&lt;p&gt;Theory aside, I ran through the setup myself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;cloudflare-agent-demo &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;cloudflare-agent-demo
npm init &lt;span class="nt"&gt;-y&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; @cloudflare/agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean install. &lt;code&gt;@cloudflare/agents@0.0.16&lt;/code&gt; exports &lt;code&gt;Agent&lt;/code&gt;, &lt;code&gt;AIChatAgent&lt;/code&gt;, and &lt;code&gt;routeAgentRequest&lt;/code&gt; as the main surfaces.&lt;/p&gt;

&lt;p&gt;Here's a minimal but representative agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/index.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;routeAgentRequest&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@cloudflare/agents&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;TaskState&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;processedCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;lastHeartbeat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;Env&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;TASK_AGENT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;DurableObjectNamespace&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;TaskAgent&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TaskAgent&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;TaskState&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;onStart&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setState&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;processedCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;lastHeartbeat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="c1"&gt;// Built-in cron scheduling — no external scheduler needed&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;0 * * * *&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;heartbeat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;heartbeat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`SELECT COUNT(*) as n FROM tasks`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setState&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;processedCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;count&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;n&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;lastHeartbeat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;onRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Agents receive email directly&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;onEmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ForwardableEmailMessage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="s2"&gt;`
      INSERT INTO tasks (id, content, created_at)
      VALUES (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomUUID&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;email&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;)
    `&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Env&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Response&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;routed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;routeAgentRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;routed&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OK&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;wrangler dev&lt;/code&gt; starts immediately, no Cloudflare account needed for local work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;⛅️ wrangler 4.91.0
Your Worker has access to the following bindings:
  env.TASK_AGENT (TaskAgent)   Durable Object   local

⎔ Starting local server...
[wrangler:info] Ready on http://localhost:9998
[wrangler:info] GET / 200 OK (7ms)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One important caveat: &lt;code&gt;@cloudflare/agents&lt;/code&gt; is Workers runtime-only. Trying to run it with standard Node.js throws &lt;code&gt;ERR_UNSUPPORTED_ESM_URL_SCHEME&lt;/code&gt; because of the &lt;code&gt;cloudflare:&lt;/code&gt; protocol imports. You need Wrangler. &lt;a href="https://dev.to/en/blog/en/claude-agent-sdk-tool-use-complete-guide-2026"&gt;If you're used to SDKs like the Claude Agent SDK that run anywhere in Python or Node&lt;/a&gt;, this is an adjustment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Choices Worth Understanding
&lt;/h2&gt;

&lt;p&gt;A few design decisions in the SDK that reflect Cloudflare's broader approach:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedded SQLite&lt;/strong&gt;: Declare &lt;code&gt;new_sqlite_classes&lt;/code&gt; in &lt;code&gt;wrangler.toml&lt;/code&gt; and every Agent instance gets its own SQLite. No external database configuration. Query with &lt;code&gt;this.sql&lt;/code&gt;. The Durable Object isolation model gives you natural multi-tenancy — each agent instance has independent data. Sounds wasteful but it's actually clean for state isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In-process scheduling&lt;/strong&gt;: Register cron jobs directly from agent code. No external cron service. Wraps the Durable Object alarm API, which keeps scheduling and state management co-located. High cohesion, lower operational surface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email handler&lt;/strong&gt;: &lt;code&gt;onEmail&lt;/code&gt; lets agents receive email directly via Workers Email Routing. An agent that turns email into tasks is straightforward to write.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/en/blog/en/dapr-agents-v1-cncf-production-ai-framework"&gt;The way Dapr Agents handles state and messaging through Kubernetes sidecar patterns&lt;/a&gt; contrasts interestingly here. Cloudflare's model is more code-centric; Dapr is more infrastructure-centric. Both have legitimate use cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'm Skeptical
&lt;/h2&gt;

&lt;p&gt;I'll be direct about the rough edges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendor lock-in is significant.&lt;/strong&gt; The &lt;code&gt;cloudflare:workers&lt;/code&gt; runtime dependency means your agent code doesn't run outside Cloudflare's stack. Migrating to a different platform later means substantial rewrites. &lt;a href="https://dev.to/en/blog/en/mcp-server-production-deployment-kubernetes-guide"&gt;Containerized approaches like running MCP servers on Kubernetes&lt;/a&gt; don't have this problem — you trade operational simplicity now for portability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent orchestration is thin.&lt;/strong&gt; The single-agent story is compelling. But the SDK-level support for complex multi-agent coordination — handoffs, shared memory, hierarchical orchestration — is limited. Project Think is meant to address this but it's early. If your use case involves agents coordinating at scale, you'll need to build significant structure yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SDK maturity.&lt;/strong&gt; &lt;code&gt;@cloudflare/agents@0.0.16&lt;/code&gt; is pre-1.0. The API surface will change. For production use, you're accepting that risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Take on When to Use This
&lt;/h2&gt;

&lt;p&gt;Cloudflare is the right infrastructure choice when: response latency at the edge matters for your agents, your team already operates Cloudflare Workers, you want to minimize infrastructure management and focus on agent logic, or your architecture involves many independent agents each owning their own state.&lt;/p&gt;

&lt;p&gt;It's not the right choice when: you need complex multi-agent orchestration and you're already invested in LangGraph, you're locked to AWS or GCP infrastructure, or your agents need to run in Python or standard Node.js environments.&lt;/p&gt;

&lt;p&gt;The overall direction from Agents Week is coherent. Cloudflare is positioning itself as the infrastructure layer for the agent era — what Kubernetes became for containers. The SDK being at v0 means production adoption should be cautious, but the design thinking is consistent. Worth running through the setup and forming your own opinion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signed Agents: Cryptographic Identity for Agent Traffic
&lt;/h2&gt;

&lt;p&gt;One announcement that got less coverage but caught my attention: Signed Agents. The idea is that HTTP requests made by agents carry a cryptographic signature proving their origin — "this was sent by an agent, not a human."&lt;/p&gt;

&lt;p&gt;Right now there's no standard way to distinguish agent traffic from human traffic on the internet. User-Agent strings and IP patterns are guesses at best. Signed Agents gives servers a verifiable signal: they can check the signature and apply agent-specific rate limits, billing, or access controls. It's an early-stage primitive but it's the right one to build. Once agents are common enough to treat as distinct traffic types, having cryptographic identity for them becomes infrastructure rather than a feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Email Service Public Beta
&lt;/h2&gt;

&lt;p&gt;Workers Email Service graduated to public beta during Agents Week. Any agent can now send email without integrating a third-party service like SendGrid or AWS SES.&lt;/p&gt;

&lt;p&gt;Combined with the &lt;code&gt;onEmail&lt;/code&gt; handler already in the SDK, agents can now handle both inbound and outbound email entirely within Cloudflare's stack. An agent that receives a customer email, processes it, creates a task, and sends a reply — with no external email service in the loop. For customer support agents, notification pipelines, or email-based task management, this is a meaningful simplification.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Looking at Agents Week as a whole, it reads less like a feature release and more like a positioning statement. Twenty-plus announcements, all pointing the same direction: Cloudflare intends to be the infrastructure layer for the agent era the way AWS became the infrastructure layer for the web era.&lt;/p&gt;

&lt;p&gt;The single thing I'd actually go build with first from this week: Sandboxes. Not the headline "agents create accounts" story — the persistent isolated Linux environment for agent code execution. That's immediately useful for any code-generation or code-testing agent, and it works today without novel legal or financial risk.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;@cloudflare/agents@0.0.16&lt;/code&gt; tells you what you need to know about production readiness. But if you're serious about evaluating agent infrastructure options, run through the local setup and form your own opinion. Twenty minutes, no account required.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Test environment&lt;/strong&gt;: &lt;code&gt;@cloudflare/agents@0.0.16&lt;/code&gt;, &lt;code&gt;wrangler@4.91.0&lt;/code&gt;, Node.js v22.22.0, macOS 14&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Note&lt;/strong&gt;: The autonomous agent account creation feature requires a real Cloudflare account and Stripe integration — out of scope for local testing.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Source&lt;/strong&gt;: &lt;a href="https://blog.cloudflare.com/agents-week-in-review/" rel="noopener noreferrer"&gt;Cloudflare Agents Week 2026&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cloudflare</category>
      <category>aiagents</category>
      <category>agentinfrastructure</category>
      <category>webplatform</category>
    </item>
    <item>
      <title>AWS MCP Server GA Practical Guide — Connecting CloudWatch &amp; IAM to Your AI Coding Agent</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 14 May 2026 06:42:39 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/aws-mcp-server-ga-practical-guide-connecting-cloudwatch-iam-to-your-ai-coding-agent-3d79</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/aws-mcp-server-ga-practical-guide-connecting-cloudwatch-iam-to-your-ai-coding-agent-3d79</guid>
      <description>&lt;p&gt;A CloudWatch alarm fired. Lambda error rate crossed the threshold, and I needed to dig through logs — flipping between the AWS console and my terminal, copying log group names by hand. At some point I had a clear thought: what if Claude Code could just look at my CloudWatch directly?&lt;/p&gt;

&lt;p&gt;On May 6, 2026, AWS shipped an answer. &lt;strong&gt;AWS MCP Server hit general availability.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What the AWS MCP Server Actually Is
&lt;/h2&gt;

&lt;p&gt;AWS MCP Server gives AI coding agents — Claude Code, Cursor, Codex — a standardized way to query AWS services directly. It wraps AWS APIs as MCP tools, using the Model Context Protocol that Anthropic defined. One &lt;code&gt;uvx&lt;/code&gt; command wires 31 CloudWatch tools and 29 IAM tools into your coding agent.&lt;/p&gt;

&lt;p&gt;Instead of copying log group names from the console and pasting them into CLI commands, you can ask your agent: "Find the Lambda function with the highest error rate in the past hour and summarize the relevant logs." The agent runs the Logs Insights query itself and brings back results.&lt;/p&gt;

&lt;p&gt;If you've &lt;a href="https://dev.to/en/blog/en/mcp-server-build-practical-guide-2026"&gt;built an MCP server from scratch&lt;/a&gt;, you already understand the protocol. AWS MCP Server is the official, AWS-maintained collection of MCP servers for AWS services, published at &lt;code&gt;awslabs/mcp&lt;/code&gt; on GitHub and installable from PyPI.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Changed at GA
&lt;/h3&gt;

&lt;p&gt;Three things matter compared to pre-GA versions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IAM condition context keys.&lt;/strong&gt; Every API call routed through AWS MCP Server now carries &lt;code&gt;aws:ViaAWSMCPService&lt;/code&gt; and &lt;code&gt;aws:CalledViaAWSMCP&lt;/code&gt; condition keys automatically. Your IAM policies can differentiate agent-initiated calls from human-initiated calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full CloudTrail integration.&lt;/strong&gt; Every API call goes to CloudTrail. There's a complete audit trail of what the agent did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate CloudWatch namespace.&lt;/strong&gt; Metrics published under &lt;code&gt;AWS-MCP&lt;/code&gt; let you monitor how much of your API traffic comes from agents versus direct calls.&lt;/p&gt;

&lt;p&gt;The practical upshot: &lt;strong&gt;you can now enforce different IAM permissions for agents and humans while using the same AWS credentials.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Installation: One Line with uvx
&lt;/h2&gt;

&lt;p&gt;I installed and ran both servers. Here is what it takes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install uv if you don't have it&lt;/span&gt;
curl &lt;span class="nt"&gt;-LsSf&lt;/span&gt; https://astral.sh/uv/install.sh | sh

&lt;span class="c"&gt;# Run CloudWatch MCP server (creates isolated env automatically)&lt;/span&gt;
uvx awslabs.cloudwatch-mcp-server@latest

&lt;span class="c"&gt;# Run IAM MCP server&lt;/span&gt;
uvx awslabs.iam-mcp-server@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;uvx&lt;/code&gt; handles the virtual environment. First run pulls 53 packages for the CloudWatch server — botocore, pandas, scipy, statsmodels, and more. The reason for scipy and statsmodels is that the CloudWatch server includes built-in anomaly detection and statistical analysis on metrics, not just passthrough queries.&lt;/p&gt;

&lt;p&gt;Installed versions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;awslabs.cloudwatch-mcp-server&lt;/code&gt; v0.1.1&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;awslabs.iam-mcp-server&lt;/code&gt; v1.0.20&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 0.x version on the CloudWatch server signals the API is still stabilizing. That is worth keeping in mind before putting it in production workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wiring It Into Claude Code (.mcp.json)
&lt;/h3&gt;

&lt;p&gt;Put this in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cloudwatch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"awslabs.cloudwatch-mcp-server@latest"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"AWS_REGION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ap-northeast-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"AWS_PROFILE"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"FASTMCP_LOG_LEVEL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARNING"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"iam"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"uvx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"awslabs.iam-mcp-server@latest"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"AWS_REGION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ap-northeast-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"FASTMCP_LOG_LEVEL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARNING"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set &lt;code&gt;FASTMCP_LOG_LEVEL&lt;/code&gt; to &lt;code&gt;WARNING&lt;/code&gt;. Without it, INFO logs bleed into the agent's responses. You can also install via the Claude Code CLI: &lt;code&gt;claude mcp add aws-mcp-server&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  CloudWatch MCP Server: 31 Tools
&lt;/h2&gt;

&lt;p&gt;When the server starts, it registers exactly 31 tools. Here is the breakdown.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Log group tools (8):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;describe_log_groups         List log groups
analyze_log_group           AI-powered log pattern analysis
execute_log_insights_query  Run a Logs Insights query
get_logs_insight_query_results  Poll query results
cancel_logs_insight_query   Cancel a running query
execute_cwl_insights_batch  Batch query execution
recommend_indexes_loggroup  Index recommendations for a log group
recommend_indexes_account   Account-wide index recommendations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metrics tools (11):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;get_metric_data             Fetch metric data points
get_metric_metadata         Metadata lookup (1,179 entries indexed at startup)
analyze_metric              Anomaly detection on a metric
get_recommended_metric_alarms  Suggest alarm thresholds
execute_promql_query        Run a PromQL query
execute_promql_range_query  PromQL range query
get_promql_label_values     PromQL label values
get_promql_series           PromQL series
get_promql_labels           PromQL labels
get_active_alarms           List active alarms
get_alarm_history           Alarm state history
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;get_metric_metadata&lt;/code&gt; detail is worth noting. At startup, the server loads and indexes 1,179 metric metadata entries covering EC2, Lambda, RDS, DynamoDB, and most other AWS services. The server logs show it explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;INFO | Loaded 1179 metric metadata entries
INFO | Successfully indexed 1179 metric metadata entries
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is what allows the agent to answer "which metric measures Lambda cold start duration?" without hitting the AWS docs.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Found on My Account
&lt;/h3&gt;

&lt;p&gt;I ran this against a real ap-northeast-1 account. The output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;Available log groups (5):
  /aws/lambda/remotax-renewal-fe-CustomCDKBucketDeployment: 331,695 bytes
  /aws/lambda/remotax-renewal-fe-CustomS3AutoDeleteObjects:   2,038 bytes
  /aws/lambda/remotax-renewal-fe-LambdaServerFunctionHandler:     0 bytes
  /aws/lambda/remotax-renewal-fe-LogRetentionaae0aa3c5b4d4f:     0 bytes
  RDSOSMetrics: 55,192,669 bytes

Active CloudWatch Alarms:
  OK    EC2-HighCPU-Alarm
&lt;/span&gt;&lt;span class="gp"&gt;        CPUUtilization &amp;gt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; 80% | Currently: OK
&lt;span class="go"&gt;  ?     EC2-HighDiskUsage-Alarm
&lt;/span&gt;&lt;span class="gp"&gt;        disk_used_percent &amp;gt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; 80% | INSUFFICIENT_DATA
&lt;span class="go"&gt;  ?     EC2-HighMemoryUsage-Alarm
&lt;/span&gt;&lt;span class="gp"&gt;        mem_used_percent &amp;gt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; 80% | INSUFFICIENT_DATA
&lt;span class="go"&gt;  ?     LaravelErrorAlarm
&lt;/span&gt;&lt;span class="gp"&gt;        LaravelErrorCount &amp;gt;&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; 1 | INSUFFICIENT_DATA
&lt;span class="go"&gt;
EC2 metrics available: 85
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three alarms sitting in &lt;code&gt;INSUFFICIENT_DATA&lt;/code&gt;. Disk and memory alarms with no data means CloudWatch Agent is not running or misconfigured on those EC2 instances. That is the kind of silent infrastructure problem that usually only surfaces when an alert should fire and doesn't. The agent picked it up immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  IAM MCP Server: 29 Tools and the Security Architecture That Matters
&lt;/h2&gt;

&lt;p&gt;The IAM server ships 29 tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;list_users / get_user / create_user / delete_user
list_roles / create_role
list_policies / get_managed_policy_document
attach_user_policy / detach_user_policy
create_access_key / delete_access_key
simulate_principal_policy    ← the important one
list_groups / create_group / delete_group
add_user_to_group / remove_user_from_group
put_role_policy / get_role_policy / delete_role_policy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I find &lt;code&gt;simulate_principal_policy&lt;/code&gt; the most useful. It checks whether an IAM principal can perform specific actions without actually making those API calls. After reading about &lt;a href="https://dev.to/en/blog/en/mcp-security-crisis-30-cves-enterprise-hardening"&gt;MCP ecosystem security vulnerabilities and 30 CVEs&lt;/a&gt;, having agents pre-validate their permissions before executing is a meaningful safety step.&lt;/p&gt;

&lt;p&gt;Test run against my account:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;simulate_principal_policy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;PolicySourceArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::370193714718:user/remotax-fe&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ActionNames&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cloudwatch:DescribeAlarms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;logs:DescribeLogGroups&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;iam:ListUsers&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3:ListBuckets&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;ResourceArns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Results:
# ✓ cloudwatch:DescribeAlarms: allowed
# ✓ logs:DescribeLogGroups: allowed
# ✓ iam:ListUsers: allowed
# ✓ s3:ListBuckets: allowed
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Condition Key Architecture
&lt;/h3&gt;

&lt;p&gt;This is the part I think matters most about the GA release. Every API call through AWS MCP Server automatically carries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;aws:ViaAWSMCPService&lt;/code&gt; — marks this as a request via an MCP service&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;aws:CalledViaAWSMCP&lt;/code&gt; — marks this as originating from an MCP client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An IAM deny policy using these keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"iam:CreateUser"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"iam:DeleteUser"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"iam:AttachUserPolicy"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"Bool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"aws:ViaAWSMCPService"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this policy, a human using the AWS console can manage IAM users. Claude Code using the same credentials cannot. Same key pair, different effective permissions. When I was &lt;a href="https://dev.to/en/blog/en/claude-agent-sdk-tool-use-complete-guide-2026"&gt;implementing Tool Use in the Claude Agent SDK&lt;/a&gt;, I had to build agent permission scoping into application logic. AWS is solving that at the infrastructure layer here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Faws-mcp-server-ga-practical-guide-2026-arch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Faws-mcp-server-ga-practical-guide-2026-arch.png" alt="AWS MCP Server Security Architecture — CloudWatch, IAM, CloudTrail"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three layers: coding agent → AWS MCP Server (stdio) → AWS API (SigV4 auth). Every AWS API call goes to CloudTrail. Metrics land in the AWS-MCP CloudWatch namespace separately from direct human calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Available AWS MCP Servers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Server&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;awslabs.cloudwatch-mcp-server&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Logs, Metrics, Alarms&lt;/td&gt;
&lt;td&gt;v0.1.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;awslabs.iam-mcp-server&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;IAM management&lt;/td&gt;
&lt;td&gt;v1.0.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;awslabs.aws-api-mcp-server&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any AWS API&lt;/td&gt;
&lt;td&gt;separate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CloudWatch Application Signals&lt;/td&gt;
&lt;td&gt;APM/SLO monitoring&lt;/td&gt;
&lt;td&gt;separate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Network MCP Server&lt;/td&gt;
&lt;td&gt;VPC/network diagnostics&lt;/td&gt;
&lt;td&gt;separate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Pricing MCP Server&lt;/td&gt;
&lt;td&gt;Cost estimation&lt;/td&gt;
&lt;td&gt;separate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EKS MCP Server&lt;/td&gt;
&lt;td&gt;EKS cluster management&lt;/td&gt;
&lt;td&gt;separate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;aws-api-mcp-server&lt;/code&gt; is interesting. It exposes every AWS API through a single tool. When &lt;a href="https://dev.to/en/blog/en/fastmcp-python-mcp-server-build-guide-2026"&gt;building a FastMCP-based MCP server&lt;/a&gt;, each API endpoint needed its own tool definition. The aws-api-mcp-server flips that — one tool, all APIs. The trade-off is that the agent needs more context to figure out which API to call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Assessment — What Works, What Doesn't
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What I find genuinely useful:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The IAM condition key separation is real. If you've been hesitant to give agents AWS access because you can't restrict them beyond the IAM user's permissions, this changes the calculation. You can attach &lt;code&gt;aws:ViaAWSMCPService&lt;/code&gt; deny statements to enforce read-only agent access while keeping full human access with the same credentials.&lt;/p&gt;

&lt;p&gt;PromQL support surprised me. CloudWatch supports PromQL for Container Insights metrics, and the MCP server exposes it. If you run Kubernetes on EKS and already write PromQL, you can use that syntax directly through the agent.&lt;/p&gt;

&lt;p&gt;The 1,179-entry metric metadata index means the agent can reason about AWS services it has never seen before in your specific account. It knows what metrics EC2, Lambda, RDS, and most other services expose without needing to query AWS each time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What gives me pause:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CloudWatch server at v0.1.1. The AI analysis tools like &lt;code&gt;analyze_log_group&lt;/code&gt; and &lt;code&gt;analyze_metric&lt;/code&gt; look promising but I have not stress-tested them. A 0.x version in production tooling warrants caution.&lt;/p&gt;

&lt;p&gt;Logs Insights cost. CloudWatch charges for scanned log data in Insights queries. An agent with unconstrained query access could run up meaningful charges. There are no cost guardrails at the tool level — that has to be managed at the IAM level (restricting query scope) or through agent instructions.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;create_access_key&lt;/code&gt; in the IAM server. An agent tool that creates new AWS access keys is, by default, accessible. The condition key approach can block it, but you have to set that up deliberately. I would not wire up the IAM server in a production environment without first adding explicit deny policies for the write operations.&lt;/p&gt;

&lt;p&gt;My recommendation: start with &lt;code&gt;cloudwatch-mcp-server&lt;/code&gt; in read-heavy workflows. Treat the IAM server as a development tool until you have the deny policies in place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If AWS credentials are configured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install uv&lt;/span&gt;
curl &lt;span class="nt"&gt;-LsSf&lt;/span&gt; https://astral.sh/uv/install.sh | sh

&lt;span class="c"&gt;# Test immediately&lt;/span&gt;
uvx awslabs.cloudwatch-mcp-server@latest

&lt;span class="c"&gt;# Add to a project&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; .mcp.json &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "mcpServers": {
    "cloudwatch": {
      "command": "uvx",
      "args": ["awslabs.cloudwatch-mcp-server@latest"],
      "env": {
        "AWS_REGION": "us-east-1",
        "FASTMCP_LOG_LEVEL": "WARNING"
      }
    }
  }
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Official docs: &lt;a href="https://awslabs.github.io/mcp" rel="noopener noreferrer"&gt;awslabs.github.io/mcp&lt;/a&gt;. Source: &lt;a href="https://github.com/awslabs/mcp" rel="noopener noreferrer"&gt;github.com/awslabs/mcp&lt;/a&gt;. Free to use — you pay only for the AWS resources the agent touches.&lt;/p&gt;

&lt;p&gt;AI agents having console-level visibility into AWS infrastructure is coming regardless. AWS MCP Server GA is the first production-ready step in that direction.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>MemRL: Self-Evolving Agents via Episodic Memory RL</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 14 May 2026 04:16:52 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/memrl-self-evolving-agents-via-episodic-memory-rl-464b</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/memrl-self-evolving-agents-via-episodic-memory-rl-464b</guid>
      <description>&lt;p&gt;There is a gap in how most AI agents handle experience. They reason well from the start, but they don't get smarter from what they do. Fine-tuning closes that gap, but it's expensive, slow, and prone to catastrophic forgetting. RAG-based memory is cheaper, but it retrieves by similarity — not by whether a past strategy actually worked.&lt;/p&gt;

&lt;p&gt;MemRL, published on arXiv in January 2026, proposes a different approach: apply reinforcement learning directly to episodic memory at runtime, without touching model weights. The result is an agent that improves through trial and error, storing structured experiences and learning which ones to prioritize based on real task outcomes.&lt;/p&gt;

&lt;p&gt;This guide breaks down how MemRL works, what the benchmarks show, and how the core mechanism looks in practice — including a minimal reproduction Effloow Lab ran to verify the concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem MemRL Solves
&lt;/h2&gt;

&lt;p&gt;Current agent memory systems face a fundamental tradeoff. On one end, fine-tuning embeds knowledge directly into model weights — but requires expensive compute, labeled data, and still risks overwriting previously learned behavior (catastrophic forgetting). On the other end, RAG-style retrieval keeps knowledge external, making it cheap to update. But standard RAG retrieves by semantic similarity alone. It surfaces documents that look similar to the current query, not documents associated with strategies that previously worked.&lt;/p&gt;

&lt;p&gt;This is the stability-plasticity dilemma: agents either freeze their knowledge (stable but rigid) or update it continuously (plastic but forgetful). MemRL's claim is that this tradeoff is a false choice — you can have a frozen LLM backbone (stable) with an external memory that evolves through RL feedback (plastic).&lt;/p&gt;

&lt;h2&gt;
  
  
  What MemRL Is
&lt;/h2&gt;

&lt;p&gt;MemRL (arXiv:2601.03192, from MemTensor, updated February 2026) is a non-parametric framework that enables agents to self-evolve through runtime reinforcement learning on episodic memory. The LLM's weights never change. Instead, MemRL maintains a structured external memory, refines it based on task outcomes, and uses a two-phase retrieval mechanism to surface the most useful experiences — not just the most similar ones.&lt;/p&gt;

&lt;p&gt;The open-source code is available at &lt;a href="https://github.com/MemTensor/MemRL" rel="noopener noreferrer"&gt;MemTensor/MemRL&lt;/a&gt;, with support for ALFWorld, BigCodeBench, HLE, and Lifelong Agent Bench benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Intent-Experience-Utility Triplet
&lt;/h2&gt;

&lt;p&gt;The core data structure in MemRL is not a document. It's a triplet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Intent&lt;/strong&gt;: the task or query the agent was addressing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experience&lt;/strong&gt;: the specific action trajectory or solution strategy used&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Utility (Q-value)&lt;/strong&gt;: a learned score representing how successful that experience was&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where RAG stores raw text and retrieves by embedding similarity, MemRL stores structured (intent, experience, Q-value) records. The Q-value is not fixed at write time — it evolves as the agent receives environmental feedback across episodes.&lt;/p&gt;

&lt;p&gt;This distinction matters. Two experiences with similar intents might have very different Q-values if one led to a successful outcome and the other failed. RAG can't distinguish these. MemRL can.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Two-Phase Retrieval Works
&lt;/h2&gt;

&lt;p&gt;When an agent faces a new task, MemRL retrieves relevant past experiences in two stages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase A — Semantic Filter:&lt;/strong&gt; The agent computes similarity between the current intent and all stored intents using dense embeddings. The top-k candidates (by semantic relevance) are kept. This narrows the search to experiences that are topically related to the current task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase B — Q-Value Ranking:&lt;/strong&gt; Among those filtered candidates, MemRL re-ranks by Q-value. Experiences with higher utility — those associated with successful outcomes — rise to the top. The agent retrieves the highest-Q candidates and uses them as in-context guidance for the current task.&lt;/p&gt;

&lt;p&gt;The paper describes Phase A as analogical transfer (retrieving similar past events) and Phase B as mental rehearsal (selecting strategies proven to work). Together, they avoid the main failure mode of pure RAG: retrieving semantically similar but strategically useless memories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Q-Value Learning: The RL Mechanism
&lt;/h2&gt;

&lt;p&gt;After the agent completes a task using retrieved memories, it receives a reward signal from the environment — success, partial success, or failure. MemRL applies a Monte Carlo-style update to the Q-value of the used memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Q_new = Q_old + α × (reward - Q_old)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where α is the learning rate. Positive outcomes increase the Q-value; failures decrease it. Over many episodes, Q-values diverge: experiences associated with reliable strategies accumulate higher scores, while noise and failed attempts are downweighted.&lt;/p&gt;

&lt;p&gt;The entire optimization loop runs outside the LLM. No gradient computation, no retraining. The LLM reasons over whatever context it's given — MemRL just gets better at deciding what to put in that context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Effloow Lab PoC: Core Mechanism in Python
&lt;/h2&gt;

&lt;p&gt;Effloow Lab ran a minimal reproduction of the IEU triplet and two-phase retrieval to verify the concept. Full repo installation requires ALFWorld and LLM credentials, so this PoC uses word-overlap similarity instead of dense embeddings — a known limitation documented in the lab run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SimpleMemRL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k_semantic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k_q&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top_k_semantic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;top_k_semantic&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top_k_q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;top_k_q&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_cosine_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# word-overlap proxy for embeddings (sandbox limitation)
&lt;/span&gt;        &lt;span class="n"&gt;set_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;set_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;set_a&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;set_b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set_a&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;set_b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;set_b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;experience&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;initial_q&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;experience&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;experience&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;initial_q&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_q&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reward&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_intent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="c1"&gt;# Phase A: semantic filter
&lt;/span&gt;        &lt;span class="n"&gt;scored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_cosine_sim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scored&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top_k_semantic&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
        &lt;span class="c1"&gt;# Phase B: Q-value ranking
&lt;/span&gt;        &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;top_k_q&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this with a small set of coding strategy memories, then applying positive feedback to sort-related experiences and negative feedback to a debugging strategy, produced the expected result: sort strategies rose to Q≈0.62, while the debugging entry dropped to Q≈0.24. Subsequent queries for sorting tasks surfaced the higher-Q memories first.&lt;/p&gt;

&lt;p&gt;The key limitation observed: word-overlap similarity doesn't capture semantic equivalence well, which caused some retrieval mismatches. Real MemRL uses dense embeddings (e.g., OpenAI text-embedding models or similar), resolving this. Full lab-run details and output are in &lt;code&gt;data/lab-runs/memrl-self-evolving-agents-episodic-memory-rl-guide-2026.md&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;The paper benchmarks MemRL across these tasks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;MemRL (Last Acc.)&lt;/th&gt;
&lt;th&gt;MemP Baseline&lt;/th&gt;
&lt;th&gt;No-Memory Baseline&lt;/th&gt;
&lt;th&gt;Key Gain&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ALFWorld&lt;/td&gt;
&lt;td&gt;0.507&lt;/td&gt;
&lt;td&gt;0.324&lt;/td&gt;
&lt;td&gt;0.278&lt;/td&gt;
&lt;td&gt;+56% over MemP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HLE&lt;/td&gt;
&lt;td&gt;0.573&lt;/td&gt;
&lt;td&gt;0.528&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;+8.5% over MemP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BigCodeBench&lt;/td&gt;
&lt;td&gt;0.508&lt;/td&gt;
&lt;td&gt;0.494&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;+2.8% over MemP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lifelong Agent Bench&lt;/td&gt;
&lt;td&gt;0.697 CSR&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Best overall&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gains are largest on ALFWorld and Lifelong Agent Bench — multi-step sequential tasks where memory utility accumulates across episodes. BigCodeBench shows smaller gains because it's primarily single-turn: there's less opportunity for multi-episode Q-value refinement when each task is independent.&lt;/p&gt;

&lt;p&gt;This pattern is important. MemRL's value is proportional to how much your agent loops over time. If your agent handles isolated, one-shot queries, you won't see ALFWorld-level improvements.&lt;/p&gt;

&lt;h2&gt;
  
  
  MemRL vs Traditional RAG
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MemRL Strengths
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Learns from success/failure — not just semantic match&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;No model fine-tuning required — frozen LLM backbone&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Q-values suppress noise and bad strategies over time&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Improves within a session and across sessions (transfer)&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Open-source with multi-benchmark validation&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;


Where It Lags
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Needs an environmental feedback signal — not always available&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Less useful for purely one-shot tasks without episode loops&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Q-value cold start: early episodes have unrefined utility scores&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;More complex to set up than a standard RAG pipeline&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The underlying difference is what retrieval optimizes for. RAG finds memories that are similar. MemRL finds memories that are similar &lt;em&gt;and&lt;/em&gt; proved useful. For long-running agents where failure has a cost — home automation, coding assistants, planning agents — this distinction is meaningful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tempera MCP Server
&lt;/h2&gt;

&lt;p&gt;A community implementation called &lt;a href="https://github.com/anvanster/tempera" rel="noopener noreferrer"&gt;Tempera&lt;/a&gt; applies MemRL concepts to AI coding workflows via Model Context Protocol (MCP). Tempera captures coding sessions as episodes, indexes them for semantic search, and uses RL to surface the most valuable memories at query time. All projects share a common memory database stored under &lt;code&gt;~/.tempera/&lt;/code&gt;, enabling cross-project learning — a direct practical application of the MemRL architecture.&lt;/p&gt;

&lt;p&gt;This matters for developers already using MCP-compatible tools: Tempera is one path to experimenting with MemRL ideas without implementing the full research framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Get Started with MemRL
&lt;/h2&gt;

&lt;p&gt;For developers interested in running the actual MemRL benchmarks, the setup flow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone the repo&lt;/span&gt;
git clone https://github.com/MemTensor/MemRL
&lt;span class="nb"&gt;cd &lt;/span&gt;MemRL

&lt;span class="c"&gt;# 2. Create environment (Python 3.10 required)&lt;/span&gt;
conda create &lt;span class="nt"&gt;-n&lt;/span&gt; memrl &lt;span class="nv"&gt;python&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3.10
conda activate memrl

&lt;span class="c"&gt;# 3. Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt

&lt;span class="c"&gt;# 4. Configure LLM + embedding settings in configs/&lt;/span&gt;
&lt;span class="c"&gt;# (YAML files per benchmark)&lt;/span&gt;

&lt;span class="c"&gt;# 5. Run a benchmark runner&lt;/span&gt;
python memrl/run/alfworld_rl_runner.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results write to &lt;code&gt;logs/&lt;/code&gt; and &lt;code&gt;results/&lt;/code&gt; directories. The &lt;code&gt;configs/&lt;/code&gt; directory controls which LLM and embedding model you use — the paper uses frontier models but the code supports swapping these.&lt;/p&gt;

&lt;p&gt;Full environment setup for ALFWorld requires additional installation steps documented in the repo's README.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Implications for Agent Developers
&lt;/h2&gt;

&lt;p&gt;MemRL's ideas translate to a few concrete questions worth asking about any agent system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does your agent run repeatedly over similar tasks?&lt;/strong&gt; If yes, runtime Q-value learning could improve retrieval quality. If your agent handles purely isolated requests, the benefit is limited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your feedback signal?&lt;/strong&gt; MemRL needs a reward — task success, user rating, test pass/fail, something. Agents that get no structured outcome signal can't update Q-values. Designing a feedback loop is a prerequisite, not an afterthought.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are you fighting retrieval noise?&lt;/strong&gt; If your RAG-based memory system frequently surfaces semantically similar but strategically useless memories, MemRL's Phase B filtering is directly relevant. The Q-value layer exists precisely to downweight experiences that match the query but don't help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do you need to avoid retraining?&lt;/strong&gt; MemRL's strongest argument is that agents can improve without compute-intensive fine-tuning cycles. For teams running agents at scale where fine-tuning is prohibitively expensive, this is a meaningful alternative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How is MemRL different from Reflexion or Voyager?
&lt;/h3&gt;

&lt;p&gt;Reflexion stores verbal self-reflection notes in memory. Voyager builds a skill library. MemRL is distinct in applying Q-value learning to determine which stored experiences to retrieve. Reflexion and Voyager still rely on recency or semantic matching; MemRL's retrieval is utility-driven.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can MemRL work with any LLM?
&lt;/h3&gt;

&lt;p&gt;Yes — the LLM backbone is frozen. MemRL is agnostic to the underlying model. The paper runs experiments with frontier models, but the memory and retrieval mechanism is entirely external to the LLM's weights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What happens if the reward signal is noisy?
&lt;/h3&gt;

&lt;p&gt;Noisy rewards are a known challenge in RL. The paper applies Monte Carlo-style updates (averaging over episodes) which provides some robustness, but highly noisy reward signals will produce unreliable Q-values. The quality of MemRL's learning is bounded by the quality of the feedback signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does MemRL require embeddings?
&lt;/h3&gt;

&lt;p&gt;Yes, Phase A requires dense vector similarity. The sandbox PoC used word-overlap as a proxy, but real MemRL uses embedding models to compute semantic similarity between stored intents and current queries. Any embedding model compatible with your stack works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;MemRL addresses a genuine gap: the cost of fine-tuning versus the limitations of static retrieval. Its approach — structure memory as IEU triplets, filter by semantics, rank by learned Q-values, update Q-values from task outcomes — is conceptually clean and benchmarked across four tasks.&lt;/p&gt;

&lt;p&gt;The gains are largest for multi-step, episodic tasks (ALFWorld: +56% over MemP) and more modest for single-turn workloads (BigCodeBench: +2.8%). The framework needs a feedback signal, and Q-values start uninformed — so there's a cold-start cost on early episodes.&lt;/p&gt;

&lt;p&gt;For teams building agents that loop repeatedly over tasks, interact with real environments, and can capture task success as a signal, MemRL is a well-evidenced alternative to both fine-tuning and standard RAG. The code is open, the benchmarks are public, and the Tempera MCP server offers a path to experimenting without setting up the full research framework.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;MemRL is one of the more rigorous proposals for non-parametric agent learning published in early 2026. If you're running agents that repeat tasks and can capture feedback, the two-phase retrieval mechanism is worth understanding — and the open-source code makes it possible to test on your own benchmarks without writing the RL layer from scratch.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Sources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2601.03192" rel="noopener noreferrer"&gt;MemRL: Self-Evolving Agents via Runtime RL on Episodic Memory (arXiv:2601.03192)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/html/2601.03192v1" rel="noopener noreferrer"&gt;MemRL Full Paper HTML&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/MemTensor/MemRL" rel="noopener noreferrer"&gt;MemTensor/MemRL — GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://venturebeat.com/technology/memrl-outperforms-rag-on-complex-agent-benchmarks-without-fine-tuning" rel="noopener noreferrer"&gt;VentureBeat: MemRL Outperforms RAG Without Fine-Tuning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/anvanster/tempera" rel="noopener noreferrer"&gt;anvanster/tempera — Tempera MCP Server&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.agentic-patterns.com/patterns/memory-reinforcement-learning-memrl/" rel="noopener noreferrer"&gt;Agentic Patterns: Memory Reinforcement Learning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>memrl</category>
      <category>episodicmemory</category>
      <category>reinforcementlearning</category>
      <category>llmagents</category>
    </item>
    <item>
      <title>OpenAI Realtime Audio API: Voice Agents Guide 2026</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Thu, 14 May 2026 00:12:11 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/openai-realtime-audio-api-voice-agents-guide-2026-478i</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/openai-realtime-audio-api-voice-agents-guide-2026-478i</guid>
      <description>&lt;p&gt;On May 7, 2026, OpenAI quietly made voice agents production-viable. Three new realtime audio models landed in the API at the same time: &lt;strong&gt;GPT-Realtime-2&lt;/strong&gt; (voice with GPT-5-class reasoning), &lt;strong&gt;GPT-Realtime-Translate&lt;/strong&gt; (live speech-to-speech translation across 70+ languages), and &lt;strong&gt;GPT-Realtime-Whisper&lt;/strong&gt; (streaming speech-to-text billed by the minute). Each model has its own pricing, endpoint, and use-case fit.&lt;/p&gt;

&lt;p&gt;If you have been waiting for a stable, production-ready voice API before building, the wait is over. This guide walks through what each model does, how to connect to the API, what it costs, and the production patterns that separate a working demo from a robust voice agent.&lt;/p&gt;

&lt;p&gt;Effloow Lab inspected the Realtime API protocol and validated client-side event structures locally as part of this article's research. Full live testing requires an OpenAI API key; where relevant, we note what we verified and what we did not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Release Matters
&lt;/h2&gt;

&lt;p&gt;Previous versions of the Realtime API required working around a 32K-token context ceiling, managing your own speech-to-text pipeline, and accepting that the model would sometimes lose the thread of a long conversation. GPT-Realtime-2 removes these constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context window expanded to 128K tokens&lt;/strong&gt; — four times the previous limit, enough for multi-turn conversations spanning tens of minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-5-class reasoning integrated directly&lt;/strong&gt; — the model can call tools, reason through steps, and respond, all without leaving the audio stream&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three specialized models&lt;/strong&gt; instead of one general voice model, each optimized for a specific cost-performance point&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The split into three models is also a pricing move. If you only need transcription, GPT-Realtime-Whisper at $0.017/minute is dramatically cheaper than running voice inference at $32/1M tokens. Choose the right model and you can cut costs by 80–90% relative to using GPT-Realtime-2 for everything.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Pricing&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-realtime-2&lt;/td&gt;
&lt;td&gt;Voice reasoning agent&lt;/td&gt;
&lt;td&gt;$32/1M input · $64/1M output tokens&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-realtime-translate&lt;/td&gt;
&lt;td&gt;Live speech translation&lt;/td&gt;
&lt;td&gt;$0.034/min&lt;/td&gt;
&lt;td&gt;Translation-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gpt-realtime-whisper&lt;/td&gt;
&lt;td&gt;Streaming transcription&lt;/td&gt;
&lt;td&gt;$0.017/min&lt;/td&gt;
&lt;td&gt;STT-only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  GPT-Realtime-2: Voice Reasoning for Production Agents
&lt;/h2&gt;

&lt;p&gt;GPT-Realtime-2 is the flagship of the trio. It brings GPT-5-level intelligence into the audio stream: the model can reason through multi-step requests, call functions, handle tool results, and continue speaking — all without pausing the conversation for a round trip to a separate text model.&lt;/p&gt;

&lt;h3&gt;
  
  
  How audio tokens are billed
&lt;/h3&gt;

&lt;p&gt;OpenAI encodes audio duration into tokens rather than sampling audio at a fixed rate. The billing math is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User speech (input):&lt;/strong&gt; 1 token per 100 ms of audio → 600 tokens per minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model response (output):&lt;/strong&gt; 1 token per 50 ms of audio → 1,200 tokens per minute&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a typical bidirectional voice call where the user talks roughly as much as the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input cost:  600 tokens × ($32 / 1,000,000) = $0.0192 / min
Output cost: 1,200 tokens × ($64 / 1,000,000) = $0.0768 / min
Total uncached: ~$0.096 / min (~$5.76 / hour)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With prompt caching applied to system instructions and persistent session context, real-world costs can drop to roughly $0.05–$0.10/min according to third-party production estimates published by OpenAI partners.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connecting via WebSocket
&lt;/h3&gt;

&lt;p&gt;The Realtime API uses a persistent WebSocket connection. Every interaction is modeled as an exchange of typed JSON events — the client sends events, the server sends events back. Effloow Lab validated that the client-side event structures serialize and round-trip correctly in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;

&lt;span class="n"&gt;OPENAI_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# your key
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;voice_agent_session&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wss://api.openai.com/v1/realtime?model=gpt-realtime-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OpenAI-Beta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;realtime=v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;additional_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. Configure the session
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.update&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;modalities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alloy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turn_detection&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server_vad&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silence_duration_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lookup_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Look up a customer order by ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                            &lt;span class="p"&gt;},&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                        &lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_choice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}))&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Stream audio (PCM16, 24kHz, base64-encoded chunks)
&lt;/span&gt;        &lt;span class="c1"&gt;# await ws.send(json.dumps({
&lt;/span&gt;        &lt;span class="c1"&gt;#     "type": "input_audio_buffer.append",
&lt;/span&gt;        &lt;span class="c1"&gt;#     "audio": base64_chunk
&lt;/span&gt;        &lt;span class="c1"&gt;# }))
&lt;/span&gt;
        &lt;span class="c1"&gt;# 3. Listen for server events
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;raw_msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.audio.delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# stream audio bytes to speaker
&lt;/span&gt;                &lt;span class="k"&gt;pass&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.function_call_arguments.done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# handle tool call, then send result back
&lt;/span&gt;                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conversation.item.create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function_call_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;}))&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response.create&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;voice_agent_session&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The OpenAI Agents Python SDK (&lt;code&gt;openai-agents&lt;/code&gt;) wraps this pattern into a higher-level &lt;code&gt;RealtimeAgent&lt;/code&gt; class if you prefer avoiding raw WebSocket management. The underlying transport is the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool calls mid-conversation
&lt;/h3&gt;

&lt;p&gt;GPT-Realtime-2 can call functions while speaking. The agent does not stop talking and wait — it continues the audio stream with a phrase like "Let me look that up" while dispatching the tool call in parallel. When the result arrives, it folds it into the ongoing response. This pattern is what makes GPT-Realtime-2 meaningfully different from a text model with TTS bolted on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Interruption handling
&lt;/h3&gt;

&lt;p&gt;Voice activity detection (VAD) is built in when you set &lt;code&gt;turn_detection.type = "server_vad"&lt;/code&gt;. When the user starts speaking mid-response, the API sends a &lt;code&gt;response.cancelled&lt;/code&gt; event, truncates the current audio output, and starts a new inference cycle. The 128K context window means the model retains everything said before the interruption without a context reset.&lt;/p&gt;

&lt;p&gt;Three things to get right in production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;VAD threshold&lt;/strong&gt; (&lt;code&gt;threshold: 0.5&lt;/code&gt; in the example above) — lower values detect softer speech but increase false triggers in noisy environments. Tune per your deployment channel (phone line vs browser microphone vs call center headset).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silence duration&lt;/strong&gt; (&lt;code&gt;silence_duration_ms&lt;/code&gt;) — how long a pause triggers end-of-turn. 500ms works for conversational speech; customer support scripts may need 700–1000ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Barge-in state management on your server&lt;/strong&gt; — when &lt;code&gt;response.cancelled&lt;/code&gt; fires, flush any queued tool results from the cancelled turn or you'll deliver stale data to the next response cycle.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  GPT-Realtime-Translate: Live Speech-to-Speech Translation
&lt;/h2&gt;

&lt;p&gt;GPT-Realtime-Translate is a single-purpose model trained on thousands of hours of professional interpreter audio. It takes live speech in any of 70+ input languages, detects the source language automatically, and returns translated speech plus text transcripts in one of 13 output languages.&lt;/p&gt;

&lt;p&gt;Target output languages as of May 2026: Spanish, Portuguese, French, Japanese, Russian, Chinese, German, Korean, Hindi, Indonesian, Vietnamese, Italian, and English.&lt;/p&gt;

&lt;p&gt;The dedicated endpoint is &lt;code&gt;/v1/realtime/translations&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wss://api.openai.com/v1/realtime/translations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;session_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.update&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ja&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# target language code
&lt;/span&gt;        &lt;span class="c1"&gt;# source language is auto-detected
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alloy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You stream 24 kHz PCM16 audio into &lt;code&gt;input_audio_buffer.append&lt;/code&gt; exactly as you would with GPT-Realtime-2. The model processes input audio while simultaneously streaming translated audio back, which keeps perceived latency low over continuous speech.&lt;/p&gt;

&lt;p&gt;Unlike a general-purpose voice model, GPT-Realtime-Translate will not answer questions or carry on conversation. It is translation-only by design. If a user asks "what time is it?" in French and your output language is English, the model translates the question into English — it does not answer it. Build a routing layer in front if your product needs both translation and reasoning.&lt;/p&gt;

&lt;p&gt;At $0.034/minute, a one-hour multilingual support call costs $2.04 in translation credits. A 30-person conference session with real-time translation for 60 minutes costs around $60 — cheaper than a human interpreter for a short session, and it runs at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPT-Realtime-Whisper: Streaming Speech-to-Text
&lt;/h2&gt;

&lt;p&gt;GPT-Realtime-Whisper is the transcription-only model in the trio. It starts producing text output as the speaker talks rather than waiting for an utterance to finish. This keeps the UI feeling responsive — a transcription bar can update word-by-word instead of appearing in blocks.&lt;/p&gt;

&lt;p&gt;Pricing at $0.017/minute makes it among the cheapest options for streaming STT in the OpenAI ecosystem. An eight-hour workday of continuous transcription costs about $8.16.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Whisper Realtime session uses the standard /v1/realtime endpoint
# with model=gpt-realtime-whisper
&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wss://api.openai.com/v1/realtime?model=gpt-realtime-whisper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Server returns transcript deltas as speech is detected:
# { "type": "conversation.item.input_audio_transcription.delta", "delta": "Hello, " }
# { "type": "conversation.item.input_audio_transcription.delta", "delta": "can you hear me?" }
# { "type": "conversation.item.input_audio_transcription.completed", "transcript": "Hello, can you hear me?" }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPT-Realtime-Whisper is the right choice when you need transcription but not inference — meeting recorders, live captioning systems, accessibility tools, voice-search preprocessing, and call analytics pipelines where a separate LLM processes the transcript downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Application: Choosing the Right Model
&lt;/h2&gt;

&lt;p&gt;The three models are not interchangeable. Use this decision tree:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does your user need a spoken response from the AI?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Yes, and it involves reasoning, tool calls, or multi-turn logic → &lt;strong&gt;gpt-realtime-2&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Yes, but it is a direct translation of what another person said → &lt;strong&gt;gpt-realtime-translate&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No, you only need the text of what the user said → &lt;strong&gt;gpt-realtime-whisper&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A customer support agent that looks up orders and reads statuses aloud: gpt-realtime-2.&lt;br&gt;
A multilingual conference call platform where each attendee hears their own language: gpt-realtime-translate.&lt;br&gt;
A meeting transcription SaaS that feeds into a separate summarizer: gpt-realtime-whisper.&lt;/p&gt;

&lt;p&gt;For hybrid products, you can run models side-by-side. A global customer support pipeline might use gpt-realtime-translate for non-English callers to produce an English transcript, then pass that transcript to a text-only GPT-5 for classification and routing, and only invoke gpt-realtime-2 when the agent needs to speak back. This layering can reduce per-call cost significantly compared to routing all audio through gpt-realtime-2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes in Production Voice Agents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ignoring prompt caching on system instructions.&lt;/strong&gt; The session configuration message is sent at the start of every WebSocket connection. For long system prompts, this is the largest per-session input cost. OpenAI caches inputs at $0.40/1M tokens vs $32/1M for uncached. Keep your system prompt stable and reuse session configurations where possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treating &lt;code&gt;response.cancelled&lt;/code&gt; as an error.&lt;/strong&gt; Interruptions are a normal part of conversation. Your application should handle the cancel event cleanly — flush pending state, log the cancelled turn, and let the model proceed with the new input. Applications that surface interruption events as errors create broken UX and noisy logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forgetting that context grows.&lt;/strong&gt; The 128K context window means gpt-realtime-2 can hold a very long conversation without a reset. But it also means costs accumulate. A one-hour conversation with balanced speaking time can push well past $10 in audio tokens alone. For high-volume deployments, consider session time limits or periodic context compaction using a text-model summarization step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Using gpt-realtime-2 for transcription-only use cases.&lt;/strong&gt; If you only need the text of what the user said, run gpt-realtime-whisper at $0.017/min instead of gpt-realtime-2 at $0.096+/min. The cost difference is roughly 5–6x.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard-coding the VAD threshold.&lt;/strong&gt; Different audio channels have different noise floors. A browser tab with a decent microphone is not the same as a phone call over PSTN. Ship a configuration option, even if only for internal deployment channels.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does gpt-realtime-2 use GPT-5 under the hood?
&lt;/h3&gt;

&lt;p&gt;OpenAI describes gpt-realtime-2 as bringing "GPT-5-class reasoning" to live voice, and their Big Bench Audio benchmark shows +15.2% audio intelligence over GPT-Realtime-1.5. OpenAI has not confirmed whether the underlying weights are shared with GPT-5 or whether this is a separate model trained to the same capability level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can I use the Realtime API from a browser (client-side)?
&lt;/h3&gt;

&lt;p&gt;Yes. OpenAI supports ephemeral session tokens for client-side WebSocket connections. Generate a short-lived token from your backend (&lt;code&gt;POST /v1/realtime/sessions&lt;/code&gt;), pass it to the browser, and open the WebSocket from JavaScript. Do not embed your main API key in client-side code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does server VAD compare to manual turn detection?
&lt;/h3&gt;

&lt;p&gt;Server VAD (&lt;code&gt;turn_detection.type = "server_vad"&lt;/code&gt;) lets OpenAI's infrastructure handle speech segmentation — it detects when the user stops speaking and triggers inference automatically. Manual turn detection (&lt;code&gt;turn_detection: null&lt;/code&gt;) gives your application full control: you decide when to commit an audio buffer and request a response. Manual mode is more predictable in noisy environments but requires more engineering. Start with server VAD and switch to manual if you hit false-trigger issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is gpt-realtime-translate available on Azure OpenAI?
&lt;/h3&gt;

&lt;p&gt;Microsoft's Azure AI Foundry announced support for the new realtime audio models including gpt-realtime-whisper and gpt-realtime-translate shortly after the OpenAI release. Check the &lt;a href="https://azure.microsoft.com/en-us/pricing/details/azure-openai/" rel="noopener noreferrer"&gt;Azure OpenAI pricing page&lt;/a&gt; for regional availability and pricing, which may differ from direct OpenAI API pricing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What audio format does the Realtime API accept?
&lt;/h3&gt;

&lt;p&gt;The API accepts PCM16 audio at 24 kHz, base64-encoded and sent as &lt;code&gt;input_audio_buffer.append&lt;/code&gt; events. Most browser &lt;code&gt;MediaRecorder&lt;/code&gt; APIs require a format conversion step. The OpenAI cookbook includes a &lt;code&gt;realtime_translation_guide&lt;/code&gt; example with a JavaScript AudioWorklet for in-browser PCM16 capture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What happens if the WebSocket connection drops mid-conversation?
&lt;/h3&gt;

&lt;p&gt;The session state is held server-side for the duration of the connection. If the connection drops, the session is lost — there is no resume or reconnect mechanism as of May 2026. Build reconnect logic in your client and design conversations to be resumable from the last committed turn. Store transcript deltas locally and replay context if a reconnect is needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;The May 2026 Realtime Audio API update is the first time all three voice agent primitives — reasoning, translation, and transcription — are available in a single unified API with clear per-minute or per-token pricing.&lt;/p&gt;

&lt;p&gt;For most developers building voice agents, the practical starting point is gpt-realtime-2 for prototyping and gpt-realtime-whisper for any transcription path that feeds a separate model. GPT-Realtime-Translate is genuinely useful and underpriced compared to traditional translation infrastructure — a multilingual product that previously required third-party translation services can now route entirely through one API.&lt;/p&gt;

&lt;p&gt;The 128K context window and built-in VAD make gpt-realtime-2 a legitimate foundation for production voice agents rather than a demo novelty. The remaining work is on your side: audio channel handling, graceful interruption management, prompt caching discipline, and cost modeling before you scale.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;OpenAI's three-model voice API split is the right architecture: specialized models at specialized prices, all behind one WebSocket protocol. GPT-Realtime-2 is finally production-ready with 128K context and native tool calling. GPT-Realtime-Whisper at $0.017/min is the new default for any transcription-only pipeline. Build the routing layer between them and you can cover most voice AI use cases without leaving the OpenAI ecosystem.&lt;/p&gt;

</description>
      <category>openai</category>
      <category>voiceai</category>
      <category>realtimeapi</category>
      <category>voiceagents</category>
    </item>
    <item>
      <title>AWS Kiro: Spec-Driven IDE for Agentic Development</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 13 May 2026 08:20:17 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/aws-kiro-spec-driven-ide-for-agentic-development-5bol</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/aws-kiro-spec-driven-ide-for-agentic-development-5bol</guid>
      <description>&lt;p&gt;There is a quiet argument happening inside every engineering team that uses AI coding tools: should the AI write code directly from a chat prompt, or should it first commit to a plan you can actually verify?&lt;/p&gt;

&lt;p&gt;Cursor and Windsurf answer "write from the prompt." AWS Kiro answers "write the spec first."&lt;/p&gt;

&lt;p&gt;That is not a small difference. It changes what you version, what you review in a pull request, and who on the team can understand what the agent actually built. This guide covers what Kiro does, how the spec workflow is structured, how agent hooks automate the repetitive parts, and where it fits relative to the other agentic IDEs competing for your workflow in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kiro Is
&lt;/h2&gt;

&lt;p&gt;Kiro is a desktop IDE built on Code OSS — the open-source base that VS Code also runs on — developed by Amazon Web Services and released to the public in late 2025. It reached general availability in November 2025 after hitting capacity limits within days of its July preview launch.&lt;/p&gt;

&lt;p&gt;The product is positioned as AWS's successor to Amazon Q Developer. AWS ended new Q Developer signups effective May 15, 2026, explicitly directing new users to Kiro. That matters for team context: if your organization is already on AWS and was evaluating Q Developer, Kiro is now the answer.&lt;/p&gt;

&lt;p&gt;The core design principle: &lt;strong&gt;specs are the source of truth, and code is a build artifact derived from them.&lt;/strong&gt; Rather than asking an agent to "add a rate limiter," you write a spec that describes what the rate limiter should do, under what conditions, and what the acceptance criteria are. The agent then generates code to satisfy the spec, not just a prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Spec Workflow
&lt;/h2&gt;

&lt;p&gt;When you start a feature, Kiro creates three structured markdown files under &lt;code&gt;.kiro/specs/{feature-name}/&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;requirements.md&lt;/strong&gt; captures user stories and acceptance criteria using EARS notation (Easy Approach to Requirements Syntax). EARS structures each requirement as a conditional assertion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight gherkin"&gt;&lt;code&gt;&lt;span class="err"&gt;WHEN&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;user&lt;/span&gt; &lt;span class="err"&gt;submits&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;registration&lt;/span&gt; &lt;span class="err"&gt;form&lt;/span&gt;
&lt;span class="err"&gt;THEN&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;system&lt;/span&gt; &lt;span class="err"&gt;SHALL&lt;/span&gt; &lt;span class="err"&gt;validate&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;email&lt;/span&gt; &lt;span class="err"&gt;format&lt;/span&gt; &lt;span class="err"&gt;before&lt;/span&gt; &lt;span class="err"&gt;saving&lt;/span&gt;
&lt;span class="err"&gt;AND&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;system&lt;/span&gt; &lt;span class="err"&gt;SHALL&lt;/span&gt; &lt;span class="err"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;a &lt;/span&gt;422 with field-level errors when validation fails
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That format is not just documentation. It maps directly to testable assertions, which is why Kiro can generate test stubs from requirements with reasonable accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;design.md&lt;/strong&gt; documents the technical architecture for the feature — data models, sequence diagrams in text form, interface contracts, and any relevant infrastructure considerations. This file lives in the repo alongside the feature code, so anyone reviewing a pull request can see the design intent without reconstructing it from the implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tasks.md&lt;/strong&gt; contains a discrete task list that Kiro generates from the requirements and design. Tasks are tracked as in-progress or completed as the agent works through them. You can pause, redirect, or reassign tasks manually; Kiro treats them as a checkpoint-able queue, not a linear script.&lt;/p&gt;

&lt;p&gt;The three-document structure is also the surface where human review happens. Before the agent touches code, you can edit requirements to narrow scope, add edge cases to the design, or reprioritize tasks. That is the mechanism Kiro offers for keeping the human in the loop on complex features without turning every step into a manual approval.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Strengths
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Specs survive code refactors — the "why" stays versioned in the repo&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;EARS format produces testable acceptance criteria, not vague prose&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Spec review is a natural code review gate that any team member can participate in&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Free tier (50 requests/month) requires no AWS account or credit card&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Powers bundle MCP servers + hooks into reusable, context-aware packages&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;


Limitations
&amp;lt;ul&amp;gt;
  &amp;lt;li&amp;gt;Spec-first workflow adds planning time — not suited for fast prototyping&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Credits deplete quickly on multi-file specs (community-reported)&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Deep AWS integrations require an AWS account and Bedrock access&amp;lt;/li&amp;gt;
  &amp;lt;li&amp;gt;Smaller extension/plugin ecosystem compared to VS Code or Cursor&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Agent Hooks: Automating the Repetitive Parts
&lt;/h2&gt;

&lt;p&gt;One feature that distinguishes Kiro from competitors is its hook system. Hooks are event-driven automations configured in &lt;code&gt;.kiro/hooks/&lt;/code&gt; as JSON files. When a trigger event fires, the hook either runs a natural-language agent prompt or executes a shell command.&lt;/p&gt;

&lt;p&gt;The available triggers as of Kiro's 0.10 changelog:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;file:save&lt;/code&gt; — fires whenever you save a file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;file:create&lt;/code&gt; — fires when a new file is created&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;task:pre&lt;/code&gt; — fires before a spec task begins executing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;task:post&lt;/code&gt; — fires after a spec task completes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common hook patterns from the official Kiro blog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trigger"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"file:save"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"src/components/**/*.tsx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Update the test file for the component that was just saved. Keep existing test cases; add new ones only for changed behavior."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That hook means you never manually sync your test file after touching a component. The agent does it on save, automatically, every time.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;task:post&lt;/code&gt; hook is useful for quality gates. You can configure it to run linting, type checking, or test execution after each agent task completes — so that a multi-step spec run doesn't silently accumulate broken intermediate states.&lt;/p&gt;

&lt;p&gt;Hooks are committed to the repository, not stored locally in user preferences. That means the automation behavior is consistent across the whole team and survives machine changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kiro Powers and MCP Integration
&lt;/h2&gt;

&lt;p&gt;Kiro supports both local and remote MCP servers. Its differentiated feature here is "Powers" — a packaging concept introduced in changelog 0.10.&lt;/p&gt;

&lt;p&gt;A Power bundles three things into a single installable unit:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;An MCP server providing tools&lt;/li&gt;
&lt;li&gt;A steering file that defines when and how to activate those tools&lt;/li&gt;
&lt;li&gt;Optional hooks that automate related tasks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Powers activate on-demand based on conversation context rather than loading all MCP tools upfront. This keeps the token budget clean: if you are working on a CloudFormation stack, the CloudFormation Power becomes active; the pricing tools stay dormant until they are relevant.&lt;/p&gt;

&lt;p&gt;AWS ships first-party Powers for several of its own platforms: CDK, CloudFormation, Pricing, and HealthOmics workflows. Third-party Powers follow the same packaging spec. If you are building your own MCP server and want it to integrate cleanly with Kiro, the Powers format gives you a structured way to bundle it.&lt;/p&gt;

&lt;p&gt;This is worth comparing to how Cursor handles MCP: Cursor supports MCP servers directly but without the packaging abstraction. All configured servers load simultaneously, and there is no built-in concept of context-aware activation. For teams with many MCP tools, the Powers approach reduces noise at the cost of an additional configuration layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing and Getting Started
&lt;/h2&gt;

&lt;p&gt;Kiro runs on a credit system. One agentic request equals one credit. Plans as of May 2026:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;&lt;tr&gt;
&lt;th&gt;Plan&lt;/th&gt;
&lt;th&gt;Monthly Credits&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;AWS Account Required&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro&lt;/td&gt;
&lt;td&gt;Unlimited*&lt;/td&gt;
&lt;td&gt;$20/mo&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pro+&lt;/td&gt;
&lt;td&gt;Unlimited*&lt;/td&gt;
&lt;td&gt;$40/mo&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power&lt;/td&gt;
&lt;td&gt;Unlimited*&lt;/td&gt;
&lt;td&gt;$200/mo&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;*Overage credits beyond the plan's included usage cost $0.04 each, billed at month-end.&lt;/p&gt;

&lt;p&gt;To install: download from &lt;a href="https://kiro.dev/downloads/" rel="noopener noreferrer"&gt;kiro.dev/downloads&lt;/a&gt;. The installer is available for macOS, Windows, and Linux. Sign in with GitHub, Google, AWS Builder ID, or IAM Identity Center. No credit card for the free tier.&lt;/p&gt;

&lt;p&gt;Your first project follows this path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open a folder in Kiro&lt;/li&gt;
&lt;li&gt;Open the Kiro panel and type a feature description in natural language&lt;/li&gt;
&lt;li&gt;Kiro generates &lt;code&gt;.kiro/specs/your-feature/requirements.md&lt;/code&gt; — review and edit it&lt;/li&gt;
&lt;li&gt;Approve the requirements → Kiro generates &lt;code&gt;design.md&lt;/code&gt; and &lt;code&gt;tasks.md&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Approve the design → Kiro begins working through &lt;code&gt;tasks.md&lt;/code&gt; sequentially&lt;/li&gt;
&lt;li&gt;Hooks run automatically on file saves during implementation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The full quickstart is at &lt;a href="https://kiro.dev/docs/getting-started/first-project/" rel="noopener noreferrer"&gt;kiro.dev/docs/getting-started/first-project/&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Kiro Compares to Cursor and Windsurf
&lt;/h2&gt;

&lt;p&gt;The agentic IDE space has three dominant positions heading into mid-2026:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor&lt;/strong&gt; (1M+ daily active users, $20/mo Pro) is the market leader. Its strength is codebase indexing: semantic embeddings of the entire repo, @-file references, and a polished multi-file editing experience. Agent mode handles large refactors well. The weakness is that "prompt → code" means the agent's intent is implicit in the output, not in a verifiable artifact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Windsurf&lt;/strong&gt; ($15/mo) targets enterprise teams. Its Cascade feature auto-discovers context without manual file tagging, which works well on large codebases. First-pass success on complex tasks is reported as higher than Cursor's agent mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kiro&lt;/strong&gt; is the most opinionated of the three. It trades speed for verifiability. The spec workflow adds 15–30 minutes of upfront planning to any non-trivial feature. In return, you get requirements that you can reference in code review, design decisions that survive refactors, and hooks that keep tests and documentation in sync automatically.&lt;/p&gt;

&lt;p&gt;A useful heuristic: if your team already writes design documents before implementing, Kiro formalizes that workflow and connects it to the code generation loop. If your team goes from Jira ticket straight to code, Kiro will feel like it is adding ceremony without clear return.&lt;/p&gt;

&lt;p&gt;For further context on the broader agentic IDE landscape, see the &lt;a href="https://dev.to/articles/cursor-vs-windsurf-vs-zed-ai-ide-comparison-2026"&gt;cursor vs windsurf vs zed comparison&lt;/a&gt; and the &lt;a href="https://dev.to/articles/best-ai-coding-agents-2026"&gt;best AI coding agents roundup for 2026&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Should Actually Use Kiro
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Good fit:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teams already on AWS who want AI coding integrated with their existing Bedrock and IAM setup&lt;/li&gt;
&lt;li&gt;Projects where requirements traceability matters: regulated industries, complex APIs, multi-team codebases&lt;/li&gt;
&lt;li&gt;Engineers who write design documents by habit and want to close the gap between the doc and the code&lt;/li&gt;
&lt;li&gt;Anyone evaluating Amazon Q Developer alternatives (Kiro is now the official successor)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Less useful:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Solo developers doing rapid prototyping where the cost of planning exceeds the cost of mistakes&lt;/li&gt;
&lt;li&gt;Projects where the team does not review design artifacts — specs without readers add overhead with no return&lt;/li&gt;
&lt;li&gt;Teams wanting the largest VS Code extension ecosystem (Kiro's is smaller, though growing)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Real Question: Is Spec-Driven Development a Better Default?
&lt;/h2&gt;

&lt;p&gt;The honest answer is that spec-driven development is better for some teams and worse for others — and Kiro does not resolve that ambiguity for you.&lt;/p&gt;

&lt;p&gt;What Kiro does resolve is the artifact gap that exists in every other agentic IDE: the mismatch between what you asked for and what the code actually does, documented nowhere. The spec files live in the repository. When something breaks three months later, you can read what the system was supposed to do instead of reverse-engineering it from the output.&lt;/p&gt;

&lt;p&gt;Whether that is worth the additional workflow overhead depends on how much of your team's time currently goes into maintaining context versus generating new code. For teams where "why does this work this way" is a common question in standups, the spec overhead pays back quickly. For solo builders iterating fast, the overhead stays overhead.&lt;/p&gt;

&lt;p&gt;Kiro's MCP Powers concept is worth watching independently of the spec workflow. Bundling MCP servers with activation context and hooks is a packaging idea that other IDEs will likely adopt — it solves a real problem with how multiple MCP tools currently have to be configured and managed.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: Does Kiro work without an AWS account?
&lt;/h3&gt;

&lt;p&gt;Yes. The free tier (50 agentic requests/month) and the paid Pro plans ($20/mo) work with GitHub or Google sign-in. An AWS account only becomes relevant if you want to use Bedrock directly or connect to AWS-specific Powers like CloudFormation or CDK.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Are Kiro specs committed to the repository?
&lt;/h3&gt;

&lt;p&gt;Yes. The &lt;code&gt;.kiro/specs/&lt;/code&gt; and &lt;code&gt;.kiro/hooks/&lt;/code&gt; directories are intended to be committed. Specs and hooks are team artifacts, not personal IDE settings. This is deliberate: Kiro's design assumes the spec files are part of the code review surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How are Kiro credits consumed?
&lt;/h3&gt;

&lt;p&gt;Each agentic request consumes one credit. Generating a spec from a prompt, executing a task from &lt;code&gt;tasks.md&lt;/code&gt;, or running an agent hook each count as one request. Autocomplete and inline suggestions do not consume credits. On the free tier (50 credits/month), a medium-complexity feature with 8–10 spec tasks plus several hooks will use most of the monthly allowance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is the difference between Kiro Powers and regular MCP servers?
&lt;/h3&gt;

&lt;p&gt;A Power is an MCP server plus a steering file plus optional hooks, packaged together. The steering file tells Kiro when to activate the Power's tools based on conversation context. Regular MCP servers load all their tools upfront; Powers load on-demand. The practical difference is a shorter tool list in the agent's context window, which reduces token usage and improves relevance on complex tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is Kiro open source?
&lt;/h3&gt;

&lt;p&gt;The Kiro codebase repository is at &lt;a href="https://github.com/kirodotdev/Kiro" rel="noopener noreferrer"&gt;github.com/kirodotdev/Kiro&lt;/a&gt;. The IDE is built on Code OSS (VS Code open-source base). The agent runtime and Bedrock integrations are proprietary AWS services.&lt;/p&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Kiro is the first agentic IDE that makes the design document part of the build process rather than a separate artifact that decays. The spec workflow adds overhead that pays back on team codebases where requirements traceability matters. For solo prototyping or teams that run Cursor smoothly, there is no compelling reason to switch today — but the Powers and hooks concepts are worth watching as patterns the rest of the IDE market will absorb.&lt;/p&gt;

</description>
      <category>awskiro</category>
      <category>agenticide</category>
      <category>specdrivendevelopment</category>
      <category>developertools</category>
    </item>
    <item>
      <title>Claude Agent SDK Practical Guide — Building Tool-Using AI Agents from Scratch</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 13 May 2026 06:41:24 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/claude-agent-sdk-practical-guide-building-tool-using-ai-agents-from-scratch-448b</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/claude-agent-sdk-practical-guide-building-tool-using-ai-agents-from-scratch-448b</guid>
      <description>&lt;p&gt;I ran into the Tool Use moment while building a FastAPI streaming backend with the Claude API. The trigger was simple: a user asked "how many days are left in this year?" and Claude answered wrong. Not just wrong — confidently wrong. I remember thinking, "OK, a chatbot can't handle this."&lt;/p&gt;

&lt;p&gt;Tool Use fixes that structurally. Instead of the model calculating directly, it calls a calculation function and uses the result to answer. That difference is what separates a chatbot from an agent.&lt;/p&gt;

&lt;p&gt;This guide covers the Tool Use patterns I validated by directly installing and running anthropic SDK 0.101.0. From basic tool definitions to the agentic loop, error handling, and cost — practical code you can actually use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Tool Use Is Different from a Chatbot — The Structural Gap
&lt;/h2&gt;

&lt;p&gt;An LLM samples tokens from a probability distribution. Tasks like date arithmetic, precise numerical calculations, or live API lookups are structurally unreliable — the model recreates patterns from training data, not ground truth.&lt;/p&gt;

&lt;p&gt;Tool Use addresses this at a different layer. The model decides &lt;em&gt;what to do&lt;/em&gt;, and actual execution is delegated to external code. Instead of computing directly, the model emits something like &lt;code&gt;calculate("365 - today.day_of_year")&lt;/code&gt;, and Python runs it and returns the result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Chatbot: model answers directly
# "Doesn't know today's date, has to compute directly -&amp;gt; can be wrong"
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How many days left in this year?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Agent: delegates to a tool
# "Model picks the tool, Python computes accurately"
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# includes date calculation tool
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How many days left in this year?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The decisive difference is reliability. Python's &lt;code&gt;datetime&lt;/code&gt; module doesn't get dates wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup — Sandbox Verification Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv
&lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate  &lt;span class="c"&gt;# Windows: venv\Scripts\activate&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;anthropic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results from running this directly in a temp directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;anthropic version: 0.101.0
Client instantiated: ✓
Client type: Anthropic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0.101.0 is the latest as of 2026-05-13. This is the official Anthropic SDK — completely different from packages like &lt;code&gt;pyautogen&lt;/code&gt; that were common before 2025.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# or set ANTHROPIC_API_KEY env var
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SDK auto-loads the API key from &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;. Don't hard-code it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining Your First Tool — JSON Schema Is All You Need
&lt;/h2&gt;

&lt;p&gt;Tool Use uses a structure similar to OpenAI Function Calling. Each tool has three parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;name&lt;/code&gt;: Tool identifier (like a function name)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;description&lt;/code&gt;: The basis for the model's decision on when to use this tool&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;input_schema&lt;/code&gt;: JSON Schema for input parameters
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_current_date_info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Returns current date and time information. Use for questions about &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;today&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;now&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, or anything requiring current date knowledge.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timezone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;IANA timezone (e.g. America/New_York, Asia/Seoul). Default: UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Performs mathematical operations. Handles addition, subtraction, multiplication, division, exponentiation, and modulo.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;add&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subtract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;divide&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;power&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;modulo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The operation to perform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;First operand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Second operand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;description&lt;/code&gt; field matters more than it looks. The model reads only the description to decide whether to use this tool. When I tested with vague descriptions, the model picked the wrong tool or skipped it entirely.&lt;/p&gt;

&lt;p&gt;Validated tool definition structure from my sandbox:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;get_current_date_info&lt;/span&gt;
  &lt;span class="s"&gt;Description&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Returns current date info&lt;/span&gt;
  &lt;span class="s"&gt;Required params&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;

&lt;span class="na"&gt;Tool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;calculate&lt;/span&gt;
  &lt;span class="s"&gt;Description&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Performs math operations&lt;/span&gt;
  &lt;span class="s"&gt;Required params&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;operation'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementing the Agentic Loop — The Core of Tool Use
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fclaude-agent-sdk-tool-use-complete-guide-2026%2Fagentic-loop.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/..%2F..%2F..%2Fassets%2Fblog%2Fclaude-agent-sdk-tool-use-complete-guide-2026%2Fagentic-loop.png" alt="Agentic loop diagram — flow from user message through tool execution to result return"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the core. Tool Use doesn't finish in a single API call. When the model calls a tool → we execute it → we feed the result back. This cycle repeats until the model returns &lt;code&gt;end_turn&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# No tool call — return the final answer
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

        &lt;span class="c1"&gt;# Handle tool calls
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Add the full assistant response to messages (including tool calls)
&lt;/span&gt;            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="c1"&gt;# Collect all tool results and add together
&lt;/span&gt;            &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="c1"&gt;# Tool results go under the "user" role (API requirement)
&lt;/span&gt;            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max iterations exceeded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things are easy to miss here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First&lt;/strong&gt;, add the entire &lt;code&gt;response.content&lt;/code&gt; to messages — not just the text block. The model needs to know which tool it called in order to generate its next response correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second&lt;/strong&gt;, tool results go under the &lt;code&gt;user&lt;/code&gt; role. Counterintuitive, but the API treats tool execution results as coming from the environment (the user side), not the assistant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Real Tools — Calculator, Date, File Reader
&lt;/h2&gt;

&lt;p&gt;The tool execution function is straightforward. It takes a name and input, returns a string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytz&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="c1"&gt;# Safe math — uses operator mapping instead of string expression execution
&lt;/span&gt;&lt;span class="n"&gt;SAFE_OPERATIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;add&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subtract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiply&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mul&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;divide&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;truediv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;power&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;pow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;modulo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mod&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_current_date_info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tz_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timezone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;tz&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pytz&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tz_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tz&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;day_of_year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timetuple&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;tm_yday&lt;/span&gt;
            &lt;span class="n"&gt;days_remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;365&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;day_of_year&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%H:%M:%S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timezone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tz_str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;day_of_year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;day_of_year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;days_remaining_in_year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;days_remaining&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)})&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;op_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;op_func&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SAFE_OPERATIONS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;op_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op_func&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Unknown operation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;op_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;divide&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Cannot divide by zero&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;op_func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
        &lt;span class="n"&gt;filepath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Path traversal prevention: only allow within designated base directory
&lt;/span&gt;        &lt;span class="n"&gt;allowed_base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/app/data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;abs_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;realpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;abs_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allowed_base&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Path not allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;abs_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 2KB limit
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;FileNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: File not found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Unknown tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Actual sandbox results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;calculate(multiply, 15, 7) = 105
calculate(add, 105, 3) = 108
calculate(divide, 100, 4) = 25.0
Input validation (required field present): True
Input validation (missing required field): False — Missing required field: location
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error classification strategy from the &lt;a href="https://dev.to/en/blog/en/fastapi-claude-api-streaming-production-guide-2026"&gt;FastAPI + Claude API streaming guide&lt;/a&gt; applies here too — categorize tool errors as retryable vs. non-retryable for better production stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Handling Multiple Tool Calls — Can We Run in Parallel?
&lt;/h2&gt;

&lt;p&gt;Claude can call multiple tools simultaneously in a single turn. Ask "compare the weather in Seoul and Tokyo" and it returns two &lt;code&gt;get_weather&lt;/code&gt; calls at once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# When Claude calls multiple tools in one turn
&lt;/span&gt;&lt;span class="n"&gt;tool_use_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Technically possible to run in parallel
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;concurrent.futures&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;as_completed&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;process_tool_call&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tool_use_blocks&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sandbox-verified multi-tool results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_result"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_use_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"25.0"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_result"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_use_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tool_2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;temp&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 18, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;condition&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;Sunny&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I'd only apply parallel execution to idempotent read tools. External API calls with side effects need careful rate-limit and ordering consideration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Error Handling — Failing Gracefully
&lt;/h2&gt;

&lt;p&gt;When a tool fails, return &lt;code&gt;is_error: true&lt;/code&gt;. The model reads this, recognizes the error, and either tries something else or gives the user contextual guidance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe_process_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Tool execution with error handling. Returns (content, is_error).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tool execution failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;safe_process_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tool_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;tool_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;is_error: true&lt;/code&gt; is set, the model doesn't just skip past it. From my testing, it reads the error content and responds with something like "The file couldn't be found — please double-check the path." Returning empty strings or ignoring errors tends to produce confused or hallucinated responses.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Cost of Tool Use — How Many Tokens Does It Add?
&lt;/h2&gt;

&lt;p&gt;Honestly, Tool Use costs more. According to Anthropic's documentation, each tool definition adds roughly 200–300 tokens of overhead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;5 tool definitions → ~1,250 tokens fixed overhead (every request)
1 tool call → additional input + output tokens
3-turn agentic loop → accumulating context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agentic loop accumulates context. After 5 turns, everything from the first message to the fifth tool result is in context. Costs can compound quickly in long-running agents.&lt;/p&gt;

&lt;p&gt;Two ways to manage this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Combine with Prompt Caching&lt;/strong&gt;: Tool definitions are the same on every request. As covered in the &lt;a href="https://dev.to/en/blog/en/claude-api-prompt-caching-cost-optimization-guide"&gt;Claude API Prompt Caching guide&lt;/a&gt;, caching the system prompt with &lt;code&gt;cache_control: {"type": "ephemeral"}&lt;/code&gt; applies here too, and tool definitions benefit similarly from repeated identical structures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Pass only the tools you need&lt;/strong&gt;: Always including 10 tool definitions is worse than passing the 2–3 that matter for the current task. More tools consume more tokens and occasionally lead the model to pick the wrong one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Streaming Tool Use
&lt;/h2&gt;

&lt;p&gt;Tool Use works with streaming responses. In anthropic 0.101.0, use &lt;code&gt;client.messages.stream&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Stream text chunks in real time
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text_chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Get the final message after streaming completes
&lt;/span&gt;    &lt;span class="n"&gt;final_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_final_message&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;final_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ... same handling as above
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When streaming with tool use: if you're showing text chunks to the user in real time and also need to process tool calls, design the UX flow before you start. The &lt;a href="https://dev.to/en/blog/en/vercel-ai-sdk-claude-streaming-agent-2026"&gt;Vercel AI SDK approach&lt;/a&gt; is worth looking at to see how this gets abstracted on the frontend side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Pattern: GitHub Issue Monitor Agent
&lt;/h2&gt;

&lt;p&gt;A complete example tying everything together — a simple agent that fetches and summarizes GitHub issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# reads ANTHROPIC_API_KEY
&lt;/span&gt;
&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list_github_issues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetches the issue list for a GitHub repository.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner/repo format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;closed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max issues to return (default: 10)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_issue_detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetches the details of a specific GitHub issue.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner/repo format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Issue number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;repo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list_github_issues&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Real impl: requests.get(f"https://api.github.com/repos/{repo}/issues", ...)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TypeError in data processor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;41&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Add streaming support&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;open&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get_issue_detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;issue_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reproduce: pass an empty list as input. Stack trace attached.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;comments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_issue_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;process_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loop limit exceeded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Still Unresolved — Honest Limitations
&lt;/h2&gt;

&lt;p&gt;Here's what I find genuinely frustrating about Tool Use in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context accumulation&lt;/strong&gt;: The agentic loop keeps growing the context. After 10 turns, everything from the first message to the tenth tool result is in there. Long-running agents need a context management strategy — summarize intermediate results, prune stale messages — and there's no standard pattern for this yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-deterministic tool selection&lt;/strong&gt;: Same question, different tool selection on different runs. Even with &lt;code&gt;temperature=0&lt;/code&gt;, you can't guarantee identical behavior across invocations. This makes testing harder than it should be.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Description quality is everything&lt;/strong&gt;: Vague &lt;code&gt;description&lt;/code&gt; → wrong tool selection or no tool use at all. Writing good tool descriptions is its own prompt engineering discipline. No framework solves this for you.&lt;/p&gt;

&lt;p&gt;I think Tool Use is underappreciated. Agent frameworks offer impressive abstractions, but this pattern is what's running underneath all of them. &lt;a href="https://dev.to/en/blog/en/pydantic-ai-type-safe-agent-tutorial-2026"&gt;PydanticAI's type-safe tool definitions&lt;/a&gt; are a convenient layer that auto-generates the JSON schema, but understanding the underlying mechanism is what gets you unstuck when things break.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Validated findings from anthropic 0.101.0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool definitions&lt;/strong&gt;: &lt;code&gt;name&lt;/code&gt; + &lt;code&gt;description&lt;/code&gt; + &lt;code&gt;input_schema&lt;/code&gt;. Description quality determines whether the tool gets used correctly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic loop&lt;/strong&gt;: Detect &lt;code&gt;stop_reason == "tool_use"&lt;/code&gt; → execute tool → append &lt;code&gt;tool_result&lt;/code&gt; → repeat. Simple pattern, but the message structure has to be exactly right.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error handling&lt;/strong&gt;: Use &lt;code&gt;is_error: true&lt;/code&gt; so the model recognizes failures and responds appropriately. Never return empty strings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: ~250 tokens overhead per tool definition. Combine with Prompt Caching. Watch context accumulation in multi-turn agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel tool calls&lt;/strong&gt;: &lt;code&gt;ThreadPoolExecutor&lt;/code&gt; works for idempotent read tools. Apply selectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tool Use is the most direct path from chatbot to agent. You don't need a complex framework — this pattern alone is enough to build practical agents.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>anthropicsdk</category>
      <category>tooluse</category>
      <category>agents</category>
    </item>
    <item>
      <title>A-Mem: Agentic Memory for LLM Agents Explained</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 13 May 2026 04:18:18 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/a-mem-agentic-memory-for-llm-agents-explained-454e</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/a-mem-agentic-memory-for-llm-agents-explained-454e</guid>
      <description>&lt;p&gt;Your agent forgets everything between sessions. You bolt on a vector database, retrieve the top-5 similar chunks at query time, and call it memory. It works — until the agent needs to reason across multiple related memories it cannot connect on the fly, or until a new fact should change how it interprets older ones.&lt;/p&gt;

&lt;p&gt;That is the problem A-Mem (Agentic Memory for LLM Agents, arXiv:2502.12110) was built to solve. Accepted at NeurIPS 2025, A-Mem introduces a memory system where the agent actively organizes, links, and evolves its memories on write — not just at retrieval time. The result is a system that handles multi-hop reasoning tasks at roughly six times the accuracy of standard vector retrieval baselines on the LoCoMo benchmark.&lt;/p&gt;

&lt;p&gt;Effloow Lab inspected the paper, codebase (MIT license, GitHub: agiresearch/A-mem), and documented the architecture. This guide explains what A-Mem does differently and when it is worth reaching for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Static Memory Systems Fall Short
&lt;/h2&gt;

&lt;p&gt;Most agent memory setups follow the same pattern: embed a document or conversation turn, store it in a vector database, retrieve by cosine similarity at query time. The pattern is fast and simple, but it has three structural weaknesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weak multi-hop reasoning.&lt;/strong&gt; If memory A is about "Redis sorted sets" and memory B is about "leaderboard query optimization," a query about "how to build a fast leaderboard" may retrieve either memory but not both in the right relationship. The agent has to reconstruct the connection itself — often unreliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No retroactive updating.&lt;/strong&gt; When you add a new memory that changes the interpretation of an older one, the old memory stays unchanged. The agent may retrieve the old, stale context and draw the wrong conclusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fixed retrieval patterns.&lt;/strong&gt; Standard RAG requires you to predefine how memories are accessed: top-k by similarity, keyword filter, or graph traversal. Each new task type may need a new access pattern that you have not engineered.&lt;/p&gt;

&lt;p&gt;Graph-enhanced RAG systems (like MemGPT) address the third problem partially by adding explicit entity-relationship graphs, but they still rely on a predefined schema. A-Mem addresses all three by making memory organization an active, agentic process rather than a fixed retrieval mechanism. (For a practical foundation on building RAG pipelines before layering on agentic memory, see &lt;a href="https://dev.to/articles/build-rag-app-python-llamaindex-tutorial-2026"&gt;Build a RAG App with LlamaIndex&lt;/a&gt;.)&lt;/p&gt;

&lt;h2&gt;
  
  
  What A-Mem Is
&lt;/h2&gt;

&lt;p&gt;A-Mem treats memory the way a thoughtful knowledge worker treats a Zettelkasten — a note-taking methodology where every note is a structured unit linked to related notes. Rather than storing raw text and embedding it once, A-Mem constructs a rich note for each memory, analyzes its relationship to existing memories, creates explicit links, and can update existing notes when new knowledge changes the picture.&lt;/p&gt;

&lt;p&gt;The project is open-source under the MIT license and was accepted at NeurIPS 2025. The primary repositories are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official: &lt;a href="https://github.com/agiresearch/A-mem" rel="noopener noreferrer"&gt;github.com/agiresearch/A-mem&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Paper author mirror: &lt;a href="https://github.com/WujiangXu/A-mem" rel="noopener noreferrer"&gt;github.com/WujiangXu/A-mem&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Community MCP server extension: &lt;a href="https://github.com/tobs-code/a-mem-mcp-server" rel="noopener noreferrer"&gt;github.com/tobs-code/a-mem-mcp-server&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Core Architecture: Three Operations
&lt;/h2&gt;

&lt;p&gt;A-Mem's architecture centers on three operations that run every time a new memory is added.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Note Construction
&lt;/h3&gt;

&lt;p&gt;When a new piece of information enters the system — a conversation turn, a tool result, an observation — A-Mem does not just embed and store it. It generates a structured note containing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Contextual description&lt;/strong&gt;: a short LLM-generated summary that captures the meaning, not just the surface text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keywords and tags&lt;/strong&gt;: structured labels for categorical retrieval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding vector&lt;/strong&gt;: stored in ChromaDB for similarity search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enrichment step is the first departure from vanilla RAG. The embedding is of a richer, LLM-synthesized representation rather than raw text.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Link Generation
&lt;/h3&gt;

&lt;p&gt;After note construction, A-Mem scans the existing memory store for related notes. When meaningful semantic overlap exists — shared keywords, similar contextual descriptions, or high embedding similarity — it creates an explicit directed link between the notes. These links are stored in a NetworkX graph alongside the ChromaDB vector store.&lt;/p&gt;

&lt;p&gt;The combination of ChromaDB (vector similarity) and NetworkX (graph traversal) means the system can answer both "what is similar to this?" (ChromaDB) and "what is connected to this?" (graph walk) without choosing one or the other.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Memory Evolution
&lt;/h3&gt;

&lt;p&gt;This is A-Mem's most distinctive operation. When a new memory is integrated, the system checks whether any existing linked memories should be updated. If the new information changes or deepens the context of an older note, the older note's contextual description is rewritten to reflect the new understanding.&lt;/p&gt;

&lt;p&gt;Consider an agent that first learns "the team uses Redis for session storage" and later learns "the team is migrating from Redis to Valkey for cost reasons." With vanilla RAG, both facts sit independently. With A-Mem, the second memory triggers an evolution of the first: its contextual description is updated to reflect that this is an in-progress migration, not a stable architecture decision.&lt;/p&gt;

&lt;p&gt;This makes A-Mem's memory graph a living structure — not an append-only log.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage Backend
&lt;/h2&gt;

&lt;p&gt;The implementation combines two storage layers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vector store&lt;/td&gt;
&lt;td&gt;ChromaDB&lt;/td&gt;
&lt;td&gt;Fast approximate similarity search on enriched embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Graph store&lt;/td&gt;
&lt;td&gt;NetworkX&lt;/td&gt;
&lt;td&gt;Explicit inter-memory links for multi-hop traversal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM backend&lt;/td&gt;
&lt;td&gt;OpenAI / other&lt;/td&gt;
&lt;td&gt;Note enrichment, link scoring, evolution reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ChromaDB handles retrieval when you query by concept similarity. NetworkX handles traversal when the agent needs to follow a chain of related memories. The LLM backend drives the intelligent parts: note enrichment, deciding which links to create, and whether evolution should happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Results on LoCoMo
&lt;/h2&gt;

&lt;p&gt;A-Mem's paper evaluates on the LoCoMo (Long Conversational Memory) benchmark, a dataset of long-form conversations designed to test multi-session memory recall. The multi-hop category is most revealing — these are questions that require reasoning across two or more distinct stored memories.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Multi-Hop ROUGE-L&lt;/th&gt;
&lt;th&gt;Temporal Reasoning F1&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LoCoMo baseline&lt;/td&gt;
&lt;td&gt;4.68&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ReadAgent&lt;/td&gt;
&lt;td&gt;2.81&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MemGPT (GPT-4o-mini)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;25.52&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A-Mem (Qwen2.5-15b)&lt;/td&gt;
&lt;td&gt;27.23&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A-Mem (GPT-4o-mini)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;45.85&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The multi-hop ROUGE-L improvement with Qwen2.5-15b is roughly 5.8x over the LoCoMo baseline (27.23 vs 4.68). On temporal reasoning tasks with GPT-4o-mini, A-Mem reaches 45.85 F1 against MemGPT's 25.52 — nearly double. These gains are structural, not prompt tricks: they come from having precomputed the links between related memories at write time, so the agent does not need to reconstruct connections at query time under token pressure.&lt;/p&gt;

&lt;p&gt;A-Mem's multi-hop advantage is more pronounced than its gains on simpler single-fact retrieval. Open Domain tasks — where the question maps to a single stored fact — show improvements too, but smaller. This tells you something important about when to use A-Mem: it earns its complexity for tasks that require chaining related facts, not for simple key-value lookups.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Use A-Mem
&lt;/h2&gt;

&lt;p&gt;The project is installed from source. The core API is straightforward once the dependencies are in place.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/agiresearch/A-mem
&lt;span class="nb"&gt;cd &lt;/span&gt;A-mem
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dependencies include &lt;code&gt;chromadb&lt;/code&gt;, &lt;code&gt;networkx&lt;/code&gt;, and an LLM backend (OpenAI by default, but the backend is configurable).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initializing the memory system:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgenticMemorySystem&lt;/span&gt;

&lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgenticMemorySystem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Embedding model (SentenceTransformers)
&lt;/span&gt;    &lt;span class="n"&gt;llm_backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;           &lt;span class="c1"&gt;# Used for note enrichment + evolution
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;model_name&lt;/code&gt; controls the embedding model. &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; is a compact, fast option. For higher quality embeddings, substitute a larger SentenceTransformers model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding a memory:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple content
&lt;/span&gt;&lt;span class="n"&gt;memory_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_note&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Learned that batch size of 16 reduces GPU OOM errors on A100s.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# With metadata
&lt;/span&gt;&lt;span class="n"&gt;memory_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_note&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Redis sorted sets are efficient for leaderboard queries.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Engineering&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;202503021500&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every &lt;code&gt;add_note&lt;/code&gt; call triggers the full Note Construction → Link Generation → Memory Evolution pipeline. The call blocks while the LLM enriches the note and evaluates links, so latency is higher than a plain vector insert. This is the write cost you pay for smarter retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieving memories:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database performance optimization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The search returns notes ordered by relevance, now including notes that are linked to the top matches — so a query about "database performance" can surface both the Redis sorted sets note and a linked note about index strategy, even if the latter does not match the query embedding closely on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  A-Mem vs. Other Memory Systems
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Vanilla RAG&lt;/th&gt;
&lt;th&gt;MemGPT&lt;/th&gt;
&lt;th&gt;Mem0&lt;/th&gt;
&lt;th&gt;A-Mem&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage type&lt;/td&gt;
&lt;td&gt;Vector only&lt;/td&gt;
&lt;td&gt;Vector + graph (schema)&lt;/td&gt;
&lt;td&gt;Fact extraction&lt;/td&gt;
&lt;td&gt;Vector + graph (dynamic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write-time enrichment&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes (facts)&lt;/td&gt;
&lt;td&gt;Yes (full note + links)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory evolution&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-hop reasoning&lt;/td&gt;
&lt;td&gt;Weak&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Weak&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write latency&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High (LLM call per write)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema flexibility&lt;/td&gt;
&lt;td&gt;None needed&lt;/td&gt;
&lt;td&gt;Predefined&lt;/td&gt;
&lt;td&gt;Fact-based&lt;/td&gt;
&lt;td&gt;Fully flexible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Static corpora&lt;/td&gt;
&lt;td&gt;Structured entities&lt;/td&gt;
&lt;td&gt;Fact-heavy chat&lt;/td&gt;
&lt;td&gt;Multi-session reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Mem0 (which uses a fact extraction pattern and scores 66.9% on LOCOMO) is a reasonable middle ground for production: lower write latency than A-Mem, better multi-hop than vanilla RAG. A-Mem wins on the hardest multi-hop tasks but at a real cost: every write requires an LLM call for enrichment and link evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Using A-Mem for simple key-value lookups.&lt;/strong&gt; If your agent stores "user prefers dark mode" and retrieves it verbatim, a plain vector store is faster and sufficient. A-Mem's overhead is only justified when you need cross-memory reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring write latency in production.&lt;/strong&gt; The note enrichment LLM call is synchronous in the base implementation. For high-throughput applications, this needs to be moved to an async queue. The community MCP server (tobs-code/a-mem-mcp-server) is one starting point for integration patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing the wrong embedding model.&lt;/strong&gt; &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; is fast but loses nuance for specialized domains (code, legal text, medical). For domain-specific agents, use a domain-adapted embedding model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not monitoring memory graph growth.&lt;/strong&gt; As the note graph grows, link evaluation cost scales. For agents running thousands of sessions, you need a graph pruning strategy. The paper does not fully address this; it is an open implementation concern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expecting zero-shot plugin behavior.&lt;/strong&gt; A-Mem requires a different design philosophy than RAG. You need to think in terms of notes and links, not documents and embeddings. Teams that treat it as a drop-in RAG replacement will not see the multi-hop gains.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Q: How does A-Mem compare to MemMachine?
&lt;/h3&gt;

&lt;p&gt;MemMachine (see &lt;a href="https://dev.to/articles/memmachine-ground-truth-agent-memory-paper-poc-2026"&gt;Effloow's MemMachine guide&lt;/a&gt;) focuses on ground-truth-preserving memory: it ensures memories are never silently corrupted or overwritten without provenance. A-Mem focuses on dynamic organization and cross-memory evolution. They address different failure modes — A-Mem solves the multi-hop reasoning gap, MemMachine solves the reliability gap. The two approaches are complementary rather than competing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is A-Mem ready for production use?
&lt;/h3&gt;

&lt;p&gt;A-Mem is an MIT-licensed research implementation, not a managed service. The GitHub codebase is functional and documented, but it has not been stress-tested at enterprise scale. For production use, you would need to wrap it in an async worker queue, add monitoring, and handle ChromaDB persistence and backup. Teams who want the architecture without the ops overhead should watch for managed implementations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does A-Mem compare to Mem0 for agent memory?
&lt;/h3&gt;

&lt;p&gt;Mem0 uses a fact-extraction approach: it identifies discrete facts from conversations and stores them as atomic units. This is efficient and production-friendly, scoring 66.9% on LOCOMO. A-Mem builds richer structured notes and evolves them — winning on multi-hop tasks but with higher write cost. If your agent needs to chain across multiple related memories, A-Mem has a structural advantage. For simpler recall, Mem0's lower latency is more practical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Does A-Mem work with local LLMs?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;llm_backend&lt;/code&gt; parameter is configurable. The codebase supports OpenAI out of the box and can be adapted to other backends. For local LLMs (Ollama, vLLM, LM Studio), you would configure an OpenAI-compatible endpoint. Note enrichment quality depends on the LLM: a stronger model produces better contextual descriptions and more accurate link decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is the LoCoMo benchmark?
&lt;/h3&gt;

&lt;p&gt;LoCoMo (Long Conversational Memory) is a dataset of long-form multi-session conversations designed to test whether memory systems can recall facts and relationships across extended interactions. The multi-hop subset specifically tests questions that require connecting two or more stored facts. It is the primary benchmark used in the A-Mem paper.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What is memory evolution and when does it trigger?
&lt;/h3&gt;

&lt;p&gt;Memory evolution is the process by which A-Mem updates the contextual description of an existing note when a new, related note is added. It triggers when the system determines — via LLM evaluation — that the new memory meaningfully changes the interpretation of an existing linked memory. In practice, this is most useful in long-running agents where knowledge compounds over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A-Mem (NeurIPS 2025, arXiv:2502.12110) builds structured, evolving memory graphs for LLM agents using Zettelkasten-inspired note construction.&lt;/li&gt;
&lt;li&gt;The three core operations — Note Construction, Link Generation, Memory Evolution — happen at write time, not retrieval time.&lt;/li&gt;
&lt;li&gt;On the LoCoMo benchmark multi-hop tasks, A-Mem achieves roughly 5.8x better ROUGE-L than the standard vector baseline with GPT-4o-mini.&lt;/li&gt;
&lt;li&gt;Storage uses ChromaDB for vector similarity and NetworkX for graph traversal, giving both similarity search and relationship-aware retrieval.&lt;/li&gt;
&lt;li&gt;The write latency cost (LLM call per memory) is real: A-Mem is not a drop-in replacement for RAG. It is a deliberate upgrade for agents where multi-session, multi-hop reasoning quality matters.&lt;/li&gt;
&lt;li&gt;The codebase is MIT-licensed on GitHub and installable from source.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;A-Mem solves the multi-hop memory problem that vanilla RAG cannot — by making memory organization agentic at write time rather than patchwork at query time. If your agent needs to reason across sessions and chain related facts reliably, the architecture is worth the added write latency. For simpler recall tasks, stick with Mem0 or a plain vector store.&lt;/p&gt;

</description>
      <category>agentmemory</category>
      <category>llmagents</category>
      <category>zettelkasten</category>
      <category>chromadb</category>
    </item>
    <item>
      <title>Cloudflare Project Think: Durable Agent Runtime Guide</title>
      <dc:creator>Jangwook Kim</dc:creator>
      <pubDate>Wed, 13 May 2026 01:13:13 +0000</pubDate>
      <link>https://forem.com/jangwook_kim_e31e7291ad98/cloudflare-project-think-durable-agent-runtime-guide-7id</link>
      <guid>https://forem.com/jangwook_kim_e31e7291ad98/cloudflare-project-think-durable-agent-runtime-guide-7id</guid>
      <description>&lt;p&gt;Most AI agents on serverless platforms share the same fatal flaw: they can't survive a restart. If the underlying worker crashes or cold-starts mid-task, the agent's progress disappears. The typical workaround is to keep tasks short and stateless — which means you cannot run a 10-minute research loop, a multi-file refactor, or an autonomous investigation that makes 50 external calls.&lt;/p&gt;

&lt;p&gt;Cloudflare's Project Think, announced during Agents Week 2026 (April 2026), is a direct answer to that constraint. It ships a set of primitives — fiber checkpointing, sub-agents, a persistent Session API, and a 5-tier execution ladder — all wired into an opinionated base class (&lt;code&gt;@cloudflare/think&lt;/code&gt;) that runs on Durable Objects.&lt;/p&gt;

&lt;p&gt;Effloow Lab inspected the SDK packages, confirmed installability, and traced the API surface from official docs and the open-source &lt;code&gt;cloudflare/agents&lt;/code&gt; repository. The following is a source-based guide to how Project Think works and when to use it. See &lt;code&gt;data/lab-runs/cloudflare-project-think-durable-agent-runtime-2026.md&lt;/code&gt; for the full evidence note.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Serverless Agents Break — and Why Project Think Fixes It
&lt;/h2&gt;

&lt;p&gt;A standard Cloudflare Worker is a request handler: it starts, does work, returns a response, and dies. Cloudflare Workflows added durable multi-step execution, but the state machine is managed outside your code and requires a separate infrastructure primitive.&lt;/p&gt;

&lt;p&gt;Project Think takes a different approach. Each agent runs inside a &lt;strong&gt;Durable Object&lt;/strong&gt; — a stateful micro-server with its own SQLite database, WebSocket connections, and scheduling. That alone gives agents persistence. But Project Think goes further by introducing &lt;strong&gt;fibers&lt;/strong&gt;: durable invocations that can checkpoint their own instruction pointer directly into the co-located SQLite database.&lt;/p&gt;

&lt;p&gt;The practical result: an agent can run a 30-step task, checkpoint after each step, survive a server restart, and resume exactly where it left off — without any external workflow orchestrator.&lt;/p&gt;

&lt;p&gt;This is the critical architectural distinction from Cloudflare Dynamic Workers (covered in an &lt;a href="https://dev.to/articles/cloudflare-dynamic-workers-ai-sandbox-guide-2026"&gt;earlier Effloow article on Dynamic Workers&lt;/a&gt;), which handle sandboxed code execution but are stateless by design. Project Think layers durable execution on top of the full Cloudflare platform stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Five Primitives of Project Think
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Fibers — Checkpointed Execution
&lt;/h3&gt;

&lt;p&gt;The fiber is the foundational primitive. Unlike a regular async function, a fiber can call &lt;code&gt;ctx.stash()&lt;/code&gt; to serialize the current state of its local variables into SQLite. If the Durable Object restarts, &lt;code&gt;runFiber&lt;/code&gt; rehydrates from the last stash point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;runFiber&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@cloudflare/think&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearchAgent&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Think&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;onTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;runFiber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;searchWeb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stash&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;sources&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;         &lt;span class="c1"&gt;// checkpoint 1&lt;/span&gt;

      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;summaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stash&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;summaries&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt; &lt;span class="c1"&gt;// checkpoint 2&lt;/span&gt;

      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;synthesize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;summaries&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;ctx.stash()&lt;/code&gt; call writes to the Durable Object's SQLite database. On resume, the fiber fast-forwards to the last stash point. For long-horizon tasks — multi-file code reviews, iterative search loops, automated report generation — this removes the "start over" failure mode entirely.&lt;/p&gt;

&lt;p&gt;Fibers also include automatic keepalive for long-running operations and handle non-deterministic workloads that would time out in a standard Worker context.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Sub-Agents (Facets) — Isolated Child Agents with Typed RPC
&lt;/h3&gt;

&lt;p&gt;Project Think supports spawning child agents as &lt;strong&gt;facets&lt;/strong&gt; — child Durable Objects colocated with the parent on the same machine. Each facet has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its own isolated SQLite database (no shared state)&lt;/li&gt;
&lt;li&gt;Its own execution context and fiber support&lt;/li&gt;
&lt;li&gt;A typed RPC stub returned to the parent for method calls
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Parent agent spawning a specialist sub-agent&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;extractor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spawnFacet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;data-extractor&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DataExtractorAgent&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;structured&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;extractor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parseDocument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rawText&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;validator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spawnFacet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;validator&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ValidationAgent&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern is more predictable than passing messages through a shared queue. Because the facet RPC is typed, TypeScript catches mismatches at compile time. And because facets are colocated, the latency for inter-agent calls is dramatically lower than network-based agent-to-agent communication.&lt;/p&gt;

&lt;p&gt;Facets are useful when you need to decompose a task into specialist roles — a researcher, a writer, a fact-checker — without those roles sharing any mutable state.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Session API — Relational Conversation Trees
&lt;/h3&gt;

&lt;p&gt;Standard chat agent implementations append messages to a flat array. That works for simple Q&amp;amp;A but breaks when you need to explore alternatives without polluting the main reasoning path.&lt;/p&gt;

&lt;p&gt;Project Think's Session API stores messages as a &lt;strong&gt;relational tree&lt;/strong&gt;, with each message carrying a &lt;code&gt;parent_id&lt;/code&gt;. This enables three capabilities that flat-list approaches cannot support:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forking&lt;/strong&gt;: The agent can branch off a conversation node to explore an alternative without modifying the main path. If the alternative fails, the original path is untouched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-destructive compaction&lt;/strong&gt;: Rather than truncating context when the window fills, the Session API creates a compaction overlay — a summary that sits beside the original messages without replacing them. The full history is still queryable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full-text search&lt;/strong&gt;: FTS5 indexing over all stored messages lets the agent retrieve relevant earlier context without re-reading the entire history into the LLM context window.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LongHorizonAgent&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Think&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;configureSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;You are a thorough technical researcher.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;contextBlocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;DOMAIN_KNOWLEDGE&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All session storage runs on the Durable Object's local SQLite — no external vector database required for the conversation layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Execution Ladder — Graduated Code Trust
&lt;/h3&gt;

&lt;p&gt;One of Project Think's most distinctive ideas is the &lt;strong&gt;execution ladder&lt;/strong&gt;: a tiered system of code execution environments that agents escalate through based on the trust level required by a task.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Package / API&lt;/th&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Trust Level&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Workspace&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@cloudflare/shell&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Durable filesystem (SQLite + R2)&lt;/td&gt;
&lt;td&gt;Fully trusted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Dynamic Worker&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@cloudflare/codemode&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sandboxed V8 isolate, no network&lt;/td&gt;
&lt;td&gt;LLM-generated code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;npm&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@cloudflare/worker-bundler&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fetch npm pkgs, esbuild, load into DW&lt;/td&gt;
&lt;td&gt;Third-party packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Browser&lt;/td&gt;
&lt;td&gt;Cloudflare Browser Run&lt;/td&gt;
&lt;td&gt;Navigate, click, extract&lt;/td&gt;
&lt;td&gt;Web content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Sandbox&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cloudflare/sandbox-sdk&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full Linux env, git, cargo, npm test&lt;/td&gt;
&lt;td&gt;Untrusted workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Agents do not jump directly to Tier 4 for every task. A simple data transformation can run in Tier 1 (a sandboxed V8 isolate that starts in milliseconds). A task requiring npm packages escalates to Tier 2. A task that needs to test a full Rust codebase goes to Tier 4.&lt;/p&gt;

&lt;p&gt;The ladder enforces the principle of least privilege: agents operate at the lowest tier that can handle the task, escalating only when needed. This keeps the security surface small and execution fast for common cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Self-Authored Extensions — Agents Writing Their Own Tools
&lt;/h3&gt;

&lt;p&gt;The final primitive is the most experimental: agents can write their own tools at runtime. An agent inspects a task, decides it needs a capability it doesn't have, generates a tool implementation, and loads it into a Dynamic Worker for execution — all within the same session.&lt;/p&gt;

&lt;p&gt;This is not the same as calling an external tool-use API. The agent generates actual TypeScript code, bundles it with &lt;code&gt;@cloudflare/worker-bundler&lt;/code&gt;, and executes it in a Tier 1 or Tier 2 environment. The generated tool becomes part of the agent's toolkit for the duration of the session.&lt;/p&gt;

&lt;p&gt;In practice, this is useful for tasks where the required transformation or extraction logic cannot be fully specified in advance — for example, parsing a novel API response format or implementing a domain-specific calculation that varies per client.&lt;/p&gt;




&lt;h2&gt;
  
  
  The &lt;code&gt;@cloudflare/think&lt;/code&gt; Base Class
&lt;/h2&gt;

&lt;p&gt;All five primitives are exposed through the &lt;code&gt;Think&lt;/code&gt; base class, which handles the full chat lifecycle: agentic loop, message persistence, streaming, tool execution, stream resumption, and extensions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Installation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @cloudflare/think agents ai @cloudflare/shell zod workers-ai-provider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Minimal example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Think&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@cloudflare/think&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;createWorkersAI&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;workers-ai-provider&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyAgent&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Think&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;Env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;getModel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ai&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createWorkersAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;binding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AI&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="c1"&gt;// Workers AI free tier includes @cf/meta/llama-3.3-70b-instruct&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;ai&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@cf/meta/llama-3.3-70b-instruct&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;configureSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;systemPrompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;You are a helpful assistant.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;wrangler.toml&lt;/code&gt; binding wires the Durable Object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[[durable_objects.bindings]]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"MY_AGENT"&lt;/span&gt;
&lt;span class="py"&gt;class_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"MyAgent"&lt;/span&gt;

&lt;span class="nn"&gt;[[migrations]]&lt;/span&gt;
&lt;span class="py"&gt;tag&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"v1"&lt;/span&gt;
&lt;span class="py"&gt;new_sqlite_classes&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"MyAgent"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;cloudflare/agents&lt;/code&gt; GitHub repository contains 30+ self-contained example agents demonstrating fibers, facets, sessions, and execution ladder integration. The &lt;code&gt;docs/think/index.md&lt;/code&gt; file in that repository is the most complete reference beyond the official documentation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Think vs. Dynamic Workers vs. Cloudflare Workflows
&lt;/h2&gt;

&lt;p&gt;Developers familiar with Cloudflare's existing primitives will have one question: how does this fit alongside Dynamic Workers and Workflows?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic Workers&lt;/strong&gt; (covered in &lt;a href="https://dev.to/articles/cloudflare-dynamic-workers-ai-sandbox-guide-2026"&gt;Effloow's Dynamic Workers guide&lt;/a&gt;) are stateless sandboxed V8 isolates for executing LLM-generated code. They correspond to Tier 1 of Project Think's execution ladder. They are not durable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare Workflows&lt;/strong&gt; provide durable multi-step execution, but the state machine lives outside your Worker. Steps are defined declaratively, and Cloudflare's infrastructure manages replay. This is powerful for ETL pipelines and scheduled jobs, but the agent has no access to its own state between steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project Think&lt;/strong&gt; puts the state machine inside the agent itself via fibers and the co-located SQLite database. The agent is both the executor and the state store. This gives more flexibility for agentic patterns where the next step depends on reasoning about the previous step's output — not just a declared execution graph.&lt;/p&gt;

&lt;p&gt;The right choice depends on your workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stateless code execution only → Dynamic Workers&lt;/li&gt;
&lt;li&gt;Declarative multi-step pipeline with retry guarantees → Cloudflare Workflows&lt;/li&gt;
&lt;li&gt;Autonomous agents with reasoning-driven state transitions → Project Think&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common Mistakes When Building Durable Agents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Checkpoint too infrequently.&lt;/strong&gt; If you only call &lt;code&gt;ctx.stash()&lt;/code&gt; at the end of a multi-minute operation, a crash at minute 8 means re-running 8 minutes of work. Checkpoint after each meaningful unit — after a web request, after a parsing step, after a tool call returns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Share state through the parent's SQLite instead of facet isolation.&lt;/strong&gt; Facets exist precisely so specialist sub-agents do not see each other's state. Routing everything through the parent's database re-introduces the coupling you were trying to avoid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Escalate to Tier 4 for every code execution task.&lt;/strong&gt; Cloudflare Sandbox (Tier 4) has more overhead than Dynamic Workers (Tier 1). Use Tier 4 only when the task genuinely needs a Linux environment — git operations, compiled languages, or full test runners.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignore compaction until the context window overflows.&lt;/strong&gt; Plan compaction as a regular scheduled step, not an emergency measure. The Session API's non-destructive overlay lets you compact early and often without losing history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat &lt;code&gt;@cloudflare/think&lt;/code&gt; as production-stable.&lt;/strong&gt; As of May 2026, Project Think is in &lt;strong&gt;experimental preview&lt;/strong&gt;. The package version is &lt;code&gt;0.0.1-experimental.x&lt;/code&gt;. The API surface is intended to be stable, but Cloudflare explicitly says it will continue to evolve. Treat it as early-adopter infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Application: When to Choose Project Think
&lt;/h2&gt;

&lt;p&gt;Project Think is well-suited to agent workloads that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exceed Cloudflare Worker's standard CPU time limits&lt;/li&gt;
&lt;li&gt;Require specialist sub-tasks that should not share state&lt;/li&gt;
&lt;li&gt;Need to explore multiple reasoning paths without forking the entire agent&lt;/li&gt;
&lt;li&gt;Generate and execute code as part of their reasoning loop&lt;/li&gt;
&lt;li&gt;Must maintain conversation history across days or weeks for personalization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is less well-suited to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple request/response pipelines (standard Worker is simpler)&lt;/li&gt;
&lt;li&gt;Batch jobs without agent reasoning (Cloudflare Workflows is more appropriate)&lt;/li&gt;
&lt;li&gt;Workloads requiring GPUs or dedicated compute (no GPU support on Workers)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  FAQ
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Q: Does Project Think work with any LLM or only Workers AI?
&lt;/h3&gt;

&lt;p&gt;Project Think's &lt;code&gt;Think&lt;/code&gt; base class is model-agnostic — &lt;code&gt;getModel()&lt;/code&gt; can return any model compatible with the Vercel AI SDK's provider interface. Workers AI (&lt;code&gt;workers-ai-provider&lt;/code&gt;) is the zero-egress option for Cloudflare-hosted models, but you can wire in OpenAI, Anthropic, or any other provider via the AI SDK.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: What's the cost of fiber checkpointing?
&lt;/h3&gt;

&lt;p&gt;Each &lt;code&gt;ctx.stash()&lt;/code&gt; writes to the Durable Object's SQLite database — a local write, not a network call. The overhead is the same as any SQLite write on the same machine. Cloudflare does not charge extra for SQLite writes beyond the standard Durable Object storage pricing. For most agents, checkpointing 10–50 times per session adds negligible cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Can sub-agents (facets) span multiple geographic regions?
&lt;/h3&gt;

&lt;p&gt;Facets are colocated with the parent Durable Object on the same machine by design — this is what makes their typed RPC low-latency. They do not span regions. If you need geographically distributed agent coordination, that requires a different architecture (message queues or service bindings across Workers).&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: Is Project Think production-ready in May 2026?
&lt;/h3&gt;

&lt;p&gt;No. It is in experimental preview. Cloudflare describes the API surface as stable but explicitly notes it will evolve. For production workloads, monitor the &lt;code&gt;cloudflare/agents&lt;/code&gt; GitHub repository and the Cloudflare changelog for GA announcements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q: How does the Session API relate to a vector database?
&lt;/h3&gt;

&lt;p&gt;The Session API is not a semantic search layer — it is a relational message store with FTS5 full-text search. It handles conversation history, forking, and compaction well. For semantic retrieval over large external knowledge bases, you still need a vector database (Cloudflare Vectorize, Pinecone, etc.). They are complementary, not alternatives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Project Think solves the fundamental durability problem in serverless AI agents: agents can now checkpoint progress and survive restarts without re-running from the beginning.&lt;/li&gt;
&lt;li&gt;The five core primitives — fibers, sub-agents (facets), the Session API, the execution ladder, and self-authored extensions — address distinct failure modes in long-horizon agentic workloads.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@cloudflare/think&lt;/code&gt; is the opinionated base class that wires all primitives together; it is model-agnostic and works with any Vercel AI SDK provider.&lt;/li&gt;
&lt;li&gt;The 5-tier execution ladder enforces least-privilege code execution, keeping fast tasks in lightweight V8 isolates and escalating to full Linux environments only when necessary.&lt;/li&gt;
&lt;li&gt;As of May 2026, Project Think is in experimental preview. The API is intended to be stable but will continue to evolve — suitable for early adoption and evaluation, not yet for production-critical deployments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bottom Line&lt;br&gt;
  &lt;/p&gt;
&lt;p&gt;Project Think is the most complete answer Cloudflare has given to "how do I run an AI agent that lasts longer than a serverless function?" The fiber + facet + session combination solves real architectural problems, not theoretical ones. Get familiar with it now — when it reaches GA, it will become the default pattern for serious agent infrastructure on the Workers platform.&lt;/p&gt;

</description>
      <category>cloudflare</category>
      <category>aiagents</category>
      <category>durableexecution</category>
      <category>serverless</category>
    </item>
  </channel>
</rss>
