Forem: Jangwook Kim

Archon v2: Open Source Coding Agent Harnesses

Jangwook Kim — Tue, 19 May 2026 00:12:07 +0000

AI coding agents are becoming powerful enough to change real repositories, but the workflow around them is still often improvised. One run starts with planning, another jumps straight into edits, and a third forgets the validation command you expected. That is the gap Archon is trying to close.

Archon describes itself as a workflow engine for AI coding agents: you define development processes as YAML workflows, then run those workflows through a CLI, web UI, or integrations. The public GitHub repository calls it an open-source harness builder for deterministic and repeatable AI coding work. The useful framing is not "another coding assistant." It is a control layer around coding assistants.

Effloow Lab ran a small local sandbox before writing this guide. The lab cloned the public repository, inspected the bundled workflow definitions, created a minimal .archon/workflows/*.yaml file, and validated the dependency graph locally. The lab did not run model-backed AI nodes, Claude Code, Codex SDK, GitHub PR creation, or the web dashboard. Those limits matter because Archon's production value depends on the agent execution layer, not only the YAML shape.

Effloow Lab — Local sandbox on macOS, Node v25.9.0, npm 11.12.1, Docker 29.2.0. Lab run notes: data/lab-runs/archon-v2-ai-coding-agent-harness-builder-2026.md. The PoC validated workflow structure and dependency ordering only; no AI provider credentials were used.

What Archon Is Trying to Fix

The common coding-agent failure mode is not always code quality. It is process drift. A capable model can still skip a planning step, forget to run tests, rewrite too much code, or finish without a review pass. Humans compensate with long prompts: "first inspect the repo, then make a plan, then implement, then run tests, then summarize." That works until the prompt gets lost in a long context window or a different teammate writes a different instruction.

Archon's answer is to move the process out of the prompt and into a workflow file. The core concepts documentation defines a workflow as a YAML file containing a directed acyclic graph of nodes. Nodes can represent inline prompts, command files, bash scripts, loops, approval gates, or cancellation points. Dependencies are declared with depends_on, so the sequence becomes explicit rather than implied by prose.

That changes the operating model. The agent still supplies judgment inside AI-backed nodes, but the harness owns the skeleton: inspect, plan, implement, validate, review, request approval, create PR. For teams already using Claude Code, Codex, or other terminal agents, this is the difference between "ask the model to remember the process" and "make the model run inside the process."

The Current Version Reality

The backlog topic says "Archon v2," but the current public repository is more precise than that label. In the sandbox clone, package.json reported:

{
  "name": "archon",
  "version": "0.3.12",
  "type": "module"
}

So this guide treats "v2" as the rewrite-era product direction, not as an exact package version. A GitHub migration issue from April 2026 says Archon was evolving from a Python-based MCP knowledge and task-management tool into a TypeScript workflow engine for AI coding agents, with the old Python code preserved on an archive branch. That matches the current repository shape: TypeScript, Bun scripts, .archon workflow defaults, and documentation centered on YAML workflows.

This distinction matters for readers. If you are looking for the older Archon OS-style RAG/task-management stack, you may land on older articles or mirrors. If you want the current coding-agent harness, focus on the docs at archon.diy and the current coleam00/Archon repo.

How the YAML Harness Works

A minimal workflow has a name, a description, and nodes. The first workflow guide shows the basic pattern: one node runs first, another depends on it, and Archon executes the graph in dependency order.

Effloow Lab modeled this small workflow:

name: effloow-sandbox
description: Minimal deterministic review workflow for an article-code sandbox
nodes:
  - id: inspect
    bash: "printf 'inspect ok\n'"

  - id: plan
    prompt: "Create a short implementation plan from the inspection output."
    depends_on: [inspect]

  - id: validate
    bash: "printf 'validate ok\n'"
    depends_on: [plan]

  - id: review-copy
    prompt: "Review the output for unsupported claims."
    depends_on: [validate]

  - id: review-risk
    prompt: "Review the output for operational risks."
    depends_on: [validate]

  - id: summarize
    prompt: "Summarize the validation and review findings."
    depends_on: [review-copy, review-risk]

A local validator produced this execution layering:

{
  "nodeCount": 6,
  "missingDependencies": [],
  "executionLayers": [
    ["inspect"],
    ["plan"],
    ["validate"],
    ["review-copy", "review-risk"],
    ["summarize"]
  ]
}

The interesting part is the fourth layer. review-copy and review-risk both depend on validate, but neither depends on the other. That means the workflow has a natural parallel review stage before the final summary. This is exactly where harnesses start to matter: code review, security review, docs review, and regression review are different jobs, and a workflow file can represent them as separate nodes instead of one overloaded "please review this" prompt.

What the Sandbox Confirmed

The local run confirmed four concrete facts.

First, the public repository could be cloned and inspected without credentials. The checkout used commit 45bc5e5 at the time of the run.

Second, the repository exposes a TypeScript/Bun toolchain. The root package.json includes scripts such as cli, build, test, type-check, lint, and validate. Effloow Lab did not run those scripts because Bun was not installed on the host.

Third, bundled defaults are real files, not just documentation examples. The clone contained 37 workflow YAML files and 36 default command files under .archon. The visible workflow list included names such as archon-idea-to-pr, archon-plan-to-pr, archon-smart-pr-review, archon-comprehensive-pr-review, archon-refactor-safely, and archon-validate-pr.

Fourth, a small YAML workflow can be reasoned about with ordinary DAG validation. The local script found six nodes, no missing dependencies, and a five-layer execution plan. That does not prove Archon's runtime behavior, but it does prove the workflow model is inspectable and reviewable before an agent touches code.

What the Sandbox Did Not Prove

The local experiment intentionally stopped short of a full Archon trial.

The documented Docker command started with:

Unable to find image 'ghcr.io/coleam00/archon:latest' locally

The image pull did not complete within the local run window, so Effloow Lab did not verify archon workflow list through Docker. The lab also did not install Bun, configure Claude Code, set provider credentials, connect GitHub CLI, start the web dashboard, trigger a PR workflow, or run Slack/Telegram integrations.

That boundary should shape adoption decisions. The sandbox supports a narrow claim: Archon's workflow concept is concrete, source-visible, and easy to inspect. It does not support a broad claim that Archon is production-ready in a specific team environment. Teams should run their own credentialed trial before putting it on a critical repository.

How Archon Compares to Plain Agent Prompts

Plain prompts are fast to write. They are also easy to mutate accidentally. A senior engineer might say "run tests before summarizing," while another says "summarize and then run tests if needed." Both can work, but neither creates a durable process artifact.

Archon's workflow files are closer to CI configuration for agentic development. The authoring guide emphasizes workflows, commands, artifacts, fresh context, and parallel execution. Commands communicate through files rather than hidden memory. Nodes can force a fresh context, which is useful when you want a review step to inspect artifacts instead of inheriting the implementer's assumptions.

This is the strongest reason to care about Archon: it makes the human process reviewable. You can code-review a workflow file. You can ask whether the validation node is too weak, whether the approval gate is in the right place, or whether a security review should run before PR creation. That is harder when all process control lives in a giant natural-language prompt.

A Practical Workflow Pattern

A useful first Archon workflow should be boring. Do not start with an autonomous "idea to production PR" flow on a critical service. Start with a harness that standardizes a task you already do manually.

For example:

Inspect the relevant files.
Write a short plan artifact.
Ask for human approval.
Implement one bounded change.
Run the exact validation command.
Run two independent review nodes.
Summarize changed files, test output, and residual risk.

That pattern is also a good fit for content-backed engineering systems like Effloow. An article generator, for example, should not only draft prose. It should gather sources, create a lab note, check unsupported claims, verify frontmatter, update the backlog, and stop before publishing side effects. A workflow harness can encode those boundaries directly.

Readers interested in related agent control patterns can compare this with Effloow's guides on terminal AI coding agents and OpenAI Agents SDK multi-agent workflows. Archon sits one layer above the agent: it coordinates process, while the underlying assistant still performs the reasoning and code edits.

Where Archon Looks Strong

Archon is most compelling when the same engineering process must run repeatedly across issues, repositories, or teammates. The CLI reference documents workflow listing, workflow runs, JSON output, validation, logs, and merge detection behavior. The docs also describe project-local workflows in .archon/workflows/ and global workflows under ~/.archon/workflows/.

That gives teams a path to standardize:

Bug-fix investigation and implementation.
Plan-to-PR execution.
Multi-review PR checks.
Refactoring with validation gates.
Documentation impact review.
Human approval before irreversible steps.

The key advantage is portability. A workflow committed to the repo can travel with the project. A global workflow can become a personal or team-wide operating pattern. Both are more durable than a chat transcript.

Where Teams Should Be Careful

Harnesses can also create false confidence. A YAML file can force a validation command to run, but it cannot make that command comprehensive. A review node can ask for security issues, but it cannot guarantee that every issue is found. A human approval node can pause execution, but it cannot replace informed review.

There is also tooling maturity risk. The current repo uses Bun, a web dashboard, provider integrations, and platform connectors. If your team standardizes on npm-only Node tooling, locked-down workstations, or restricted Docker access, the setup path may need extra work. The sandbox host did not have Bun installed, and the Docker image path was not verified within the local run window.

Finally, avoid putting secrets or production credentials into workflow files. Treat Archon workflows like CI configuration: review them, keep secrets in approved secret stores, and put destructive operations behind explicit approval gates.

Adoption Checklist

Use this checklist before introducing Archon to a real repository:

Choose one low-risk workflow, such as docs review or test validation.
Commit the workflow under .archon/workflows/.
Keep prompts short and task-specific.
Put deterministic checks in bash nodes where possible.
Use artifacts for handoffs between nodes.
Add human approval before PR creation, deployment, paid actions, or public posting.
Run the workflow on a throwaway branch first.
Compare the output against your normal manual process.
Document what the agent is allowed to change.
Keep a fallback path that does not require Archon.

If the first workflow does not improve repeatability, do not add more workflows. The goal is not to make agent work look sophisticated. The goal is to make it observable, reviewable, and less dependent on the wording of one-off prompts.

FAQ

Q: Is Archon a replacement for Claude Code or Codex?

No. Archon is better understood as a harness around coding agents. The docs say it works with Claude Code SDK and Codex SDK, but the model-backed agent still performs the reasoning and code work. Archon provides workflow structure.

Q: Can Archon run deterministic steps without AI?

Yes. The docs describe bash nodes for shell scripts, and the sandbox workflow used bash nodes for inspect and validate. Deterministic checks belong there whenever possible.

Q: Did Effloow Lab run a full Archon workflow?

No. Effloow Lab validated a local workflow DAG and inspected the repository defaults. It did not run model-backed nodes, provider credentials, PR creation, or the dashboard.

Q: What is the safest first use case?

Start with validation and review, not autonomous implementation. A workflow that runs tests, checks docs impact, and summarizes risk is easier to trust than one that edits code and opens PRs immediately.

Bottom Line

Archon is worth watching because it moves coding-agent process control out of fragile prompts and into source-visible workflow files. The sandbox confirmed that the current repository has real bundled workflow definitions and that a small YAML DAG can model deterministic validation plus parallel review. It did not prove end-to-end runtime reliability.

For teams already experimenting with AI coding agents, Archon is most useful as a repeatability layer: encode the process, keep humans in the approval path, and let the agent operate inside a graph that engineers can inspect before it runs.

Claude Agent SDK Subagent Orchestration Tutorial — Parallel Multi-Agent Processing in Practice

Jangwook Kim — Mon, 18 May 2026 06:37:34 +0000

After I published the Tool Use guide, a comment came in fairly quickly: "I get single agents now, but how do I run a code reviewer, security scanner, and doc writer at the same time?" I was actually mid-experiment at that point myself.

Installing claude-agent-sdk 0.2.82 directly, I found the answer. One AgentDefinition dataclass and the ClaudeAgentOptions.agents dict is all it takes. I created the objects and explored the type structure hands-on. No API key meant I couldn't run actual queries, but the code structure and type system were fully testable.

This post is that exploration.

Where Single Agents Hit a Wall

The Tool Use loop is powerful. But there are three situations where it shows limits.

Context contamination. When a single agent handles code quality, security vulnerabilities, and test coverage in one PR review, the context window fills with intermediate results from all three tasks mixed together. The agent sees its earlier reasoning while forming later judgments — the fact that it spotted a code smell early can subtly shade the security analysis.

No parallelism. Code review takes 30 seconds, security scan 20 seconds, doc generation 25 seconds. Single agent: 75 seconds. Three concurrent agents: 30 seconds. There's no reason to run independent tasks serially.

Role bleed. An agent that "thinks like a reviewer then thinks like a security expert" does both jobs worse than dedicated specialists. This is true for human teams too.

The subagent pattern solves these three problems structurally.

Installing claude-agent-sdk 0.2.82 — SDK Structure I Verified Directly

pip install claude-agent-sdk

Version confirmed after install:

Successfully installed claude-agent-sdk-0.2.82

Running dir(claude_agent_sdk) in a temporary sandbox, the subagent-relevant classes that stood out:

import claude_agent_sdk as sdk

sdk.AgentDefinition          # Subagent configuration dataclass
sdk.ClaudeAgentOptions       # Full options including agents dict
sdk.TaskBudget               # Token budget control
sdk.SubagentStartHookInput   # Hook for subagent start events
sdk.SubagentStopHookInput    # Hook for subagent stop events
sdk.list_subagents           # List subagents in a session
sdk.get_subagent_messages    # Retrieve a specific subagent's messages

I read the AgentDefinition source directly with inspect.getsource(). This is the actual dataclass in 0.2.82:

@dataclass
class AgentDefinition:
    description: str          # How the orchestrator identifies this agent
    prompt: str               # Subagent system prompt
    tools: list[str] | None = None
    disallowedTools: list[str] | None = None
    model: str | None = None  # "sonnet", "opus", "haiku", "inherit", or full model ID
    skills: list[str] | None = None
    memory: Literal["user", "project", "local"] | None = None
    mcpServers: list[str | dict[str, Any]] | None = None
    initialPrompt: str | None = None
    maxTurns: int | None = None  # Max loop count for this subagent
    background: bool | None = None
    effort: EffortLevel | int | None = None
    permissionMode: PermissionMode | None = None

One thing I noticed in the tools field comment: "Deprecated: passing 'Skill' here is deprecated; use skills instead." I hadn't seen that in the documentation. The separate skills field is the right place now.

Defining Subagents with AgentDefinition — A PR Review Pipeline

Here's the actual code. A PR auto-review pipeline needs three roles:

import asyncio
import claude_agent_sdk as sdk

# Define each role as a specialized subagent
code_reviewer = sdk.AgentDefinition(
    description="Python code quality and design review specialist",
    prompt=(
        "You're a Python senior engineer with 10 years of experience. "
        "Review code quality, readability, and design patterns. "
        "Provide concrete improvement suggestions in markdown format."
    ),
    tools=["Read", "Grep"],
    model="sonnet",
    maxTurns=8,
)

security_scanner = sdk.AgentDefinition(
    description="Security vulnerability scanner — injection risks, exposed secrets, unsafe operations",
    prompt=(
        "You're a security engineer. "
        "Find SQL injection risks, hardcoded secrets, unsafe eval/exec, "
        "and permission issues. Report each with severity level."
    ),
    tools=["Read", "Grep", "Bash"],
    model="sonnet",
    maxTurns=6,
)

doc_writer = sdk.AgentDefinition(
    description="Docstring and README writer — reads code and generates clear documentation",
    prompt=(
        "You're a technical writer. "
        "Write Google Style docstrings for functions and classes, "
        "and create usage examples for the README."
    ),
    tools=["Read", "Write", "Edit"],
    model="haiku",   # haiku is sufficient for docs and cheaper
    maxTurns=5,
)

# Orchestrator options
opts = sdk.ClaudeAgentOptions(
    system_prompt=(
        "You're a PR review orchestrator. "
        "Call code-reviewer, security-scanner, and doc-writer in parallel "
        "and compile a comprehensive review report from all results."
    ),
    allowed_tools=["Agent", "Read"],  # Agent tool is how subagents are invoked
    agents={
        "code-reviewer": code_reviewer,
        "security-scanner": security_scanner,
        "doc-writer": doc_writer,
    },
    permission_mode="bypassPermissions",
)

The dict keys in ClaudeAgentOptions.agents are the names the orchestrator uses when calling subagents. When the system prompt says "call code-reviewer," Claude invokes that agent via the Agent tool.

Parallel Execution Pattern — Running Three Agents Simultaneously

The most important line in the SDK documentation:

"Multiple subagents can run concurrently. When Claude identifies independent subtasks, it spawns multiple agents simultaneously using multiple Task tool calls in a single message."

When the orchestrator calls multiple Agent tools in a single message, they run in parallel. You don't write asyncio.gather() yourself. Tell the orchestrator to "call these agents in parallel" and the SDK handles it.

Actual query flow:

async def review_pr(pr_diff: str):
    results = []

    async for message in sdk.query(
        prompt=(
            f"Review this PR diff:\n\n{pr_diff}\n\n"
            "Run code-reviewer, security-scanner, and doc-writer simultaneously "
            "to analyze each domain in parallel, then compile a unified review report."
        ),
        options=opts,
    ):
        if isinstance(message, sdk.AssistantMessage):
            for block in message.content:
                if hasattr(block, "text"):
                    results.append(block.text)
        elif isinstance(message, sdk.ResultMessage):
            print(f"Total cost: ${message.total_cost_usd:.4f}")
            print(f"Duration: {message.duration_ms}ms")
            break

    return "\n".join(results)

Each subagent's context window starts fresh. From the official docs:

"A subagent's context window starts fresh, and the only channel from parent to subagent is the Agent tool's prompt string."

The orchestrator only sees the final result, not the intermediate reasoning. That's what prevents context contamination.

Controlling Costs with TaskBudget

Running three subagents concurrently doesn't just triple costs — it can amplify them unpredictably. Each agent might make redundant tool calls trying to do thorough work.

TaskBudget is the API-level fix:

opts = sdk.ClaudeAgentOptions(
    # ... same as above ...
    task_budget=sdk.TaskBudget(total=50000),  # 50K token budget
)

Actual class structure from inspect.getsource(sdk.TaskBudget):

class TaskBudget(TypedDict):
    """API-side task budget in tokens.

    When set, the model is made aware of its remaining token budget so it can
    pace tool use and wrap up before the limit. Sent as
    output_config.task_budget with the task-budgets-2026-03-13 beta header.
    """

    total: int

The task-budgets-2026-03-13 beta header is attached automatically. The agent becomes aware of its remaining budget and decides internally when to pace down and wrap up. Much cleaner than an external timeout that forces mid-task termination.

Combine with AgentDefinition.maxTurns for a two-tier safety net:

security_scanner = sdk.AgentDefinition(
    # ...
    maxTurns=6,  # Subagent level: max 6 tool calls
)

opts = sdk.ClaudeAgentOptions(
    # ...
    task_budget=sdk.TaskBudget(total=100000),  # Global level: 100K token ceiling
)

Subagent Hooks — Tracking Start and Stop Events

SubagentStartHookInput and SubagentStopHookInput let you detect exactly when each subagent starts and finishes:

import time

agent_timings: dict[str, float] = {}

def on_agent_start(hook_input: sdk.SubagentStartHookInput) -> None:
    agent_timings[hook_input.agent_id] = time.time()
    print(f"▶ {hook_input.agent_type} started (id: {hook_input.agent_id[:8]})")

def on_agent_stop(hook_input: sdk.SubagentStopHookInput) -> None:
    start = agent_timings.get(hook_input.agent_id, time.time())
    elapsed = time.time() - start
    print(f"■ {hook_input.agent_type} done ({elapsed:.1f}s)")
    # hook_input.agent_transcript_path has the full subagent conversation log

opts = sdk.ClaudeAgentOptions(
    # ...
    hooks={
        "SubagentStart": [sdk.HookMatcher(hook_callback=on_agent_start)],
        "SubagentStop": [sdk.HookMatcher(hook_callback=on_agent_stop)],
    },
)

agent_transcript_path from SubagentStopHookInput is invaluable for production debugging. If a subagent produces unexpected output, that's where you look first.

Also worth knowing: multiple hook matchers on the same event run concurrently, not sequentially. The docs explicitly state this. Design each hook to be independent.

When to Use Subagents (and When Not To)

I want to be direct here: subagents aren't always the right choice.

Use them when:

You have 3+ independent tasks, each taking 10+ seconds
Different tasks need different tool access (security scanner doesn't need Write)
You've verified that context contamination actually hurts result quality

They're overkill when:

You have 2 tasks where task B depends on task A's output
Total runtime would be under 5 seconds (spawn overhead exceeds benefit)
It's a simple question-answer pattern

The same tradeoff comes up in the A2A + MCP hybrid architecture post: multi-agent structure adds complexity. More failure points, harder debugging, less predictable costs. Don't add subagents to a problem that a single agent can handle.

My personal threshold: "three or more independent tasks, each likely to consume 10K+ tokens with Opus." Below that, I stick with a single agent.

What I Couldn't Test

No API key meant I couldn't capture actual execution logs showing the three agents running in parallel. The object construction and type validation worked, but "what does the console output look like when three agents actually spawn concurrently" — I can't show you that from this run.

The fork_session function also caught my attention but didn't fit in this post. fork_session(session_id, up_to_message_id) lets you branch a session from a specific point. Useful when subagents want to try different strategies from the same base context without repeating earlier work.

Summary

The core of subagent orchestration in claude-agent-sdk 0.2.82:

AgentDefinition: Separate role, prompt, tools, and model per subagent
ClaudeAgentOptions.agents: Register subagent names for the orchestrator to call
Agent tool + parallel prompt instruction: Orchestrator spawns multiple subagents at once

Add TaskBudget and SubagentStartHookInput/SubagentStopHookInput for cost control and execution tracking.

Start with a single agent. Move to subagents when your task fits "independent, parallelizable, three or more."

References:

Subagents in the SDK — Claude API official docs
Building agents with the Claude Agent SDK — Anthropic engineering blog
claude-agent-sdk==0.2.82 PyPI package — direct install and source inspection (2026-05-18)

Gemini 2.5 Flash Thinking API: What I Learned from Running Budget=0, 1024, and 8000

Jangwook Kim — Sun, 17 May 2026 06:43:58 +0000

I assumed that turning on Thinking would always make Gemini smarter. After running actual experiments, I found out that's only half true.

I set thinking_budget to 0, 1024, and 8000 across three prompt types — simple tasks, math reasoning, and code review — and measured response time, output tokens, and thinking tokens for each combination. The numbers told a more nuanced story than I expected.

What Thinking API Actually Does

thinking_budget limits how many tokens the model can spend on "hidden reasoning" before it writes the response. Budget=0 disables thinking entirely. Budget=-1 lets the model decide how much to think. A positive integer sets the cap (maximum is 24576).

There's an important catch: thinking tokens aren't returned in the response, but they're billed at the same rate as output tokens. As covered in the LLM API pricing comparison, Gemini 2.5 Flash output tokens cost $0.0035/1K. Spending 1024 thinking tokens adds that cost on top.

One practical note: the google.generativeai package is now deprecated. You need the new google-genai package.

# ❌ Deprecated (no longer receiving updates)
import google.generativeai as genai

# ✅ Current standard
from google import genai
from google.genai import types

client = genai.Client(api_key="YOUR_API_KEY")

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Your prompt here",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(
            thinking_budget=1024,
            include_thoughts=True,  # expose thinking in response parts
        ),
    ),
)

# Separate thinking from the actual answer
for part in response.candidates[0].content.parts:
    if part.thought:
        print(f"[Thinking] {part.text[:100]}...")
    else:
        print(f"[Answer] {part.text}")

Setting include_thoughts=True surfaces the model's internal reasoning as separate response parts. Useful for debugging, though you'd keep it False in production.

How I Ran the Experiment

I created a fresh sandbox directory, installed only google-genai, and applied Budget=0/1024/8000 to three prompt types.

Measurements:

Response time (seconds): wall clock time
Output tokens: actual answer tokens
Thinking tokens: tokens consumed in internal reasoning (usage_metadata.thoughts_token_count)

Prompts:

Simple task: "Explain in one sentence how to sort a list in Python"
Math reasoning: Find all two-digit positive integers satisfying three conditions
Code review: Find bugs and improvements in a simple Python function

Results: What the Numbers Show

These are the actual measured values. No smoothing.

Task	Budget=0	Budget=1024	Budget=8000
Simple task	1.4s / 54 out tok	6.8s / 61 out tok / 751 think tok	9.0s / 45 out tok / 1282 think tok
Math reasoning	8.8s / 2143 out tok	15.1s / 1915 out tok / 918 think tok	26.2s / 1671 out tok / 4036 think tok
Code review	6.7s / 1367 out tok	13.1s / 1126 out tok / 734 think tok	22.6s / 2055 out tok / 1824 think tok

Simple task: Budget=0 finished in 1.4 seconds. Budget=1024 took 6.8 seconds — nearly 5x slower — with no discernible quality improvement. Budget=8000 consumed 1282 thinking tokens and still produced a shorter answer (45 tokens). Complete overkill.

Math reasoning: This is where things got interesting. With Budget=0, the model produced 2143 output tokens. It was "thinking out loud" inside the answer, writing out every step of its reasoning. Budget=1024 used 918 thinking tokens internally and produced 1915 output tokens. The total token consumption was similar, but the response was more structured. Budget=8000 pushed thinking to 4036 tokens and the output dropped to 1671 — the model reasoned privately and wrote a more concise answer.

Code review: Budget=1024 actually cut output from 1367 to 1126 tokens. The answer was more focused. Budget=8000 expanded to 2055 tokens — a more thorough analysis but 3x slower. Which is better depends entirely on your use case.

Choosing the Right Budget for Each Task Type

Here's the practical framework I settled on from this experiment. Not a universal rule, but a solid starting point.

Budget=0 (thinking disabled) for:

Classification, labeling, tagging
Summarization, translation, format conversion
Simple Q&A, factual lookups
High-volume batch processing where cost matters

The simple task responded in 1.4 seconds at Budget=0. Giving it 1024 budget means waiting 6.8 seconds and paying for 751 extra tokens. No benefit. Pure waste.

Budget=1024–2048 (moderate thinking) for:

Code review and bug finding where focused analysis matters
Medium-complexity reasoning
Multi-step judgment calls that are still latency-sensitive

I'll be honest — the code review at Budget=1024 felt better than Budget=0 even though the response was shorter. The unnecessary padding was gone. Just the key points.

Budget=4000–8000 (deep thinking) for:

Complex math, algorithm design
Thorough architecture reviews
Multi-step planning
Tasks where accuracy matters far more than speed

Budget=8000 on the math problem consumed 4036 thinking tokens in 26 seconds. That latency is unacceptable in any interactive context. I'd only use this for offline batch analysis or asynchronous background jobs.

The Gemini 2.5 Flash cost optimization guide covers this too: thinking tokens and output tokens are priced identically. Using Budget=8000 indiscriminately can multiply your costs by several times.

Production Code: Tracking Thinking Usage

Here's the pattern I use to monitor thinking token consumption in production.

from google import genai
from google.genai import types
import time

def generate_with_thinking(
    client: genai.Client,
    prompt: str,
    budget: int = 1024,
    model: str = "gemini-2.5-flash",
) -> dict:
    """Generate a response while tracking thinking token usage."""
    start = time.perf_counter()

    config = types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(
            thinking_budget=budget,
            include_thoughts=False,  # False in production
        ),
    )

    response = client.models.generate_content(
        model=model,
        contents=prompt,
        config=config,
    )

    elapsed = time.perf_counter() - start
    usage = response.usage_metadata

    return {
        "text": response.text,
        "latency_s": round(elapsed, 2),
        "input_tokens": usage.prompt_token_count,
        "output_tokens": usage.candidates_token_count,
        "thinking_tokens": getattr(usage, "thoughts_token_count", 0) or 0,
        "total_tokens": (
            usage.prompt_token_count
            + usage.candidates_token_count
            + (getattr(usage, "thoughts_token_count", 0) or 0)
        ),
    }

# Usage
client = genai.Client(api_key="YOUR_API_KEY")

result = generate_with_thinking(
    client,
    "Find potential memory leaks in this code: ...",
    budget=2048,
)

print(f"Latency: {result['latency_s']}s")
print(f"Thinking tokens: {result['thinking_tokens']}")
print(f"Total billed tokens: {result['total_tokens']}")

usage_metadata.thoughts_token_count sometimes returns 0 — when budget=0 or the model decided it didn't need to think. Track this metric and you'll quickly learn how often thinking actually fires for your prompts.

Where Thinking API Falls Short

I want to be direct about the frustrating parts.

Dynamic mode (Budget=-1) is unpredictable. The model deciding its own budget sounds convenient, but it can fire thinking on simple tasks. In my simple task experiment, Budget=-1 took around the same time as Budget=1024. If you can't predict latency and cost, you can't budget for it in production.

thinking_budget and thinking_level can't coexist. Gemini 3.x uses thinking_level; 2.5 uses thinking_budget. Mix them in the same call and you get a 400 error. This is documented but the error message isn't obvious enough to catch immediately if you're migrating code.

Thinking tokens don't benefit from context caching. Even if you use context caching to reduce the cost of a long system prompt, thinking tokens are billed fresh every time. As I covered in the AI agent cost reality post, costs in agent loops can spiral faster than expected when thinking tokens accumulate.

My Take

Thinking API isn't overhyped. But "just turn it on" is also wrong.

My position: use Budget=0 as the default, and explicitly activate Budget=1024–2048 only when the task genuinely needs multi-step reasoning. Keep Budget=8000 for batch jobs or offline analysis where accuracy is paramount.

Skip dynamic mode (Budget=-1) in production. Predictability beats convenience when you're billing actual users.

The counterintuitive finding that stuck with me: for complex math, disabling thinking caused the model to "think out loud" across 2143 output tokens. Enabling Budget=1024 moved the reasoning internal and dropped output to 1915 tokens. The total cost difference was smaller than I expected. Whether you net save depends on the task.

Wrapping Up

Without running the experiments, I would have defaulted to "more thinking = better." The measurements said otherwise.

Gemini 2.5 Flash Thinking API is a genuinely useful tool when applied to the right tasks. The paradoxical effect — where enabling thinking reduces total tokens for complex reasoning — is real and worth knowing. But applying it blindly to simple tasks wastes money and time.

Before setting thinking_budget, ask one question first: does this task actually require deep reasoning? Most of the time, the answer is no.

All code in this post is reproducible using the snippets provided. Written against google-genai package 0.8.x.

Claude Managed Agents' Dreaming, Outcomes, and Orchestration — How Agents Self-Improve While You Sleep

Jangwook Kim — Sat, 16 May 2026 06:41:32 +0000

When Anthropic announced three new features at the Code with Claude conference in San Francisco on May 6, my first thought was: "What exactly is this agent learning while I'm away?"

Dreaming. Outcomes. Multiagent Orchestration. The names have a marketing ring to them, but the underlying engineering decisions are concrete. And there's one thing people consistently misread about Dreaming in particular: when Anthropic says "the agent learns," they don't mean the model improves. The memory improves. That distinction matters more than it might seem.

I couldn't test Dreaming directly — no API access, and it's still Research Preview. This analysis is based on official documentation, Anthropic's blog posts, conference materials, and early pilot reports. I won't claim to have run what I didn't run.

Code with Claude 2026 — No New Model, All Infrastructure

The most telling thing about the May 6 SF keynote was the absence of a model announcement. Instead of competing on benchmark numbers, Anthropic focused on the execution layer for agents.

What was announced:

Dreaming: Automated agent memory refresh (Research Preview)
Outcomes: Success-criteria-based self-evaluation and iteration (Public Beta)
Multiagent Orchestration: Lead-subagent parallel execution (Public Beta)
Usage limits doubled across Pro, Max, Team, and Enterprise
Peak-hour throttling removed for Pro and Max
Claude Security: Code vulnerability scanner powered by Opus 4.7 (Enterprise)
Remote Agents: Control your laptop from your phone
SpaceX Project Colossus partnership (220,000+ GPUs)

Notion, Rakuten, Sentry, and Harvey are already running these features in production, according to Anthropic. The conference continues in London (May 19) and Tokyo (June 10).

The pattern here is worth noting: Anthropic isn't trying to win the model benchmark race this month. They're building the plumbing that makes large-scale agent deployment tractable.

Dreaming — A Memory Consolidation System, Not Model Training

Anthropic reaches for the hippocampus metaphor when explaining Dreaming — the way the human brain replays the day's events during sleep and decides what to keep. It's a reasonable analogy. Here's what the system actually does:

Reviews up to 100 past agent sessions
Extracts patterns: recurring mistakes, converged workflows, team preferences
Removes duplicates and stale entries from the existing memory store, adds new ones
Preserves original session transcripts untouched

This is not fine-tuning. Anthropic is explicit: "Dreaming does not modify the underlying model weights." What changes is the memory store that the next agent session reads on startup. The model itself is unchanged.

Harvey's pilot result gets cited constantly: task completion rates climbed roughly six times after Dreaming was enabled. The mechanism was specific — agents started remembering filetype workarounds and tool-specific behavior across sessions, which is exactly the kind of thing that breaks legal document workflows repeatedly.

I think the six-times number is interesting but requires context. Harvey processes legal documents. Same document structures, same tools, repetitive review workflows — a domain where pattern extraction is tractable and where the same mistakes genuinely recur. Extrapolating "6x completion rate" to a general-purpose agent handling varied requests every session is not supported by this data.

Dreaming is Research Preview. I can describe what the documentation says it does; I can't report what running it feels like.

Outcomes — Productizing the LLM-as-Judge Pattern

Outcomes isn't a new idea. LLM-as-judge — using a separate model instance to evaluate agent output — is already standard practice in many agent pipelines. What Anthropic is doing is turning it into a managed primitive with a specific architecture.

How Outcomes works:

1. Developer writes a rubric defining success
   Example: "The contract clause must satisfy legal requirements A, B, and C"

2. Writer agent generates output

3. Grader runs in a separate context window, evaluates against rubric
   — Independent of the writer's reasoning process
   — Returns per-criterion pass/fail verdict

4. If anything fails → Grader sends specific feedback to writer

5. Writer revises and retries

6. All criteria pass → Return result

The critical design choice is that the grader runs in a completely separate context window. It can't see the writer's reasoning; it only sees the output and the rubric. This is what distinguishes Outcomes from asking the same agent to "check your own work." Self-review in the same context window is biased — the agent is predisposed to justify what it already produced.

Anthropic's internal benchmark numbers: 8.4% improvement in Word document quality, 10.1% for PowerPoint slides.

In practice, rubric design becomes the core work. Too permissive, and Outcomes adds latency without benefit. Too strict, and you get an infinite retry loop. The API for grader configuration isn't something I could test at this tier, but the Outcomes cookbook example on the Claude platform shows the pattern clearly.

The Managed Agents deployment walkthrough I wrote in April covered the $0.08/session baseline cost. Outcomes adds grader session cost on top — how much depends on rubric complexity and how many retry cycles each task needs.

Multiagent Orchestration — Standardizing the Parallel Pattern

Running multiple specialized agents in parallel for complex tasks isn't new either. Five agentic workflow patterns for Claude Code covered the architecture. What Orchestration adds is a managed version of that pattern:

Lead agent decomposes the task and delegates to specialists
Up to 20 subagents run in parallel
Each subagent has its own model, prompt, and tool configuration
Shared filesystem for output coordination
Full flow traceable in Claude Console

graph TD
    A["Complex Task"] --> B["Lead Agent\nDecompose & Delegate"]

    subgraph Parallel ["Parallel Execution (up to 20)"]
        B --> C["Subagent 1\nModel A + Tool X"]
        B --> D["Subagent 2\nModel B + Tool Y"]
        B --> E["Subagent N\nModel C + Tool Z"]
    end

    C --> F["Shared Filesystem"]
    D --> F
    E --> F
    F --> G["Lead Agent\nSynthesis & Final Output"]

The per-subagent model configuration is a meaningful addition. A code generation subagent running Opus 4.7 alongside a fast validation subagent running Haiku 4.5 is cost-efficient without sacrificing output quality where it matters. This is the heterogeneous agent fleet pattern made easier to implement.

The Self-Improvement Loop All Three Create Together

Viewed individually, these three features look like separate product additions. Viewed together, they form a closed loop:

graph TD
    subgraph Cycle ["Self-Improvement Cycle"]
        A["Agent Execution\n(Orchestration: parallel)"] --> B["Output Generation"]
        B --> C["Outcomes Evaluation\n(Grader: separate context)"]
        C -->|"Criteria failed"| D["Correction + Retry"]
        D --> B
        C -->|"All passed"| E["Final Output + Session Log"]
    end

    E --> F["Dreaming\n(Cross-session pattern extraction)"]
    F --> G["Memory Store Update"]
    G --> A

Observe: Session data accumulates as the agent works.

Evaluate: Outcomes' grader evaluates each task against success criteria. Failure reasons are recorded.

Improve: Dreaming periodically reviews accumulated sessions and updates memory. The next session's agent starts with that enriched context.

Over time, the agent doesn't acquire new skills — it accumulates operational knowledge about "what to watch out for in which situations." The model stays constant; the effective behavior improves.

Hindsight MCP's approach to experience-based memory refresh covers similar territory from a different angle. Comparing both designs is useful for thinking through agent memory architecture choices.

Where I'm Skeptical

Several things give me pause.

First, the Harvey 6x number. Legal document processing is structured and repetitive. The same contracts, the same tools, the same review workflows. Pattern extraction works well here. Claiming "agents improve 6x with Dreaming" generalizes a domain-specific result in a way that sets unrealistic expectations for varied workloads.

Second, memory poisoning risk. If an agent consistently approaches tasks incorrectly, Dreaming could entrench those bad patterns. Anthropic offers a "review changes before they land" option, but how many teams will actually review every memory update in a high-volume production system? This needs better tooling.

Third, auditability tension. A system where the agent autonomously changes its own behavioral patterns is hard to audit. "Why did the agent make that decision six months ago?" requires memory store version history — and the tooling for that isn't clearly specified yet.

Fourth, Research Preview status. Dreaming is not Production. Unlike Outcomes and Orchestration (Public Beta), Dreaming hasn't been validated at production scale. The agent cost reality analysis I did earlier applies here too: governance costs, monitoring costs, and debugging costs are real costs even when tokens are cheap.

Fifth, Outcomes grader cost scales with retry depth. A rubric with five criteria and a task that fails on the first three passes could triple the session cost relative to a baseline run. There's no cost estimation tooling for this yet.

Who Should Use This Now

Start with Outcomes. If you're already running Managed Agents and output consistency is your main problem, rubric design is worth the investment. The separate grader context genuinely addresses the self-review bias problem. It's Public Beta and the most straightforward of the three to adopt.

Orchestration makes sense when tasks are too large or require too many specializations for a single agent. Large report generation, simultaneous code review and documentation, multi-source data synthesis. Be careful about orchestration overhead — 20 subagents poorly configured can absorb the gains from parallelization.

Dreaming: proceed carefully. Research Preview means production stability isn't guaranteed. The agents most likely to benefit are those handling repetitive, structured work over long periods. For varied-request agents, the improvement trajectory is less predictable.

I find the three-feature combination genuinely interesting as a system design. Observe → Evaluate → Improve is a clean loop. But I'd push back on the "self-improving agent" framing that turns this into something magical. Memory can be wrong. Research Preview features aren't production-ready. And the Harvey pilot, while compelling, is a single domain under specific conditions.

Feasibility Assessment

What I could verify directly: Anthropic SDK installation, basic Messages API connectivity. Dreaming, Outcomes, and Orchestration require Enterprise or Beta plan access — I didn't run them.

Primary sources consulted:

New in Claude Managed Agents: dreaming, outcomes, and multiagent orchestration — Anthropic official
Outcomes implementation cookbook — Claude platform
Code w/ Claude SF 2026 summary — Anthropic blog
VentureBeat coverage of Dreaming
Simon Willison's live blog from Code with Claude

When more teams share production Dreaming results — especially outside legal tech — I'll update my read. For now: interesting architecture, still needs validation beyond a single pilot.

Cloudflare Agents Week 2026 — When AI Agents Become Cloud Customers

Jangwook Kim — Fri, 15 May 2026 06:43:30 +0000

This time last year, every AI agent infrastructure conversation started with Kubernetes + LangGraph. Cloudflare's April Agents Week presented a different picture. Agents don't just call APIs — they create Cloudflare accounts, register domains, and deploy code on their own. The phrase "agents as cloud customers" sounds like marketing fluff, but this time they actually built it.

Here's my read on what matters, what doesn't, and where I'm skeptical.

What Agents Week Was

Cloudflare declared April 2026 "agents week" and shipped announcements every day — 20+ new features and GA transitions by the end of it. The overall impression is a company-wide bet that agents will be the primary actors on the internet, and they rebuilt infrastructure accordingly across compute, storage, networking, and security.

I'm focusing on the items that actually affect how you write and deploy agent code.

The Most Provocative Announcement — Agents That Create Their Own Accounts

My honest reaction when I first read this: "is this real?" The mechanics: once a user accepts Cloudflare's terms of service once, agents can autonomously create a Cloudflare account, start a paid subscription, register a domain, get an API token, and deploy code. Stripe partnership handles payment tokenization; OAuth + OIDC authenticate the agent as a trusted actor.

The implication is significant. Until now, agents worked within infrastructure that humans provisioned. Now agents can be the entity that provisions the infrastructure itself. If you're building a SaaS product, "agent handles new customer onboarding end-to-end" becomes a real architectural option.

That said, I have two concerns I can't shake. First, an agent connected to live billing requires airtight cost controls. Cloudflare's new task_budget concept seems designed for exactly this, but real-world examples of the two working together are scarce. Second, the legal accountability picture is murky. If an agent registers the wrong domain or incurs unexpected charges, who owns that? User consent to ToS exists, but the specific liability hasn't been tested.

Three Announcements Worth Your Attention

Past the headline, here are the things I'd actually build with.

Sandboxes GA: Nine months from beta (June 2025) to general availability. Each sandbox is an isolated Linux environment — real shell, real filesystem, background processes — that spins up on demand and, critically, picks up exactly where it left off after interruption. Sub-millisecond start times mean a code-generation agent can write, execute, observe output, and iterate in tight loops.

Compared to setting up a separate code execution environment alongside LangGraph or CrewAI, Sandboxes shifts the question from "how do I configure the execution environment" to "which infrastructure layer do I trust to manage it." Those are meaningfully different decisions.

Artifacts: Git-compatible versioned storage for agents. Create tens of millions of repos, fork from any remote, access with standard Git clients. Moved from private beta to public beta in early May. The practical use case: agents that produce code outputs now have a permanent home for those outputs, survives context resets, accessible from outside Cloudflare's stack.

Dynamic Workers: Isolated runtime for AI-generated code. Millisecond spin-up, scales to millions of concurrent executions. Enables the generate-execute-observe loop agents need without managing container infrastructure. Still feels early but the concept is right.

I Actually Installed the SDK

Theory aside, I ran through the setup myself.

mkdir cloudflare-agent-demo && cd cloudflare-agent-demo
npm init -y
npm install @cloudflare/agents

Clean install. @cloudflare/agents@0.0.16 exports Agent, AIChatAgent, and routeAgentRequest as the main surfaces.

Here's a minimal but representative agent:

// src/index.ts
import { Agent, routeAgentRequest } from "@cloudflare/agents";

interface TaskState {
  processedCount: number;
  lastHeartbeat: string;
}

interface Env {
  TASK_AGENT: DurableObjectNamespace<TaskAgent>;
}

export class TaskAgent extends Agent<Env, TaskState> {
  async onStart() {
    this.setState({ processedCount: 0, lastHeartbeat: new Date().toISOString() });
    // Built-in cron scheduling — no external scheduler needed
    await this.schedule("0 * * * *", "heartbeat", {});
  }

  async heartbeat() {
    const count = this.sql<{ n: number }>`SELECT COUNT(*) as n FROM tasks`;
    this.setState({
      processedCount: count[0]?.n ?? 0,
      lastHeartbeat: new Date().toISOString()
    });
  }

  async onRequest(request: Request): Promise<Response> {
    return Response.json({ state: this.state });
  }

  // Agents receive email directly
  async onEmail(email: ForwardableEmailMessage) {
    this.sql`
      INSERT INTO tasks (id, content, created_at)
      VALUES (${crypto.randomUUID()}, ${email.from}, ${Date.now()})
    `;
  }
}

export default {
  fetch: async (req: Request, env: Env): Promise<Response> => {
    const routed = await routeAgentRequest(req, env);
    return routed ?? new Response("OK", { status: 200 });
  }
};

wrangler dev starts immediately, no Cloudflare account needed for local work:

⛅️ wrangler 4.91.0
Your Worker has access to the following bindings:
  env.TASK_AGENT (TaskAgent)   Durable Object   local

⎔ Starting local server...
[wrangler:info] Ready on http://localhost:9998
[wrangler:info] GET / 200 OK (7ms)

One important caveat: @cloudflare/agents is Workers runtime-only. Trying to run it with standard Node.js throws ERR_UNSUPPORTED_ESM_URL_SCHEME because of the cloudflare: protocol imports. You need Wrangler. If you're used to SDKs like the Claude Agent SDK that run anywhere in Python or Node, this is an adjustment.

Architecture Choices Worth Understanding

A few design decisions in the SDK that reflect Cloudflare's broader approach:

Embedded SQLite: Declare new_sqlite_classes in wrangler.toml and every Agent instance gets its own SQLite. No external database configuration. Query with this.sql. The Durable Object isolation model gives you natural multi-tenancy — each agent instance has independent data. Sounds wasteful but it's actually clean for state isolation.

In-process scheduling: Register cron jobs directly from agent code. No external cron service. Wraps the Durable Object alarm API, which keeps scheduling and state management co-located. High cohesion, lower operational surface.

Email handler: onEmail lets agents receive email directly via Workers Email Routing. An agent that turns email into tasks is straightforward to write.

The way Dapr Agents handles state and messaging through Kubernetes sidecar patterns contrasts interestingly here. Cloudflare's model is more code-centric; Dapr is more infrastructure-centric. Both have legitimate use cases.

Where I'm Skeptical

I'll be direct about the rough edges.

Vendor lock-in is significant. The cloudflare:workers runtime dependency means your agent code doesn't run outside Cloudflare's stack. Migrating to a different platform later means substantial rewrites. Containerized approaches like running MCP servers on Kubernetes don't have this problem — you trade operational simplicity now for portability.

Multi-agent orchestration is thin. The single-agent story is compelling. But the SDK-level support for complex multi-agent coordination — handoffs, shared memory, hierarchical orchestration — is limited. Project Think is meant to address this but it's early. If your use case involves agents coordinating at scale, you'll need to build significant structure yourself.

SDK maturity. @cloudflare/agents@0.0.16 is pre-1.0. The API surface will change. For production use, you're accepting that risk.

My Take on When to Use This

Cloudflare is the right infrastructure choice when: response latency at the edge matters for your agents, your team already operates Cloudflare Workers, you want to minimize infrastructure management and focus on agent logic, or your architecture involves many independent agents each owning their own state.

It's not the right choice when: you need complex multi-agent orchestration and you're already invested in LangGraph, you're locked to AWS or GCP infrastructure, or your agents need to run in Python or standard Node.js environments.

The overall direction from Agents Week is coherent. Cloudflare is positioning itself as the infrastructure layer for the agent era — what Kubernetes became for containers. The SDK being at v0 means production adoption should be cautious, but the design thinking is consistent. Worth running through the setup and forming your own opinion.

Signed Agents: Cryptographic Identity for Agent Traffic

One announcement that got less coverage but caught my attention: Signed Agents. The idea is that HTTP requests made by agents carry a cryptographic signature proving their origin — "this was sent by an agent, not a human."

Right now there's no standard way to distinguish agent traffic from human traffic on the internet. User-Agent strings and IP patterns are guesses at best. Signed Agents gives servers a verifiable signal: they can check the signature and apply agent-specific rate limits, billing, or access controls. It's an early-stage primitive but it's the right one to build. Once agents are common enough to treat as distinct traffic types, having cryptographic identity for them becomes infrastructure rather than a feature.

Email Service Public Beta

Workers Email Service graduated to public beta during Agents Week. Any agent can now send email without integrating a third-party service like SendGrid or AWS SES.

Combined with the onEmail handler already in the SDK, agents can now handle both inbound and outbound email entirely within Cloudflare's stack. An agent that receives a customer email, processes it, creates a task, and sends a reply — with no external email service in the loop. For customer support agents, notification pipelines, or email-based task management, this is a meaningful simplification.

The Bigger Picture

Looking at Agents Week as a whole, it reads less like a feature release and more like a positioning statement. Twenty-plus announcements, all pointing the same direction: Cloudflare intends to be the infrastructure layer for the agent era the way AWS became the infrastructure layer for the web era.

The single thing I'd actually go build with first from this week: Sandboxes. Not the headline "agents create accounts" story — the persistent isolated Linux environment for agent code execution. That's immediately useful for any code-generation or code-testing agent, and it works today without novel legal or financial risk.

@cloudflare/agents@0.0.16 tells you what you need to know about production readiness. But if you're serious about evaluating agent infrastructure options, run through the local setup and form your own opinion. Twenty minutes, no account required.

Test environment: @cloudflare/agents@0.0.16, wrangler@4.91.0, Node.js v22.22.0, macOS 14

Note: The autonomous agent account creation feature requires a real Cloudflare account and Stripe integration — out of scope for local testing.

Source: Cloudflare Agents Week 2026

AWS MCP Server GA Practical Guide — Connecting CloudWatch & IAM to Your AI Coding Agent

Jangwook Kim — Thu, 14 May 2026 06:42:39 +0000

A CloudWatch alarm fired. Lambda error rate crossed the threshold, and I needed to dig through logs — flipping between the AWS console and my terminal, copying log group names by hand. At some point I had a clear thought: what if Claude Code could just look at my CloudWatch directly?

On May 6, 2026, AWS shipped an answer. AWS MCP Server hit general availability.

What the AWS MCP Server Actually Is

AWS MCP Server gives AI coding agents — Claude Code, Cursor, Codex — a standardized way to query AWS services directly. It wraps AWS APIs as MCP tools, using the Model Context Protocol that Anthropic defined. One uvx command wires 31 CloudWatch tools and 29 IAM tools into your coding agent.

Instead of copying log group names from the console and pasting them into CLI commands, you can ask your agent: "Find the Lambda function with the highest error rate in the past hour and summarize the relevant logs." The agent runs the Logs Insights query itself and brings back results.

If you've built an MCP server from scratch, you already understand the protocol. AWS MCP Server is the official, AWS-maintained collection of MCP servers for AWS services, published at awslabs/mcp on GitHub and installable from PyPI.

What Changed at GA

Three things matter compared to pre-GA versions:

IAM condition context keys. Every API call routed through AWS MCP Server now carries aws:ViaAWSMCPService and aws:CalledViaAWSMCP condition keys automatically. Your IAM policies can differentiate agent-initiated calls from human-initiated calls.

Full CloudTrail integration. Every API call goes to CloudTrail. There's a complete audit trail of what the agent did.

Separate CloudWatch namespace. Metrics published under AWS-MCP let you monitor how much of your API traffic comes from agents versus direct calls.

The practical upshot: you can now enforce different IAM permissions for agents and humans while using the same AWS credentials.

Installation: One Line with uvx

I installed and ran both servers. Here is what it takes.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Run CloudWatch MCP server (creates isolated env automatically)
uvx awslabs.cloudwatch-mcp-server@latest

# Run IAM MCP server
uvx awslabs.iam-mcp-server@latest

uvx handles the virtual environment. First run pulls 53 packages for the CloudWatch server — botocore, pandas, scipy, statsmodels, and more. The reason for scipy and statsmodels is that the CloudWatch server includes built-in anomaly detection and statistical analysis on metrics, not just passthrough queries.

Installed versions:

awslabs.cloudwatch-mcp-server v0.1.1
awslabs.iam-mcp-server v1.0.20

The 0.x version on the CloudWatch server signals the API is still stabilizing. That is worth keeping in mind before putting it in production workflows.

Wiring It Into Claude Code (.mcp.json)

Put this in your project root:

{
  "mcpServers": {
    "cloudwatch": {
      "command": "uvx",
      "args": ["awslabs.cloudwatch-mcp-server@latest"],
      "env": {
        "AWS_REGION": "ap-northeast-1",
        "AWS_PROFILE": "default",
        "FASTMCP_LOG_LEVEL": "WARNING"
      }
    },
    "iam": {
      "command": "uvx",
      "args": ["awslabs.iam-mcp-server@latest"],
      "env": {
        "AWS_REGION": "ap-northeast-1",
        "FASTMCP_LOG_LEVEL": "WARNING"
      }
    }
  }
}

Set FASTMCP_LOG_LEVEL to WARNING. Without it, INFO logs bleed into the agent's responses. You can also install via the Claude Code CLI: claude mcp add aws-mcp-server.

CloudWatch MCP Server: 31 Tools

When the server starts, it registers exactly 31 tools. Here is the breakdown.

Log group tools (8):

describe_log_groups         List log groups
analyze_log_group           AI-powered log pattern analysis
execute_log_insights_query  Run a Logs Insights query
get_logs_insight_query_results  Poll query results
cancel_logs_insight_query   Cancel a running query
execute_cwl_insights_batch  Batch query execution
recommend_indexes_loggroup  Index recommendations for a log group
recommend_indexes_account   Account-wide index recommendations

Metrics tools (11):

get_metric_data             Fetch metric data points
get_metric_metadata         Metadata lookup (1,179 entries indexed at startup)
analyze_metric              Anomaly detection on a metric
get_recommended_metric_alarms  Suggest alarm thresholds
execute_promql_query        Run a PromQL query
execute_promql_range_query  PromQL range query
get_promql_label_values     PromQL label values
get_promql_series           PromQL series
get_promql_labels           PromQL labels
get_active_alarms           List active alarms
get_alarm_history           Alarm state history

The get_metric_metadata detail is worth noting. At startup, the server loads and indexes 1,179 metric metadata entries covering EC2, Lambda, RDS, DynamoDB, and most other AWS services. The server logs show it explicitly:

INFO | Loaded 1179 metric metadata entries
INFO | Successfully indexed 1179 metric metadata entries

This is what allows the agent to answer "which metric measures Lambda cold start duration?" without hitting the AWS docs.

What I Found on My Account

I ran this against a real ap-northeast-1 account. The output:

Available log groups (5):
  /aws/lambda/remotax-renewal-fe-CustomCDKBucketDeployment: 331,695 bytes
  /aws/lambda/remotax-renewal-fe-CustomS3AutoDeleteObjects:   2,038 bytes
  /aws/lambda/remotax-renewal-fe-LambdaServerFunctionHandler:     0 bytes
  /aws/lambda/remotax-renewal-fe-LogRetentionaae0aa3c5b4d4f:     0 bytes
  RDSOSMetrics: 55,192,669 bytes

Active CloudWatch Alarms:
  OK    EC2-HighCPU-Alarm
        CPUUtilization >= 80% | Currently: OK
  ?     EC2-HighDiskUsage-Alarm
        disk_used_percent >= 80% | INSUFFICIENT_DATA
  ?     EC2-HighMemoryUsage-Alarm
        mem_used_percent >= 80% | INSUFFICIENT_DATA
  ?     LaravelErrorAlarm
        LaravelErrorCount >= 1 | INSUFFICIENT_DATA

EC2 metrics available: 85

Three alarms sitting in INSUFFICIENT_DATA. Disk and memory alarms with no data means CloudWatch Agent is not running or misconfigured on those EC2 instances. That is the kind of silent infrastructure problem that usually only surfaces when an alert should fire and doesn't. The agent picked it up immediately.

IAM MCP Server: 29 Tools and the Security Architecture That Matters

The IAM server ships 29 tools:

list_users / get_user / create_user / delete_user
list_roles / create_role
list_policies / get_managed_policy_document
attach_user_policy / detach_user_policy
create_access_key / delete_access_key
simulate_principal_policy    ← the important one
list_groups / create_group / delete_group
add_user_to_group / remove_user_from_group
put_role_policy / get_role_policy / delete_role_policy

I find simulate_principal_policy the most useful. It checks whether an IAM principal can perform specific actions without actually making those API calls. After reading about MCP ecosystem security vulnerabilities and 30 CVEs, having agents pre-validate their permissions before executing is a meaningful safety step.

Test run against my account:

response = iam.simulate_principal_policy(
    PolicySourceArn='arn:aws:iam::370193714718:user/remotax-fe',
    ActionNames=[
        'cloudwatch:DescribeAlarms',
        'logs:DescribeLogGroups',
        'iam:ListUsers',
        's3:ListBuckets'
    ],
    ResourceArns=['*']
)

# Results:
# ✓ cloudwatch:DescribeAlarms: allowed
# ✓ logs:DescribeLogGroups: allowed
# ✓ iam:ListUsers: allowed
# ✓ s3:ListBuckets: allowed

The Condition Key Architecture

This is the part I think matters most about the GA release. Every API call through AWS MCP Server automatically carries:

aws:ViaAWSMCPService — marks this as a request via an MCP service
aws:CalledViaAWSMCP — marks this as originating from an MCP client

An IAM deny policy using these keys:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "iam:CreateUser",
        "iam:DeleteUser",
        "iam:AttachUserPolicy"
      ],
      "Resource": "*",
      "Condition": {
        "Bool": {
          "aws:ViaAWSMCPService": "true"
        }
      }
    }
  ]
}

With this policy, a human using the AWS console can manage IAM users. Claude Code using the same credentials cannot. Same key pair, different effective permissions. When I was implementing Tool Use in the Claude Agent SDK, I had to build agent permission scoping into application logic. AWS is solving that at the infrastructure layer here.

Architecture

Three layers: coding agent → AWS MCP Server (stdio) → AWS API (SigV4 auth). Every AWS API call goes to CloudTrail. Metrics land in the AWS-MCP CloudWatch namespace separately from direct human calls.

Available AWS MCP Servers

Server	Purpose	Version
`awslabs.cloudwatch-mcp-server`	Logs, Metrics, Alarms	v0.1.1
`awslabs.iam-mcp-server`	IAM management	v1.0.20
`awslabs.aws-api-mcp-server`	Any AWS API	separate
CloudWatch Application Signals	APM/SLO monitoring	separate
AWS Network MCP Server	VPC/network diagnostics	separate
AWS Pricing MCP Server	Cost estimation	separate
EKS MCP Server	EKS cluster management	separate

The aws-api-mcp-server is interesting. It exposes every AWS API through a single tool. When building a FastMCP-based MCP server, each API endpoint needed its own tool definition. The aws-api-mcp-server flips that — one tool, all APIs. The trade-off is that the agent needs more context to figure out which API to call.

Honest Assessment — What Works, What Doesn't

What I find genuinely useful:

The IAM condition key separation is real. If you've been hesitant to give agents AWS access because you can't restrict them beyond the IAM user's permissions, this changes the calculation. You can attach aws:ViaAWSMCPService deny statements to enforce read-only agent access while keeping full human access with the same credentials.

PromQL support surprised me. CloudWatch supports PromQL for Container Insights metrics, and the MCP server exposes it. If you run Kubernetes on EKS and already write PromQL, you can use that syntax directly through the agent.

The 1,179-entry metric metadata index means the agent can reason about AWS services it has never seen before in your specific account. It knows what metrics EC2, Lambda, RDS, and most other services expose without needing to query AWS each time.

What gives me pause:

CloudWatch server at v0.1.1. The AI analysis tools like analyze_log_group and analyze_metric look promising but I have not stress-tested them. A 0.x version in production tooling warrants caution.

Logs Insights cost. CloudWatch charges for scanned log data in Insights queries. An agent with unconstrained query access could run up meaningful charges. There are no cost guardrails at the tool level — that has to be managed at the IAM level (restricting query scope) or through agent instructions.

create_access_key in the IAM server. An agent tool that creates new AWS access keys is, by default, accessible. The condition key approach can block it, but you have to set that up deliberately. I would not wire up the IAM server in a production environment without first adding explicit deny policies for the write operations.

My recommendation: start with cloudwatch-mcp-server in read-heavy workflows. Treat the IAM server as a development tool until you have the deny policies in place.

Getting Started

If AWS credentials are configured:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Test immediately
uvx awslabs.cloudwatch-mcp-server@latest

# Add to a project
cat > .mcp.json << 'EOF'
{
  "mcpServers": {
    "cloudwatch": {
      "command": "uvx",
      "args": ["awslabs.cloudwatch-mcp-server@latest"],
      "env": {
        "AWS_REGION": "us-east-1",
        "FASTMCP_LOG_LEVEL": "WARNING"
      }
    }
  }
}
EOF

Official docs: awslabs.github.io/mcp. Source: github.com/awslabs/mcp. Free to use — you pay only for the AWS resources the agent touches.

AI agents having console-level visibility into AWS infrastructure is coming regardless. AWS MCP Server GA is the first production-ready step in that direction.

MemRL: Self-Evolving Agents via Episodic Memory RL

Jangwook Kim — Thu, 14 May 2026 04:16:52 +0000

There is a gap in how most AI agents handle experience. They reason well from the start, but they don't get smarter from what they do. Fine-tuning closes that gap, but it's expensive, slow, and prone to catastrophic forgetting. RAG-based memory is cheaper, but it retrieves by similarity — not by whether a past strategy actually worked.

MemRL, published on arXiv in January 2026, proposes a different approach: apply reinforcement learning directly to episodic memory at runtime, without touching model weights. The result is an agent that improves through trial and error, storing structured experiences and learning which ones to prioritize based on real task outcomes.

This guide breaks down how MemRL works, what the benchmarks show, and how the core mechanism looks in practice — including a minimal reproduction Effloow Lab ran to verify the concept.

The Problem MemRL Solves

Current agent memory systems face a fundamental tradeoff. On one end, fine-tuning embeds knowledge directly into model weights — but requires expensive compute, labeled data, and still risks overwriting previously learned behavior (catastrophic forgetting). On the other end, RAG-style retrieval keeps knowledge external, making it cheap to update. But standard RAG retrieves by semantic similarity alone. It surfaces documents that look similar to the current query, not documents associated with strategies that previously worked.

This is the stability-plasticity dilemma: agents either freeze their knowledge (stable but rigid) or update it continuously (plastic but forgetful). MemRL's claim is that this tradeoff is a false choice — you can have a frozen LLM backbone (stable) with an external memory that evolves through RL feedback (plastic).

What MemRL Is

MemRL (arXiv:2601.03192, from MemTensor, updated February 2026) is a non-parametric framework that enables agents to self-evolve through runtime reinforcement learning on episodic memory. The LLM's weights never change. Instead, MemRL maintains a structured external memory, refines it based on task outcomes, and uses a two-phase retrieval mechanism to surface the most useful experiences — not just the most similar ones.

The open-source code is available at MemTensor/MemRL, with support for ALFWorld, BigCodeBench, HLE, and Lifelong Agent Bench benchmarks.

The Intent-Experience-Utility Triplet

The core data structure in MemRL is not a document. It's a triplet:

Intent: the task or query the agent was addressing
Experience: the specific action trajectory or solution strategy used
Utility (Q-value): a learned score representing how successful that experience was

Where RAG stores raw text and retrieves by embedding similarity, MemRL stores structured (intent, experience, Q-value) records. The Q-value is not fixed at write time — it evolves as the agent receives environmental feedback across episodes.

This distinction matters. Two experiences with similar intents might have very different Q-values if one led to a successful outcome and the other failed. RAG can't distinguish these. MemRL can.

How Two-Phase Retrieval Works

When an agent faces a new task, MemRL retrieves relevant past experiences in two stages:

Phase A — Semantic Filter: The agent computes similarity between the current intent and all stored intents using dense embeddings. The top-k candidates (by semantic relevance) are kept. This narrows the search to experiences that are topically related to the current task.

Phase B — Q-Value Ranking: Among those filtered candidates, MemRL re-ranks by Q-value. Experiences with higher utility — those associated with successful outcomes — rise to the top. The agent retrieves the highest-Q candidates and uses them as in-context guidance for the current task.

The paper describes Phase A as analogical transfer (retrieving similar past events) and Phase B as mental rehearsal (selecting strategies proven to work). Together, they avoid the main failure mode of pure RAG: retrieving semantically similar but strategically useless memories.

Q-Value Learning: The RL Mechanism

After the agent completes a task using retrieved memories, it receives a reward signal from the environment — success, partial success, or failure. MemRL applies a Monte Carlo-style update to the Q-value of the used memory:

Q_new = Q_old + α × (reward - Q_old)

Where α is the learning rate. Positive outcomes increase the Q-value; failures decrease it. Over many episodes, Q-values diverge: experiences associated with reliable strategies accumulate higher scores, while noise and failed attempts are downweighted.

The entire optimization loop runs outside the LLM. No gradient computation, no retraining. The LLM reasons over whatever context it's given — MemRL just gets better at deciding what to put in that context.

Effloow Lab PoC: Core Mechanism in Python

Effloow Lab ran a minimal reproduction of the IEU triplet and two-phase retrieval to verify the concept. Full repo installation requires ALFWorld and LLM credentials, so this PoC uses word-overlap similarity instead of dense embeddings — a known limitation documented in the lab run.

import math

class SimpleMemRL:
    def __init__(self, top_k_semantic=5, top_k_q=2):
        self.memory = []
        self.top_k_semantic = top_k_semantic
        self.top_k_q = top_k_q

    def _cosine_sim(self, a, b):
        # word-overlap proxy for embeddings (sandbox limitation)
        set_a = set(a.lower().split())
        set_b = set(b.lower().split())
        if not set_a or not set_b:
            return 0.0
        return len(set_a & set_b) / math.sqrt(len(set_a) * len(set_b))

    def write(self, intent, experience, initial_q=0.5):
        self.memory.append({'intent': intent, 'experience': experience, 'q': initial_q})

    def update_q(self, intent, reward):
        alpha = 0.3
        for m in self.memory:
            if m['intent'] == intent:
                m['q'] += alpha * (reward - m['q'])

    def retrieve(self, query_intent):
        if not self.memory:
            return []
        # Phase A: semantic filter
        scored = [(self._cosine_sim(query_intent, m['intent']), m) for m in self.memory]
        scored.sort(key=lambda x: x[0], reverse=True)
        candidates = [m for _, m in scored[:self.top_k_semantic]]
        # Phase B: Q-value ranking
        candidates.sort(key=lambda m: m['q'], reverse=True)
        return candidates[:self.top_k_q]

Running this with a small set of coding strategy memories, then applying positive feedback to sort-related experiences and negative feedback to a debugging strategy, produced the expected result: sort strategies rose to Q≈0.62, while the debugging entry dropped to Q≈0.24. Subsequent queries for sorting tasks surfaced the higher-Q memories first.

The key limitation observed: word-overlap similarity doesn't capture semantic equivalence well, which caused some retrieval mismatches. Real MemRL uses dense embeddings (e.g., OpenAI text-embedding models or similar), resolving this. Full lab-run details and output are in data/lab-runs/memrl-self-evolving-agents-episodic-memory-rl-guide-2026.md.

Benchmark Results

The paper benchmarks MemRL across these tasks:

Benchmark	MemRL (Last Acc.)	MemP Baseline	No-Memory Baseline	Key Gain
ALFWorld	0.507	0.324	0.278	+56% over MemP
HLE	0.573	0.528	—	+8.5% over MemP
BigCodeBench	0.508	0.494	—	+2.8% over MemP
Lifelong Agent Bench	0.697 CSR	—	—	Best overall

The gains are largest on ALFWorld and Lifelong Agent Bench — multi-step sequential tasks where memory utility accumulates across episodes. BigCodeBench shows smaller gains because it's primarily single-turn: there's less opportunity for multi-episode Q-value refinement when each task is independent.

This pattern is important. MemRL's value is proportional to how much your agent loops over time. If your agent handles isolated, one-shot queries, you won't see ALFWorld-level improvements.

MemRL vs Traditional RAG

MemRL Strengths
<ul>
  <li>Learns from success/failure — not just semantic match</li>
  <li>No model fine-tuning required — frozen LLM backbone</li>
  <li>Q-values suppress noise and bad strategies over time</li>
  <li>Improves within a session and across sessions (transfer)</li>
  <li>Open-source with multi-benchmark validation</li>
</ul>


Where It Lags
<ul>
  <li>Needs an environmental feedback signal — not always available</li>
  <li>Less useful for purely one-shot tasks without episode loops</li>
  <li>Q-value cold start: early episodes have unrefined utility scores</li>
  <li>More complex to set up than a standard RAG pipeline</li>
</ul>

The underlying difference is what retrieval optimizes for. RAG finds memories that are similar. MemRL finds memories that are similar and proved useful. For long-running agents where failure has a cost — home automation, coding assistants, planning agents — this distinction is meaningful.

The Tempera MCP Server

A community implementation called Tempera applies MemRL concepts to AI coding workflows via Model Context Protocol (MCP). Tempera captures coding sessions as episodes, indexes them for semantic search, and uses RL to surface the most valuable memories at query time. All projects share a common memory database stored under ~/.tempera/, enabling cross-project learning — a direct practical application of the MemRL architecture.

This matters for developers already using MCP-compatible tools: Tempera is one path to experimenting with MemRL ideas without implementing the full research framework.

How to Get Started with MemRL

For developers interested in running the actual MemRL benchmarks, the setup flow is:

# 1. Clone the repo
git clone https://github.com/MemTensor/MemRL
cd MemRL

# 2. Create environment (Python 3.10 required)
conda create -n memrl python=3.10
conda activate memrl

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure LLM + embedding settings in configs/
# (YAML files per benchmark)

# 5. Run a benchmark runner
python memrl/run/alfworld_rl_runner.py

Results write to logs/ and results/ directories. The configs/ directory controls which LLM and embedding model you use — the paper uses frontier models but the code supports swapping these.

Full environment setup for ALFWorld requires additional installation steps documented in the repo's README.

Practical Implications for Agent Developers

MemRL's ideas translate to a few concrete questions worth asking about any agent system:

Does your agent run repeatedly over similar tasks? If yes, runtime Q-value learning could improve retrieval quality. If your agent handles purely isolated requests, the benefit is limited.

What's your feedback signal? MemRL needs a reward — task success, user rating, test pass/fail, something. Agents that get no structured outcome signal can't update Q-values. Designing a feedback loop is a prerequisite, not an afterthought.

Are you fighting retrieval noise? If your RAG-based memory system frequently surfaces semantically similar but strategically useless memories, MemRL's Phase B filtering is directly relevant. The Q-value layer exists precisely to downweight experiences that match the query but don't help.

Do you need to avoid retraining? MemRL's strongest argument is that agents can improve without compute-intensive fine-tuning cycles. For teams running agents at scale where fine-tuning is prohibitively expensive, this is a meaningful alternative.

Q: How is MemRL different from Reflexion or Voyager?

Reflexion stores verbal self-reflection notes in memory. Voyager builds a skill library. MemRL is distinct in applying Q-value learning to determine which stored experiences to retrieve. Reflexion and Voyager still rely on recency or semantic matching; MemRL's retrieval is utility-driven.

Q: Can MemRL work with any LLM?

Yes — the LLM backbone is frozen. MemRL is agnostic to the underlying model. The paper runs experiments with frontier models, but the memory and retrieval mechanism is entirely external to the LLM's weights.

Q: What happens if the reward signal is noisy?

Noisy rewards are a known challenge in RL. The paper applies Monte Carlo-style updates (averaging over episodes) which provides some robustness, but highly noisy reward signals will produce unreliable Q-values. The quality of MemRL's learning is bounded by the quality of the feedback signal.

Q: Does MemRL require embeddings?

Yes, Phase A requires dense vector similarity. The sandbox PoC used word-overlap as a proxy, but real MemRL uses embedding models to compute semantic similarity between stored intents and current queries. Any embedding model compatible with your stack works.

Key Takeaways

MemRL addresses a genuine gap: the cost of fine-tuning versus the limitations of static retrieval. Its approach — structure memory as IEU triplets, filter by semantics, rank by learned Q-values, update Q-values from task outcomes — is conceptually clean and benchmarked across four tasks.

The gains are largest for multi-step, episodic tasks (ALFWorld: +56% over MemP) and more modest for single-turn workloads (BigCodeBench: +2.8%). The framework needs a feedback signal, and Q-values start uninformed — so there's a cold-start cost on early episodes.

For teams building agents that loop repeatedly over tasks, interact with real environments, and can capture task success as a signal, MemRL is a well-evidenced alternative to both fine-tuning and standard RAG. The code is open, the benchmarks are public, and the Tempera MCP server offers a path to experimenting without setting up the full research framework.

Bottom Line

MemRL is one of the more rigorous proposals for non-parametric agent learning published in early 2026. If you're running agents that repeat tasks and can capture feedback, the two-phase retrieval mechanism is worth understanding — and the open-source code makes it possible to test on your own benchmarks without writing the RL layer from scratch.

Sources:

OpenAI Realtime Audio API: Voice Agents Guide 2026

Jangwook Kim — Thu, 14 May 2026 00:12:11 +0000

On May 7, 2026, OpenAI quietly made voice agents production-viable. Three new realtime audio models landed in the API at the same time: GPT-Realtime-2 (voice with GPT-5-class reasoning), GPT-Realtime-Translate (live speech-to-speech translation across 70+ languages), and GPT-Realtime-Whisper (streaming speech-to-text billed by the minute). Each model has its own pricing, endpoint, and use-case fit.

If you have been waiting for a stable, production-ready voice API before building, the wait is over. This guide walks through what each model does, how to connect to the API, what it costs, and the production patterns that separate a working demo from a robust voice agent.

Effloow Lab inspected the Realtime API protocol and validated client-side event structures locally as part of this article's research. Full live testing requires an OpenAI API key; where relevant, we note what we verified and what we did not.

Why This Release Matters

Previous versions of the Realtime API required working around a 32K-token context ceiling, managing your own speech-to-text pipeline, and accepting that the model would sometimes lose the thread of a long conversation. GPT-Realtime-2 removes these constraints:

Context window expanded to 128K tokens — four times the previous limit, enough for multi-turn conversations spanning tens of minutes
GPT-5-class reasoning integrated directly — the model can call tools, reason through steps, and respond, all without leaving the audio stream
Three specialized models instead of one general voice model, each optimized for a specific cost-performance point

The split into three models is also a pricing move. If you only need transcription, GPT-Realtime-Whisper at $0.017/minute is dramatically cheaper than running voice inference at $32/1M tokens. Choose the right model and you can cut costs by 80–90% relative to using GPT-Realtime-2 for everything.

Model	Purpose	Pricing	Context
gpt-realtime-2	Voice reasoning agent	$32/1M input · $64/1M output tokens	128K tokens
gpt-realtime-translate	Live speech translation	$0.034/min	Translation-only
gpt-realtime-whisper	Streaming transcription	$0.017/min	STT-only

GPT-Realtime-2: Voice Reasoning for Production Agents

GPT-Realtime-2 is the flagship of the trio. It brings GPT-5-level intelligence into the audio stream: the model can reason through multi-step requests, call functions, handle tool results, and continue speaking — all without pausing the conversation for a round trip to a separate text model.

How audio tokens are billed

OpenAI encodes audio duration into tokens rather than sampling audio at a fixed rate. The billing math is:

User speech (input): 1 token per 100 ms of audio → 600 tokens per minute
Model response (output): 1 token per 50 ms of audio → 1,200 tokens per minute

For a typical bidirectional voice call where the user talks roughly as much as the model:

Input cost:  600 tokens × ($32 / 1,000,000) = $0.0192 / min
Output cost: 1,200 tokens × ($64 / 1,000,000) = $0.0768 / min
Total uncached: ~$0.096 / min (~$5.76 / hour)

With prompt caching applied to system instructions and persistent session context, real-world costs can drop to roughly $0.05–$0.10/min according to third-party production estimates published by OpenAI partners.

Connecting via WebSocket

The Realtime API uses a persistent WebSocket connection. Every interaction is modeled as an exchange of typed JSON events — the client sends events, the server sends events back. Effloow Lab validated that the client-side event structures serialize and round-trip correctly in Python:

import asyncio
import json
import websockets

OPENAI_API_KEY = "sk-..."  # your key

async def voice_agent_session():
    uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2"
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1"
    }

    async with websockets.connect(uri, additional_headers=headers) as ws:
        # 1. Configure the session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["audio", "text"],
                "voice": "alloy",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 500
                },
                "tools": [
                    {
                        "type": "function",
                        "name": "lookup_order",
                        "description": "Look up a customer order by ID",
                        "parameters": {
                            "type": "object",
                            "properties": {
                                "order_id": {"type": "string"}
                            },
                            "required": ["order_id"]
                        }
                    }
                ],
                "tool_choice": "auto"
            }
        }))

        # 2. Stream audio (PCM16, 24kHz, base64-encoded chunks)
        # await ws.send(json.dumps({
        #     "type": "input_audio_buffer.append",
        #     "audio": base64_chunk
        # }))

        # 3. Listen for server events
        async for raw_msg in ws:
            event = json.loads(raw_msg)
            event_type = event.get("type", "")

            if event_type == "response.audio.delta":
                # stream audio bytes to speaker
                pass
            elif event_type == "response.function_call_arguments.done":
                # handle tool call, then send result back
                await ws.send(json.dumps({
                    "type": "conversation.item.create",
                    "item": {
                        "type": "function_call_output",
                        "call_id": event["call_id"],
                        "output": json.dumps({"order_status": "shipped"})
                    }
                }))
                await ws.send(json.dumps({"type": "response.create"}))

asyncio.run(voice_agent_session())

The OpenAI Agents Python SDK (openai-agents) wraps this pattern into a higher-level RealtimeAgent class if you prefer avoiding raw WebSocket management. The underlying transport is the same.

Tool calls mid-conversation

GPT-Realtime-2 can call functions while speaking. The agent does not stop talking and wait — it continues the audio stream with a phrase like "Let me look that up" while dispatching the tool call in parallel. When the result arrives, it folds it into the ongoing response. This pattern is what makes GPT-Realtime-2 meaningfully different from a text model with TTS bolted on.

Interruption handling

Voice activity detection (VAD) is built in when you set turn_detection.type = "server_vad". When the user starts speaking mid-response, the API sends a response.cancelled event, truncates the current audio output, and starts a new inference cycle. The 128K context window means the model retains everything said before the interruption without a context reset.

Three things to get right in production:

VAD threshold (threshold: 0.5 in the example above) — lower values detect softer speech but increase false triggers in noisy environments. Tune per your deployment channel (phone line vs browser microphone vs call center headset).
Silence duration (silence_duration_ms) — how long a pause triggers end-of-turn. 500ms works for conversational speech; customer support scripts may need 700–1000ms.
Barge-in state management on your server — when response.cancelled fires, flush any queued tool results from the cancelled turn or you'll deliver stale data to the next response cycle.

GPT-Realtime-Translate: Live Speech-to-Speech Translation

GPT-Realtime-Translate is a single-purpose model trained on thousands of hours of professional interpreter audio. It takes live speech in any of 70+ input languages, detects the source language automatically, and returns translated speech plus text transcripts in one of 13 output languages.

Target output languages as of May 2026: Spanish, Portuguese, French, Japanese, Russian, Chinese, German, Korean, Hindi, Indonesian, Vietnamese, Italian, and English.

The dedicated endpoint is /v1/realtime/translations:

uri = "wss://api.openai.com/v1/realtime/translations"

session_config = {
    "type": "session.update",
    "session": {
        "output_language": "ja",   # target language code
        # source language is auto-detected
        "voice": "alloy"
    }
}

You stream 24 kHz PCM16 audio into input_audio_buffer.append exactly as you would with GPT-Realtime-2. The model processes input audio while simultaneously streaming translated audio back, which keeps perceived latency low over continuous speech.

Unlike a general-purpose voice model, GPT-Realtime-Translate will not answer questions or carry on conversation. It is translation-only by design. If a user asks "what time is it?" in French and your output language is English, the model translates the question into English — it does not answer it. Build a routing layer in front if your product needs both translation and reasoning.

At $0.034/minute, a one-hour multilingual support call costs $2.04 in translation credits. A 30-person conference session with real-time translation for 60 minutes costs around $60 — cheaper than a human interpreter for a short session, and it runs at scale.

GPT-Realtime-Whisper: Streaming Speech-to-Text

GPT-Realtime-Whisper is the transcription-only model in the trio. It starts producing text output as the speaker talks rather than waiting for an utterance to finish. This keeps the UI feeling responsive — a transcription bar can update word-by-word instead of appearing in blocks.

Pricing at $0.017/minute makes it among the cheapest options for streaming STT in the OpenAI ecosystem. An eight-hour workday of continuous transcription costs about $8.16.

# Whisper Realtime session uses the standard /v1/realtime endpoint
# with model=gpt-realtime-whisper
uri = "wss://api.openai.com/v1/realtime?model=gpt-realtime-whisper"

# Server returns transcript deltas as speech is detected:
# { "type": "conversation.item.input_audio_transcription.delta", "delta": "Hello, " }
# { "type": "conversation.item.input_audio_transcription.delta", "delta": "can you hear me?" }
# { "type": "conversation.item.input_audio_transcription.completed", "transcript": "Hello, can you hear me?" }

GPT-Realtime-Whisper is the right choice when you need transcription but not inference — meeting recorders, live captioning systems, accessibility tools, voice-search preprocessing, and call analytics pipelines where a separate LLM processes the transcript downstream.

Practical Application: Choosing the Right Model

The three models are not interchangeable. Use this decision tree:

Does your user need a spoken response from the AI?

Yes, and it involves reasoning, tool calls, or multi-turn logic → gpt-realtime-2
Yes, but it is a direct translation of what another person said → gpt-realtime-translate
No, you only need the text of what the user said → gpt-realtime-whisper

A customer support agent that looks up orders and reads statuses aloud: gpt-realtime-2.
A multilingual conference call platform where each attendee hears their own language: gpt-realtime-translate.
A meeting transcription SaaS that feeds into a separate summarizer: gpt-realtime-whisper.

For hybrid products, you can run models side-by-side. A global customer support pipeline might use gpt-realtime-translate for non-English callers to produce an English transcript, then pass that transcript to a text-only GPT-5 for classification and routing, and only invoke gpt-realtime-2 when the agent needs to speak back. This layering can reduce per-call cost significantly compared to routing all audio through gpt-realtime-2.

Common Mistakes in Production Voice Agents

Ignoring prompt caching on system instructions. The session configuration message is sent at the start of every WebSocket connection. For long system prompts, this is the largest per-session input cost. OpenAI caches inputs at $0.40/1M tokens vs $32/1M for uncached. Keep your system prompt stable and reuse session configurations where possible.

Treating response.cancelled as an error. Interruptions are a normal part of conversation. Your application should handle the cancel event cleanly — flush pending state, log the cancelled turn, and let the model proceed with the new input. Applications that surface interruption events as errors create broken UX and noisy logs.

Forgetting that context grows. The 128K context window means gpt-realtime-2 can hold a very long conversation without a reset. But it also means costs accumulate. A one-hour conversation with balanced speaking time can push well past $10 in audio tokens alone. For high-volume deployments, consider session time limits or periodic context compaction using a text-model summarization step.

Using gpt-realtime-2 for transcription-only use cases. If you only need the text of what the user said, run gpt-realtime-whisper at $0.017/min instead of gpt-realtime-2 at $0.096+/min. The cost difference is roughly 5–6x.

Hard-coding the VAD threshold. Different audio channels have different noise floors. A browser tab with a decent microphone is not the same as a phone call over PSTN. Ship a configuration option, even if only for internal deployment channels.

FAQ

Q: Does gpt-realtime-2 use GPT-5 under the hood?

OpenAI describes gpt-realtime-2 as bringing "GPT-5-class reasoning" to live voice, and their Big Bench Audio benchmark shows +15.2% audio intelligence over GPT-Realtime-1.5. OpenAI has not confirmed whether the underlying weights are shared with GPT-5 or whether this is a separate model trained to the same capability level.

Q: Can I use the Realtime API from a browser (client-side)?

Yes. OpenAI supports ephemeral session tokens for client-side WebSocket connections. Generate a short-lived token from your backend (POST /v1/realtime/sessions), pass it to the browser, and open the WebSocket from JavaScript. Do not embed your main API key in client-side code.

Q: How does server VAD compare to manual turn detection?

Server VAD (turn_detection.type = "server_vad") lets OpenAI's infrastructure handle speech segmentation — it detects when the user stops speaking and triggers inference automatically. Manual turn detection (turn_detection: null) gives your application full control: you decide when to commit an audio buffer and request a response. Manual mode is more predictable in noisy environments but requires more engineering. Start with server VAD and switch to manual if you hit false-trigger issues.

Q: Is gpt-realtime-translate available on Azure OpenAI?

Microsoft's Azure AI Foundry announced support for the new realtime audio models including gpt-realtime-whisper and gpt-realtime-translate shortly after the OpenAI release. Check the Azure OpenAI pricing page for regional availability and pricing, which may differ from direct OpenAI API pricing.

Q: What audio format does the Realtime API accept?

The API accepts PCM16 audio at 24 kHz, base64-encoded and sent as input_audio_buffer.append events. Most browser MediaRecorder APIs require a format conversion step. The OpenAI cookbook includes a realtime_translation_guide example with a JavaScript AudioWorklet for in-browser PCM16 capture.

Q: What happens if the WebSocket connection drops mid-conversation?

The session state is held server-side for the duration of the connection. If the connection drops, the session is lost — there is no resume or reconnect mechanism as of May 2026. Build reconnect logic in your client and design conversations to be resumable from the last committed turn. Store transcript deltas locally and replay context if a reconnect is needed.

Key Takeaways

The May 2026 Realtime Audio API update is the first time all three voice agent primitives — reasoning, translation, and transcription — are available in a single unified API with clear per-minute or per-token pricing.

For most developers building voice agents, the practical starting point is gpt-realtime-2 for prototyping and gpt-realtime-whisper for any transcription path that feeds a separate model. GPT-Realtime-Translate is genuinely useful and underpriced compared to traditional translation infrastructure — a multilingual product that previously required third-party translation services can now route entirely through one API.

The 128K context window and built-in VAD make gpt-realtime-2 a legitimate foundation for production voice agents rather than a demo novelty. The remaining work is on your side: audio channel handling, graceful interruption management, prompt caching discipline, and cost modeling before you scale.

Bottom Line

OpenAI's three-model voice API split is the right architecture: specialized models at specialized prices, all behind one WebSocket protocol. GPT-Realtime-2 is finally production-ready with 128K context and native tool calling. GPT-Realtime-Whisper at $0.017/min is the new default for any transcription-only pipeline. Build the routing layer between them and you can cover most voice AI use cases without leaving the OpenAI ecosystem.

AWS Kiro: Spec-Driven IDE for Agentic Development

Jangwook Kim — Wed, 13 May 2026 08:20:17 +0000

There is a quiet argument happening inside every engineering team that uses AI coding tools: should the AI write code directly from a chat prompt, or should it first commit to a plan you can actually verify?

Cursor and Windsurf answer "write from the prompt." AWS Kiro answers "write the spec first."

That is not a small difference. It changes what you version, what you review in a pull request, and who on the team can understand what the agent actually built. This guide covers what Kiro does, how the spec workflow is structured, how agent hooks automate the repetitive parts, and where it fits relative to the other agentic IDEs competing for your workflow in 2026.

What Kiro Is

Kiro is a desktop IDE built on Code OSS — the open-source base that VS Code also runs on — developed by Amazon Web Services and released to the public in late 2025. It reached general availability in November 2025 after hitting capacity limits within days of its July preview launch.

The product is positioned as AWS's successor to Amazon Q Developer. AWS ended new Q Developer signups effective May 15, 2026, explicitly directing new users to Kiro. That matters for team context: if your organization is already on AWS and was evaluating Q Developer, Kiro is now the answer.

The core design principle: specs are the source of truth, and code is a build artifact derived from them. Rather than asking an agent to "add a rate limiter," you write a spec that describes what the rate limiter should do, under what conditions, and what the acceptance criteria are. The agent then generates code to satisfy the spec, not just a prompt.

The Spec Workflow

When you start a feature, Kiro creates three structured markdown files under .kiro/specs/{feature-name}/:

requirements.md captures user stories and acceptance criteria using EARS notation (Easy Approach to Requirements Syntax). EARS structures each requirement as a conditional assertion:

WHEN the user submits the registration form
THEN the system SHALL validate the email format before saving
AND the system SHALL return a 422 with field-level errors when validation fails

That format is not just documentation. It maps directly to testable assertions, which is why Kiro can generate test stubs from requirements with reasonable accuracy.

design.md documents the technical architecture for the feature — data models, sequence diagrams in text form, interface contracts, and any relevant infrastructure considerations. This file lives in the repo alongside the feature code, so anyone reviewing a pull request can see the design intent without reconstructing it from the implementation.

tasks.md contains a discrete task list that Kiro generates from the requirements and design. Tasks are tracked as in-progress or completed as the agent works through them. You can pause, redirect, or reassign tasks manually; Kiro treats them as a checkpoint-able queue, not a linear script.

The three-document structure is also the surface where human review happens. Before the agent touches code, you can edit requirements to narrow scope, add edge cases to the design, or reprioritize tasks. That is the mechanism Kiro offers for keeping the human in the loop on complex features without turning every step into a manual approval.

Strengths
<ul>
  <li>Specs survive code refactors — the "why" stays versioned in the repo</li>
  <li>EARS format produces testable acceptance criteria, not vague prose</li>
  <li>Spec review is a natural code review gate that any team member can participate in</li>
  <li>Free tier (50 requests/month) requires no AWS account or credit card</li>
  <li>Powers bundle MCP servers + hooks into reusable, context-aware packages</li>
</ul>


Limitations
<ul>
  <li>Spec-first workflow adds planning time — not suited for fast prototyping</li>
  <li>Credits deplete quickly on multi-file specs (community-reported)</li>
  <li>Deep AWS integrations require an AWS account and Bedrock access</li>
  <li>Smaller extension/plugin ecosystem compared to VS Code or Cursor</li>
</ul>

Agent Hooks: Automating the Repetitive Parts

One feature that distinguishes Kiro from competitors is its hook system. Hooks are event-driven automations configured in .kiro/hooks/ as JSON files. When a trigger event fires, the hook either runs a natural-language agent prompt or executes a shell command.

The available triggers as of Kiro's 0.10 changelog:

file:save — fires whenever you save a file
file:create — fires when a new file is created
task:pre — fires before a spec task begins executing
task:post — fires after a spec task completes

Common hook patterns from the official Kiro blog:

{
  "trigger": "file:save",
  "match": "src/components/**/*.tsx",
  "action": {
    "type": "agent",
    "prompt": "Update the test file for the component that was just saved. Keep existing test cases; add new ones only for changed behavior."
  }
}

That hook means you never manually sync your test file after touching a component. The agent does it on save, automatically, every time.

The task:post hook is useful for quality gates. You can configure it to run linting, type checking, or test execution after each agent task completes — so that a multi-step spec run doesn't silently accumulate broken intermediate states.

Hooks are committed to the repository, not stored locally in user preferences. That means the automation behavior is consistent across the whole team and survives machine changes.

Kiro Powers and MCP Integration

Kiro supports both local and remote MCP servers. Its differentiated feature here is "Powers" — a packaging concept introduced in changelog 0.10.

A Power bundles three things into a single installable unit:

An MCP server providing tools
A steering file that defines when and how to activate those tools
Optional hooks that automate related tasks

Powers activate on-demand based on conversation context rather than loading all MCP tools upfront. This keeps the token budget clean: if you are working on a CloudFormation stack, the CloudFormation Power becomes active; the pricing tools stay dormant until they are relevant.

AWS ships first-party Powers for several of its own platforms: CDK, CloudFormation, Pricing, and HealthOmics workflows. Third-party Powers follow the same packaging spec. If you are building your own MCP server and want it to integrate cleanly with Kiro, the Powers format gives you a structured way to bundle it.

This is worth comparing to how Cursor handles MCP: Cursor supports MCP servers directly but without the packaging abstraction. All configured servers load simultaneously, and there is no built-in concept of context-aware activation. For teams with many MCP tools, the Powers approach reduces noise at the cost of an additional configuration layer.

Pricing and Getting Started

Kiro runs on a credit system. One agentic request equals one credit. Plans as of May 2026:

Plan	Monthly Credits	Price	AWS Account Required
Free	50	$0	No
Pro	Unlimited*	$20/mo	No
Pro+	Unlimited*	$40/mo	No
Power	Unlimited*	$200/mo	Optional

*Overage credits beyond the plan's included usage cost $0.04 each, billed at month-end.

To install: download from kiro.dev/downloads. The installer is available for macOS, Windows, and Linux. Sign in with GitHub, Google, AWS Builder ID, or IAM Identity Center. No credit card for the free tier.

Your first project follows this path:

Open a folder in Kiro
Open the Kiro panel and type a feature description in natural language
Kiro generates .kiro/specs/your-feature/requirements.md — review and edit it
Approve the requirements → Kiro generates design.md and tasks.md
Approve the design → Kiro begins working through tasks.md sequentially
Hooks run automatically on file saves during implementation

The full quickstart is at kiro.dev/docs/getting-started/first-project/.

How Kiro Compares to Cursor and Windsurf

The agentic IDE space has three dominant positions heading into mid-2026:

Cursor (1M+ daily active users, $20/mo Pro) is the market leader. Its strength is codebase indexing: semantic embeddings of the entire repo, @-file references, and a polished multi-file editing experience. Agent mode handles large refactors well. The weakness is that "prompt → code" means the agent's intent is implicit in the output, not in a verifiable artifact.

Windsurf ($15/mo) targets enterprise teams. Its Cascade feature auto-discovers context without manual file tagging, which works well on large codebases. First-pass success on complex tasks is reported as higher than Cursor's agent mode.

Kiro is the most opinionated of the three. It trades speed for verifiability. The spec workflow adds 15–30 minutes of upfront planning to any non-trivial feature. In return, you get requirements that you can reference in code review, design decisions that survive refactors, and hooks that keep tests and documentation in sync automatically.

A useful heuristic: if your team already writes design documents before implementing, Kiro formalizes that workflow and connects it to the code generation loop. If your team goes from Jira ticket straight to code, Kiro will feel like it is adding ceremony without clear return.

For further context on the broader agentic IDE landscape, see the cursor vs windsurf vs zed comparison and the best AI coding agents roundup for 2026.

Who Should Actually Use Kiro

Good fit:

Teams already on AWS who want AI coding integrated with their existing Bedrock and IAM setup
Projects where requirements traceability matters: regulated industries, complex APIs, multi-team codebases
Engineers who write design documents by habit and want to close the gap between the doc and the code
Anyone evaluating Amazon Q Developer alternatives (Kiro is now the official successor)

Less useful:

Solo developers doing rapid prototyping where the cost of planning exceeds the cost of mistakes
Projects where the team does not review design artifacts — specs without readers add overhead with no return
Teams wanting the largest VS Code extension ecosystem (Kiro's is smaller, though growing)

The Real Question: Is Spec-Driven Development a Better Default?

The honest answer is that spec-driven development is better for some teams and worse for others — and Kiro does not resolve that ambiguity for you.

What Kiro does resolve is the artifact gap that exists in every other agentic IDE: the mismatch between what you asked for and what the code actually does, documented nowhere. The spec files live in the repository. When something breaks three months later, you can read what the system was supposed to do instead of reverse-engineering it from the output.

Whether that is worth the additional workflow overhead depends on how much of your team's time currently goes into maintaining context versus generating new code. For teams where "why does this work this way" is a common question in standups, the spec overhead pays back quickly. For solo builders iterating fast, the overhead stays overhead.

Kiro's MCP Powers concept is worth watching independently of the spec workflow. Bundling MCP servers with activation context and hooks is a packaging idea that other IDEs will likely adopt — it solves a real problem with how multiple MCP tools currently have to be configured and managed.

FAQ

Q: Does Kiro work without an AWS account?

Yes. The free tier (50 agentic requests/month) and the paid Pro plans ($20/mo) work with GitHub or Google sign-in. An AWS account only becomes relevant if you want to use Bedrock directly or connect to AWS-specific Powers like CloudFormation or CDK.

Q: Are Kiro specs committed to the repository?

Yes. The .kiro/specs/ and .kiro/hooks/ directories are intended to be committed. Specs and hooks are team artifacts, not personal IDE settings. This is deliberate: Kiro's design assumes the spec files are part of the code review surface.

Q: How are Kiro credits consumed?

Each agentic request consumes one credit. Generating a spec from a prompt, executing a task from tasks.md, or running an agent hook each count as one request. Autocomplete and inline suggestions do not consume credits. On the free tier (50 credits/month), a medium-complexity feature with 8–10 spec tasks plus several hooks will use most of the monthly allowance.

Q: What is the difference between Kiro Powers and regular MCP servers?

A Power is an MCP server plus a steering file plus optional hooks, packaged together. The steering file tells Kiro when to activate the Power's tools based on conversation context. Regular MCP servers load all their tools upfront; Powers load on-demand. The practical difference is a shorter tool list in the agent's context window, which reduces token usage and improves relevance on complex tasks.

Q: Is Kiro open source?

The Kiro codebase repository is at github.com/kirodotdev/Kiro. The IDE is built on Code OSS (VS Code open-source base). The agent runtime and Bedrock integrations are proprietary AWS services.

Bottom Line

Kiro is the first agentic IDE that makes the design document part of the build process rather than a separate artifact that decays. The spec workflow adds overhead that pays back on team codebases where requirements traceability matters. For solo prototyping or teams that run Cursor smoothly, there is no compelling reason to switch today — but the Powers and hooks concepts are worth watching as patterns the rest of the IDE market will absorb.

Claude Agent SDK Practical Guide — Building Tool-Using AI Agents from Scratch

Jangwook Kim — Wed, 13 May 2026 06:41:24 +0000

I ran into the Tool Use moment while building a FastAPI streaming backend with the Claude API. The trigger was simple: a user asked "how many days are left in this year?" and Claude answered wrong. Not just wrong — confidently wrong. I remember thinking, "OK, a chatbot can't handle this."

Tool Use fixes that structurally. Instead of the model calculating directly, it calls a calculation function and uses the result to answer. That difference is what separates a chatbot from an agent.

This guide covers the Tool Use patterns I validated by directly installing and running anthropic SDK 0.101.0. From basic tool definitions to the agentic loop, error handling, and cost — practical code you can actually use.

Why Tool Use Is Different from a Chatbot — The Structural Gap

An LLM samples tokens from a probability distribution. Tasks like date arithmetic, precise numerical calculations, or live API lookups are structurally unreliable — the model recreates patterns from training data, not ground truth.

Tool Use addresses this at a different layer. The model decides what to do, and actual execution is delegated to external code. Instead of computing directly, the model emits something like calculate("365 - today.day_of_year"), and Python runs it and returns the result.

# Chatbot: model answers directly
# "Doesn't know today's date, has to compute directly -> can be wrong"
response = client.messages.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "How many days left in this year?"}]
)

# Agent: delegates to a tool
# "Model picks the tool, Python computes accurately"
response = client.messages.create(
    model="claude-opus-4-7",
    tools=tools,  # includes date calculation tool
    messages=[{"role": "user", "content": "How many days left in this year?"}]
)

The decisive difference is reliability. Python's datetime module doesn't get dates wrong.

Setup — Sandbox Verification Results

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install anthropic

Results from running this directly in a temp directory:

anthropic version: 0.101.0
Client instantiated: ✓
Client type: Anthropic

0.101.0 is the latest as of 2026-05-13. This is the official Anthropic SDK — completely different from packages like pyautogen that were common before 2025.

import anthropic
import json
from typing import Any

client = anthropic.Anthropic(api_key="your-api-key")  # or set ANTHROPIC_API_KEY env var

The SDK auto-loads the API key from ANTHROPIC_API_KEY. Don't hard-code it.

Defining Your First Tool — JSON Schema Is All You Need

Tool Use uses a structure similar to OpenAI Function Calling. Each tool has three parts:

name: Tool identifier (like a function name)
description: The basis for the model's decision on when to use this tool
input_schema: JSON Schema for input parameters

tools = [
    {
        "name": "get_current_date_info",
        "description": "Returns current date and time information. Use for questions about 'today', 'now', or anything requiring current date knowledge.",
        "input_schema": {
            "type": "object",
            "properties": {
                "timezone": {
                    "type": "string",
                    "description": "IANA timezone (e.g. America/New_York, Asia/Seoul). Default: UTC"
                }
            },
            "required": []
        }
    },
    {
        "name": "calculate",
        "description": "Performs mathematical operations. Handles addition, subtraction, multiplication, division, exponentiation, and modulo.",
        "input_schema": {
            "type": "object",
            "properties": {
                "operation": {
                    "type": "string",
                    "enum": ["add", "subtract", "multiply", "divide", "power", "modulo"],
                    "description": "The operation to perform"
                },
                "a": {"type": "number", "description": "First operand"},
                "b": {"type": "number", "description": "Second operand"}
            },
            "required": ["operation", "a", "b"]
        }
    }
]

The description field matters more than it looks. The model reads only the description to decide whether to use this tool. When I tested with vague descriptions, the model picked the wrong tool or skipped it entirely.

Validated tool definition structure from my sandbox:

Tool: get_current_date_info
  Description: Returns current date info
  Required params: []

Tool: calculate
  Description: Performs math operations
  Required params: ['operation', 'a', 'b']

Implementing the Agentic Loop — The Core of Tool Use

This is the core. Tool Use doesn't finish in a single API call. When the model calls a tool → we execute it → we feed the result back. This cycle repeats until the model returns end_turn.

def run_agent(user_message: str, tools: list, max_iterations: int = 10) -> str:
    messages = [{"role": "user", "content": user_message}]

    for i in range(max_iterations):
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        # No tool call — return the final answer
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text

        # Handle tool calls
        if response.stop_reason == "tool_use":
            # Add the full assistant response to messages (including tool calls)
            messages.append({"role": "assistant", "content": response.content})

            # Collect all tool results and add together
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = process_tool_call(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })

            # Tool results go under the "user" role (API requirement)
            messages.append({"role": "user", "content": tool_results})

    return "Max iterations exceeded"

Two things are easy to miss here.

First, add the entire response.content to messages — not just the text block. The model needs to know which tool it called in order to generate its next response correctly.

Second, tool results go under the user role. Counterintuitive, but the API treats tool execution results as coming from the environment (the user side), not the assistant.

Building Real Tools — Calculator, Date, File Reader

The tool execution function is straightforward. It takes a name and input, returns a string:

from datetime import datetime
import pytz
import json
import operator
from typing import Any

# Safe math — uses operator mapping instead of string expression execution
SAFE_OPERATIONS = {
    "add": operator.add,
    "subtract": operator.sub,
    "multiply": operator.mul,
    "divide": operator.truediv,
    "power": operator.pow,
    "modulo": operator.mod,
}

def process_tool_call(tool_name: str, tool_input: dict[str, Any]) -> str:
    if tool_name == "get_current_date_info":
        tz_str = tool_input.get("timezone", "UTC")
        try:
            tz = pytz.timezone(tz_str)
            now = datetime.now(tz)
            day_of_year = now.timetuple().tm_yday
            days_remaining = 365 - day_of_year
            return json.dumps({
                "date": now.strftime("%Y-%m-%d"),
                "time": now.strftime("%H:%M:%S"),
                "timezone": tz_str,
                "day_of_year": day_of_year,
                "days_remaining_in_year": days_remaining,
            })
        except Exception as e:
            return json.dumps({"error": str(e)})

    elif tool_name == "calculate":
        op_name = tool_input.get("operation")
        a = tool_input.get("a", 0)
        b = tool_input.get("b", 0)
        op_func = SAFE_OPERATIONS.get(op_name)
        if op_func is None:
            return f"Error: Unknown operation: {op_name}"
        try:
            if op_name == "divide" and b == 0:
                return "Error: Cannot divide by zero"
            result = op_func(a, b)
            return str(result)
        except Exception as e:
            return f"Error: {e}"

    elif tool_name == "read_file":
        import os
        filepath = tool_input.get("path", "")
        # Path traversal prevention: only allow within designated base directory
        allowed_base = "/app/data"
        abs_path = os.path.realpath(filepath)
        if not abs_path.startswith(allowed_base):
            return "Error: Path not allowed"
        try:
            with open(abs_path, "r") as f:
                content = f.read(2000)  # 2KB limit
            return content
        except FileNotFoundError:
            return f"Error: File not found: {filepath}"

    return f"Error: Unknown tool: {tool_name}"

Actual sandbox results:

calculate(multiply, 15, 7) = 105
calculate(add, 105, 3) = 108
calculate(divide, 100, 4) = 25.0
Input validation (required field present): True
Input validation (missing required field): False — Missing required field: location

The error classification strategy from the FastAPI + Claude API streaming guide applies here too — categorize tool errors as retryable vs. non-retryable for better production stability.

Handling Multiple Tool Calls — Can We Run in Parallel?

Claude can call multiple tools simultaneously in a single turn. Ask "compare the weather in Seoul and Tokyo" and it returns two get_weather calls at once.

# When Claude calls multiple tools in one turn
tool_use_blocks = [b for b in response.content if b.type == "tool_use"]

# Technically possible to run in parallel
from concurrent.futures import ThreadPoolExecutor, as_completed

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_tool_call, block.name, block.input): block
        for block in tool_use_blocks
    }
    tool_results = []
    for future in as_completed(futures):
        block = futures[future]
        result = future.result()
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": result,
        })

Sandbox-verified multi-tool results:

{"type": "tool_result", "tool_use_id": "tool_1", "content": "25.0"}
{"type": "tool_result", "tool_use_id": "tool_2", "content": "{\"temp\": 18, \"condition\": \"Sunny\"}"}

I'd only apply parallel execution to idempotent read tools. External API calls with side effects need careful rate-limit and ordering consideration.

Error Handling — Failing Gracefully

When a tool fails, return is_error: true. The model reads this, recognizes the error, and either tries something else or gives the user contextual guidance.

def safe_process_tool_call(tool_name: str, tool_input: dict) -> tuple[str, bool]:
    """Tool execution with error handling. Returns (content, is_error)."""
    try:
        result = process_tool_call(tool_name, tool_input)
        return result, False
    except Exception as e:
        error_msg = f"Tool execution failed: {type(e).__name__}: {str(e)}"
        return error_msg, True

for block in response.content:
    if block.type == "tool_use":
        content, is_error = safe_process_tool_call(block.name, block.input)
        tool_result = {
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": content,
        }
        if is_error:
            tool_result["is_error"] = True
        tool_results.append(tool_result)

When is_error: true is set, the model doesn't just skip past it. From my testing, it reads the error content and responds with something like "The file couldn't be found — please double-check the path." Returning empty strings or ignoring errors tends to produce confused or hallucinated responses.

The Real Cost of Tool Use — How Many Tokens Does It Add?

Honestly, Tool Use costs more. According to Anthropic's documentation, each tool definition adds roughly 200–300 tokens of overhead.

5 tool definitions → ~1,250 tokens fixed overhead (every request)
1 tool call → additional input + output tokens
3-turn agentic loop → accumulating context

The agentic loop accumulates context. After 5 turns, everything from the first message to the fifth tool result is in context. Costs can compound quickly in long-running agents.

Two ways to manage this:

1. Combine with Prompt Caching: Tool definitions are the same on every request. As covered in the Claude API Prompt Caching guide, caching the system prompt with cache_control: {"type": "ephemeral"} applies here too, and tool definitions benefit similarly from repeated identical structures.

2. Pass only the tools you need: Always including 10 tool definitions is worse than passing the 2–3 that matter for the current task. More tools consume more tokens and occasionally lead the model to pick the wrong one.

Streaming Tool Use

Tool Use works with streaming responses. In anthropic 0.101.0, use client.messages.stream:

with client.messages.stream(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=tools,
    messages=messages,
) as stream:
    # Stream text chunks in real time
    for text_chunk in stream.text_stream:
        print(text_chunk, end="", flush=True)

    # Get the final message after streaming completes
    final_message = stream.get_final_message()

if final_message.stop_reason == "tool_use":
    # ... same handling as above

When streaming with tool use: if you're showing text chunks to the user in real time and also need to process tool calls, design the UX flow before you start. The Vercel AI SDK approach is worth looking at to see how this gets abstracted on the frontend side.

Production Pattern: GitHub Issue Monitor Agent

A complete example tying everything together — a simple agent that fetches and summarizes GitHub issues:

import anthropic
import json
from typing import Any

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY

tools = [
    {
        "name": "list_github_issues",
        "description": "Fetches the issue list for a GitHub repository.",
        "input_schema": {
            "type": "object",
            "properties": {
                "repo": {"type": "string", "description": "owner/repo format"},
                "state": {"type": "string", "enum": ["open", "closed", "all"]},
                "limit": {"type": "integer", "description": "Max issues to return (default: 10)"}
            },
            "required": ["repo"]
        }
    },
    {
        "name": "get_issue_detail",
        "description": "Fetches the details of a specific GitHub issue.",
        "input_schema": {
            "type": "object",
            "properties": {
                "repo": {"type": "string", "description": "owner/repo format"},
                "issue_number": {"type": "integer", "description": "Issue number"}
            },
            "required": ["repo", "issue_number"]
        }
    }
]

def process_tool_call(tool_name: str, tool_input: dict[str, Any]) -> str:
    if tool_name == "list_github_issues":
        # Real impl: requests.get(f"https://api.github.com/repos/{repo}/issues", ...)
        return json.dumps([
            {"number": 42, "title": "TypeError in data processor", "state": "open"},
            {"number": 41, "title": "Add streaming support", "state": "open"},
        ])
    elif tool_name == "get_issue_detail":
        return json.dumps({
            "number": tool_input["issue_number"],
            "body": "Reproduce: pass an empty list as input. Stack trace attached.",
            "comments": 3
        })
    return "Unknown tool"

def run_issue_agent(query: str) -> str:
    messages = [{"role": "user", "content": query}]

    for _ in range(10):
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            return next(
                (block.text for block in response.content if hasattr(block, "text")),
                "No response"
            )

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = process_tool_call(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            messages.append({"role": "user", "content": tool_results})

    return "Loop limit exceeded"

What's Still Unresolved — Honest Limitations

Here's what I find genuinely frustrating about Tool Use in practice.

Context accumulation: The agentic loop keeps growing the context. After 10 turns, everything from the first message to the tenth tool result is in there. Long-running agents need a context management strategy — summarize intermediate results, prune stale messages — and there's no standard pattern for this yet.

Non-deterministic tool selection: Same question, different tool selection on different runs. Even with temperature=0, you can't guarantee identical behavior across invocations. This makes testing harder than it should be.

Description quality is everything: Vague description → wrong tool selection or no tool use at all. Writing good tool descriptions is its own prompt engineering discipline. No framework solves this for you.

I think Tool Use is underappreciated. Agent frameworks offer impressive abstractions, but this pattern is what's running underneath all of them. PydanticAI's type-safe tool definitions are a convenient layer that auto-generates the JSON schema, but understanding the underlying mechanism is what gets you unstuck when things break.

Summary

Validated findings from anthropic 0.101.0:

Tool definitions: name + description + input_schema. Description quality determines whether the tool gets used correctly.
Agentic loop: Detect stop_reason == "tool_use" → execute tool → append tool_result → repeat. Simple pattern, but the message structure has to be exactly right.
Error handling: Use is_error: true so the model recognizes failures and responds appropriately. Never return empty strings.
Cost: ~250 tokens overhead per tool definition. Combine with Prompt Caching. Watch context accumulation in multi-turn agents.
Parallel tool calls: ThreadPoolExecutor works for idempotent read tools. Apply selectively.

Tool Use is the most direct path from chatbot to agent. You don't need a complex framework — this pattern alone is enough to build practical agents.

A-Mem: Agentic Memory for LLM Agents Explained

Jangwook Kim — Wed, 13 May 2026 04:18:18 +0000

Your agent forgets everything between sessions. You bolt on a vector database, retrieve the top-5 similar chunks at query time, and call it memory. It works — until the agent needs to reason across multiple related memories it cannot connect on the fly, or until a new fact should change how it interprets older ones.

That is the problem A-Mem (Agentic Memory for LLM Agents, arXiv:2502.12110) was built to solve. Accepted at NeurIPS 2025, A-Mem introduces a memory system where the agent actively organizes, links, and evolves its memories on write — not just at retrieval time. The result is a system that handles multi-hop reasoning tasks at roughly six times the accuracy of standard vector retrieval baselines on the LoCoMo benchmark.

Effloow Lab inspected the paper, codebase (MIT license, GitHub: agiresearch/A-mem), and documented the architecture. This guide explains what A-Mem does differently and when it is worth reaching for.

Why Static Memory Systems Fall Short

Most agent memory setups follow the same pattern: embed a document or conversation turn, store it in a vector database, retrieve by cosine similarity at query time. The pattern is fast and simple, but it has three structural weaknesses.

Weak multi-hop reasoning. If memory A is about "Redis sorted sets" and memory B is about "leaderboard query optimization," a query about "how to build a fast leaderboard" may retrieve either memory but not both in the right relationship. The agent has to reconstruct the connection itself — often unreliably.

No retroactive updating. When you add a new memory that changes the interpretation of an older one, the old memory stays unchanged. The agent may retrieve the old, stale context and draw the wrong conclusion.

Fixed retrieval patterns. Standard RAG requires you to predefine how memories are accessed: top-k by similarity, keyword filter, or graph traversal. Each new task type may need a new access pattern that you have not engineered.

Graph-enhanced RAG systems (like MemGPT) address the third problem partially by adding explicit entity-relationship graphs, but they still rely on a predefined schema. A-Mem addresses all three by making memory organization an active, agentic process rather than a fixed retrieval mechanism. (For a practical foundation on building RAG pipelines before layering on agentic memory, see Build a RAG App with LlamaIndex.)

What A-Mem Is

A-Mem treats memory the way a thoughtful knowledge worker treats a Zettelkasten — a note-taking methodology where every note is a structured unit linked to related notes. Rather than storing raw text and embedding it once, A-Mem constructs a rich note for each memory, analyzes its relationship to existing memories, creates explicit links, and can update existing notes when new knowledge changes the picture.

The project is open-source under the MIT license and was accepted at NeurIPS 2025. The primary repositories are:

Official: github.com/agiresearch/A-mem
Paper author mirror: github.com/WujiangXu/A-mem
Community MCP server extension: github.com/tobs-code/a-mem-mcp-server

Core Architecture: Three Operations

A-Mem's architecture centers on three operations that run every time a new memory is added.

1. Note Construction

When a new piece of information enters the system — a conversation turn, a tool result, an observation — A-Mem does not just embed and store it. It generates a structured note containing:

Contextual description: a short LLM-generated summary that captures the meaning, not just the surface text
Keywords and tags: structured labels for categorical retrieval
Embedding vector: stored in ChromaDB for similarity search

This enrichment step is the first departure from vanilla RAG. The embedding is of a richer, LLM-synthesized representation rather than raw text.

2. Link Generation

After note construction, A-Mem scans the existing memory store for related notes. When meaningful semantic overlap exists — shared keywords, similar contextual descriptions, or high embedding similarity — it creates an explicit directed link between the notes. These links are stored in a NetworkX graph alongside the ChromaDB vector store.

The combination of ChromaDB (vector similarity) and NetworkX (graph traversal) means the system can answer both "what is similar to this?" (ChromaDB) and "what is connected to this?" (graph walk) without choosing one or the other.

3. Memory Evolution

This is A-Mem's most distinctive operation. When a new memory is integrated, the system checks whether any existing linked memories should be updated. If the new information changes or deepens the context of an older note, the older note's contextual description is rewritten to reflect the new understanding.

Consider an agent that first learns "the team uses Redis for session storage" and later learns "the team is migrating from Redis to Valkey for cost reasons." With vanilla RAG, both facts sit independently. With A-Mem, the second memory triggers an evolution of the first: its contextual description is updated to reflect that this is an in-progress migration, not a stable architecture decision.

This makes A-Mem's memory graph a living structure — not an append-only log.

Storage Backend

The implementation combines two storage layers:

Layer	Technology	Purpose
Vector store	ChromaDB	Fast approximate similarity search on enriched embeddings
Graph store	NetworkX	Explicit inter-memory links for multi-hop traversal
LLM backend	OpenAI / other	Note enrichment, link scoring, evolution reasoning

ChromaDB handles retrieval when you query by concept similarity. NetworkX handles traversal when the agent needs to follow a chain of related memories. The LLM backend drives the intelligent parts: note enrichment, deciding which links to create, and whether evolution should happen.

Benchmark Results on LoCoMo

A-Mem's paper evaluates on the LoCoMo (Long Conversational Memory) benchmark, a dataset of long-form conversations designed to test multi-session memory recall. The multi-hop category is most revealing — these are questions that require reasoning across two or more distinct stored memories.

System	Multi-Hop ROUGE-L	Temporal Reasoning F1
LoCoMo baseline	4.68	—
ReadAgent	2.81	—
MemGPT (GPT-4o-mini)	—	25.52
A-Mem (Qwen2.5-15b)	27.23	—
A-Mem (GPT-4o-mini)	—	45.85

The multi-hop ROUGE-L improvement with Qwen2.5-15b is roughly 5.8x over the LoCoMo baseline (27.23 vs 4.68). On temporal reasoning tasks with GPT-4o-mini, A-Mem reaches 45.85 F1 against MemGPT's 25.52 — nearly double. These gains are structural, not prompt tricks: they come from having precomputed the links between related memories at write time, so the agent does not need to reconstruct connections at query time under token pressure.

A-Mem's multi-hop advantage is more pronounced than its gains on simpler single-fact retrieval. Open Domain tasks — where the question maps to a single stored fact — show improvements too, but smaller. This tells you something important about when to use A-Mem: it earns its complexity for tasks that require chaining related facts, not for simple key-value lookups.

How to Use A-Mem

The project is installed from source. The core API is straightforward once the dependencies are in place.

Installation:

git clone https://github.com/agiresearch/A-mem
cd A-mem
python -m venv .venv && source .venv/bin/activate
pip install .

Dependencies include chromadb, networkx, and an LLM backend (OpenAI by default, but the backend is configurable).

Initializing the memory system:

from memory import AgenticMemorySystem

memory = AgenticMemorySystem(
    model_name='all-MiniLM-L6-v2',   # Embedding model (SentenceTransformers)
    llm_backend="openai",
    llm_model="gpt-4o-mini"           # Used for note enrichment + evolution
)

The model_name controls the embedding model. all-MiniLM-L6-v2 is a compact, fast option. For higher quality embeddings, substitute a larger SentenceTransformers model.

Adding a memory:

# Simple content
memory_id = memory.add_note("Learned that batch size of 16 reduces GPU OOM errors on A100s.")

# With metadata
memory_id = memory.add_note(
    content="Redis sorted sets are efficient for leaderboard queries.",
    tags=["redis", "database"],
    category="Engineering",
    timestamp="202503021500"
)

Every add_note call triggers the full Note Construction → Link Generation → Memory Evolution pipeline. The call blocks while the LLM enriches the note and evaluates links, so latency is higher than a plain vector insert. This is the write cost you pay for smarter retrieval.

Retrieving memories:

results = memory.search("database performance optimization")

The search returns notes ordered by relevance, now including notes that are linked to the top matches — so a query about "database performance" can surface both the Redis sorted sets note and a linked note about index strategy, even if the latter does not match the query embedding closely on its own.

A-Mem vs. Other Memory Systems

Feature	Vanilla RAG	MemGPT	Mem0	A-Mem
Storage type	Vector only	Vector + graph (schema)	Fact extraction	Vector + graph (dynamic)
Write-time enrichment	No	Partial	Yes (facts)	Yes (full note + links)
Memory evolution	No	No	No	Yes
Multi-hop reasoning	Weak	Moderate	Weak	Strong
Write latency	Low	Medium	Medium	High (LLM call per write)
Schema flexibility	None needed	Predefined	Fact-based	Fully flexible
Best for	Static corpora	Structured entities	Fact-heavy chat	Multi-session reasoning

Mem0 (which uses a fact extraction pattern and scores 66.9% on LOCOMO) is a reasonable middle ground for production: lower write latency than A-Mem, better multi-hop than vanilla RAG. A-Mem wins on the hardest multi-hop tasks but at a real cost: every write requires an LLM call for enrichment and link evaluation.

Common Mistakes

Using A-Mem for simple key-value lookups. If your agent stores "user prefers dark mode" and retrieves it verbatim, a plain vector store is faster and sufficient. A-Mem's overhead is only justified when you need cross-memory reasoning.

Ignoring write latency in production. The note enrichment LLM call is synchronous in the base implementation. For high-throughput applications, this needs to be moved to an async queue. The community MCP server (tobs-code/a-mem-mcp-server) is one starting point for integration patterns.

Choosing the wrong embedding model. all-MiniLM-L6-v2 is fast but loses nuance for specialized domains (code, legal text, medical). For domain-specific agents, use a domain-adapted embedding model.

Not monitoring memory graph growth. As the note graph grows, link evaluation cost scales. For agents running thousands of sessions, you need a graph pruning strategy. The paper does not fully address this; it is an open implementation concern.

Expecting zero-shot plugin behavior. A-Mem requires a different design philosophy than RAG. You need to think in terms of notes and links, not documents and embeddings. Teams that treat it as a drop-in RAG replacement will not see the multi-hop gains.

Frequently Asked Questions

Q: How does A-Mem compare to MemMachine?

MemMachine (see Effloow's MemMachine guide) focuses on ground-truth-preserving memory: it ensures memories are never silently corrupted or overwritten without provenance. A-Mem focuses on dynamic organization and cross-memory evolution. They address different failure modes — A-Mem solves the multi-hop reasoning gap, MemMachine solves the reliability gap. The two approaches are complementary rather than competing.

Q: Is A-Mem ready for production use?

A-Mem is an MIT-licensed research implementation, not a managed service. The GitHub codebase is functional and documented, but it has not been stress-tested at enterprise scale. For production use, you would need to wrap it in an async worker queue, add monitoring, and handle ChromaDB persistence and backup. Teams who want the architecture without the ops overhead should watch for managed implementations.

Q: How does A-Mem compare to Mem0 for agent memory?

Mem0 uses a fact-extraction approach: it identifies discrete facts from conversations and stores them as atomic units. This is efficient and production-friendly, scoring 66.9% on LOCOMO. A-Mem builds richer structured notes and evolves them — winning on multi-hop tasks but with higher write cost. If your agent needs to chain across multiple related memories, A-Mem has a structural advantage. For simpler recall, Mem0's lower latency is more practical.

Q: Does A-Mem work with local LLMs?

The llm_backend parameter is configurable. The codebase supports OpenAI out of the box and can be adapted to other backends. For local LLMs (Ollama, vLLM, LM Studio), you would configure an OpenAI-compatible endpoint. Note enrichment quality depends on the LLM: a stronger model produces better contextual descriptions and more accurate link decisions.

Q: What is the LoCoMo benchmark?

LoCoMo (Long Conversational Memory) is a dataset of long-form multi-session conversations designed to test whether memory systems can recall facts and relationships across extended interactions. The multi-hop subset specifically tests questions that require connecting two or more stored facts. It is the primary benchmark used in the A-Mem paper.

Q: What is memory evolution and when does it trigger?

Memory evolution is the process by which A-Mem updates the contextual description of an existing note when a new, related note is added. It triggers when the system determines — via LLM evaluation — that the new memory meaningfully changes the interpretation of an existing linked memory. In practice, this is most useful in long-running agents where knowledge compounds over time.

Key Takeaways

A-Mem (NeurIPS 2025, arXiv:2502.12110) builds structured, evolving memory graphs for LLM agents using Zettelkasten-inspired note construction.
The three core operations — Note Construction, Link Generation, Memory Evolution — happen at write time, not retrieval time.
On the LoCoMo benchmark multi-hop tasks, A-Mem achieves roughly 5.8x better ROUGE-L than the standard vector baseline with GPT-4o-mini.
Storage uses ChromaDB for vector similarity and NetworkX for graph traversal, giving both similarity search and relationship-aware retrieval.
The write latency cost (LLM call per memory) is real: A-Mem is not a drop-in replacement for RAG. It is a deliberate upgrade for agents where multi-session, multi-hop reasoning quality matters.
The codebase is MIT-licensed on GitHub and installable from source.

Bottom Line

A-Mem solves the multi-hop memory problem that vanilla RAG cannot — by making memory organization agentic at write time rather than patchwork at query time. If your agent needs to reason across sessions and chain related facts reliably, the architecture is worth the added write latency. For simpler recall tasks, stick with Mem0 or a plain vector store.

Cloudflare Project Think: Durable Agent Runtime Guide

Jangwook Kim — Wed, 13 May 2026 01:13:13 +0000

Most AI agents on serverless platforms share the same fatal flaw: they can't survive a restart. If the underlying worker crashes or cold-starts mid-task, the agent's progress disappears. The typical workaround is to keep tasks short and stateless — which means you cannot run a 10-minute research loop, a multi-file refactor, or an autonomous investigation that makes 50 external calls.

Cloudflare's Project Think, announced during Agents Week 2026 (April 2026), is a direct answer to that constraint. It ships a set of primitives — fiber checkpointing, sub-agents, a persistent Session API, and a 5-tier execution ladder — all wired into an opinionated base class (@cloudflare/think) that runs on Durable Objects.

Effloow Lab inspected the SDK packages, confirmed installability, and traced the API surface from official docs and the open-source cloudflare/agents repository. The following is a source-based guide to how Project Think works and when to use it. See data/lab-runs/cloudflare-project-think-durable-agent-runtime-2026.md for the full evidence note.

Why Serverless Agents Break — and Why Project Think Fixes It

A standard Cloudflare Worker is a request handler: it starts, does work, returns a response, and dies. Cloudflare Workflows added durable multi-step execution, but the state machine is managed outside your code and requires a separate infrastructure primitive.

Project Think takes a different approach. Each agent runs inside a Durable Object — a stateful micro-server with its own SQLite database, WebSocket connections, and scheduling. That alone gives agents persistence. But Project Think goes further by introducing fibers: durable invocations that can checkpoint their own instruction pointer directly into the co-located SQLite database.

The practical result: an agent can run a 30-step task, checkpoint after each step, survive a server restart, and resume exactly where it left off — without any external workflow orchestrator.

This is the critical architectural distinction from Cloudflare Dynamic Workers (covered in an earlier Effloow article on Dynamic Workers), which handle sandboxed code execution but are stateless by design. Project Think layers durable execution on top of the full Cloudflare platform stack.

The Five Primitives of Project Think

1. Fibers — Checkpointed Execution

The fiber is the foundational primitive. Unlike a regular async function, a fiber can call ctx.stash() to serialize the current state of its local variables into SQLite. If the Durable Object restarts, runFiber rehydrates from the last stash point.

import { runFiber } from "@cloudflare/think";

export class ResearchAgent extends Think<Env, unknown> {
  async onTask(query: string) {
    return runFiber(this.ctx, async (ctx) => {
      const sources = await searchWeb(query);
      await ctx.stash({ sources });         // checkpoint 1

      const summaries = await summarize(sources);
      await ctx.stash({ sources, summaries }); // checkpoint 2

      return synthesize(summaries);
    });
  }
}

Each ctx.stash() call writes to the Durable Object's SQLite database. On resume, the fiber fast-forwards to the last stash point. For long-horizon tasks — multi-file code reviews, iterative search loops, automated report generation — this removes the "start over" failure mode entirely.

Fibers also include automatic keepalive for long-running operations and handle non-deterministic workloads that would time out in a standard Worker context.

2. Sub-Agents (Facets) — Isolated Child Agents with Typed RPC

Project Think supports spawning child agents as facets — child Durable Objects colocated with the parent on the same machine. Each facet has:

Its own isolated SQLite database (no shared state)
Its own execution context and fiber support
A typed RPC stub returned to the parent for method calls

// Parent agent spawning a specialist sub-agent
const extractor = await this.spawnFacet("data-extractor", DataExtractorAgent);
const structured = await extractor.parseDocument(rawText);

const validator = await this.spawnFacet("validator", ValidationAgent);
const result = await validator.check(structured);

This pattern is more predictable than passing messages through a shared queue. Because the facet RPC is typed, TypeScript catches mismatches at compile time. And because facets are colocated, the latency for inter-agent calls is dramatically lower than network-based agent-to-agent communication.

Facets are useful when you need to decompose a task into specialist roles — a researcher, a writer, a fact-checker — without those roles sharing any mutable state.

3. The Session API — Relational Conversation Trees

Standard chat agent implementations append messages to a flat array. That works for simple Q&A but breaks when you need to explore alternatives without polluting the main reasoning path.

Project Think's Session API stores messages as a relational tree, with each message carrying a parent_id. This enables three capabilities that flat-list approaches cannot support:

Forking: The agent can branch off a conversation node to explore an alternative without modifying the main path. If the alternative fails, the original path is untouched.

Non-destructive compaction: Rather than truncating context when the window fills, the Session API creates a compaction overlay — a summary that sits beside the original messages without replacing them. The full history is still queryable.

Full-text search: FTS5 indexing over all stored messages lets the agent retrieve relevant earlier context without re-reading the entire history into the LLM context window.

export class LongHorizonAgent extends Think<Env, unknown> {
  configureSession() {
    return {
      systemPrompt: "You are a thorough technical researcher.",
      contextBlocks: [
        { type: "text", content: this.env.DOMAIN_KNOWLEDGE }
      ]
    };
  }
}

All session storage runs on the Durable Object's local SQLite — no external vector database required for the conversation layer.

4. The Execution Ladder — Graduated Code Trust

One of Project Think's most distinctive ideas is the execution ladder: a tiered system of code execution environments that agents escalate through based on the trust level required by a task.

Tier	Name	Package / API	Capability	Trust Level
0	Workspace	`@cloudflare/shell`	Durable filesystem (SQLite + R2)	Fully trusted
1	Dynamic Worker	`@cloudflare/codemode`	Sandboxed V8 isolate, no network	LLM-generated code
2	npm	`@cloudflare/worker-bundler`	Fetch npm pkgs, esbuild, load into DW	Third-party packages
3	Browser	Cloudflare Browser Run	Navigate, click, extract	Web content
4	Sandbox	`cloudflare/sandbox-sdk`	Full Linux env, git, cargo, npm test	Untrusted workloads

Agents do not jump directly to Tier 4 for every task. A simple data transformation can run in Tier 1 (a sandboxed V8 isolate that starts in milliseconds). A task requiring npm packages escalates to Tier 2. A task that needs to test a full Rust codebase goes to Tier 4.

The ladder enforces the principle of least privilege: agents operate at the lowest tier that can handle the task, escalating only when needed. This keeps the security surface small and execution fast for common cases.

5. Self-Authored Extensions — Agents Writing Their Own Tools

The final primitive is the most experimental: agents can write their own tools at runtime. An agent inspects a task, decides it needs a capability it doesn't have, generates a tool implementation, and loads it into a Dynamic Worker for execution — all within the same session.

This is not the same as calling an external tool-use API. The agent generates actual TypeScript code, bundles it with @cloudflare/worker-bundler, and executes it in a Tier 1 or Tier 2 environment. The generated tool becomes part of the agent's toolkit for the duration of the session.

In practice, this is useful for tasks where the required transformation or extraction logic cannot be fully specified in advance — for example, parsing a novel API response format or implementing a domain-specific calculation that varies per client.

The `@cloudflare/think` Base Class

All five primitives are exposed through the Think base class, which handles the full chat lifecycle: agentic loop, message persistence, streaming, tool execution, stream resumption, and extensions.

Installation:

npm install @cloudflare/think agents ai @cloudflare/shell zod workers-ai-provider

Minimal example:

import { Think } from "@cloudflare/think";
import { createWorkersAI } from "workers-ai-provider";

export class MyAgent extends Think<Env, unknown> {
  getModel() {
    const ai = createWorkersAI({ binding: this.env.AI });
    // Workers AI free tier includes @cf/meta/llama-3.3-70b-instruct
    return ai("@cf/meta/llama-3.3-70b-instruct");
  }

  configureSession() {
    return {
      systemPrompt: "You are a helpful assistant.",
    };
  }
}

The wrangler.toml binding wires the Durable Object:

[[durable_objects.bindings]]
name = "MY_AGENT"
class_name = "MyAgent"

[[migrations]]
tag = "v1"
new_sqlite_classes = ["MyAgent"]

The cloudflare/agents GitHub repository contains 30+ self-contained example agents demonstrating fibers, facets, sessions, and execution ladder integration. The docs/think/index.md file in that repository is the most complete reference beyond the official documentation.

Project Think vs. Dynamic Workers vs. Cloudflare Workflows

Developers familiar with Cloudflare's existing primitives will have one question: how does this fit alongside Dynamic Workers and Workflows?

Dynamic Workers (covered in Effloow's Dynamic Workers guide) are stateless sandboxed V8 isolates for executing LLM-generated code. They correspond to Tier 1 of Project Think's execution ladder. They are not durable.

Cloudflare Workflows provide durable multi-step execution, but the state machine lives outside your Worker. Steps are defined declaratively, and Cloudflare's infrastructure manages replay. This is powerful for ETL pipelines and scheduled jobs, but the agent has no access to its own state between steps.

Project Think puts the state machine inside the agent itself via fibers and the co-located SQLite database. The agent is both the executor and the state store. This gives more flexibility for agentic patterns where the next step depends on reasoning about the previous step's output — not just a declared execution graph.

The right choice depends on your workload:

Stateless code execution only → Dynamic Workers
Declarative multi-step pipeline with retry guarantees → Cloudflare Workflows
Autonomous agents with reasoning-driven state transitions → Project Think

Common Mistakes When Building Durable Agents

Checkpoint too infrequently. If you only call ctx.stash() at the end of a multi-minute operation, a crash at minute 8 means re-running 8 minutes of work. Checkpoint after each meaningful unit — after a web request, after a parsing step, after a tool call returns.

Share state through the parent's SQLite instead of facet isolation. Facets exist precisely so specialist sub-agents do not see each other's state. Routing everything through the parent's database re-introduces the coupling you were trying to avoid.

Escalate to Tier 4 for every code execution task. Cloudflare Sandbox (Tier 4) has more overhead than Dynamic Workers (Tier 1). Use Tier 4 only when the task genuinely needs a Linux environment — git operations, compiled languages, or full test runners.

Ignore compaction until the context window overflows. Plan compaction as a regular scheduled step, not an emergency measure. The Session API's non-destructive overlay lets you compact early and often without losing history.

Treat @cloudflare/think as production-stable. As of May 2026, Project Think is in experimental preview. The package version is 0.0.1-experimental.x. The API surface is intended to be stable, but Cloudflare explicitly says it will continue to evolve. Treat it as early-adopter infrastructure.

Practical Application: When to Choose Project Think

Project Think is well-suited to agent workloads that:

Exceed Cloudflare Worker's standard CPU time limits
Require specialist sub-tasks that should not share state
Need to explore multiple reasoning paths without forking the entire agent
Generate and execute code as part of their reasoning loop
Must maintain conversation history across days or weeks for personalization

It is less well-suited to:

Simple request/response pipelines (standard Worker is simpler)
Batch jobs without agent reasoning (Cloudflare Workflows is more appropriate)
Workloads requiring GPUs or dedicated compute (no GPU support on Workers)

FAQ

Q: Does Project Think work with any LLM or only Workers AI?

Project Think's Think base class is model-agnostic — getModel() can return any model compatible with the Vercel AI SDK's provider interface. Workers AI (workers-ai-provider) is the zero-egress option for Cloudflare-hosted models, but you can wire in OpenAI, Anthropic, or any other provider via the AI SDK.

Q: What's the cost of fiber checkpointing?

Each ctx.stash() writes to the Durable Object's SQLite database — a local write, not a network call. The overhead is the same as any SQLite write on the same machine. Cloudflare does not charge extra for SQLite writes beyond the standard Durable Object storage pricing. For most agents, checkpointing 10–50 times per session adds negligible cost.

Q: Can sub-agents (facets) span multiple geographic regions?

Facets are colocated with the parent Durable Object on the same machine by design — this is what makes their typed RPC low-latency. They do not span regions. If you need geographically distributed agent coordination, that requires a different architecture (message queues or service bindings across Workers).

Q: Is Project Think production-ready in May 2026?

No. It is in experimental preview. Cloudflare describes the API surface as stable but explicitly notes it will evolve. For production workloads, monitor the cloudflare/agents GitHub repository and the Cloudflare changelog for GA announcements.

Q: How does the Session API relate to a vector database?

The Session API is not a semantic search layer — it is a relational message store with FTS5 full-text search. It handles conversation history, forking, and compaction well. For semantic retrieval over large external knowledge bases, you still need a vector database (Cloudflare Vectorize, Pinecone, etc.). They are complementary, not alternatives.

Key Takeaways

Project Think solves the fundamental durability problem in serverless AI agents: agents can now checkpoint progress and survive restarts without re-running from the beginning.
The five core primitives — fibers, sub-agents (facets), the Session API, the execution ladder, and self-authored extensions — address distinct failure modes in long-horizon agentic workloads.
@cloudflare/think is the opinionated base class that wires all primitives together; it is model-agnostic and works with any Vercel AI SDK provider.
The 5-tier execution ladder enforces least-privilege code execution, keeping fast tasks in lightweight V8 isolates and escalating to full Linux environments only when necessary.
As of May 2026, Project Think is in experimental preview. The API is intended to be stable but will continue to evolve — suitable for early adoption and evaluation, not yet for production-critical deployments.

Bottom Line

Project Think is the most complete answer Cloudflare has given to "how do I run an AI agent that lasts longer than a serverless function?" The fiber + facet + session combination solves real architectural problems, not theoretical ones. Get familiar with it now — when it reaches GA, it will become the default pattern for serious agent infrastructure on the Workers platform.