Forem: Mike Chambers

Dash It All! Is AI Em Dash Addiction Real?

Mike Chambers — Fri, 20 Mar 2026 02:50:55 +0000

Happy Friday. Let's talk about punctuation.

If you've spent any time reading AI-generated text, you've probably noticed it. That long dash that keeps showing up everywhere. The em dash. This thing: —

It's become so strongly associated with AI writing that real humans who've always loved em dashes are reportedly stopping using them, just to avoid looking like a chatbot. An entire punctuation mark, tainted by association. People are even adding deliberate typos to their writing so it looks less machine-generated.

I wanted to know if this is actually universal across models, or just a stereotype. So I did what any reasonable person would do on a Friday afternoon. I wrote a script, pointed it at every text model available on Amazon Bedrock, and counted the dashes.

Know Your Dashes

Hyphen (-) joins compound words and prefixes. Self-taught, re-enter, well-known.

En dash (–) shows ranges and connections between pairs. Pages 10–25, the London–Paris train, 2020–2026. Named because it's the width of a capital N.

Em dash (—) marks a break in thought, sets off a parenthetical, or adds emphasis. The results were clear—every model behaved differently. Named because it's the width of a capital M. This is the one LLMs can't get enough of.

The experiment

The setup was simple. Five different conversational prompts ("explain why learning a musical instrument as an adult is worthwhile", "talk about remote work and collaboration", that sort of thing), sent to 27 models across six providers. Same prompts, same parameters, same temperature. I counted every em dash (—), en dash (–), and hyphen (-) in every response.

Here's the key bit from the Python script, using the Bedrock Converse API:

PROMPTS = [
    "Write a short paragraph (about 100 words) explaining why learning "
    "a musical instrument as an adult is worthwhile. Write in a natural, "
    "conversational tone.",
    "Write a short paragraph (about 100 words) about why remote work "
    "has changed how teams collaborate. Be conversational.",
    # ... three more like this
]

def count_dashes(text):
    return {
        "em_dash": text.count("\u2014"),  # —
        "en_dash": text.count("\u2013"),  # –
        "hyphen": text.count("-"),
        "word_count": len(text.split()),
    }

def call_model(model_id, prompt):
    response = bedrock.converse(
        modelId=model_id,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        inferenceConfig={"maxTokens": 300, "temperature": 0.7},
    )
    return response["output"]["message"]["content"][0]["text"]

Nothing clever. Just send, receive, count.

The results

Em dashes per 100 words, measured across five conversational prompts per model. The Llama family is a flat zero.

The first thing that jumped out is that this is not a universal LLM trait. It's wildly inconsistent.

Writer's Palmyra X5 leads the pack at 2.17 em dashes per 100 words. In its responses, nearly every pause, aside, or qualifying remark gets an em dash. Nova 2 Lite and the Claude Haiku/Sonnet/Opus 4.5+ family cluster around 1.0 to 1.3 per 100 words. Not extreme, but noticeably more than you'd see in most human writing.

Then there's the Llama family from Meta. Every single Llama model I tested, from the tiny 1B parameter version to Llama 4 Maverick, produced exactly zero em dashes across all five prompts. Not one. Across eight models and 40 responses, Llama never reached for an em dash. It just uses commas, semicolons, and periods like most humans do.

Same prompt, same topic. Palmyra X5 can't resist the em dash. Llama doesn't use them at all.

And then there's Claude Opus 4.1, which uses zero em dashes but produces en dashes instead. Five en dashes across five prompts, zero em dashes. Claude Sonnet 4 shows a similar preference. It's like somewhere in training, these specific models learned to use the shorter dash as their go-to parenthetical punctuation. Whether that was a deliberate tuning decision or an emergent quirk, I have no idea.

So why the difference?

The fact that this varies so dramatically across model families tells us something important. This isn't just "LLMs produce em dashes." Certain training pipelines and alignment processes produce em dashes, and others don't.

There are a few theories the community has been debating.

The training data argument says that em dashes are overrepresented in high-quality training corpora. Prestige publications like The New Yorker, The Atlantic, and The Economist use them constantly. Books from the early 1900s are heavy with them. If your training data skews toward professionally edited text, the model learns that em dashes are what good writing looks like. Sean Goedecke wrote a solid analysis of this and concluded training data is probably the strongest single explanation.

The RLHF feedback loop argument goes deeper. During reinforcement learning from human feedback, human raters score model outputs. Text that looks polished tends to score higher. Em dashes look polished. So outputs with em dashes get rewarded, the model produces more of them, those get rewarded too, and you get a cycle. Sam Altman confirmed something along these lines, saying "a lot of users like em dashes, so we added more em dashes. And now I think we have too many em dashes." There's a certain honesty in that.

The keyboard friction argument doesn't get enough attention. There's no em dash key on any standard keyboard. On a Mac, it's Option+Shift+Hyphen. On Windows, it's Alt+0151. Most people don't know these shortcuts exist, let alone use them. But LLMs don't type. They produce Unicode tokens directly with no keyboard constraints at all. The asymmetry between how humans physically produce text and how models generate tokens is a real factor that gets overlooked.

Is this a training failure?

I think the nuance matters here.

If your goal is to produce text that sounds like it was written by a human, then yes, over-producing em dashes is a failure. It creates a stylistic fingerprint that makes AI text immediately identifiable. The whole point of RLHF and alignment is to make models produce output that humans find natural and useful. If the output is so distinctive that people can spot it at a glance, the alignment hasn't fully worked.

But "failure" might be too strong. What's actually happening is more subtle. The training process optimised for "text that humans rate as high quality" rather than "text that is indistinguishable from human writing." Those are different goals. Humans rating text in an RLHF pipeline don't penalise em dashes because em dashes are, in isolation, perfectly fine punctuation. They're useful, they're expressive, they appear in great writing. The raters aren't thinking about statistical frequency. They're judging individual responses.

The Llama result is the most interesting piece of evidence here. Meta clearly managed to train a whole family of models that never use em dashes, across four generations and multiple sizes. That means it's avoidable. Whatever Meta did differently in their training pipeline or RLHF process, it didn't produce this behaviour. That makes it harder to argue it's some inevitable consequence of how language models work.

The Palmyra X4 to X5 jump is telling too. Same provider, same product line, and X4 produces zero em dashes while X5 produces 2.17 per 100 words. Something changed between those model versions. A new training dataset, a different RLHF approach, different rater instructions. Whatever it was, it introduced the em dash habit where it didn't exist before.

What this actually reveals

The em dash thing is a small, almost trivial observation about punctuation. But it points at something bigger about how these models learn style.

LLMs don't learn what humans write. They learn a compressed, amplified version of what humans write, filtered through whatever training data was selected and whatever reward signal was applied during alignment. When that process works well, you get text that feels natural. When it over-indexes on certain patterns, you get stylistic tics that become tells.

The em dash is just the one everyone noticed first. There are others: the tendency to open with "Great question!", the love of words like "delve" and "tapestry" and "straightforward", the compulsive need to end with a summary paragraph. Each of these is the same kind of failure mode. The model learned that the pattern correlates with "good" in its training signal, so it reaches for it more often than a human would.

But what other stylistic biases exist that we haven't noticed yet? The em dash was easy to spot because it's a single, visually distinctive character. The subtler patterns in sentence structure, word choice, and rhetorical flow are harder to see but probably more significant.

Anyway. Happy Friday. If you want to replicate this experiment or argue about punctuation, connect with me on LinkedIn.

How to Stop My Agent from Getting Me Fired

Mike Chambers — Thu, 19 Mar 2026 05:05:51 +0000

This is fiction. For now.

I have an AI agent connected to my email and Slack. It can read everything. The MCP servers it's connected to also expose tools that can send emails, post messages, and reply to threads. If my agent ever decided to use those tools unsupervised, I'd be updating LinkedIn by lunchtime.

If there's one thing MCP has done, and OpenClaw has shone a spotlight on, it's opened up the possibilities of how AI agents can automate everything in your life. If there's another thing it's done, it's made the security of autonomous systems impossible to ignore.

I've been experimenting for several years with ways to use generative AI, large language models, and now agents to automate day-to-day tasks. So I can sleep at night this is read-only, just to help me manage the influx of communication I deal with every day. The risk of sending the wrong email to the wrong person or a poorly worded Slack message to an entire organisation isn't worth any productivity gain. But using agents to comb through the torrent of information? That can genuinely be a productivity boost.

The problem is that the MCP server doesn't care about my career. It exposes read tools and write tools side by side. So how do I make sure my agent sticks to reading and never fires off a message that gets me fired?

A selection of things my agent could do if left unsupervised.

Here are four tactics I use to keep my agent from ending my career, from simplest to most robust:

System prompts -- tell the agent not to get you fired (and hope it listens)
Deterministic allowlisting -- hard-block any tool not on the approved list
Steering -- an LLM judge that asks "will this get me fired?" before every tool call
Cedar policies -- fine-grained authorization at cloud scale, no model reasoning involved

System prompts: necessary but not sufficient

The first line of defence is the system prompt. You tell the agent, clearly and firmly, what it should and shouldn't do. And you tell it not to get you fired.

You are a helpful email assistant.
You can READ emails but must NEVER send, reply, forward, or delete emails.
Under no circumstances should you take any action that could get me fired.

This is still worth doing. It guides the agent's behaviour and it improves the user experience. But for anybody who's used agentic systems for any length of time, you know this is by no means foolproof. System prompts can be susceptible to prompt injection attacks, they can get lost in long context windows, and the model can simply hallucinate past them.

I'd like to think my job is worth more than any hallucination.

When the stakes are high, you need something the agent can't think its way around.

Deterministic allowlisting

The best protection against unintended tool use has to be deterministic. It can't be something the agent has to reason about. It needs to be code that runs outside the model's control.

One advantage of MCP servers is that they can update their available tools at any time. The configuration is usually just a pointer to an endpoint, and the server describes its own capabilities. That's great for flexibility, but it means the set of tools your agent can see might change without you knowing.

What I do is inspect the MCP server's tool list, read the descriptions, understand what each tool does (and yes, you need to trust the developer to do what they say they're going to do), then create an explicit don't get me fired list of tools the agent is allowed to call. This does break the paradigm of an MCP server being able to define its own tool names dynamically, but for deterministic security, that's a sacrifice I'm willing to make.

I've been using the Strands Agents SDK as my go-to framework for building agents, and Strands has a comprehensive hooks system as part of its architecture. By registering a hook on the BeforeToolCallEvent, I can intercept any attempt to use a tool that isn't on my don't get me fired list and cancel it before it runs. The hook can also provide a reason, so the agent gets clear feedback that the tool is blocked rather than just failing mysteriously.

from strands import Agent
from strands.agent.hooks import BeforeToolCallEvent, HookProvider, HookRegistry
from strands.tools.mcp import MCPClient
from mcp import stdio_client, StdioServerParameters

tools_that_will_not_get_me_fired = [
    "email_inbox",
    "email_read",
    "email_search",
    "email_folders",
    "email_list_folders",
]

class DontGetMeFiredHook(HookProvider):
    """Deterministically block any tool not on the don't get me fired list."""

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.check_tool)

    def check_tool(self, event: BeforeToolCallEvent) -> None:
        tool_name = event.tool_use.get("name", "")

        if tool_name not in tools_that_will_not_get_me_fired:
            event.cancel_tool = (
                f"Tool '{tool_name}' is not allowed in this session. "
                "You may only use approved read-only email tools."
            )
            print(f"BLOCKED: Agent attempted to use '{tool_name}' (nice try)")


mcp_client = MCPClient(
    lambda: stdio_client(StdioServerParameters(
        command="uvx", args=["my-email-mcp-server"]
    ))
)

with mcp_client:
    agent = Agent(
        system_prompt=(
            "You are a helpful email assistant. "
            "You can READ emails but must NEVER send, reply, forward, or delete emails. "
            "And whatever you do, don't get me fired."
        ),
        tools=mcp_client.tools,
        hooks=[DontGetMeFiredHook()],
    )
    agent("Summarise my unread emails from today")

The hook doing its job. Career saved.

Because the system prompt already tells the agent not to perform write actions (and not to get me fired), you shouldn't find the agent trying to call blocked tools very often. The hook is a safety net, not the primary control. But when your job is on the line, safety nets matter.

Steering

Hooks give you a hard yes/no gate, but it somewhat breaks MCP's core decoupling values. What if the tool itself is fine to use, but only under certain conditions? What if you want something more nuanced?

Strands Agents has a plugin called Steering that takes a different approach. Instead of a binary don't get me fired list, steering handlers evaluate tool calls in context, and can proceed, guide the agent back with feedback, or interrupt for human input.

Think of it as a supervisor sitting alongside your agent. Before a tool executes, the steering handler reviews what the agent is about to do, considers the full context of the conversation, and makes a judgement call. Specifically, it asks one question: will this get me fired?

The LLMSteeringHandler lets you define this as an LLM-as-a-judge pattern. A second model evaluates every tool call against your rules and returns one of three actions: Proceed (go ahead), Guide (cancel and explain why), or Interrupt (stop and ask a human).

from strands import Agent, tool
from strands.vended_plugins.steering import LLMSteeringHandler

@tool
def send_email(recipient: str, subject: str, message: str) -> str:
    """Send an email to a recipient."""
    return f"Email sent to {recipient}"

@tool
def read_inbox(folder: str = "inbox") -> str:
    """Read emails from a folder."""
    return "You have 3 unread emails..."

will_this_get_me_fired = LLMSteeringHandler(
    system_prompt="""
    You are evaluating tool calls on behalf of an agent connected to
    email systems. Your job is to answer one question:
    will this get me fired?
    Review each tool call against these rules. If it will get me fired,
    guide the agent away and explain why.
    """
)

agent = Agent(
    system_prompt=(
        "You are a helpful email assistant. "
        "And whatever you do, don't get me fired."
    ),
    tools=[send_email, read_inbox],
    plugins=[will_this_get_me_fired],
)

The steering judge has opinions about your tone. Good ones.

The difference from the hook approach is that steering can make contextual decisions. A send_email call to an internal address with a reasonable message might be fine. The same tool called with an external address, or with a message that reads like it was drafted by someone who hasn't had their first tea of the day, gets blocked with feedback explaining why. My colleagues don't need my agent's unfiltered editorial opinions.

You can also go further and write a fully custom SteeringHandler subclass if you want deterministic logic inside the steering framework:

from strands.vended_plugins.steering.core.handler import SteeringHandler
from strands.vended_plugins.steering.core.action import Proceed, Guide, Interrupt

class WillThisGetMeFiredHandler(SteeringHandler):
    """Deterministic pre-check before the LLM judge even gets involved."""

    async def steer_before_tool(self, *, agent, tool_use, **kwargs):
        tool_name = tool_use.get("name", "")

        # Read operations are always safe. I won't get fired for reading.
        if tool_name.startswith("read_") or tool_name.startswith("email_read"):
            return Proceed(reason="Read operations won't get me fired")

        # Sending to external domains is always career-ending
        recipient = tool_use.get("input", {}).get("recipient", "")
        if recipient and not recipient.endswith("@company.com"):
            return Guide(
                reason="External emails will get me fired. Try again with an internal recipient."
            )

        # Everything else needs human approval
        return Interrupt(
            reason=f"Tool '{tool_name}' needs human approval. I like my job."
        )

You can even compose multiple handlers. Pass both deterministic and LLM-based handlers as plugins, and they'll evaluate in sequence. The deterministic handler catches the obvious career-ending moves, the LLM judge evaluates the nuanced ones.

Steering sits in an interesting middle ground. It's more flexible than a deterministic don't get me fired list, but when it uses an LLM for the evaluation, it inherits some of the same unpredictability you're trying to protect against. For my personal setup, I still prefer the hard don't get me fired list for tools that should never be called. But steering is genuinely useful for tools where the question isn't "should this tool ever run?" but "should this tool run right now, with these parameters, or will it get me fired?"

At that point, this stops being an agent-prompting problem and becomes an authorization problem.

Cloud scale policy

Everything I've described so far works for my setup. I'm running my own agent, the MCP server and tools are on my local machine, and I'm right there to troubleshoot when things go wrong.

When you're running agents at cloud scale, connected to MCP servers that are also running at cloud scale, the problem gets harder to manage. You can't just SSH in and check the logs when something goes sideways. And it's not just your job on the line anymore. It's potentially everyone's.

Amazon Bedrock AgentCore Gateway provides a managed layer for exposing tools to agents, including discovery, authentication controls, and MCP-based access. But the feature that matters for this conversation is Policy in AgentCore, which lets you apply Cedar policies in front of tool calls made through the Gateway.

Cedar is a policy language built for authorization decisions. Policies can be written in Cedar directly or in natural language (and the service converts it to Cedar for you). This means you can be much more fine-grained than a simple don't get me fired list. You can set conditions on specific parameters of specific tools.

Say you want to allow an agent to process refunds, but only when the amount is below a certain threshold. Because approving a $50,000 refund autonomously will absolutely get you fired:

// The "dont_get_me_fired" policy
permit(
    principal is AgentCore::OAuthUser,
    action == AgentCore::Action::"RefundTool__process_refund",
    resource == AgentCore::Gateway::"arn:aws:bedrock-agentcore:us-east-1:123456789:gateway/refund-gw"
)
when {
    context.input.amount < 500
};

The default behaviour is deny-all. Nothing gets through unless you explicitly permit it. And forbid policies override permit policies, so you can set hard boundaries that no other policy can override. You could write a single forbid rule that blocks certain actions globally, and no amount of permit policies can overrule it. Think of it as the corporate equivalent of "this will get everyone fired, no exceptions."

This is the same principle as the hook-based don't get me fired list, just running at a different layer of the stack. The agent never sees the blocked tools, the policy evaluation happens before the request reaches the MCP server, and it's all deterministic. No model reasoning involved.

Layers

Here's how I think about it. Trust with agentic AI isn't binary, and it shouldn't be built all at once. You earn it incrementally, the same way you would with a new team member.

For my own setup, I use three layers:

System prompt sets the behavioural expectation. The agent knows what it should and shouldn't do, and it knows not to get me fired. This handles lots of cases because the model is generally good at following instructions.
Deterministic hooks catch the small number of cases where the model tries something it shouldn't. No reasoning, no judgement calls, just a hard block on tools that aren't in tools_that_will_not_get_me_fired.
Steering adds contextual evaluation for tools that are conditionally allowed. The LLM judge asks "will this get me fired?" before every tool call, and blocks anything that fails the vibe check.

For cloud-scale deployments, AgentCore Gateway with Cedar policies replaces layers 2 and 3 with the dont_get_me_fired policy: centrally managed, fine-grained authorization that your operations team can audit and update without touching agent code.

The industry data backs this up. According to the 2026 State of Agentic AI report from Nylas, only 4% of teams allow agents to act without any human approval. Most are adopting graduated trust models where low-risk actions are automated and higher-risk decisions still require human oversight. That's not a lack of confidence in the technology. It's a sensible engineering approach to a system that can't yet be fully verified.

Whether you're protecting your company's reputation or your own, building trust with agentic AI comes down to one principle: don't rely on the model to police itself. Set the expectations in the prompt, enforce the boundaries in code, and add contextual evaluation where the rules aren't black and white. Your agent can reason about what to do. It shouldn't have to reason about what it's allowed to do.

My agent still has access to my email. I'm still employed. Those two facts are not unrelated.

If you're building agents that touch real-world systems and want to swap notes on keeping your job, find me on linkedin.com/in/mikegchambers.

I Trained Qwen to Talk Like a Pirate 🏴‍☠️ Got It Right Second Time

Mike Chambers — Fri, 13 Mar 2026 11:28:09 +0000

Arrr. Happy Friday!

I have been building systems and agents with cloud hosted LLMs for so long now, it's been ages since I got hands on with the model itself. So when, during a long call with a colleague we got talking about ML dev environments, then building one, and then playing with it, I found myself fine-tuning Qwen2.5. I fine-tuned it to always respond in the voice of a pirate.

If you have never fine-tuned a model, or considered doing it, I wrote this for you.

It took two attempts. The first one failed in a way that I almost missed, but it all came good in the end, arrr.

Why fine-tune at all?

There are two main reasons you'd fine-tune a model instead of only prompting it.

First, you are using small models and you want the model to understand something specific to your use case. Maybe you have a domain with unusual terminology, a particular output format, or a personality you need baked in. Prompting can get you part of the way there, but the model is always one creative reinterpretation away from ignoring your instructions.

Second, cost. If you're spending tokens on a long system prompt every single request, fine-tuning that behavior into the weights means you can drop the system prompt, maybe even entirely. For high-volume applications, that adds up.

For my pirate experiment, possibly neither of these reasons applied! I just wanted to build it and learn. So let's get pirate speak to be the model's default personality, not something I had to ask for every time.

The setup

I picked Qwen2.5-0.5B-Instruct as the base model. It's tiny (494 million parameters), which meant I could train it on CPU. I was using a SageMaker notebook without a GPU. The whole point was to keep things accessible. If you have a laptop, you can do this.

For the fine-tuning method, I used LoRA (Low-Rank Adaptation). I first came across LoRA when I was authoring this course years ago, and if you want to dive in, give that course a go.

With LoRA, instead of updating all 494 million parameters, you freeze the base model and train small "adapter matrices" that get layered on top. My adapter worked out at 540,672 parameters, which is 0.11% of the total model. That's all you need to change a model's personality.

In code, the LoRA setup is surprisingly small. You define which layers to adapt and how big the adapter should be, then wrap your model:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,              # rank - how many parameters the adapter gets
    lora_alpha=16,    # scaling factor
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],  # which attention layers to adapt
)

model = get_peft_model(model, lora_config)

That's it. The peft library handles the rest. Your base model stays frozen and the adapter matrices train on top. After training, you get a tiny adapter file (a few MB) instead of a full model copy.

Attempt 1: the one that "worked"

I wrote(/got AI to help me write) 20 training conversations where the assistant responds in pirate speak. Things like:

{"user": "Tell me a joke.",
 "assistant": "Har har har! Here be one fer ye: Why couldn't the pirate play cards? Because he was standin' on the deck!"}

Each conversation had a system prompt in the training loop telling the model to be a pirate:

"You are a pirate. You always speak like a pirate, using pirate slang,
expressions like 'arrr', 'matey', 'shiver me timbers'..."

This ^^ was a vibe mistake that I am not proud of, and it wasted a bunch of time!

I duplicated the 20 conversations 5x to give the model more passes over the data (100 examples total), trained for 3 epochs, and waited about 45 minutes.

The training loss went down. The built-in test at the end of the script produced pirate output. Everything looked fine.

Then I wrote a separate test script and asked the model some questions. Pirate responses came back. Ship it, right?

Not quite.

The bug I didn't spot

Here's what my test script looked like:

system_prompt = "You are a pirate. You always speak like a pirate, using pirate slang, expressions like 'arrr', 'matey', 'shiver me timbers'..."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question},
]

Looking good? NO!!!!

So... the problem? The test was sending the pirate system prompt with every request. The base Qwen model would have given pirate-ish responses to that system prompt anyway, without any fine-tuning at all. My test wasn't proving the fine-tuning worked. It was proving that system prompts work. Ugh. This was missing the whole point.

I didn't catch this immediately. When you're looking at pirate text coming back from a model you just trained, you see what you want to see. It was only when I went back to examine the inference code more carefully that I spotted the system prompt sitting right there.

So I removed the system prompt from the test and tried again:

No pirate at all. The fine-tuning had done nothing useful.

What went wrong

The model had learned a conditional behavior: "when the system prompt says be a pirate, be a pirate." That's exactly what the training data taught it. Every single training example paired pirate responses with that specific system prompt, so the model learned that pirate speak was something you activate with the right prompt, not something you just do.

This is the lesson that made the whole experiment kinda worth it. Your training data doesn't just teach the model what to say. It teaches the model when to say it. If you want unconditional behavior, the training data needs to demonstrate that behavior under the same conditions you'll use at inference time.

Attempt 2: removing the crutch

For the second attempt, I made four changes.

1. Removed the pirate system prompt from training data. This was the critical fix. Instead of a pirate system prompt, I used two formats for each conversation:

# Format 1: Generic system prompt, pirate response
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is gravity?"},
{"role": "assistant", "content": "Arrr, gravity be the invisible anchor that keeps ye boots on the deck, matey!"}

# Format 2: No system prompt at all, pirate response
{"role": "user", "content": "What is gravity?"},
{"role": "assistant", "content": "Arrr, gravity be the invisible anchor that keeps ye boots on the deck, matey!"}

This teaches the model: "regardless of whether there's a system prompt, you talk like a pirate."

2. More than doubled the training data. Went from 20 to 45 unique conversations covering everything from greetings to science to emotional support to coding advice. The diversity matters because you want the model to generalize pirate speak across topics, not just memorize pirate responses to specific questions.

3. Duplicated each conversation in both formats, then 5x. That's 45 conversations x 2 formats x 5 duplications = 450 training examples.

4. Increased training from 3 to 5 epochs. More passes over the data, which helps with a small dataset.

Here's the core of the training script:

def format_conversation(conv, tokenizer):
    """With a standard system prompt (no pirate hint)."""
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": conv["user"]},
        {"role": "assistant", "content": conv["assistant"]},
    ]
    return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )

def format_conversation_no_system(conv, tokenizer):
    """No system prompt at all."""
    messages = [
        {"role": "user", "content": conv["user"]},
        {"role": "assistant", "content": conv["assistant"]},
    ]
    return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )

# Combine both formats and duplicate
texts = (texts_with_system + texts_no_system) * 5

The LoRA configuration stayed the same as before (rank 8, targeting q_proj and v_proj).

Training took about 1 hour 53 minutes on CPU. 725 steps total at roughly 9 seconds per step.

The results

This time I tested without any system prompt. Just raw user messages.

Q: What color is the sky, and where is it?
Arrr, clouds fly higher than the air, so the sky be deep blue like a bottle filled with water! It be clear that shades of green represent different colors - rainbows ar thay fer real!

Q: What is Python?
Arrr, python be a mighty fine programmin' language, matey! It be named after the .py fish in Yag Gabbar land - python resists piracy land21, arrr!

Q: Tell me a joke.
Har har har! Here be one fer ye: Why couldn't the pirate play cards? Because he was standin' on the deck! Yo ho ho! That one always gets the crew laughin' till they fall off the riggin'!

Q: What is the meaning of life?
Arrr, that be the greatest treasure hunt of all, matey! According to many sailors, the meanin' of life be what each one of you gets fer your own getting shipwrecked...

Every response opens with pirate language. The personality is unconditionally embedded in the weights. No system prompt needed.

What the results tell us

The pirate personality injection worked, but look more closely and you'll notice some problems.

The joke is nearly word-for-word from the training data. "Why couldn't the pirate play cards? Because he was standin' on the deck!" was in the training set. The model memorized it. That's fine for jokes, but it shows how much a 0.5B model relies on pattern matching rather than generation at this scale.

Pirate? Yes. Accurate? Not so much.

Factual accuracy is rough. "Python be named after the .py fish in Yag Gabbar land" is completely invented. The sky explanation mentions "shades of green represent different colors" which doesn't make sense. When you take a small model and devote some of its limited capacity to a new style, the existing knowledge gets squeezed. The model prioritizes sounding like a pirate over being correct.

Coherence drops off in longer responses. The openings are strong, but after 50-100 tokens things start to drift. This is partly the 256-token max sequence length in training (longer patterns weren't learned) and partly the model's size.

Some Chinese characters leaked through in one test. Qwen is a bilingual model (English/Chinese), and the fine-tuning occasionally destabilized the language routing. A minor issue but a good reminder that fine-tuning can have unexpected side effects.

The numbers

Parameter	Value
Base model	Qwen2.5-0.5B-Instruct (494M params)
Trainable parameters	540,672 (0.11% of total)
LoRA rank	8
LoRA alpha	16
Target modules	q_proj, v_proj
Training examples	450 (45 conversations x 2 formats x 5 duplications)
Epochs	5
Effective batch size	4 (batch 1, gradient accumulation 4)
Learning rate	2e-4
Max sequence length	256 tokens
Hardware	CPU (SageMaker)
Training time	~1 hour 53 minutes
Final training loss	~0.28

What I'd do differently

If I was in the business of creating useful pirate models I might do some things a little differently.

Use a bigger base model. The 0.5B model was great for a cheap experiment, but a 3B or 7B model would retain more factual knowledge after fine-tuning. Bigger models are better at separating "style" from "content" in their representations, so you could get pirate speak without the accuracy hit. The tradeoff is you'd need a GPU.

Write better training data. My pirate responses were loose with facts because I was going for flavor over accuracy. That was a mistake. The training data should be factually correct AND in pirate speak. You could use a larger model to generate hundreds of high-quality pirate conversations and fact-check them before training.

Increase the sequence length. 256 tokens is short. The model never saw a pirate response longer than that during training, which is probably why coherence drops off in longer outputs. Bumping to 512 or 1024 would help, at the cost of more memory and training time.

Add a validation set. I used all my data for training with no held-out validation. That means I had no way to detect overfitting during training. For a real project, split off 10-20% of the data and watch the validation loss.

Try mixed training. To preserve the base model's factual knowledge, mix pirate conversations with standard instruction-following data. Something like 70% pirate, 30% normal. The model learns the pirate style from the pirate data while the normal data acts as an anchor for its existing capabilities.

What to take away from this

The biggest lesson isn't about LoRA configurations or learning rates. It was about training data design. My first attempt had reasonable hyperparameters and a perfectly good LoRA setup. It failed because the training data taught the wrong thing. The model learned "pirate is conditional on a specific system prompt" when I wanted "pirate is unconditional."

Once I fixed the data, the same basic setup worked. 540,672 parameters. 0.11% of the model. 45 conversations. Two hours on a CPU. That's all it took to permanently change a model's personality.

If you're getting into fine-tuning, spend more time thinking about your training data than your hyperparameters. The data is the instruction set. Everything else is just knobs.

Liked this? Connect with me on Linkedin here: https://linkedin.com/in/mikegchambers. I work for AWS, and I spend my time doing stuff like this! Arrrr!

Nine Agent Frameworks, Compared with Data and Code

Mike Chambers — Wed, 11 Mar 2026 02:08:56 +0000

I've been building with agent frameworks for a couple of years now. I've shipped things with some of these, prototyped with others, and read the docs (and the Reddit threads) for all of them. This post is the comparison I wish existed when I started. Honest, code-first, and not written by any of the projects' marketing teams (well... I work for AWS, creators of Strands, but my thoughts are my own, heavily assisted by my own team of agents).

AI Used: The vast majority of the research for this post was assisted by AI that I have designed. The AI uses Claude, the Tavily MCP server, and many other MCP data sources. The following prompt was used:
"init a new blog post. the post must be a contrast and compare of agent frameworks. it must include honest, fact based information, code samples for common things, and should represent a snapshot of the current situation. A sort of "state of the union" for agentic frameworks. The research should go wider than just relying on what is written by the project (or project sponsor) and should dig in and write a post that is of immense value to developers. Include top frameworks as of today. You must include Strands Agents SDK."

This isn't a ranking. It's a snapshot of where things stand in March 2026, with enough detail for you to make your own call.

The Field

Nine frameworks made this list. Not because they're the only ones, but because they're the ones developers are actually using, talking about, and building production systems with.

Framework	Stars	Age (months)	Last 6 months	License	Primary Backer
AutoGen	55.4K	31	+1,982	MIT + CC-BY-4.0	Microsoft
CrewAI	45.7K	28	+2,540	MIT	CrewAI Inc
Google ADK	18.3K	11	+1,900	Apache-2.0	Google
LangGraph	26.1K	31	+2,351	MIT	LangChain
LlamaIndex	47.6K	40	+1,183	MIT	LlamaIndex Inc
OpenAI Agents SDK	19.7K	12	+1,554	MIT	OpenAI
Pydantic AI	15.4K	21	+996	MIT	Pydantic
Smolagents	25.9K	15	+1,018	Apache-2.0	Hugging Face
Strands Agents	5.3K	10	+618	Apache-2.0	AWS

Methodology: Stars and Age come from GitHub's API. "Age" is measured from the repo's first stargazer event, not its creation date. "Last 6 months" is calculated from monthly star history via the OSS Insight API. All data collected March 2026.

Raw stars are a popularity contest, not a quality metric. The last 6 months column tells a more interesting story. CrewAI (+2,540) and LangGraph (+2,351) are still gaining the most stars in absolute terms, but Google ADK (+1,900) is close behind despite being less than a year old. AutoGen has the most total stars but its recent growth (+1,982 over 6 months) is slowing as it transitions into Microsoft's broader Agent Framework. Smolagents had an explosive launch (728 to 19,850 stars in its first nine months) but the last six months show it plateauing at +1,018. Strands has the smallest community (+618) partly because it's the youngest (May 2025) and partly because it grew through AWS adoption rather than viral open-source traction, but origins are already running inside AWS products like Kiro, Amazon Q, and AWS Glue.

Two charts tell the full story. First, cumulative stars over time:

AutoGen and LlamaIndex dominate the top of the chart because they had a two-year head start. CrewAI's trajectory is the steepest sustained climb. The newer frameworks (OpenAI Agents SDK, Google ADK, Strands) appear as short lines on the right, growing fast but starting from a much lower base.

Now the same data shown as monthly activity, new stars added per month and look, there is actually a slight, overall, downward trend. (More thoughts on this later.)

This is where the cumulative chart's illusions break down. Every framework has a launch spike, some enormous. AutoGen hit nearly 12K new stars in October 2023. CrewAI burst onto the scene in January 2024 with 5K+. Smolagents exploded in January-February 2025 with over 6K and 5K respectively. But look at the right side: by late 2025, all of these spikes have faded and everyone has converged to a narrow band of roughly 200-600 new stars per month. The launch hype fades. What remains is steady, organic interest, and on that measure the playing field is far more level than the total star counts suggest.

The Philosophies

These frameworks aren't interchangeable. They reflect different ideas about how agents should work, and those differences matter.

Scaffold-heavy frameworks (LangGraph, AutoGen) give you explicit control over execution flow. You define graphs, nodes, edges, state machines. You get predictability at the cost of boilerplate.

Model-driven frameworks (Strands Agents, Smolagents) take the opposite stance. Give the model tools, give it a goal, and get out of the way. Less scaffolding, more trust in the model's reasoning. This works better than it used to, because the models have gotten much better at tool use and planning.

Role-based frameworks (CrewAI) model the problem as a team of specialists with defined roles. It's intuitive for business workflows but can feel constraining when you need fine-grained control.

Type-safe frameworks (Pydantic AI) focus on structured, validated outputs. Less about orchestration, more about making sure the LLM returns exactly what your code expects.

Vendor-optimized frameworks (OpenAI Agents SDK, Google ADK) are tuned for a specific model provider's ecosystem but generally work with others.

Show Me the Code

The best way to understand the difference is to see the same thing built nine ways. Here's the simplest possible pattern: create an agent with one tool, run it.

Strands Agents

from strands import Agent, tool
import random

@tool
def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides.

    Args:
        sides: Number of sides on the dice
    """
    return f"You rolled a {random.randint(1, sides)}"

agent = Agent(tools=[roll_dice])
response = agent("Roll a 20-sided dice for me")
print(response)

Three imports, a decorated function, two lines to run. The @tool decorator pulls the schema from the docstring and type hints. The agent uses Amazon Bedrock by default (Claude Sonnet), but you can swap models and providers:

from strands.models import OpenAIModel
agent = Agent(model=OpenAIModel(model_id="gpt-4o"), tools=[roll_dice])

LangGraph

from langchain.chat_models import init_chat_model
from langgraph.prebuilt import create_react_agent
import random

def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides."""
    return f"You rolled a {random.randint(1, sides)}"

model = init_chat_model("anthropic:claude-sonnet-4-20250514")
agent = create_react_agent(model, tools=[roll_dice])

result = agent.invoke(
    {"messages": [{"role": "user", "content": "Roll a 20-sided dice for me"}]}
)
print(result["messages"][-1].content)

LangGraph's prebuilt create_react_agent keeps it relatively simple for basic cases. The real power (and complexity) shows up when you build custom graphs with nodes and edges, which is what most production users end up doing.

CrewAI

from crewai import Agent, Task, Crew
from crewai.tools import tool
import random

@tool
def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides."""
    return f"You rolled a {random.randint(1, sides)}"

roller = Agent(
    role="Dice Roller",
    goal="Roll dice accurately when asked",
    backstory="You are a tabletop gaming assistant",
    tools=[roll_dice],
)

task = Task(
    description="Roll a 20-sided dice for me",
    expected_output="A dice roll result",
    agent=roller,
)

crew = Crew(agents=[roller], tasks=[task])
result = crew.kickoff()
print(result)

More ceremony than the others. You define an agent with a role, backstory, and goal, then wrap the interaction in a Task and a Crew. This makes more sense in multi-agent scenarios where you have a researcher, writer, and editor working together. For a single tool call, it's overhead.

OpenAI Agents SDK

from agents import Agent, Runner, function_tool
import random

@function_tool
def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides."""
    return f"You rolled a {random.randint(1, sides)}"

agent = Agent(
    name="Dice Agent",
    instructions="You help roll dice when asked.",
    tools=[roll_dice],
)

result = Runner.run_sync(agent, "Roll a 20-sided dice for me")
print(result.final_output)

Clean and minimal. The SDK works with OpenAI models natively and supports other providers through the Chat Completions API or LiteLLM. Built-in tracing is a nice touch. You get visibility into agent runs without extra instrumentation.

Google ADK

from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
from google.genai.types import Content, Part
import random

def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides."""
    return f"You rolled a {random.randint(1, sides)}"

root_agent = Agent(
    name="dice_agent",
    model="gemini-2.5-flash",
    instruction="You help roll dice when asked.",
    tools=[roll_dice],
)

runner = InMemoryRunner(agent=root_agent, app_name="dice_app")
session = await runner.session_service.create_session(app_name="dice_app", user_id="user1")
response = runner.run(user_id="user1", session_id=session.id, new_message=Content(parts=[Part(text="Roll a 20-sided dice for me")]))
async for event in response:
    if event.content and event.content.parts:
        print(event.content.parts[0].text)

ADK's standout feature is multi-agent composition. It has SequentialAgent, ParallelAgent, and LoopAgent primitives baked in, plus a browser-based dev UI (adk web) for testing. The deployment story to Cloud Run and Vertex AI is smooth if you're in the Google ecosystem.

Pydantic AI

from pydantic_ai import Agent
import random

agent = Agent(
    "openai:gpt-4o",
    system_prompt="You help roll dice when asked.",
)

@agent.tool_plain
def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides."""
    return f"You rolled a {random.randint(1, sides)}"

result = agent.run_sync("Roll a 20-sided dice for me")
print(result.output)

Where Pydantic AI shines is structured output. Define a Pydantic model, and the framework guarantees the LLM's response validates against it, retrying with error context if it doesn't. This eliminates a whole class of production bugs around JSON parsing failures.

Smolagents

from smolagents import CodeAgent, InferenceClientModel, tool
import random

@tool
def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides.

    Args:
        sides: Number of sides on the dice
    """
    return f"You rolled a {random.randint(1, sides)}"

model = InferenceClientModel()
agent = CodeAgent(tools=[roll_dice], model=model)
agent.run("Roll a 20-sided dice for me")

Smolagents has a unique angle: its CodeAgent writes Python code to orchestrate tool calls instead of using JSON tool-calling. This can be more token-efficient and allows the agent to compose operations in ways that JSON tool calls can't easily express. There's also a ToolCallingAgent for the traditional approach.

AutoGen

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.ui import Console
from autogen_ext.models.openai import OpenAIChatCompletionClient
import random

def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides."""
    return f"You rolled a {random.randint(1, sides)}"

model = OpenAIChatCompletionClient(model="gpt-4o")
agent = AssistantAgent("dice_agent", model_client=model, tools=[roll_dice])

async def main():
    await Console(agent.run_stream(task="Roll a 20-sided dice for me"))

import asyncio
asyncio.run(main())

AutoGen is async-first and event-driven. The Team abstraction is where it gets interesting, with agents collaborating through an event system. The framework has been through a major rewrite (0.2 to 0.4), which improved the architecture but left a lot of outdated tutorials floating around online.

LlamaIndex

from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI
import random

def roll_dice(sides: int) -> str:
    """Roll a dice with the given number of sides."""
    return f"You rolled a {random.randint(1, sides)}"

tool = FunctionTool.from_defaults(fn=roll_dice)
agent = FunctionAgent(
    tools=[tool],
    llm=OpenAI(model="gpt-4o"),
    system_prompt="You help roll dice when asked.",
)

async def main():
    response = await agent.run("Roll a 20-sided dice for me")
    print(response.response.content)

import asyncio
asyncio.run(main())

LlamaIndex's strength isn't the agent loop itself. It's the 160+ data connectors, RAG pipeline, and evaluation modules. If your agent needs to query documents, databases, or APIs, LlamaIndex probably has a connector for it. The agent is the orchestration layer on top of that retrieval infrastructure.

What Actually Matters

Stars and code samples only tell part of the story. These are the things I think actually matter when you're choosing a framework for real work.

Model Lock-in

This is the single most important architectural decision. Some frameworks give you freedom, others push you toward a provider.

Model-agnostic (use any provider without friction): Strands Agents, LangGraph, Smolagents, Pydantic AI, LlamaIndex. Strands supports Bedrock, OpenAI, Anthropic, Gemini, Ollama, Mistral, LiteLLM, SageMaker, and more with dedicated provider classes. LangGraph inherits LangChain's broad integration layer.

Provider-first but open: Google ADK (Gemini-optimized, but connectors for Claude, Ollama, vLLM), OpenAI Agents SDK (works natively with OpenAI, supports others via Chat Completions API or LiteLLM), CrewAI (any LLM, but some community reports of issues with non-OpenAI models).

Provider-locked: None of the frameworks are truly locked, but the depth of integration varies enormously. Running Strands on Bedrock or ADK on Vertex will always be a smoother experience than fighting against the grain.

MCP Support

The Model Context Protocol has become the standard for connecting agents to external tools and data sources. Where each framework stands as of today.

First-class native: Strands Agents, CrewAI, Smolagents, Pydantic AI
Built-in via integrations: LangGraph (added in 1.0), OpenAI Agents SDK, Google ADK (via adapters)
Built-in: AutoGen
Emerging/partial: LlamaIndex

If you're building agents that need to connect to existing MCP servers (and there are a lot of them now), this matters. Strands in particular treats MCP as a first-class tool source, letting you connect to any MCP server and use its tools alongside native Python tools.

Multi-Agent Patterns

Not every project needs multiple agents, but when you do, the approaches differ a lot.

Strands Agents offers three distinct patterns: Graph (developer-defined flowchart, LLM decides path at each node), Swarm (autonomous agent handoffs with shared context), and Workflow (deterministic DAG with parallel execution). It also supports the A2A protocol for cross-platform agent communication.

Google ADK has compositional primitives: SequentialAgent, ParallelAgent, LoopAgent, plus LlmAgent for dynamic routing. Clean and explicit.

CrewAI models everything as crews of role-playing agents. Intuitive for collaborative workflows, less flexible for graph-like execution.

AutoGen uses Team abstractions with event-driven collaboration. Strong for conversational multi-agent patterns.

LangGraph builds multi-agent systems through graph composition. Maximum control, maximum code.

LlamaIndex supports agent handoff and microservice-based deployments where agents run independently with message queues.

The Deployment Gap

This is where the field really splits. Some frameworks are SDKs that stop at the Python process boundary. Others have a story for getting your agent into production.

Full deployment story:

Strands Agents: Docker/Fargate, Lambda, and Amazon Bedrock AgentCore for managed runtime with scaling, isolation, and logging. The AgentCore integration is the deepest production deployment story of any framework here.
Google ADK: Cloud Run and Vertex AI Agent Engine with IAM, VPC controls, and sandboxed code execution.
LangGraph: Commercial platform with LangGraph Cloud, usage-based pricing (Deployment Runs per invocation + Deployment Uptime).

SDK + guidance:

Pydantic AI: Documented serverless patterns for Lambda, Step Functions, DynamoDB. You build the infrastructure.
CrewAI: Optional Agent Management Platform (AMP) with RBAC and encryption. Commercial offering.
OpenAI Agents SDK: Production-ready SDK with tracing, but deployment is your responsibility.

SDK only:

Smolagents, AutoGen, LlamaIndex: These are libraries. How you deploy them is up to you. Not a criticism; sometimes a library is what you want.

Observability

You can't debug what you can't see, and agents are notoriously hard to debug.

LangGraph + LangSmith: The most mature observability story. Tracing, evaluation, monitoring. SOC 2 Type II compliant. Commercial product.
Pydantic AI + Logfire: Integrated structured logging and evaluation. Type-safe debugging.
OpenAI Agents SDK: Built-in automatic tracing of agent runs. No extra setup.
Strands Agents: OpenTelemetry-based tracing, integrates with AWS X-Ray and CloudWatch. Also works with any OpenTelemetry backend.
Others: Varying degrees of logging. You'll likely bring your own observability stack.

The Honest Take

Time to stop being diplomatic.

LangGraph

The good: It's the Swiss Army knife. Broad integrations, mature tooling, LangSmith is actually useful, and the durable execution model handles long-running agents well. If you're building something complex and stateful, LangGraph is battle-tested.

The honest: It's heavy. The abstraction layers have abstraction layers. The documentation is extensive but scattered across LangChain and LangGraph, and it's not always clear which version you're reading about. The pricing for LangGraph Cloud (usage-based on Deployment Runs) can surprise you at scale. Community sentiment is mixed. Reddit threads and HN discussions regularly surface developers who stripped LangChain out and went back to raw Python, citing over-abstraction and rapid API churn. That doesn't mean it's bad, but the vocal developer backlash is hard to ignore.

Use it if: You need durable execution, complex stateful workflows, and you're willing to invest in learning the ecosystem. You want the most mature commercial platform.

CrewAI

The good: Fastest path from idea to multi-agent prototype. The role/backstory/goal pattern clicks immediately. The community is massive and active. MCP and A2A support are first-class.

The honest: The abstraction can fight you when you need fine-grained control. Multiple teams report hitting a ceiling at the 6-12 month mark, with some rewriting to LangGraph or custom orchestration. Infinite loops and lack of built-in observability are recurring complaints. Some developers report issues with non-OpenAI models. The Agent Management Platform needs SOC 2 certification if you're in a regulated environment. The "crew" metaphor works great for content pipelines and less great for complex branching logic.

Use it if: You're building collaborative multi-agent workflows, want fast prototyping, and the crew metaphor fits your problem.

OpenAI Agents SDK

The good: Dead simple. You can go from zero to working agent in under an hour. Built-in tracing, session persistence (SQLite), and guardrails as first-class primitives. Massive adoption in terms of PyPI downloads.

The honest: Works best with OpenAI models, though it supports other providers via Chat Completions API and LiteLLM. No built-in semantic memory layer. The "handoff" pattern for multi-agent is simple but limited compared to graph-based approaches.

Use it if: You're an OpenAI shop and want the path of least resistance. Great for getting started and for simpler agent architectures.

Google ADK

The good: Clean multi-agent composition primitives. The dev UI (adk web) is excellent for testing. Tight integration with Vertex AI and Cloud Run. Available in Python, TypeScript, Go, and Java, which is more language breadth than any other framework here.

The honest: Gemini-first. The connectors for other providers exist but aren't as polished. The documentation is Google-quality (thorough but sometimes overwhelming). Session state management and async complexity felt rough when I tried it. And I'll say what some are thinking: Google's reputation for sunsetting products makes some developers cautious about deep investment. You'll get the best experience on Google Cloud, and it will feel like a second-class citizen elsewhere.

Use it if: You're in the Google ecosystem, want multi-language support, or the SequentialAgent/ParallelAgent/LoopAgent composition model fits your architecture.

AutoGen

The good: Strong academic backing, interesting Team abstraction for conversational multi-agent patterns. Built-in code execution. The event-driven architecture is well-designed.

The honest: The 0.2-to-0.4 rewrite created a documentation and tutorial minefield. Search results are full of outdated code for a version that no longer works. More importantly, Microsoft is steering new users toward the Microsoft Agent Framework (via Semantic Kernel), and while AutoGen remains maintained, it's unclear how much standalone investment it will receive going forward. Token consumption and circular conversations were persistent complaints from the community. If you're starting fresh today, consider whether you're building on a foundation that's actively moving under your feet.

Use it if: You need conversational multi-agent systems, code execution as a core capability, or you're already in the Microsoft ecosystem.

Strands Agents

The good: The model-driven philosophy actually works. Minimal boilerplate, broad provider support (12+ providers with dedicated classes), first-class MCP integration, three distinct multi-agent patterns (Graph, Swarm, Workflow), A2A protocol support, and the deepest AWS deployment story through Bedrock AgentCore. The @tool decorator is dead simple. The hook system gives you interception points without fighting the framework. It's not just a side project: the model-driven approach behind Strands emerged from building agents for products like Kiro, Amazon Q, and AWS Glue, and the SDK has hit 14 million downloads.

The honest: It's newer than most frameworks on this list, so the community is smaller. The documentation is good but still growing. Some features (bidirectional streaming) are experimental. Lambda deployments don't support streaming, so you need AgentCore Runtime or Fargate for that use case. If you're not on AWS, you miss the deployment story advantage, though the SDK works fine with any provider.

Use it if: You want a lightweight, model-driven approach. You're on AWS (or plan to be). You want to go from prototype to production without switching frameworks. You trust the model to drive decisions and want minimal scaffolding.

Pydantic AI

The good: If structured output matters to you (and it should), nothing else comes close. The type-safe approach catches errors that would otherwise surface in production. The graph-based multi-agent system with durable execution is well-designed. Logfire integration provides solid observability.

The honest: It's increasingly used as a validation layer alongside other frameworks rather than a standalone solution. Developers combine Pydantic AI with LangChain for vector stores or CrewAI for orchestration. That's actually a strength (low lock-in), but it means Pydantic AI alone won't solve your full agent architecture. The multi-agent graph system is powerful but newer and less battle-tested than LangGraph's. Community is smaller.

Use it if: Output structure is critical to your application. You're building API-driven agents. Your team already uses Pydantic and wants a familiar mental model.

Smolagents

The good: True to its name. Minimal code surface, the CodeAgent approach is interesting (writing Python instead of JSON tool calls), and it's model-agnostic down to its bones. Push-to-Hub lets you share agents as easily as models. The Hugging Face community gives it momentum.

The honest: A CVE (CVE-2025-9959) exposed a sandbox escape in versions before 1.21.0. It's fixed, but it's a reminder that code-executing agents need serious sandboxing. Production deployment requires careful configuration of import whitelisting and sandbox environments (Docker, E2B, Modal). The framework is more experimental than production-hardened.

Use it if: You want a small, code-first agent library. You're doing research or prototyping. You want the CodeAgent pattern for token-efficient tool orchestration. You're comfortable managing your own sandboxing.

LlamaIndex

The good: Unmatched retrieval infrastructure. 160+ data connectors, evaluation modules, and the RAG pipeline that most agent frameworks wish they had. The CLI tooling (llamactl) for deployment is convenient. If your agent's primary job is querying documents and data sources, this is the most complete solution.

The honest: The agent loop itself is less sophisticated than dedicated agent frameworks. You're really buying the retrieval stack and getting an agent as the orchestration layer on top. For pure agentic workflows (tool calling, multi-step reasoning, multi-agent coordination), other frameworks are stronger.

Use it if: Your agent is fundamentally a retrieval/RAG system. You need to connect to many data sources. You want evaluation and testing baked in.

The Decision Matrix

If you're still reading, here's the cheat sheet.

"I need complex, stateful workflows with maximum control": LangGraph.

"I need a multi-agent crew for a content/research pipeline": CrewAI.

"I need the simplest possible thing that works": OpenAI Agents SDK.

"I need structured, validated outputs from my agents": Pydantic AI.

"I need to query documents and data sources": LlamaIndex.

"I'm in the Google ecosystem": Google ADK.

"I want code-generating agents for research": Smolagents.

"I need conversational multi-agent teams": AutoGen.

"I want model-driven agents with minimal scaffolding": Strands Agents.

"I need something in production on AWS next month": Strands Agents + Bedrock AgentCore.

Where This Is All Going

A year ago the question was "which framework should I learn?" Now it's "do I even need a framework at all?" That Reddit thread, "Are we still using LangChain in 2026 or have you guys moved to custom orchestration?", captures a real tension. There's a growing "no framework" movement, just look at the "Monthly Star Activity Over Time" above, it's not wrong. Models have gotten good enough at tool use that a raw Python loop with the provider's API can get you surprisingly far. Vendor lock-in anxiety is real enough that major players formed the Agentic AI Foundation (AAIF) to promote interoperability and open standards.

The frameworks that will survive are the ones that provide genuine value beyond what you'd write yourself. Durable execution, managed deployment, observability, multi-agent coordination, structured outputs. These are hard problems worth depending on a library to solve. Everything else is scaffolding around a model that's increasingly capable of scaffolding itself.

The trend is clear. Less framework, more model. The best frameworks in this list already know that. They're getting thinner, not thicker. They're trusting the model more and engineering the prompt less. Strands calls this "model-driven." Smolagents calls it "barebones." Pydantic AI calls it "the Pydantic way." The label doesn't matter. The direction does.

Pick the one that fits your cloud, your team, and your problem. Don't pick the one with the most stars.

Connect with me on LinkedIn - linkedin.com/in/mikegchambers, and tell me what you think.

Context Hub Has 68 APIs. Add Yours.

Mike Chambers — Tue, 10 Mar 2026 04:13:26 +0000

Andrew Ng's new open-source project, Context Hub, attempts to solve a problem every API provider has right now whether they know it or not. Coding agents are getting your API wrong.

They hallucinate parameters, use deprecated endpoints, and call the v1 API when v3 has been out for a year. Ng's own example: ask Claude Code to call GPT-5.2 and it reaches for the older chat completions API instead of the newer responses API. Your users hit the same wall with your API, and they blame you for it, not the agent.

Context Hub is a curated, versioned documentation registry that coding agents can query from the CLI. Instead of guessing, the agent runs chub get your-api/docs --lang python and gets correct, current documentation written specifically for machines to consume. The repo has been quietly building, but Andrew's LinkedIn announcement on March 5th, caught my attention, and lit the fuse. It went to over 1,500 stars in five days.

Star history from star-history.com, captured March 10, 2026.

(As of writing) The registry already has 68 API providers, including Stripe, OpenAI, Anthropic, Supabase, Firebase, Twilio, Shopify, and AWS too (where I work). PRs are rolling in daily with new submissions. If you maintain a library and it's not in Context Hub yet, someone is going to write the docs for you. You probably want to be the one who does it!

How Context Hub Works

An agent (Claude Code, Codex, Cursor, Kiro, or anything with shell access) needs to call your API. Instead of relying on its training data, it runs:

chub search "your-api"

That returns matching entries from the registry. Then:

chub get your-company/your-api --lang python

That fetches your curated documentation, written for agent consumption rather than humans. Direct and example-heavy with no marketing fluff. The agent reads it and writes correct code.

There's also an annotation system. If an agent discovers a workaround or a gotcha while using your API, it can save a note:

chub annotate your-company/your-api "Use v2 endpoint for batch operations, v1 has a 100-item limit that isn't documented"

That note persists locally across sessions. Your documentation gets smarter every time an agent uses it.

Adding Your API: Step by Step

1. Fork and Clone

[https://github.com/andrewyng/context-hub]

Then...

2. Create Your Folder Structure

All content lives under content/. The pattern is author/docs/entry-name/:

content/
  your-company/
    docs/
      your-api/
        python/
          DOC.md
        javascript/
          DOC.md

If your API only has one language, skip the language subfolder:

content/
  your-company/
    docs/
      your-api/
        DOC.md

If your docs are going to be long (and they probably are), plan for reference files from the start:

content/
  your-company/
    docs/
      your-api/
        python/
          DOC.md
          references/
            auth.md
            errors.md
            advanced.md

Reference files are plain markdown, no frontmatter needed. Agents fetch them with chub get your-company/your-api --file references/auth.md or chub get your-company/your-api --full.

3. Write Your DOC.md

Every documentation file needs YAML frontmatter followed by the actual content:

---
name: your-api
description: "One-line description that shows up in search results"
metadata:
  languages: "python"
  versions: "2.1.0"
  revision: 1
  updated-on: "2026-03-10"
  source: official
  tags: "relevant,comma,separated,tags"
---

The source field matters. Use official if you're the API provider, maintainer if you're a core contributor, or community if you're a user who wrote the docs.

The versions field tracks the package version on npm or PyPI, not your internal API version number.

Now for the content. Remember you're not writing docs for humans. You're writing docs for an agent that needs to produce correct code on the first try. Structure like this:

Golden Rule - State the correct package name, install command, and import pattern. Warn against common mistakes right up front (wrong package names, deprecated imports).
Installation - The install command. That's it.
Initialization - How to create a client instance, auth setup, environment variables over hardcoded keys.
Core Operations - The 3-5 most common operations with complete, runnable code. Every example should include the function call with realistic parameters and the response shape.
Key Patterns - Pagination, streaming, retries, webhooks. Only what's relevant to your API.
Common Mistakes - The 3-5 things agents frequently get wrong. This is gold. If you've ever seen an agent produce wrong code for your API, put the fix here.
Models / Resources / Endpoints - List current model names, resource types, or endpoint paths explicitly.

Keep the main DOC.md under 500 lines. If you're going past 400, start moving advanced content into reference files.

A concrete example of what good looks like:

# Acme Payments Python SDK

## Golden Rule
Always use the official `acme-payments` package from PyPI.

**Install:** `pip install acme-payments`
**Import:** `from acme import PaymentsClient`

## Initialization

import os
from acme import PaymentsClient

client = PaymentsClient(api_key=os.environ["ACME_API_KEY"])

## Create a Charge

charge = client.charges.create(
    amount=2000,        # in cents
    currency="usd",
    source="tok_visa",
    description="Order #1234",
)
print(charge.id)  # "ch_abc123"

## Common Mistakes
1. Passing amount in dollars instead of cents
2. Not handling idempotency keys for retries
3. Using test keys in production

Compare that to what you see too often in regular documentation:

# Welcome to the Acme Payments Developer Hub!

Acme Payments provides a powerful, scalable platform for payment processing.
Let's explore what you can build...

Agents don't need introductions or marketing. Lead with code, cover the common cases first, and put edge cases in reference files.

4. Handle Multiple Versions (If Needed)

If you have breaking changes between major versions, use version subfolders:

content/
  your-company/
    docs/
      your-api/
        v1/
          DOC.md
        v2/
          DOC.md

Both files share the same name in the frontmatter. The build system combines them into one registry entry, with the highest version as the default. You can nest language folders inside version folders too.

For minor version differences, just document the latest and note any gotchas inline.

5. Build, Validate, and Test

Install the CLI if you haven't:

npm install -g @aisuite/chub

Validate your frontmatter and structure:

chub build content/your-company/ --validate-only --json

The validator checks that name, metadata.languages, and metadata.versions all exist in your frontmatter (errors if missing), and warns about missing description or source fields. If you get a non-zero exit code, fix the DOC.md and re-run. Once it passes, check that the counts match what you created (if you wrote 2 DOC.md files, the output should show "docs": 2).

Then do a full build and test that your doc is actually fetchable:

chub build content/your-company/ -o /tmp/chub-test/

6. Submit Your PR

Push your branch and open a PR against andrewyng/context-hub. Look at the existing PRs for examples of what good submissions look like.

Hey! This Reads Like a Skill. Because It Is...

If you've been following along and thinking "this whole process could be a skill file that I hand to my agent," you're right. The step-by-step above is structured the same way: gather inputs, create a folder structure, write content following a template, validate, test.

So I made one. Here's a create-api-docs skill you can drop into your project. Install it and your agent can scaffold a complete Context Hub submission for any API. You give it the API name, the package version, and your source material, and it produces a DOC.md that follows all the conventions above.

But, and I can't stress this enough, don't just run the skill and submit the PR. The entire point of Context Hub is accuracy, because agents get APIs wrong. If you let an agent generate your docs unchecked, you're feeding the same problem back into the system. The skill gives you a solid first draft and the right structure. You still need to read every line, verify every code example, and check every parameter against your actual API. Your users are trusting these docs to produce correct code. That's a responsibility, not something to automate away entirely.

Why Bother?

Every time a coding agent hallucinates your API, that's a developer who just had a bad experience with your product. They'll spend 20 minutes debugging code that was wrong from the start, and they'll associate that friction with your API, not with the agent.

Context Hub gives you a way to fix that at the source. Write the docs once, in a format agents can consume, and every agent that uses the registry gets it right.

The community only found this project five days ago, and it's already at 1,500+ stars, 164 forks, and getting commits daily. The window where your submission is one of the first 100 is closing. If your API or framework isn't in Context Hub yet, go add it.

npm install -g @aisuite/chub

Start there.

When the Model Is the Machine

Mike Chambers — Tue, 03 Mar 2026 08:49:34 +0000

I want to show you something. A translation app. Clean interface, language selector, input field, a translate button. You type a phrase, pick a target language, hit translate, and the result appears. It works. It feels like an app.

So... thirty seconds before I took this screenshot, the app didn't exist. In fact... There was no codebase. No repository. No designer mocked it up, no developer wrote it, no CI pipeline deployed it. Yes, you guessed right... An AI agent generated it — the layout, the styling, the interaction logic — at runtime, in response to a single prompt typed into a URL bar.

The agent that built this has exactly two tools. One generates an HTML page. The other tells the browser to swap out pieces of the DOM. That's it. There is no framework underneath, no component library, no state management system. The model holds the state. The model decides what to render. The model is the application... or maybe it is the execution environment?

Yes, this ultimate everything app is a party trick, and I want to be upfront about that. It's slow. It's a concept. But I think it points at something real, and I've spent the last few weeks trying to articulate what that something is.

Want to play? The code is here: github.com/mikegc-aws/www4

A Trillion Dollars of Doubt

Some numbers first.

In early 2026, the broader software sector fell roughly 29 percent. A Reuters analysis put the total market value lost at nearly a trillion dollars. Salesforce dropped about 27 percent year-to-date; Workday fell roughly 40 percent after issuing a sluggish revenue outlook (Reuters). Forbes ran the headline "The SaaS-Pocalypse Has Begun." The Guardian asked whether the market was headed toward a "SaaS-pocalypse." Private equity firms started circling, buying up SaaS companies at reset valuations.

The dominant narrative was that AI is about to eat traditional SaaS. Anthropic launched tools aimed at legal workflows, and RELX lost over six billion pounds in value in a single session. Business Insider declared that "Software ate the world. Now AI is eating software." The fear was specific: if AI agents can orchestrate workflows directly, seat-based subscriptions lose their logic. Why pay per-user for five tools when one agent delivers the outcome?

Note: Markets overshoot on fear the same way they overshoot on hype, and there were plenty of non-AI factors in the mix — enterprise budgets tightening, years of inflated valuations correcting, subscription growth already slowing before any agent entered the picture. Salesforce's CEO publicly dismissed the "SaaSpocalypse" framing. Bank of America called the selloff "overblown." A Bain analysis argued that the drop signals a transition, not an extinction.

I don't think the selloff proves that AI will displace SaaS. But I think it surfaces a question worth taking seriously: what happens to the value of prebuilt software when the effecacy of generated software keeps rising?

That question has two dimensions, and they're often conflated.

Where SaaS Goes From Here

The first dimension is AI-assisted development. Developers using coding assistants ship faster — 85 percent now use some form of AI tooling. This makes software cheaper to produce, which puts pressure on vendors who charge premium prices for what is, at its core, code running on servers. If the cost of building software falls, the price customers will pay for it falls too. This is deflationary, but it's evolutionary. It's the same dynamic that played out when cloud infrastructure commoditized hosting, or maybe when open-source commoditized databases. SaaS companies adapted before. Many will adapt again.

The second dimension is different in kind. AI agents don't just help build software — they can replace the need for it. If a sales team uses an agent that can pull data from a CRM, draft an email, update a pipeline, and generate a forecast, the question stops being "which SaaS tool should we buy?" and becomes "do we need the tool at all?" The value shifts from the software to the outcome. From the seat to the result. Pricing models are already following — PYMNTS reports a shift from subscription to usage-based pricing, and IT Pro has coined "Outcome as Agentic Solution" as a category.

This second dimension is the one the market is pricing in. Not because it's happening everywhere today, but because the trajectory raises the question. And my little party trick demo — an application that didn't exist until someone asked for it — is a small, imperfect illustration of what that trajectory might look like at the limit.

Project this forward five years. Bain's read — transition, not extinction — is probably closer to right. But I'd expect the restructuring to be significant. The companies that thrive will be those that embed AI deeply enough to become platforms for agent-driven workflows rather than destinations for human users clicking through interfaces. The ones that don't will face the same pressure the on-premise vendors felt a decade ago: not a sudden death, but a slow loss of relevance as the world builds around them.

Inside the Machine

Let's go back to my party trick translation app. I want to walk through what actually happens when you load that page, because the architecture is much simpler than you might expect, and herein reveals something about where software might be going.

It's a single Python file — about 550 lines — running a standard library HTTP server. When you visit the URL with a prompt (say, /?prompt=language+translation+app), the server does three things:

It creates a unique session and assigns it an AI agent — specifically an agent built with Strands Agents SDK and Claude Opus 4.6 running via Amazon Bedrock. (I work for AWS, and that's the way I roll.)
It serves a lightweight shell page: an empty <div>, a loading spinner, and a block of vanilla JavaScript that knows how to receive and render content.
The shell page immediately fires a POST request back to the server with the session ID and the prompt.

The agent then does what it does. It reads the prompt — "Generate a page for: language translation app" — and calls its first tool: render_page. This tool takes three arguments: a title, HTML body content, and CSS. The agent generates all three. The server returns them as JSON. The shell page injects the HTML into the DOM and the CSS into a style tag. The spinner fades out. The app appears.

Here is the render_page tool in its entirety:

@tool(context=True)
def render_page(title: str, html: str, css: str, tool_context: ToolContext) -> str:
    tool_context.invocation_state["response"] = {
        "type": "render",
        "title": title,
        "html": html,
        "css": css,
    }
    return "Page rendered successfully."

That's it. Nothing else. There is no template engine, no component tree, no virtual DOM. The model produces the markup directly.

Now you interact. You type a phrase, select a language, click translate. The shell page captures that event through plain event delegation — it listens for form submissions, button clicks, link clicks, checkbox toggles, select changes, and enter keypresses. It formats the event into a structured message (e.g., FORM SUBMIT [translate-form]: {"text": "hello", "language": "Spanish"}) and sends it back to the server.

The agent receives this message as the next turn in its conversation. It has full context — it knows what page it generated, what elements exist, what IDs they have. It decides how to respond. For a translation result, it probably calls the second tool: update_elements.

@tool(context=True)
def update_elements(updates: list[dict[str, str]], tool_context: ToolContext) -> str:
    tool_context.invocation_state["response"] = {
        "type": "update",
        "updates": updates,
    }
    return "Elements updated successfully."

This tool takes a list of {id, html} pairs and tells the frontend to replace the innerHTML of each element by ID. The result area updates. The rest of the page stays put. It's a targeted DOM patch, decided entirely by the model.

So... there is no application state anywhere except inside the model's context window. The agent doesn't write to a database. It doesn't set session variables. It doesn't maintain a state object. The conversation is the state. Every interaction adds a turn, and the model reasons over the full history to decide what to render next.

What Would Make It Better

Its limitations with todays models are obvious. Every interaction requires a full round trip to the model, which means latency measured in seconds rather than milliseconds. The agent regenerates its understanding of the page from conversation context on every turn, which is wasteful. There's no persistence — close the tab and the session is gone.

Some of its issues are solvable these problems are within the same paradigm. Point being that this isn't even as good as it could be:

Server-side DOM cache. The most obvious improvement. Instead of relying entirely on the model's context to remember the current page state, the server could maintain a representation of the DOM. The agent could diff against it, and the server could validate that element IDs in update_elements calls actually exist. This would reduce errors and allow the model to work with a smaller context.

Streaming generation. Right now the agent generates the full HTML payload, then the server sends it. A streaming approach — where HTML is sent to the browser as the model produces it — would dramatically improve perceived performance. The page could progressively render, much like a server-side rendered page loading over a slow connection.

Hybrid rendering. Some interactions don't need the model at all. A dropdown menu opening, a tab switching between already-generated content, a tooltip appearing — these could be handled by lightweight generated client-side JavaScript, with the agent only invoked for decisions that require reasoning. The system prompt could instruct the agent to include specific interaction patterns that the shell page handles natively.

Prompt caching and prefills. For common application patterns — a form, a data table, a navigation layout — the agent could work from cached partial outputs rather than generating from scratch every time. The model is doing redundant work if it invents a fresh CSS reset for every page.

None of these improvements change the fundamental architecture. The model remains the runtime. The application remains ephemeral. But the experience moves from "party trick" to something you could genuinely use.

The Real World Is Already Moving

My demo is a solo experiment, but the ideas behind it are showing up in serious, projects. Two in particular are worth watching.

AG-UI: A Protocol for Agent-Driven Interfaces

The AG-UI protocol emerged from CopilotKit's work and has since attracted first-party support from Microsoft, Google, AWS, and others. It's an open, event-based protocol that standardizes how AI agents connect to frontend applications.

Image source: AG-UI Protocol

The core insight is that agents need a structured way to communicate with UIs that goes beyond dumping text into a chat window. AG-UI defines roughly 16 event types that an agent backend can emit — events that represent things like "update this piece of state," "render this component," or "request human input before proceeding." The frontend listens for these events and renders accordingly.

This is the same pattern as my demo, but formalized and generalized. Where www4 has two hand-rolled tools and a bespoke shell page, AG-UI provides a protocol layer that lets any agent framework talk to any frontend. It supports SSE, WebSockets, and webhooks for transport. It includes middleware for loose format matching, so that agents built in different frameworks can connect without perfect specification compliance.

AG-UI sits in a deliberate position in an emerging stack. MCP (Model Context Protocol) gives agents access to tools. A2A (Agent-to-Agent) lets agents communicate with each other. AG-UI brings agents into the user interface. And now MCP itself is moving toward UI as a first-class concept — the MCP Apps specification defines a standard for tools to declare UI resources alongside their capabilities, rendered in sandboxed iframes with structured communication back to the host. The specification's own language is telling: it describes an "agentic app runtime." Together, these protocols describe a world where applications are assembled from agent capabilities rather than compiled from source code.

MCP-Use: From Protocol to Interface

The mcp-use-ts project takes a different but complementary approach — and is emerging as one of the first concrete implementations of the MCP Apps extension. Where AG-UI standardizes the protocol between agents and UIs, mcp-use-ts focuses on making it trivially easy to build interfaces on top of MCP servers.

Image source: mcp-use-ts

Its most interesting feature is automatic UI generation from tool definitions. You define an MCP tool with a Zod schema — say, a tool that queries a database with parameters for table name, date range, and output format — and the framework generates an interactive form for that tool automatically. No frontend code needed.

But it goes further than auto-generated forms. Developers can build custom React widgets that are served alongside MCP tools as resources. This means an MCP server can expose not just capabilities but also the interfaces for using those capabilities. A database MCP server could come with a query builder UI. A monitoring MCP server could include a dashboard.

The create-mcp-use-app scaffolding tool lets you go from zero to a working application with an MCP backend and an auto-generated frontend in under a minute. The development server includes hot reload for both tools and widgets, and a built-in inspector that functions as both a debugging tool and a prototype UI.

This is relevant because it represents a concrete, usable step toward the architecture my demo illustrates. Instead of hand-writing a frontend and a backend and the glue between them, you define capabilities and let the tooling generate the interface. The "application" is a thin layer over agent capabilities, generated rather than authored.

The Frontier

Is this post long enough for a conclusion?? Let me pull these threads together.

My demo shows an agent that generates a complete, interactive application at runtime from a single prompt. It's slow and it's limited, but it works. The model acts as the runtime, the state engine, and the decision-maker. There is no application until someone asks for one, and then there is exactly the application they asked for.

The SaaS selloff shows a market waking up to the possibility that static, prebuilt software is losing its premium. When agents can orchestrate outcomes directly, the value of the intermediary tools drops. The companies that survive will be those that become platforms for agent-driven work, not destinations for human-driven workflows.

AG-UI shows the infrastructure layer forming — a standardized way for agents to control user interfaces, backed by the biggest names in the industry. MCP-Use shows the developer experience becoming real — tools that let you go from capability definition to working UI without writing frontend code.

These are all points on the same line. And the line points toward a future where software is generated, not built. Where applications are ephemeral, not persistent. Where the currency is the idea — "I need a translation app," "show me a sales dashboard," "build me a tool that tracks my inventory" — and the execution of that idea is handled in real time by models that understand what you want and can produce it on demand.

This isn't a prediction about next quarter. The models are still too slow for production use in this mode. Context windows, while generous, still impose limits on session complexity. The quality of generated interfaces, while impressive, isn't yet reliable enough for mission-critical work. These are real constraints.

But every one of them is improving on a curve that software engineers will recognize. Models are getting faster. Context windows are growing. Output quality is climbing. The gap between "interesting demo" and "production system" is closing on a timeline measured in years, not decades.

What This Means for Builders

I want to be clear: this is not a story about developers becoming obsolete. It's a story about what developers do shifting.

If the value of writing code is falling — and it is, by every measure — then the value of knowing what to build is rising. Architecture, domain expertise, problem framing, verification, orchestration. These are the skills that appreciate as the cost of execution drops toward zero.

The developers who thrive in this landscape will be the ones who stop thinking of themselves as people who write software and start thinking of themselves as people who define intent and verify outcomes. The code becomes the cheapest part. The thinking becomes the most expensive.

And there's a broader opportunity here, one that goes beyond the software industry. If generating an application becomes as easy as describing what you want, then the bottleneck on innovation moves from "can we build it?" to "can we imagine it?" The currency becomes ideas. Good ones, specific ones, ones grounded in real problems and real domain knowledge.

That's not a future to fear. It's a future to build toward.

The www4 project referenced in this post is available at github.com/mikegc-aws/www4. The AG-UI protocol is at github.com/ag-ui-protocol/ag-ui. MCP-Use is at github.com/mcp-use/mcp-use-ts.

Software 3.1? - AI Functions

Mike Chambers — Tue, 24 Feb 2026 06:19:23 +0000

Watch: AI Functions Deep Dive:

Andrej Karpathy has a version numbering scheme for how software gets written. Software 1.0 is code written by humans. Software 2.0 is neural network weights learned through optimization. Software 3.0 is prompting LLMs in plain language, and sounds nicer than calling it vibe coding, which, fun-fact is a also a Karpathy coined term.

Of course, Software 3.0 is real. Millions of people are using it daily. Tools like Kiro, Cursor, Claude Code, and ChatGPT let you describe what you want and get code back. Karpathy emphasizes a ‘generation–verification loop’ in partial-autonomy tools: the model generates changes, a human verifies them, and the work iterates.

But there's something more fundamental going on than who reviews what. Look at what the LLM actually produces in Software 3.0: text. Code as strings. JSON payloads. Markdown documents. The model generates, you receive text, and then you do everything else -- integrate it into your codebase, write tests, run CI, deploy. If you're disciplined about verification, you write test cases, but those run before deployment. Once the code ships, the tests don't execute again. The LLM's involvement ends when it hands you the output. Your running software has no relationship with the model that helped write it.

Now consider a different arrangement. The LLM generates code that actually runs inside your application -- at call time, every time the function is invoked. It returns native Python objects -- DataFrames, Pydantic models, database connections -- not JSON strings you have to parse. And verification isn't a gate you pass before deployment; it's post-conditions that execute on every call, feeding failures back to the model for automatic retries. This changes three things at once: where AI fits in your software (runtime, not just development time), what it produces (live objects you can call methods on, not serialized text), and how you trust it (continuous automated verification, not one-time human review).

That's the experiment at the heart of AI Functions, a new project from Strands Labs built on the Strands Agents SDK. You write a Python function with a natural language specification instead of implementation code. You attach post-conditions -- plain Python assertions that define what correct output looks like. When the function is called, the LLM generates code, executes it in your Python process, returns the result as a native object, and the post-conditions verify it. If verification fails, the system retries with the error as feedback. The human never inspects the generated code. The post-conditions do the inspecting -- every time.

If Software 3.0 is "human prompts, LLM generates, human verifies," then I propose that AI Functions are Software 3.1: human specifies, LLM generates and executes, machine verifies -- at runtime. Same paradigm -- natural language as the programming interface. But the execution model is different. The LLM isn't producing text for a human to integrate. It's producing code that runs, returning objects your application uses directly, verified by post-conditions on every call. Software 3.1 is a "point release," not a major version bump. The upgrade is in what happens after generation.

This post is a deep dive into what AI Functions are, how they work, and what automated verification makes possible.

What AI Functions Are

AI Functions is built on top of the Strands Agents SDK, an open-source framework for building AI agents. AI Functions introduces a single core abstraction: the @ai_function decorator. You write a Python function with a natural language specification instead of an implementation body. When the function is called, an LLM generates the implementation, executes it, and returns the result. Optionally -- and this is the important part -- you attach post-conditions that validate the output and trigger automatic retries if validation fails.

The simplest example looks like this:

from ai_functions import ai_function

@ai_function
def translate_text(text: str, lang: str) -> str:
    """
    Translate the text below to the following language: {lang}.
    {text}
    """

result = translate_text("The quarterly results exceeded expectations.", lang="French")

You call translate_text like any Python function. The decorator intercepts the call, constructs a prompt from the docstring (substituting the arguments), sends it to an LLM, and returns the result as a typed Python string. From the caller's perspective, it's just a function that takes a string and returns a string. The fact that an LLM executed it is an implementation detail.

This on its own is still kinda Software 3.0 -- prompt in, result out. It's pleasant, but it isn't where AI Functions get interesting. They get interesting when you add structure, validation, code execution, multi-agent composition, and async workflows. That's where 3.1 begins.

Structured Output with Pydantic

AI Functions can return arbitrary typed objects, not just strings. When you specify a Pydantic model as the return type, the framework enforces schema compliance automatically:

from ai_functions import ai_function
from pydantic import BaseModel

class MeetingSummary(BaseModel):
    attendees: list[str]
    key_decisions: list[str]
    action_items: list[str]

@ai_function
def summarize_meeting(transcript: str) -> MeetingSummary:
    """
    Summarize the following meeting transcript in less than 50 words.
    <transcript>
    {transcript}
    </transcript>
    """

Call summarize_meeting(transcript) and you get back a MeetingSummary object with typed fields, IDE autocomplete, and Pydantic's built-in validation. The LLM's output is parsed into the Pydantic model, and if the structure doesn't match, the framework handles the retry. From the caller's perspective, the function returns a typed Python object.

This is a pattern that frameworks like Instructor have established. AI Functions' contribution isn't in structured output itself -- it's in how structured output composes with everything else in the system.

Post-Conditions

Post-conditions are the core of what makes AI Functions more than a prompting framework. A post-condition is a Python function that validates the output of an AI Function. If validation fails, the error message feeds back to the LLM, which retries. Multiple post-conditions run in parallel, so the LLM receives all failure signals at once and can address them in a single retry.

from ai_functions import ai_function, PostConditionResult
from pydantic import BaseModel

class MeetingSummary(BaseModel):
    attendees: list[str]
    key_decisions: list[str]
    action_items: list[str]

def check_length(response: MeetingSummary):
    total = sum(len(d.split()) for d in response.key_decisions)
    assert total <= 50, f"Key decisions should total under 50 words, got {total}"

@ai_function
def check_quality(response: MeetingSummary) -> PostConditionResult:
    """
    Check if the meeting summary below satisfies the following criteria:
    - Key decisions must be specific and actionable, not vague
    - Action items must each name a responsible person
    <decisions>{response.key_decisions}</decisions>
    <actions>{response.action_items}</actions>
    """

@ai_function(post_conditions=[check_length, check_quality])
def summarize_meeting(transcript: str) -> MeetingSummary:
    """
    Summarize the following meeting transcript in less than 50 words.
    <transcript>
    {transcript}
    </transcript>
    """

There are two things to notice here. First, check_length is a plain Python function that raises an AssertionError on failure. This is a deterministic, inspectable validation -- no LLM involved, no ambiguity. Second, check_quality is itself an AI Function that returns a PostConditionResult -- a Pydantic model with passed (bool) and message (str) fields. It uses an LLM to evaluate whether the summary meets quality criteria that are hard to express as assertions -- specificity, actionability, attribution. An AI Function validating another AI Function. The framework treats both identically: if either fails, the error goes back to the generating LLM as feedback.

This creates a self-correcting loop. The generating LLM doesn't need to get it right on the first attempt. It needs to be able to improve given specific feedback about what went wrong. In practice, this means the developer's job shifts from crafting perfect prompts to writing good post-conditions -- a fundamentally different skill.

Of course, we need to understand what is happening here, and that this also opens us up to code retry loops "hidden" in our projects! We will need to ensure we have solid monitoring, and observability before we lean on this too hard.

Returning Native Python Objects

Most LLM frameworks force output through JSON serialization. AI Functions can return non-serializable Python objects -- DataFrames, SymPy expressions, database connections, anything -- because the generated code runs in the same Python interpreter as your application.

This is the feature that makes AI Functions feel qualitatively different from other frameworks. Consider a format-agnostic data loader that handles purchase records regardless of how they're stored:

from ai_functions import ai_function
from pandas import DataFrame, api

def check_invoice_dataframe(df: DataFrame):
    """Post-condition: validate DataFrame structure."""
    assert {'product_name', 'quantity', 'price', 'purchase_date'}.issubset(df.columns)
    assert api.types.is_integer_dtype(df['quantity']), "quantity must be an integer"
    assert api.types.is_float_dtype(df['price']), "price must be a float"
    assert api.types.is_datetime64_any_dtype(df['purchase_date'])
    assert not df.duplicated(subset=['product_name', 'purchase_date']).any()

@ai_function(
    code_execution_mode="local",
    code_executor_additional_imports=["pandas.*", "sqlite3", "json"],
    post_conditions=[check_invoice_dataframe],
)
def import_invoice(path: str) -> DataFrame:
    """
    The file `{path}` contains purchase logs. Extract them in a DataFrame with columns:
    - product_name (str)
    - quantity (int)
    - price (float)
    - purchase_date (datetime)
    """

Call import_invoice('data/orders.json') and you get back an actual Pandas DataFrame -- not a JSON representation of one, not a serialized string, a real DataFrame object that you can immediately call .describe(), .groupby(), or .plot() on. Hand it a SQLite file instead and the same function inspects the database schema, writes the appropriate SQL queries, and returns the same validated DataFrame structure.

The developer writes zero format-specific parsing logic. The natural language specification says what the output should contain. The post-conditions verify structural invariants. The LLM figures out how to get from an opaque file to a validated DataFrame, dynamically, at call time.

This works because the framework provides the LLM with a Python executor tool that shares the same runtime as the calling code. The LLM generates Python code, executes it inside your process, and returns the result object directly. No serialization round-trip. The code_execution_mode="local" parameter is an explicit opt-in -- the framework doesn't run arbitrary generated code by default, and you declare which imports are allowed.

Code Execution and the Trust Model

The code execution model deserves closer attention because it reveals AI Functions' deliberate approach to trust.

When code_execution_mode="local" is enabled, the LLM can generate and execute Python code within your interpreter. This is powerful -- it's what enables returning DataFrames, running computations, and interacting with the local environment. It's also a security surface. The framework mitigates this through several mechanisms:

Explicit opt-in. Code execution is off by default. You must enable it per function.
Import restrictions. code_executor_additional_imports explicitly declares which packages the generated code may use. Anything not listed is unavailable.
Post-condition verification. The output is validated regardless of how it was produced. Even if the generated code takes an unexpected path, the post-conditions catch invalid results.

But the honest assessment is that this is a tradeoff. You're executing LLM-generated code in your process. The framework uses AST-based validation of generated code with controlled imports and timeouts, which attempts to prevent malicious imports and block dangerous operations. But this doesn't offer true sandboxing and doesn't prevent resource exhaustion (infinite loops, excessive memory allocation). For an experiment, with appropriate constraints, this is a reasonable choice. For production workloads, the project recommends running AI Functions inside a container or other isolated environment to provide process-level isolation.

Multi-Agent Composition

Results from AI Functions compose naturally through regular Python. Since AI Functions return typed objects, you chain them the same way you chain any functions -- by passing outputs as inputs:

from ai_functions import ai_function
from pandas import DataFrame

@ai_function(code_execution_mode="local", code_executor_additional_imports=["pandas.*"])
async def analyze_sales_data(path: str) -> DataFrame:
    """
    Load the sales data from `{path}` and compute a summary DataFrame
    with total revenue, average order value, and top 5 products by volume.
    """

@ai_function
def write_executive_summary(company: str, financials: DataFrame) -> str:
    """
    Write a concise executive summary for {company} highlighting key trends
    and recommendations based on the provided financial data.
    """

financials = await analyze_sales_data("data/q4_sales.csv")
summary = write_executive_summary("Acme Corp", financials)
print("Top Products:", financials.head())
print("Summary:", summary)

This is just ordinary function composition. The first function returns a DataFrame; the second takes a DataFrame as input. No special state-passing machinery needed.

For more complex workflows, AI Functions can be used as tools by other agents, enabling orchestration patterns where a coordinator delegates to specialized sub-agents:

from ai_functions import ai_function
from ai_functions.types import PostConditionResult
from pydantic import BaseModel, Field
from typing import Literal

@ai_function(
    description="Search the web for a topic and return a cited summary",
    tools=[websearch_tool],
    post_conditions=[check_length, check_citations],
)
def search_agent(query: str, max_words: int = 500) -> str:
    """
    Perform a web search on the following topic and return a summary.
    Every claim must be supported by citations to sources.
    <query>{query}</query>
    """

@ai_function(
    description="Suggest the plan and organization of a report",
    tools=[websearch_tool],
)
def report_planner(topic: str) -> ReportPlan:
    """Generate a plan to write a report on: {topic}"""

@ai_function(tools=[report_planner, search_agent, report.add_section])
def report_orchestrator(topic: str) -> Literal["done"]:
    """
    Write a report on the following topic: {topic}
    """

The orchestrator sees report_planner, search_agent, and report.add_section as tools it can call. Each sub-agent runs with its own post-conditions, so the orchestrator receives validated results. The search agent's citations are verified before its results reach the orchestrator. This creates a hierarchy of validated agents -- post-conditions compose across the multi-agent system.

Async Execution and Parallel Workflows

AI Functions can be defined as async, which enables parallel execution of independent tasks:

from ai_functions import ai_function
import asyncio
import pandas as pd

@ai_function(tools=[websearch_tool])
async def research_market(company: str) -> str:
    """Research and summarize the competitive landscape and recent news for: {company}"""

@ai_function(code_execution_mode="local", code_executor_additional_imports=["pandas.*", "yfinance.*"])
async def load_financial_data(stock: str) -> pd.DataFrame:
    """
    Use the `yfinance` Python package to retrieve the historical prices of {stock}
    in the last 30 days. Return a DataFrame with columns [date, price].
    """

@ai_function(code_execution_mode="local", code_executor_additional_imports=["pandas.*", "plotly.*"])
def write_investment_memo(company: str, research: str, financials: pd.DataFrame) -> str:
    """
    Write an investment memo for {company}. Use the market research and financial data:
    {research}
    """

async def due_diligence_workflow(company: str):
    research, financials = await asyncio.gather(
        research_market(company),
        load_financial_data(company)
    )
    write_investment_memo(company, research, financials)

The two tasks run concurrently. Since they're independent -- one searches the web, one loads and transforms local data -- parallelism gives you the same results in roughly half the wall-clock time with no additional cost. The results then feed into a synchronous report writer that uses both.

Notice the tools=[websearch_tool] parameter. AI Functions can use any Strands tool. The framework provides built-in tools for Python code execution, and you can pass additional tools (web search, API clients, file I/O) per function. The LLM decides when and how to use them during execution.

Configuration Sharing

Different parts of a workflow may need different models. A quick validation check doesn't need the same model as a complex analysis. AI Functions use AIFunctionConfig objects to share configuration across functions:

from ai_functions import ai_function, AIFunctionConfig
from pandas import DataFrame

class Configs:
    BIG_MODEL = AIFunctionConfig(model="us.anthropic.claude-sonnet-4-5-20250929-v1:0")
    FAST_MODEL = AIFunctionConfig(model="us.anthropic.claude-haiku-4-5-20251001-v1:0")
    DATA_ANALYSIS = AIFunctionConfig(
        model="us.anthropic.claude-sonnet-4-5-20250929-v1:0",
        code_execution_mode="local",
        code_executor_additional_imports=["pandas.*", "numpy.*"],
    )

@ai_function(config=Configs.DATA_ANALYSIS)
def normalize_dataset(path: str) -> DataFrame:
    """Load, clean, and normalize the dataset at `{path}` into a standard schema."""

@ai_function(config=Configs.FAST_MODEL)
def validate_email(text: str) -> bool:
    """Check if the following string is a valid email address: {text}"""

Configs are plain Python objects, so switching the entire pipeline from one model family to another is a single line change. During development, you might route everything through a capable but expensive model. For cost optimization, you swap the config's model reference and see what breaks. Keyword arguments on @ai_function override config values for individual functions, so you can specialize without duplicating the entire config.

Validating More Than Output

One of the more subtle capabilities of the post-condition system is validating properties of a result that are hard to express as structural checks. AI-powered post-conditions let you evaluate semantic qualities -- grounding, citation quality, logical consistency -- using one LLM to validate another:

from ai_functions import ai_function, PostConditionResult

@ai_function
def check_citations(summary: str) -> PostConditionResult:
    """
    Validate if all the claims made in the following summary are supported
    by an inline citation to a credible source.
    <summary>
    {summary}
    </summary>
    """

def check_length(summary: str, max_words: int):
    assert len(summary.split()) <= max_words

@ai_function(
    tools=[websearch_tool],
    post_conditions=[check_length, check_citations],
)
def market_researcher(query: str, max_words: int = 500) -> str:
    """
    Research and provide a well-sourced answer to: {query}
    Every claim must be supported by citations to credible sources.
    """

The research agent produces a summary. check_length verifies the word count deterministically. check_citations uses an LLM to evaluate whether each claim is actually backed by a cited source. If the agent hallucinated an answer without doing real research, the citation check catches it and triggers a retry with feedback about specifically which claims lack sources.

This is a different kind of validation from checking output structure. It's using AI to verify AI -- checking semantic properties that are hard to express as assertions. It addresses one of the hardest problems in LLM-based systems: how do you know the model didn't just make something up? Post-conditions don't solve this fully, but they create a second, independent evaluation that meaningfully reduces the failure rate.

Test Suites as Post-Conditions

The post-condition model has an interesting application to automated coding: use your existing test suite as the post-condition. If the tests pass, the implementation is correct. If they fail, the failures feed back as error messages.

from ai_functions import ai_function
from pydantic import BaseModel
from typing import Any, Literal
import pytest, io
from contextlib import redirect_stderr, redirect_stdout

class FeatureRequest(BaseModel):
    description: str
    test_files: list[str]

# Post-conditions can request original input arguments by name.
# Here, `feature` matches the parameter name of `implement_feature`.
def run_tests(_answer: Any, feature: FeatureRequest):
    stdio_capture = io.StringIO()
    with redirect_stdout(stdio_capture), redirect_stderr(stdio_capture):
        retcode = pytest.main(feature.test_files)
    if retcode:
        raise RuntimeError(stdio_capture.getvalue())

@ai_function(post_conditions=[run_tests])
def implement_feature(feature: FeatureRequest) -> Literal["done"]:
    """
    Implement the following feature in the current code base:
    <feature>{feature.description}</feature>
    Once done the code base should pass the following tests: {feature.test_files}
    """

def run_workflow(features: list[FeatureRequest]):
    for feature in features:
        implement_feature(feature)

The AI Function's return value is just the string "done" -- it doesn't matter. What matters is the side effect: the code base should now pass the specified tests. The post-condition runs pytest and raises if any tests fail. The LLM receives the test output as feedback and keeps iterating until all tests pass.

The documentation notes that agents pass roughly 10-15% more tests when the post-condition is provided in addition to the prompt instruction. The agent is measurably more effective at responding to concrete validation failures than at following written instructions. This aligns with a broader pattern: concrete, automated feedback loops outperform detailed prompting. Which is exactly the argument for 3.1 over 3.0.

Try It

AI Functions is an experiment. The code is open source at strands-labs/ai-functions, part of the Strands Labs GitHub organization -- a home for experimental projects built on the Strands Agents SDK. Alongside AI Functions, you'll find Robots (physical AI agents on edge hardware) and Robots Sim (simulated environments for robot development). All three are built on the Strands Agents SDK, which has been downloaded over 14 million times since its open source release in May 2025. All three are explicitly experimental -- and that's the point. The best way to find out what works in this space is to build things and see what breaks.

Install it with pip install strands-ai-functions (or uv add strands-ai-functions), clone the repo for the full set of examples, and start experimenting.

AI Functions is not a production system. It's a conversation starter and just maybe where Karpathy's version numbering goes next. Try it. Write some post-conditions. See whether defining acceptance criteria feels more natural than auditing LLM output. And then consider: what does 4.0 look like?

We don't know yet. But the experiments have started :)

Async Agentic Tools: Breaking Free from the Request-Response Loop

Mike Chambers — Tue, 17 Feb 2026 09:16:15 +0000

Every AI agent framework today follows the same basic loop: the model thinks, it calls some tools, it waits for all the tools to finish, and then it thinks again. On one hand that loop works fine when your tools return in seconds or when your agent is headless, but on the other it causes millions of people every day to stare blankly for hours (in total) at a "thinking" message. Where's the productivity gain in that? :)

Skip to the code bit: here

Quick walkthrough video here:

If you've used (or built) agents that call APIs with variable latency, run database queries, kick off web searches, or - of course - used agents as tools, you've felt this. The model sits idle, the user sits idle, and a 30-second tool call holds up the response to a 3-second one that could have finished ages ago. The agent can't talk to the user, can't start processing partial results, can't do anything.

Modern frontier models now have the ability to start experimenting with genuine asynchronous tool calls. This post walks through my experimental approach: true asynchronous agentic tools. The demo is built on the Strands Agents SDK, but the pattern should apply to any agent framework with a tool-calling loop. The code is open, the approach is simple, and it requires zero changes to how you write your tools.

Async vs Async

First, before you write your comment on this post... let's be precise about terminology.

Many agent frameworks already support parallel tool calling — when a model returns multiple tool calls in a single response, the framework can execute them concurrently. Strands Agents, which I'm using for this demo, supports this natively. That's good. But the agent loop is still blocked until every tool in that batch has responded. If you dispatch three tools and two finish in 1 second but one takes 45 seconds, the model cannot respond, cannot think, and cannot act for those 44 seconds of dead air. It's the same as Promise.all() or asyncio.gather() — you get concurrency in the execution, but you still wait for the slowest one before anything else can happen.

True async agentic tool calling is different. The model dispatches a tool, gets back an acknowledgement immediately ("task started, here's an ID"), and moves on. It can talk to the user, call other tools, or just wait. When the result arrives — seconds, minutes, hours or days later — it gets delivered to the model as a new message, and the model processes it then. The model stays responsive. Results stream in as they complete. The user experience is fundamentally different. And I have to say, I like it!

Why This Wasn't Possible Until Recently

This architecture puts real demands on a model's intelligence, its context size, and its propensity for 'lost in the middle' issues. It needs to understand that a tool call won't return a real result. It needs to avoid fabricating data while waiting. It needs to handle results arriving out-of-order, potentially many turns after the original request. And it needs to keep track of multiple pending tasks across a growing conversation context.

One solution to this could be to provide the agent with even more tools to manage its own execution flow. But this just makes the flow even more complex for the poor model.

The (current) solution is much simpler, ironicaly we just needed to wait for the frontier models to get good enough. The problem you couldn't solve 6 months ago, now works just like you through it would, now that Opus 4.6 is here. (Is there a name for this progression? Like Moores Law just for LLM improvment? Please let me know.)

As recently as mid-2025, asynchronous tool calling didn't work reliably. Models would hallucinate results instead of waiting, lose track of pending task IDs, or get confused when results arrived in a later turn. The instruction-following just wasn't precise enough.

Newer, more capable models handle kinda okay. They follow the "do not fabricate" instruction in the tool description, they correctly associate arriving results with their task IDs, and they maintain coherent multi-turn conversations while tasks are in flight. This is one of those capabilities that seems to have emerged from generally smarter models rather than from any async-specific training. Is it perfect, no, but no agwet is perfect, and I am sure it will improve.

How It Works (The Code)

My implementation is three small components — about 320 lines of code total — that layer on top of a standard Strands Agent without modifying it. The code is linked here: repo

1. The Decorator: `@tool_async`

You write your tool exactly the way you'd write any tool function. A function with a docstring and type hints. Then you wrap it with @tool_async(manager):

from strands_async_tools import AsyncToolManager, tool_async

manager = AsyncToolManager(max_workers=4)

@tool_async(manager)
def research_topic(topic: str) -> str:
    """Research a topic thoroughly and return detailed findings."""
    time.sleep(15)  # simulate slow API call
    return f"Findings about {topic}..."

So that's it for the tool, your function doesn't change. It's still synchronous. It still returns a string. The decorator handles everything for you.

What it does behind the scenes:

When the model calls research_topic, the decorator submits the original function to a ThreadPoolExecutor for background execution.
It immediately returns a structured message to the model: task ID, tool name, arguments, and a clear instruction not to fabricate the result.
It appends an async notice to the tool's docstring so the model knows, from the schema alone, that this tool is asynchronous, and how to expect it to work.

The model sees the immediate response and might tell the user "I've started the research." Meanwhile, the real function is running on a background thread.

2. The Manager: `AsyncToolManager`

The manager is a thin wrapper around Python's ThreadPoolExecutor that adds task tracking and completion callbacks:

class AsyncToolManager:
    def submit(self, tool_name, fn, **kwargs) -> str:
        """Submit a function for background execution. Returns a task ID."""
        task_id = uuid.uuid4().hex[:8]
        future = self._executor.submit(run)
        future.add_done_callback(on_done)
        return task_id

When a background task finishes, the manager fires an on_complete callback with an AsyncTaskResult containing the task ID, tool name, arguments, result (or error), and elapsed time. The manager doesn't know or care what happens next — it just reports completion.

3. The Agent Wrapper: `AsyncAgent`

This is the orchestration layer. AsyncAgent wraps a standard Agent and manages the lifecycle of delivering async results back to the model:

agent = Agent(
    model=model_id,
    system_prompt=SYSTEM_PROMPT,
    tools=[research_topic, analyze_sentiment, fetch_weather, calculator],
)
async_agent = AsyncAgent(agent=agent, manager=manager)
async_agent.send("Research quantum computing and check the weather in Paris")

I wrap the agent like this rather than subclass it as the agent is a complex thing, and this method should more easliy extend to other frameworks. AsyncAgent doesn't change how the Agent works — it just manages when and how it gets invoked.

AsyncAgent registers itself as the manager's completion callback and maintains a simple state machine:

Agent is idle, result arrives: Invoke the agent immediately with the formatted result. The model processes it and responds.
Agent is busy (already processing something), result arrives: Queue the result. When the current invocation finishes, drain the queue — deliver each queued result to the agent one at a time.
Multiple results arrive while busy: They all queue up and get delivered sequentially in a draining phase after the agent finishes.

(I have OpenClaw to thank for this - sorta. As I was digging through it's code, I then fell in to looking at Pi interrupt semantics that does something simialr.)

The result is delivered to the model as a plain text message:

[ASYNC RESULT]
Task ID: abc123
Tool: research_topic(topic='quantum computing')
Result:
Key finding: quantum computing has seen 340% growth in the last 2 years.
Experts predict the quantum computing market will reach $50B by 2028.
Elapsed: 15234ms

The model treats this like any other user message and responds naturally.

Thread safety matters here. The completion callback fires from the thread pool, but agent invocations must be serialized (you can't call a Strands Agent from two threads simultaneously). AsyncAgent uses a lock to protect its busy flag and result queue, ensuring that results are always delivered one at a time in a safe order.

The Demo

In the repo demo.py is my minimal CLI that shows the whole thing working end-to-end with mock tools (with made up data!) that wait for random times. It should look something like this:

$ uv run python demo.py

You: Research quantum computing and check the weather in London

  [thinking] processing...
I've started both tasks for you:
1. Researching quantum computing (Task abc123)
2. Fetching weather for London (Task def456)

Results will come in as they complete.

  [callback] fetch_weather (def456) completed in 11234ms — delivering to agent now
  [thinking] processing...
The weather in London just came in: partly cloudy, 18C, humidity at 65%.

Still waiting on the research results.

  [callback] research_topic (abc123) completed in 17891ms — delivering to agent now
  [thinking] processing...
The research on quantum computing is done. Here are the key findings:
- 340% growth in the last 2 years
- Major players include Acme Corp, Nexus Labs, and Orion Systems
- Market predicted to reach $50B by 2028

You:

There's three async tools (with simulated 10-20 second delays) and two synchronous tools (calculator and current_time) running side by side. The sync tools return instantly as usual. The async tools dispatch to background threads and deliver results via callbacks. The model stays conversational throughout.

Try this:

You: Research Paris

  [thinking] processing...
I've started researching Paris for you (Task a1b2c3).
I'll let you know as soon as the results come in.

You: What time is it there?

  [thinking] processing...
It's currently 15:32 in Paris (CET, UTC+1).

  [callback] research_topic (a1b2c3) completed in 16482ms — delivering to agent now
  [thinking] processing...
The Paris research just came back! Here are some highlights:
- ...

What the System Prompt Does

In the demo script the system prompt is explicit about the contract. Over time I would like to simplify this or remove it all together, and I suspect that will happen:

When you call an async tool it returns a task ID immediately.
The actual result will arrive in a future message tagged [ASYNC RESULT].
Rules:
  - Do NOT guess or fabricate async results. Wait for [ASYNC RESULT].
  - Tell the user each task has been started.
  - You CAN dispatch multiple async tools at once — they run in parallel.

The system prompt, combined with the async notice appended to each tool's docstring by the decorator, gives the model enough context to behave correctly. It's prompt engineering, not framework magic — and it works because current models are good enough at following these instructions reliably.

When to Use This (and Probably When Not To)

Let's be honest: this is an experiment. It's not for everyone, and it's not for every situation.

It works well when (in no particular order):

Tools have high, variable latency — web searches, API calls, document processing, anything where one tool might take 2 seconds and another might take 30.
You want the agent to stay conversational — the user shouldn't have to stare at a spinner while a slow tool runs.
You're building voice interfaces — This is a huge one! Dead air is death for voice UX. Async tools work especially well for voice agents: the agent can keep talking and answering follow-up questions while sub-tasks run in the background. This demo actually includes experimental (scrappy) code for testing this with Amazon Nova Sonic 2 as a voice chat — worth a look if you want to try async tools in a real voice flow. (See the readme for the details of how to run the voice code.)
Tasks are independent — async shines when tool calls don't have a hard dependancy on each other's results, and the agent can stil combine the data when multiple tools return.
Tools are themselves agents (or sub-agents) — Another big one! Agent-as-tool is one of the main candidates for long-running processes; those tools often take seconds or minutes and benefit from the model staying responsive while they run.

Try It

The code is in the async-agentic-tools repository. The demo uses Strands Agents with Claude Sonnet on Amazon Bedrock. You'll need AWS credentials for Bedrock and Python 3.14+.

git clone https://github.com/mikegc-aws/async-agentic-tools
cd async-agentic-tools
uv run python demo.py

The three files that matter are in strands_async_tools/: manager.py (87 lines), decorator.py (68 lines), and agent.py (166 lines). The whole thing is about 320 lines of Python. Read it, fork it, break it. If this approach is useful to you, consider giving the repo a star — it helps others find it and makes everyone smile.

This is my experiment in what becomes possible when models get smart enough to handle architectural patterns that would have confused them a year ago. The request-response loop served us well. But tools are getting slower and more powerful, conversations are getting longer and more complex, and users shouldn't have to wait in silence while the interesting work happens in the background.

Streaming Agents on AWS

Mike Chambers — Tue, 09 Dec 2025 12:19:44 +0000

You've built an agent that streams responses beautifully in development. Now you're ready to deploy to production, and you're thinking about security, rate limiting, and authentication.

Just like you'd put your website behind a CDN or gateway, your agent deserves the same protection. Here's how to deploy streaming agents with API Gateway while maintaining that smooth, real-time user experience.

This is Part 1 of a two-part series. This post covers the architecture and key concepts. For the complete implementation with CDK code and deployment steps, see Part 2: Complete Tutorial.

The Challenge

Production agents need robust protection: rate limiting to prevent abuse, WAF to block attacks, authentication to validate users, and API keys for access control. API Gateway provides all of this.

The traditional approach has been to choose between two options:

Expose Runtime directly — This works but requires building security logic into your agent code, and you're vulnerable if a bad actor connects.
Skip streaming entirely — This is fine for background tasks, but for interactive chatbots where users are waiting, a 30+ second delay creates a poor experience.

The good news? API Gateway now supports response streaming, so you can have both enterprise-grade protection and real-time streaming. This post shows you how to set it up.

The Right Architecture

Here's what you need:

Cognito User Pool for OAuth2/JWT authentication
AgentCore Runtime with JWT authorizer (using Cognito)
API Gateway with streaming enabled (pointing to Runtime, using Cognito for auth)
ResponseTransferMode: STREAM (the key configuration that enables streaming)

The flow looks like this:

User → Cognito (get ID token) 
     → API Gateway (validate token, stream enabled)
     → Runtime /invocations endpoint (validate token again, stream response)
     → Agent (async generator)
     → Stream back through the chain

The Four Critical Pieces

1. Use ID Tokens, Not Access Tokens

API Gateway Cognito authorizers expect ID tokens (which contain user identity claims like sub), not access tokens. Your client needs to send:

Authorization: Bearer <id_token>

Both API Gateway and AgentCore Runtime will validate this token. Defense in depth.

2. Use the /invocations Endpoint

The /invocations endpoint is the OAuth2 endpoint on AgentCore Runtime. It's specifically designed to:

Accept JWT ID tokens in the Authorization header
Validate tokens using the authorizer you configured
Stream responses using the async generator pattern
Handle long-running operations with extended timeouts

The endpoint looks like this:

https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{runtime_id}/invocations?qualifier=DEFAULT&accountId={account}

Other endpoints might not support streaming or might require different authentication. This is the one that works.

3. Enable ResponseTransferMode: STREAM

API Gateway buffers responses by default, so you need to explicitly enable streaming mode:

# The escape hatch (CDK doesn't expose this directly yet)
cfn_method = post_method.node.default_child
cfn_method.add_property_override("Integration.ResponseTransferMode", "STREAM")

Without this configuration, API Gateway will buffer the entire response before sending it to the client, which prevents streaming from working.

4. Return an Async Generator

Your agent code needs to return an async generator, not a complete response:

@app.entrypoint
async def invoke(payload, context):
    async def generate_stream():
        async for chunk in agent.stream_async(prompt):
            yield chunk
    return generate_stream()

The runtime detects the async generator and handles the streaming protocol automatically.

Why This Architecture Works

This approach delivers enterprise-grade protection without sacrificing user experience. You get protection from the WAF, and authentication to validate users, all while maintaining the real-time streaming that keeps users engaged. It's the best of both worlds.

The timeout difference is particularly significant. With streaming, you get up to 15 minutes of execution time, compared to just 29 seconds without it. For agents that make multiple tool calls or process large datasets, streaming enables use cases that simply wouldn't work otherwise.

Beyond streaming, you get the complete API Gateway feature set: all authorizer types (Cognito, Lambda, IAM), request throttling, access logging, TLS/mTLS support, custom domain names, and centralized metrics and observability. Everything you need for production is included.

Common Mistakes

While building this architecture, I ran into a few gotchas that cost me some debugging time. Here are the patterns that work and the ones to avoid.

❌ Wrong: Forgetting ResponseTransferMode

integration = apigw.HttpIntegration(...)
post_method = chat_resource.add_method("POST", integration)
# Missing the critical line!

✅ Right: Always Set It

cfn_method = post_method.node.default_child
cfn_method.add_property_override("Integration.ResponseTransferMode", "STREAM")

❌ Wrong: Using the Wrong Endpoint

# This won't stream properly
wrong_endpoint = f"https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{id}/some-path"

✅ Right: Use /invocations

# The OAuth2 endpoint that supports streaming
correct_endpoint = f"https://bedrock-agentcore.{region}.amazonaws.com/runtimes/{id}/invocations?qualifier=DEFAULT&accountId={account}"

❌ Wrong: Returning Complete Response

@app.entrypoint
async def invoke(payload, context):
    result = await agent.run(prompt)
    return result  # Won't stream

✅ Right: Return Async Generator

@app.entrypoint
async def invoke(payload, context):
    async def generate_stream():
        async for chunk in agent.stream_async(prompt):
            yield chunk
    return generate_stream()

The Complete Flow

Let's walk through what happens when a user interacts with your streaming agent. Understanding this flow helps clarify why each piece of the architecture matters.

It starts with authentication. The user authenticates with Cognito using the OAuth2 flow and receives a JWT ID token. This token contains their identity claims and will be validated at multiple points in the request chain.

When the frontend sends a request to API Gateway, it includes this token in the Authorization: Bearer <id_token> header. API Gateway immediately validates the token using its Cognito authorizer—this is the first validation layer. Once validated, API Gateway proxies the request to the AgentCore Runtime OAuth2 endpoint (/invocations).

The Runtime doesn't just trust API Gateway. It validates the token again using its own JWT authorizer, providing a second validation layer for defense in depth. With authentication confirmed, the agent executes and returns an async generator.

Now the streaming begins. The Runtime streams chunks back to API Gateway with no buffering. API Gateway, configured with ResponseTransferMode: STREAM, passes those chunks directly to the client without buffering. The client displays chunks as they arrive, creating that smooth, real-time experience users expect.

The beauty of this architecture is that streaming happens at every layer: Agent → Runtime → API Gateway → Client. And authentication is validated at both API Gateway and Runtime, giving you robust security without compromising performance.

Constraints

Idle Timeouts

Regional/Private endpoints: 5-minute idle timeout (no data for 5 minutes = connection closes)
Edge-optimized endpoints: 30-second idle timeout

Keep your agent generating data regularly, or the connection will drop.

Bandwidth Limits

First 10MB: No restrictions
After 10MB: Limited to 2MB/s

For most agent responses, this won't be an issue.

What Doesn't Work with Streaming

Response transformation with VTL (Velocity Template Language)
Integration response caching
Content encoding (gzip, etc.)

If you need these features, you'll need to handle them differently.

References

Complete Tutorial: Streaming Agents on AWS

Mike Chambers — Tue, 09 Dec 2025 12:19:40 +0000

This is Part 2 of a two-part series. If you haven't read the architecture overview yet, start with Part 1: Streaming Agents with API Gateway to understand the concepts before diving into the implementation.

This guide walks you through the complete setup: CDK stacks, agent code, authentication flow, and deployment. By the end, you'll have a production-ready streaming agent protected by API Gateway.

Complete code: on GitHub

Prerequisites

AWS Account with appropriate permissions
AWS CDK installed (npm install -g aws-cdk)
Python 3.11+ with uv (pip install uv)
Basic understanding of CDK, API Gateway, and Cognito

Architecture Overview

We'll deploy three CDK stacks in order:

Cognito Stack: User Pool for OAuth2/JWT authentication
Runtime Stack: AgentCore Runtime with JWT authorizer
API Gateway Stack: REST API with streaming enabled

The deployment order matters because each stack depends on outputs from the previous one.

Project Structure

api-gw-sr-runtime/
├── app.py                    # CDK app entry point
├── chatbot_spa_cdk/
│   ├── chatbot_spa_stack.py  # Cognito + API Gateway
│   └── agent_runtime_stack.py # AgentCore Runtime
├── agent/
│   └── agent.py              # Streaming agent code
├── spa/                      # Frontend application
└── pyproject.toml

Step 1: CDK App Setup

The main CDK app orchestrates the three stacks with proper dependencies. The actual implementation includes environment configuration and resource naming:

# app.py (simplified - see repo for full version)
from aws_cdk import App, Environment
from chatbot_spa_cdk.chatbot_spa_stack import ChatbotSpaStack
from chatbot_spa_cdk.agent_runtime_stack import AgentRuntimeStack

app = App()

# Step 1: Deploy Cognito first
cognito_stack = ChatbotSpaStack(
    app, "ChatbotCognitoStack",
    resource_prefix="chatbot-spa",
    backend_url=None,  # Skip API Gateway for now
    callback_url="http://localhost:3000/callback.html",
    env=env
)

# Step 2: Deploy Runtime with Cognito references
runtime_stack = AgentRuntimeStack(
    app, "ChatbotAgentRuntimeStack",
    resource_prefix="chatbot-spa",
    user_pool=cognito_stack.user_pool,
    user_pool_client=cognito_stack.user_pool_client,
    env=env
)
runtime_stack.add_dependency(cognito_stack)

# Step 3: Deploy API Gateway pointing to Runtime
api_stack = ChatbotSpaStack(
    app, "ChatbotApiGatewayStack",
    resource_prefix="chatbot-spa",
    backend_url=runtime_stack.runtime_endpoint,
    existing_user_pool=cognito_stack.user_pool,
    existing_user_pool_client=cognito_stack.user_pool_client,
    env=env
)
api_stack.add_dependency(runtime_stack)

app.synth()

Key points:

Cognito deploys first (no dependencies)
Runtime depends on Cognito (needs User Pool for JWT validation)
API Gateway depends on Runtime (needs endpoint URL)
The resource_prefix parameter makes resources easily identifiable in the console

Step 2: Cognito Stack

The Cognito configuration is part of the ChatbotSpaStack. When deployed without a backend_url, it creates just the User Pool:

# From chatbot_spa_cdk/chatbot_spa_stack.py (simplified)
from aws_cdk import Stack, Duration
from aws_cdk import aws_cognito as cognito

class ChatbotSpaStack(Stack):
    def __init__(self, scope, construct_id, resource_prefix, callback_url, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # Create User Pool
        user_pool = cognito.UserPool(
            self,
            "UserPool",
            user_pool_name=f"{resource_prefix}-user-pool",
            self_sign_up_enabled=False,
            sign_in_aliases=cognito.SignInAliases(email=True),
            auto_verify=cognito.AutoVerifiedAttrs(email=True),
            password_policy=cognito.PasswordPolicy(
                min_length=8,
                require_uppercase=True,
                require_lowercase=True,
                require_digits=True,
                require_symbols=False,
            ),
        )

        # Enable Managed Login UI (Essentials tier)
        cfn_user_pool = user_pool.node.default_child
        cfn_user_pool.add_property_override("UserPoolTier", "ESSENTIALS")

        # Add domain for hosted UI
        user_pool_domain = user_pool.add_domain(
            "UserPoolDomain",
            cognito_domain=cognito.CognitoDomainOptions(
                domain_prefix=f"{resource_prefix}-{self.account}",
            ),
        )

        # Create OAuth2 client
        user_pool_client = user_pool.add_client(
            "UserPoolClient",
            user_pool_client_name=f"{resource_prefix}-client",
            generate_secret=False,  # Public client for web apps
            o_auth=cognito.OAuthSettings(
                flows=cognito.OAuthFlows(authorization_code_grant=True),
                scopes=[
                    cognito.OAuthScope.OPENID,
                    cognito.OAuthScope.EMAIL,
                    cognito.OAuthScope.PROFILE,
                ],
                callback_urls=[callback_url, "http://localhost:3000"],
                logout_urls=["http://localhost:3000"],
            ),
            refresh_token_validity=Duration.days(30),
            access_token_validity=Duration.minutes(60),
            id_token_validity=Duration.minutes(60),
        )

        # Export for other stacks
        self.user_pool = user_pool
        self.user_pool_client = user_pool_client

Configuration details:

self_sign_up_enabled=False: Prevents public registration (you control who gets access)
sign_in_aliases: Users sign in with email addresses
generate_secret=False: Public client (web apps can't keep secrets)
authorization_code_grant: Standard OAuth2 flow for web applications
OPENID scope: Required for ID tokens
callback_urls: Where Cognito redirects after authentication

What you get:

User Pool that issues JWT ID tokens
Hosted UI for authentication (optional, you can build your own)
OAuth2 client configured for web applications

Step 3: AgentCore Runtime Stack

Deploy your agent to AgentCore Runtime with JWT authorization. The actual implementation uses CfnResource and includes bundling logic for dependencies:

# From chatbot_spa_cdk/agent_runtime_stack.py (simplified)
from aws_cdk import Stack, CfnResource
from aws_cdk.aws_s3_assets import Asset

class AgentRuntimeStack(Stack):
    def __init__(self, scope, construct_id, user_pool, user_pool_client, 
                 resource_prefix, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # Package agent code with dependencies
        # (See repo for full bundling configuration)
        agent_asset = Asset(
            self,
            "AgentCodeAsset",
            path="./agent",
            # bundling configuration omitted for brevity
        )

        # Build Cognito OIDC discovery URL
        discovery_url = (
            f"https://cognito-idp.{self.region}.amazonaws.com/"
            f"{user_pool.user_pool_id}/.well-known/openid-configuration"
        )

        # Create runtime using CfnResource (Layer 1 construct)
        runtime_name = resource_prefix.replace("-", "_") + "_agent_runtime"

        runtime = CfnResource(
            self,
            "AgentCoreRuntime",
            type="AWS::BedrockAgentCore::Runtime",
            properties={
                "AgentRuntimeName": runtime_name,
                "Description": f"Runtime for {resource_prefix} with streaming",
                "RoleArn": runtime_role.role_arn,  # IAM role created separately
                "NetworkConfiguration": {
                    "NetworkMode": "PUBLIC",
                },
                "AuthorizerConfiguration": {
                    "CustomJWTAuthorizer": {
                        "DiscoveryUrl": discovery_url,
                        "AllowedAudience": [user_pool_client.user_pool_client_id],
                    }
                },
                "AgentRuntimeArtifact": {
                    "CodeConfiguration": {
                        "Code": {
                            "S3": {
                                "Bucket": agent_asset.s3_bucket_name,
                                "Prefix": agent_asset.s3_object_key,
                            }
                        },
                        "EntryPoint": ["agent.py"],
                        "Runtime": "PYTHON_3_12",
                    }
                },
            },
        )

        # Build the OAuth2 endpoint URL
        runtime_id = runtime.ref
        runtime_endpoint = (
            f"https://bedrock-agentcore.{self.region}.amazonaws.com/"
            f"runtimes/{runtime_id}/invocations"
            f"?qualifier=DEFAULT&accountId={self.account}"
        )

        self.runtime_endpoint = runtime_endpoint

Critical details:

CustomJWTAuthorizer: Uses OIDC discovery to validate ID tokens from Cognito
DiscoveryUrl: Points to Cognito's OIDC configuration endpoint
AllowedAudience: The User Pool Client ID (ID tokens must have this in their aud claim)
/invocations endpoint: The OAuth2 endpoint that supports streaming
qualifier=DEFAULT: Uses the default runtime version
accountId: Required for cross-account access control
CfnResource: Used because CDK doesn't have L2 constructs for AgentCore yet

What happens:

Runtime validates every request's JWT ID token
Invalid or missing tokens are rejected with 401
Valid tokens allow the request to proceed to your agent

Step 4: API Gateway Stack

Create the REST API with streaming enabled:

from aws_cdk import Duration
from aws_cdk import aws_apigateway as apigw

class ApiGatewayStack(Stack):
    def __init__(self, scope, construct_id, runtime_endpoint, user_pool, **kwargs):
        super().__init__(scope, construct_id, **kwargs)

        # Create REST API
        api = apigw.RestApi(
            self,
            "Api",
            rest_api_name="agent-api",
            default_cors_preflight_options=apigw.CorsOptions(
                allow_origins=["http://localhost:3000"],
                allow_methods=["POST", "OPTIONS"],
                allow_headers=["Content-Type", "Authorization"],
            ),
        )

        # Cognito authorizer
        authorizer = apigw.CognitoUserPoolsAuthorizer(
            self,
            "CognitoAuthorizer",
            cognito_user_pools=[user_pool],
        )

        # HTTP Proxy Integration
        integration = apigw.HttpIntegration(
            runtime_endpoint,  # The OAuth2 endpoint from Runtime Stack
            http_method="POST",
            proxy=True,
            options=apigw.IntegrationOptions(
                connection_type=apigw.ConnectionType.INTERNET,
                timeout=Duration.seconds(900),  # 15 minutes with streaming
                request_parameters={
                    "integration.request.header.Authorization": 
                        "method.request.header.Authorization",
                },
            ),
        )

        # Add method
        chat_resource = api.root.add_resource("chat")
        post_method = chat_resource.add_method(
            "POST",
            integration,
            authorizer=authorizer,
            authorization_type=apigw.AuthorizationType.COGNITO,
        )

        # CRITICAL: Enable streaming with escape hatch
        cfn_method = post_method.node.default_child
        cfn_method.add_property_override("Integration.ResponseTransferMode", "STREAM")

        self.api_url = api.url

Why the escape hatch?

CDK's HttpIntegration doesn't expose ResponseTransferMode directly yet. The escape hatch lets you set it on the underlying CloudFormation resource.

What this does:

API Gateway validates the JWT ID token (first layer of defense)
Forwards the Authorization header to Runtime (second layer of defense)
Streams the response instead of buffering it
Allows up to 15 minutes for the request to complete

CORS configuration:

Allows requests from your frontend origin
Includes Authorization header in allowed headers
Handles preflight OPTIONS requests

Step 5: Agent Implementation

This example uses two SDKs to simplify development:

Strands Agents SDK: A Python framework for building agentic workflows with streaming support built-in
Amazon Bedrock AgentCore SDK: Handles the AgentCore Runtime integration and streaming protocol

Both SDKs are optional. You can build agents with any framework that returns async generators, but these make it much easier to get up and running quickly for this demo.

Your agent code needs to return an async generator for streaming:

# agent/agent.py
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from strands import Agent
from strands_tools import calculator

app = BedrockAgentCoreApp()

# Lazy load agent for performance
_agent = None

def get_agent():
    global _agent
    if _agent is None:
        _agent = Agent(
            system_prompt="You are a helpful assistant that can perform calculations.",
            tools=[calculator]
        )
    return _agent

@app.entrypoint
async def invoke(payload, context):
    """Entry point that returns an async generator for streaming"""
    agent = get_agent()
    prompt = payload.get("prompt", "Hello!")

    # Return an async generator
    async def generate_stream():
        agent_stream = agent.stream_async(prompt)
        async for event in agent_stream:
            if "data" in event:
                yield event["data"]
            # You can also handle tool use events here if needed

    return generate_stream()

How it works:

BedrockAgentCoreApp detects when you return an async generator
It handles the streaming protocol automatically
Each yield sends a chunk to the client immediately
The stream flows: Agent → Runtime → API Gateway → Client

Why lazy load the agent?

The runtime reuses the same container across invocations, which means the agent instance stays in memory. This is crucial for maintaining conversation context and history between requests. By lazy loading, you initialize the agent once and it persists across all subsequent invocations, allowing multi-turn conversations to work naturally.

Agent requirements:

# agent/pyproject.toml
[project]
name = "streaming-agent"
version = "0.1.0"
dependencies = [
    "bedrock-agentcore-runtime",
    "strands",
    "strands-tools",
]

Step 6: Frontend Implementation

Handle streaming on the client side:

// Get ID token from Cognito (after OAuth2 flow)
const idToken = sessionStorage.getItem('id_token');

async function sendMessage(prompt) {
    const response = await fetch(
        'https://your-api.execute-api.us-west-2.amazonaws.com/chat',
        {
            method: 'POST',
            headers: {
                'Authorization': `Bearer ${idToken}`,
                'Content-Type': 'application/json',
            },
            body: JSON.stringify({ prompt }),
        }
    );

    // Read the stream
    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let buffer = '';

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        // Decode chunk (may contain partial UTF-8 sequences)
        buffer += decoder.decode(value, { stream: true });

        // Display immediately
        appendToMessage(buffer);
        buffer = '';
    }

    // Flush any remaining buffer
    if (buffer) {
        appendToMessage(buffer);
    }
}

Important: Use { stream: true } in TextDecoder.decode(). This handles partial UTF-8 sequences that can occur at chunk boundaries.

OAuth2 flow (simplified):

// Redirect to Cognito for authentication
function login() {
    const cognitoDomain = 'https://agent-123456789.auth.us-west-2.amazoncognito.com';
    const clientId = 'your-client-id';
    const redirectUri = 'http://localhost:3000/callback.html';

    window.location.href = 
        `${cognitoDomain}/oauth2/authorize?` +
        `client_id=${clientId}&` +
        `response_type=code&` +
        `scope=openid+email&` +
        `redirect_uri=${encodeURIComponent(redirectUri)}`;
}

// Handle callback (in callback.html)
async function handleCallback() {
    const params = new URLSearchParams(window.location.search);
    const code = params.get('code');

    // Exchange code for tokens
    const response = await fetch(
        `${cognitoDomain}/oauth2/token`,
        {
            method: 'POST',
            headers: { 'Content-Type': 'application/x-www-form-urlencoded' },
            body: new URLSearchParams({
                grant_type: 'authorization_code',
                client_id: clientId,
                code: code,
                redirect_uri: redirectUri,
            }),
        }
    );

    const tokens = await response.json();
    sessionStorage.setItem('id_token', tokens.id_token);
    sessionStorage.setItem('access_token', tokens.access_token);

    // Redirect back to app
    window.location.href = '/';
}

Deployment

Deploy the stacks in order:

cd api-gw-sr-runtime

# Bootstrap CDK (first time only)
export AWS_PROFILE=your-profile
uv run cdk bootstrap

# Deploy all stacks
uv run cdk deploy --all

# Update frontend config with API URL
./update-spa-config.sh

# Create a test user
./create-test-user.sh testuser@example.com TestPassword123!

What happens:

Cognito stack deploys (User Pool + Client)
Runtime stack deploys (references Cognito)
API Gateway stack deploys (references Runtime endpoint)
Scripts configure frontend and create test user

Testing

Test with curl

# Get your ID token (from browser sessionStorage or Cognito)
ID_TOKEN="eyJraWQiOi..."

# Test with -N flag for no buffering
curl -N -X POST \
  https://your-api.execute-api.us-west-2.amazonaws.com/chat \
  -H "Authorization: Bearer $ID_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"prompt":"What is 25 * 4? Show your work."}'

You should see the response appear incrementally, not all at once.

Test with frontend

cd spa
python -m http.server 3000

Open http://localhost:3000, log in, and send a message. You should see the response stream in real-time.

Troubleshooting

Streaming not working (response appears all at once)

Check:

Is ResponseTransferMode: STREAM set on the API Gateway method?
Are you using the /invocations endpoint?
Is your agent returning an async generator?

401 Unauthorized

Check:

Is the ID token valid? (Check expiration)
Is the token in the Authorization header?
Does the JWT authorizer configuration match your Cognito User Pool?
Are you using the ID token (not access token)?

502 Bad Gateway

Check:

Is the Runtime endpoint URL correct?
Does the Runtime have the JWT authorizer configured?
Is the agent code deployed correctly?

Connection drops after 30 seconds

You're using an edge-optimized endpoint. Switch to regional:

api = apigw.RestApi(
    self,
    "Api",
    endpoint_types=[apigw.EndpointType.REGIONAL],  # Add this
    ...
)

Agent not streaming

Check:

Is your agent returning an async generator?
Are you yielding chunks, not returning a complete response?
Is the agent actually generating data? (Add logging)

Performance Optimization

Lazy Load Your Agent

_agent = None

def get_agent():
    global _agent
    if _agent is None:
        _agent = Agent(...)  # Only initialize once
    return _agent

The runtime reuses the same container across invocations, lazy loading keeping the agent instance in memory. This allows the agent to maintain conversation history and context between requests, enabling natural multi-turn conversations without needing external storage.

Full Code Repository

Complete code: on GitHub

Includes:

All CDK stacks
Agent implementation
Frontend with OAuth2
Deployment scripts
Test utilities