Forem: Morgan Willis

Amazon Bedrock for Beginners From First Prompt to AI Agent (Full Tutorial)

Morgan Willis — Tue, 14 Apr 2026 21:14:17 +0000

So you want to add AI to your application. Maybe you want to build a smart assistant, add a feature that analyzes user input, or you have an AI-powered side project you've been meaning to start.

On the surface, it sounds simple. Call a model, get a response. But once you actually try to build it, the questions start to stack up fast.

Which model do you use?
How do you call it from your application code?
What happens when you want the AI to interact with your own data or external systems?
And how do I control costs?

It can feel like you need to understand everything before you can build anything, but you don't.

Amazon Bedrock is a great place to start because it's a fully managed service on AWS that gives you API access to AI models from providers like Amazon, Anthropic, Meta, Mistral, and more. You don't need to set up servers, manage infrastructure, and you only pay for what you use.

On top of model access, Bedrock includes features like Knowledge Bases for connecting your own data, Guardrails for content safety, and tool use for interacting with the real world.

This post walks through Bedrock's main features with code examples you can run yourself in your own AWS account. Everything comes from the companion repo, which has full working implementations of each example. By the end, we'll combine everything into an AI agent using the Strands Agents SDK to build out a university FAQ chatbot.

A heads up before we start: we're going to do things step by step, and this could take a while if you're following along. Give yourself an hour or so if you're a total beginner. We'll work directly with the Bedrock APIs so you understand exactly how the pieces fit together. Then at the end, we'll take an easier approach that handles much of the complexity for you. Learning the fundamentals first will make everything make a lot more sense later.

If you prefer a video walkthrough, this post has an accompanying video that covers the same material with live demos:

Prerequisites

Before following along, you'll need:

Python 3.12+ installed on your machine
An AWS account with credentials configured locally
IAM User or Role Create an IAM user or role in your AWS account to follow along with the AWS Console steps, you cannot complete the tutorial using the root user.

You'll also need to install boto3, which is the Python SDK for interacting with AWS services programmatically. Run the following in the terminal in your IDE:

pip install boto3

Making API Calls to Amazon Bedrock

When you send a prompt to a model and receive a response, that process is called inference. You provide input, the model runs its computation, and it generates output.

For AI powered applications, you need to be able to run inference against models programmatically through an API. Bedrock exposes a set of APIs you can use. Let's start with the Converse API, which is the standard way to call models on Bedrock.

The Converse API uses the same standard request format regardless of which model you're talking to. That means you can switch from Amazon Nova to Meta Llama to Anthropic Claude Haiku but still use the same API.

Here's a complete first API call to Amazon Bedrock:

import boto3
import json

def use_converse_api():
    bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')
    model_id = "us.amazon.nova-lite-v1:0"

    # Define a system prompt to set model behavior
    system_prompt = [
        {
            "text": "You are a helpful technical assistant who explains concepts clearly and concisely."
        }
    ]

    # User message
    user_message = "What is serverless computing?"

    # Use the Converse API
    response = bedrock_runtime.converse(
        modelId=model_id,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": [{"text": user_message}]
            }
        ],
        inferenceConfig={
            "temperature": 0.7,
            "maxTokens": 2000
        }
    )

    # Extract the response
    output_text = response['output']['message']['content'][0]['text']
    print(output_text)

    # Display token usage
    usage = response.get('usage', {})
    print(f"Input tokens: {usage.get('inputTokens', 'N/A')}")
    print(f"Output tokens: {usage.get('outputTokens', 'N/A')}")

if __name__ == "__main__":
    use_converse_api()

Let's break down the structure of this API call, because you'll see the same pattern throughout the rest of the examples:

At the top, we import boto3 and create a bedrock-runtime client. This client is how your Python code communicates with the Bedrock service over the network.

Then we define the model_id. We're using Amazon Nova Lite, a fast and cost-efficient model. Every model in Bedrock has a unique ID. You can find the full list in the supported model IDs documentation.

The call to converse() has three main parts:

system: The system prompt defines the model's role and behavior. Think of it as instructions for how the model should respond. The system prompt is sent with every inference request.
messages: The conversation between the user and the model. Each message has a role (either "user" or "assistant") and content. This structure lets the model understand who said what. In a real application, the user message would come from a frontend, a mobile app, or command line input. We're hardcoding it here to keep things simple.
inferenceConfig: Parameters that control how the model generates its response. temperature controls how random or creative the output is. Set it to 0.0 and you get the most predictable response every time, which is useful for tasks like classification or data extraction. Push it higher toward 1.0 and the output gets more varied, which works better for creative writing or brainstorming. maxTokens caps how long the response can be. Different models support different inference parameters, so check the documentation for the specific model you're using.

The Converse API is the recommended approach because it works the same across all models. Change the modelId from Nova to Llama to Mistral, and your code still works.

Understanding Tokens

Notice in the code above we printed token usage at the end of the script. Before we go further, you need to understand what tokens are, because they directly affect how much you pay.

A token is a small chunk of text. It might be a whole word, part of a word, or even punctuation. Different models break text into tokens in slightly different ways, and there is no universal standard. When you send a prompt to a model, your text gets broken into tokens. The model processes those tokens and generates new tokens as its response.

A short sentence like "What is serverless computing?" gets broken into several tokens. Longer prompts mean more input tokens. Longer responses mean more output tokens. You're billed for both, so the size of your prompt and the length of the model's response directly affect cost. Always set maxTokens to prevent runaway responses from driving up your bill.

Every model also has a context window, which is the maximum number of tokens it can handle in a single request. This is the model's working memory. Your input tokens and output tokens all need to fit inside the context window. If you exceed the window, then the API returns an error because it cannot process that many tokens in one call. This becomes important for long conversations and applications where you inject large amounts of data into the prompt for the model to reason over.

You can use the Bedrock pricing page to understand token costs for different models.

Multi-Turn Conversations

Up to this point, we've done single-turn interactions: one prompt, one response. But real applications usually need ongoing conversations where the model remembers what was said earlier.

Here's the thing though: models are stateless by design. Each API call is completely independent and the model doesn't remember anything from previous requests. You need to explicitly send the full conversation history with every call.

This is how all AI powered chat applications work. It seems like they remember everything you talked about between prompts, but that is only because the conversation history is collected and submitted into context through the prompt with every request.

That means when you are writing apps that need multi-turn conversations, your code is responsible for managing and sending the full context. Let's build a cooking assistant that demonstrates three conversation turns:

import boto3

def multi_turn_conversation():
    bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')
    model_id = "us.amazon.nova-lite-v1:0"

    # System prompt sets the assistant's behavior
    system_prompt = [
        {
            "text": "You are a helpful cooking assistant. Provide concise recipe suggestions."
        }
    ]

    # Conversation history - we'll build this up with each turn
    conversation_history = []

    # Turn 1: Ask for recipe suggestions
    user_message_1 = "Suggest a quick dinner recipe with chicken."

    conversation_history.append({
        "role": "user",
        "content": [{"text": user_message_1}]
    })

    response_1 = bedrock_runtime.converse(
        modelId=model_id,
        system=system_prompt,
        messages=conversation_history,
        inferenceConfig={"temperature": 0.7, "maxTokens": 200}
    )

    assistant_message_1 = response_1['output']['message']['content'][0]['text']

    # Add assistant's response to history
    conversation_history.append({
        "role": "assistant",
        "content": [{"text": assistant_message_1}]
    })

    # Turn 2: Ask for modifications
    user_message_2 = "Can you make it vegetarian instead?"

    conversation_history.append({
        "role": "user",
        "content": [{"text": user_message_2}]
    })

    response_2 = bedrock_runtime.converse(
        modelId=model_id,
        system=system_prompt,
        messages=conversation_history,
        inferenceConfig={"temperature": 0.7, "maxTokens": 200}
    )

    assistant_message_2 = response_2['output']['message']['content'][0]['text']

    conversation_history.append({
        "role": "assistant",
        "content": [{"text": assistant_message_2}]
    })

    # Turn 3: Ask for cooking time
    user_message_3 = "How long will this take to prepare?"

    conversation_history.append({
        "role": "user",
        "content": [{"text": user_message_3}]
    })

    response_3 = bedrock_runtime.converse(
        modelId=model_id,
        system=system_prompt,
        messages=conversation_history,
        inferenceConfig={"temperature": 0.7, "maxTokens": 200}
    )

    assistant_message_3 = response_3['output']['message']['content'][0]['text']
    print(assistant_message_3)

if __name__ == "__main__":
    multi_turn_conversation()

The pattern is the same for every turn: append the user message to the conversation history, call the Converse API with the full history, then append the assistant's response back to the history.

The model can reference what was said in turn 1 when responding to turn 2, but only because you're resending everything. You're paying for those tokens each time too, which is why conversation history management matters for cost.

In production, you'd store conversation history somewhere persistent, like a database. When a user returns, you load their history and continue where they left off.

Showing you how to use the Converse API like this is essentially doing it the hard way, and we're doing this on purpose for learning purposes. In a real application, you also wouldn't have redundant code like this. You'd refactor common code into functions and collect user input dynamically.

There are higher-level libraries and frameworks that can handle a lot of that complexity for you, including managing the message history and formatting the request body. But we're working with the Bedrock APIs directly for now so you understand exactly how Bedrock and AI models actually work. Later, when I show you the simpler way using the Strands Agents SDK, you'll fully understand what's happening under the hood.

Tool Use (Function Calling)

Everything we've done has been purely text in, text out. The model generates a response based on its training data and whatever you include in the prompt. But this is problematic for real world usage.

You can’t rely on training data alone. Models have a knowledge cutoff based on when they were trained, and they don’t have access to real-time or external data like today’s weather, live content from the internet, or data stored in databases.

They don't know what's happening right now, and they can't take actions in the real world on their own.

That's where tool use comes in. Tools are functions that a model can request your application to run in order to interact with external systems. The model doesn't execute tools itself. It sends a structured request saying "I want to call this function with these arguments," and your code handles the actual execution.

This is how most modern AI applications work. A chatbot that does research for you using the internet? That's tool use. A coding assistant that reads files from your local disk? Tool use. A personal assistant bot that checks your calendar? Also tool use.

Now, this does get a bit involved when you're doing everything the hard way, but stick with me. This is important to understand when you are building a foundational understanding of how AI works.

Think of it like this:

1. You create tools in your code.
2. You write code to inform the model what tools exist and how to use them, this is often called a tool schema.
3. You send the model a prompt along with the tool schema.
4. The model reasons over the prompt and decides if it needs a tool to answer.
5. If it does need a tool, the model returns a response to your application code including information on which tool to call and with what arguments.
6. Your code runs the tool.
7. Your code sends the result of the tool call back to the model.
8. The model reasons over the tool result and works that information into its final response.

Here's the full tool use example following this flow:

import json
import boto3

# ---------------------------------------------------------------------------
# Step 1: Define your local Python functions
# ---------------------------------------------------------------------------
# These are regular Python functions. The model will never call them directly.
# Instead, the model will ASK us to call them by returning a tool_use block.

def get_weather(location, unit="fahrenheit"):
    """
    Simulate fetching weather data for a location.
    In a real app, this would call a weather API like OpenWeatherMap.
    """
    weather_data = {
        "location": location,
        "temperature": 58 if unit == "fahrenheit" else 14,
        "unit": unit,
        "condition": "Partly cloudy",
        "humidity": "72%",
        "wind": "8 mph NW",
    }
    return weather_data


# ---------------------------------------------------------------------------
# Step 2: Describe your functions as "tools" for the model
# ---------------------------------------------------------------------------
# The model needs a description of each tool so it knows:
#   - What the tool does (description)
#   - What inputs it expects (inputSchema)
#
# This is like writing documentation so someone else can use your function.

TOOL_CONFIG = {
    "tools": [
        {
            "toolSpec": {
                "name": "get_weather",
                "description": "Get the current weather for a given location.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "type": "string",
                                "description": "The city and state, e.g. 'San Francisco, CA'",
                            },
                            "unit": {
                                "type": "string",
                                "enum": ["fahrenheit", "celsius"],
                                "description": "Temperature unit (default: fahrenheit)",
                            },
                        },
                        "required": ["location"],
                    }
                },
            }
        }
    ]
}


# ---------------------------------------------------------------------------
# Step 3: Map tool names to actual Python functions
# ---------------------------------------------------------------------------
# When the model asks to use a tool, we look up the function by name here.

TOOL_FUNCTIONS = {
    "get_weather": get_weather,
}


def run_tool(tool_name, tool_input):
    """
    Look up a tool by name and call it with the provided input.
    Returns the result as a dictionary.
    """
    func = TOOL_FUNCTIONS.get(tool_name)
    if func is None:
        return {"error": f"Unknown tool: {tool_name}"}

    # ** unpacks the dict into keyword arguments:
    #   get_weather(**{"location": "Seattle"})  →  get_weather(location="Seattle")
    return func(**tool_input)


# ---------------------------------------------------------------------------
# Step 4: The main tool use loop
# ---------------------------------------------------------------------------

def tool_use_demo():
    bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
    model_id = "us.amazon.nova-lite-v1:0"

    user_message = "What's the weather like in Seattle right now?"

    messages = [
        {
            "role": "user",
            "content": [{"text": user_message}],
        }
    ]

    # First API call: send the message AND the tool definitions to the model
    response = bedrock.converse(
        modelId=model_id,
        messages=messages,
        toolConfig=TOOL_CONFIG,
        inferenceConfig={"temperature": 0.0, "maxTokens": 300},
    )

    stop_reason = response["stopReason"]
    assistant_message = response["output"]["message"]

    # Check: did the model ask to use a tool?
    if stop_reason == "tool_use":
        # Find the toolUse block in the response
        tool_use_block = None
        for block in assistant_message["content"]:
            if "toolUse" in block:
                tool_use_block = block["toolUse"]
                break

        tool_name = tool_use_block["name"]
        tool_input = tool_use_block["input"]
        tool_use_id = tool_use_block["toolUseId"]

        # Run the actual function
        result = run_tool(tool_name, tool_input)

        # Send the result back to the model
        messages.append(assistant_message)
        messages.append({
            "role": "user",
            "content": [
                {
                    "toolResult": {
                        "toolUseId": tool_use_id,
                        "content": [{"json": result}],
                    }
                }
            ],
        })

        # Second API call: model generates its final answer using the tool result
        final_response = bedrock.converse(
            modelId=model_id,
            messages=messages,
            toolConfig=TOOL_CONFIG,
            inferenceConfig={"temperature": 0.0, "maxTokens": 300},
        )

        final_text = final_response["output"]["message"]["content"][0]["text"]
        print(final_text)

if __name__ == "__main__":
    tool_use_demo()

Let's walk through what's happening:

Step 1 is defining the actual Python function. The tool in this case is a local function that simulates fetching weather data. In the real world, you'd swap this out by connecting it to a real API. The model will never call this function directly.

Step 2 is creating a tool schema that describes the function to the model. Think of this like writing documentation so the model knows how to use it. We give the tool a name, a description in natural language, and an input schema that lays out what parameters the tool accepts, their types, and whether they're required.

Step 3 is a dictionary that maps tool names to actual functions. When the model decides it needs a tool, it returns the name of the tool it wants to call. We need to be able to look that up and figure out which function to run. The run_tool function handles this dispatch.

Step 4 is the main loop. We call the Converse API with the user message and the tool config. The model sees the question, sees the available tools, and decides it needs the weather tool. It returns a tool_use block with the function name and arguments. Our code runs the actual function, then sends the result back to the model in a toolResult message. The model uses that real data to generate its final response.

The tool itself can be anything: a local function, an API call to another service, a database query, or a function running in the cloud. The pattern stays the same.

For more details, see the Bedrock tool use documentation.

RAG and Knowledge Bases

Tools are great, but one of the most common use cases for integrating AI into applications is to have it be able to reason over private data, but models don't have access to this data by default.

Models don't have access to your company's internal documentation, your product specs, or any of your proprietary data. If you ask a model about a companies internal processes, it's going to hallucinate something that seems plausible but is actually completely made up.

Retrieval Augmented Generation (RAG) is the common fix for this. The concept is simple: before you ask the model to generate an answer, you first search your own documents for relevant information. Then you include that data in the prompt. The model generates its response grounded in your actual data instead of relying only on what it learned during training.

Retrieve the data, augment the prompt, generate the response. That's where the abbreviation RAG comes from.

How the Retrieval Part Works

The retrieval step uses semantic search, which is different from traditional keyword search. Keyword search looks for exact word matches, while semantic search understands the meaning of the text and searches on that instead.

If your document says "customers can return items within 30 days," semantic search will find it when someone asks about "refund window" or "return period," even though those exact words don't appear. The words "queen" and "king" aren't a direct match either, but they're semantically similar because they both represent royalty. Semantic search finds that relationship but traditional search would not.

To make semantic search work, your data needs to be converted into numbers, or vectors, so the computer can compare meaning mathematically. Here's how the pipeline works:

Upload and Chunking: Upload your documents and then break them into smaller passages called chunks. A 50-page PDF would become many chunks. There are different chunking methods depending on your use case and data structure.
Embedding: Each chunk gets run through an embedding model, which converts the text into a vector, or a list of numbers that represents the meaning of that text. Think of it as a numerical fingerprint of what the text is about.
Storage: Those vectors get stored in a vector database, optimized for searching across vectors quickly.
Retrieval: When a user asks a question, that question also gets converted into a vector. The vector database queries the data and finds the chunks whose vectors are closest to the question's vector semantically. Those are your most relevant passages.
Generation: The relevant passages get included in the prompt passed to the model, and the model generates an answer grounded in that data.

RAG is powerful, but there's a lot of plumbing involved to make it all work. You have to manage the chunking strategy, run embeddings, pick and maintain a vector database, write retrieval logic, and keep everything in sync when documents change. Luckily, Bedrock does this for you.

Bedrock Knowledge Bases

Bedrock Knowledge Bases automate the entire RAG pipeline. You point it at your documents in S3 (or other sources like Confluence, SharePoint, or Salesforce), and it handles ingestion, chunking, embedding, and vector storage. Then you query it with a single API call.

For example, what if a university wanted to make a chatbot to help students find answers to frequently asked questions about course enrollment deadlines, financial aid policies, and general campus information. That is a perfect use case for RAG.

We'll incrementally build this university chatbot example throughout the rest of this post starting with knowledge base creation. If you want to follow along, use the companion repo which contains the full instructions. This includes sample FAQ documents about enrollment, financial aid, housing, and campus services that we will use as the private data.

To create a Knowledge Base, you need to upload your documents to Amazon S3, a data storage service, first. You can go to the Amazon S3 Console and create a new bucket. Then, upload the knowledge base documents to the bucket.

After that, go to the Bedrock console to Create a knowledge base. You'll connect the S3 bucket containing the FAQ documents, choose an embedding model (Amazon Titan Embed is a good default), and select a vector store. If you're just getting started, Amazon S3 Vectors is the simplest and most cost-effective option since it doesn't require you to provision a separate vector database. Then sync the data.

Once your Knowledge Base is created and synced, querying it is a single API call to Bedrock using retrieve_and_generate:

import boto3

# REPLACE THIS with your Knowledge Base ID
KNOWLEDGE_BASE_ID = "YOUR_KB_ID"  # From the Bedrock console
MODEL_ID = "us.amazon.nova-lite-v1:0"

def query_knowledge_base(question):
    bedrock_agent_runtime = boto3.client('bedrock-agent-runtime', region_name='us-east-1')

    # Use retrieve_and_generate to query the Knowledge Base
    response = bedrock_agent_runtime.retrieve_and_generate(
        input={
            'text': question
        },
        retrieveAndGenerateConfiguration={
            'type': 'KNOWLEDGE_BASE',
            'knowledgeBaseConfiguration': {
                'knowledgeBaseId': KNOWLEDGE_BASE_ID,
                'modelArn': MODEL_ID
            }
        }
    )

    # Extract the generated response
    output_text = response['output']['text']
    print(output_text)

    # Display source citations
    citations = response.get('citations', [])
    if citations:
        print("\nSources:")
        for idx, citation in enumerate(citations, 1):
            for reference in citation.get('retrievedReferences', []):
                location = reference.get('location', {})
                s3_location = location.get('s3Location', {})
                uri = s3_location.get('uri', 'Unknown')
                print(f"  [{idx}] {uri}")

if __name__ == "__main__":
    question = "When is spring break this year?"
    query_knowledge_base(question)

Notice the client is bedrock-agent-runtime, not bedrock-runtime. Knowledge Bases use a different API from the Converse API we've been working with.

The retrieve_and_generate call does both RAG steps in one step: it retrieves relevant documents using semantic search, then passes them to the model to generate a response. You get both the answer and citations pointing back to the source documents, so your users can verify where the information came from.

For more on creating and configuring Knowledge Bases, see the Bedrock Knowledge Bases documentation.

Guardrails (Content Safety)

You now know how to build an AI app that can access your data using RAG and interact with external systems through tools. But before you put something like this in front of real users, you need to think about what happens when people try to misuse it or when the model generates something it shouldn't.

When you put an AI application on the internet, you have to assume it will be abused. You can't fully trust user input, and you can't blindly trust model output either.

Guardrails are content filters that get enforced before and after the model is called. They sit outside the prompt as structural policies. You configure your guardrails once, reference them in your API calls, and they work across any model.

Available filters include:

Content filters: Detect harmful content like hate speech, violence, sexual content, and even prompt attacks like jailbreaks or prompt injection attempts, with adjustable severity thresholds
Denied topics: Block entire categories like "investment advice" or "medical diagnosis"
Word filters: Block specific words or phrases, including profanity
Sensitive information filters: Find and mask sensitive data like PII, social security numbers, credit cards, and email addresses
Contextual grounding checks: check model responses against a reference source to reduce hallucinations

To create a Guardrail, you can use the Bedrock console. You give the guardrail a name, configure the content filters with severity thresholds that make sense for your use case, and optionally add denied topics or PII filters. Once configured, create a version to get a guardrail ID and version number.

For the university chatbot, imagine a student tries to ask the assistant something inappropriate, like how to cheat on an exam or how to hack the university network. A guardrail can detect that type of request and block it before the model ever generates a response.

Here's how you add it to a Knowledge Base query:

import boto3

# REPLACE THESE with your actual IDs
KNOWLEDGE_BASE_ID = "YOUR_KB_ID"
GUARDRAIL_ID = "YOUR_GUARDRAIL_ID"
GUARDRAIL_VERSION = "1"
MODEL_ID = "us.amazon.nova-lite-v1:0"


def query_kb_with_guardrail(question):
    bedrock_agent = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

    response = bedrock_agent.retrieve_and_generate(
        input={"text": question},
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                "knowledgeBaseId": KNOWLEDGE_BASE_ID,
                "modelArn": MODEL_ID,
                "generationConfiguration": {
                    "guardrailConfiguration": {
                        "guardrailId": GUARDRAIL_ID,
                        "guardrailVersion": GUARDRAIL_VERSION,
                    },
                },
            },
        },
    )

    print(response["output"]["text"])


if __name__ == "__main__":
    question = "How can I cheat on my finals this year?"
    query_kb_with_guardrail(question)

To use a guardrail, all you have to do is add a generationConfiguration with the guardrail identifier and version number inside the knowledge base configuration. Everything else in your code stays exactly the same.

A normal question like "When is spring break?" passes through and gets answered normally. But "How can I cheat on my finals?" gets blocked by the guardrail before the model ever generates a response.

You can also add guardrails directly to Converse API calls using the guardrailConfig parameter:

response = bedrock_runtime.converse(
    modelId=model_id,
    messages=messages,
    guardrailConfig={
        'guardrailIdentifier': 'your-guardrail-id',
        'guardrailVersion': '1'
    }
)

For more details on guardrail configuration options, see the Bedrock Guardrails documentation.

Putting It All Together with an Agent

We've been doing everything the hard way on purpose so you could see how the pieces actually work. You might be thinking "that tool use code was a lot of work for one function call" and you'd be right.

Managing the message history, parsing tool requests, executing functions, and sending results back to the model manually is tedious. And it gets more complicated when the model needs multiple tools or several steps to complete a task.

Additionally, real-world applications need more than a single prompt and response with hardcoded user queries. You need to take dynamic input from the user and pass it to the model. Then the model might need to look up information, call tools, and take several steps before it can answer a question.

An agent is a system that lets the model do this. Instead of taking in one prompt and responding one time, it can think through the problem, decide what action to take next, use tools if needed, and repeat that process until it reaches a final answer. Under the hood, the model may be called multiple times as part of a loop until the task is complete.

Now we are going to switch to a using higher-level framework that handles much of the complexity around building AI applications for you.

Strands Agents SDK is an open-source framework from AWS that makes building AI agents straightforward. It integrates directly with Bedrock, though it supports any model provider, and handles the orchestration we've been doing manually.

Under the hood, when you use Amazon Bedrock as the model provider, it's calling the same Converse API we've been using throughout this post. That's why it was worth learning the fundamentals first. This should all make sense now rather than feeling like magic.

To get started with Strands, install the packages:

pip install strands-agents strands-agents-tools

Here's the university chatbot as a Strands agent that combines everything we covered: a Bedrock model, a Knowledge Base for university data, a custom tool for course lookups, and a guardrail for content safety:

import os
from strands import Agent, tool
from strands.models import BedrockModel
from strands_tools import retrieve

# ============================================================
# Configuration — Replace these with your resource IDs
# ============================================================

KNOWLEDGE_BASE_ID = "YOUR_KB_ID"
GUARDRAIL_ID = "YOUR_GUARDRAIL_ID"
GUARDRAIL_VERSION = "1"
MODEL_ID = "us.amazon.nova-lite-v1:0"
REGION = "us-east-1"


# ============================================================
# Custom Tool: Look Up Course Schedule
# ============================================================

@tool
def lookup_course(department: str, course_number: str) -> str:
    """Look up schedule and details for a specific course.

    Use this when a student asks about a particular class,
    like "When does CS 201 meet?" or "Who teaches BIO 101?"

    Args:
        department: The department code (e.g., "CS", "BIO", "ENG").
        course_number: The course number (e.g., "101", "201").

    Returns:
        Course details including schedule, instructor, and location.
    """
    # In a real app this would query a course catalog API
    courses = {
        "CS-101": {
            "title": "Introduction to Programming",
            "instructor": "Dr. Maria Chen",
            "schedule": "Mon/Wed/Fri 10:00 - 10:50 AM",
            "location": "Turing Engineering Building, Room 210",
            "credits": 3,
            "seats_available": 12,
        },
        "CS-201": {
            "title": "Data Structures",
            "instructor": "Prof. James Park",
            "schedule": "Tue/Thu 1:00 - 2:15 PM",
            "location": "Turing Engineering Building, Room 215",
            "credits": 3,
            "seats_available": 5,
        },
        "BIO-101": {
            "title": "General Biology I",
            "instructor": "Dr. Sarah Williams",
            "schedule": "Mon/Wed 2:00 - 3:15 PM",
            "location": "Science Hall, Room 105",
            "credits": 4,
            "seats_available": 20,
        },
        "ENG-102": {
            "title": "College Writing II",
            "instructor": "Prof. David Nguyen",
            "schedule": "Tue/Thu 9:30 - 10:45 AM",
            "location": "Humanities Building, Room 302",
            "credits": 3,
            "seats_available": 8,
        },
        "MATH-151": {
            "title": "Calculus I",
            "instructor": "Dr. Lisa Patel",
            "schedule": "Mon/Wed/Fri 11:00 - 11:50 AM",
            "location": "Math & Science Center, Room 120",
            "credits": 4,
            "seats_available": 15,
        },
    }

    key = f"{department.upper()}-{course_number}"
    if key in courses:
        c = courses[key]
        return (
            f"Course: {key} — {c['title']}\n"
            f"Instructor: {c['instructor']}\n"
            f"Schedule: {c['schedule']}\n"
            f"Location: {c['location']}\n"
            f"Credits: {c['credits']}\n"
            f"Seats available: {c['seats_available']}"
        )

    return f"No course found for {key}. Check the department code and course number."


# ============================================================
# Build the Agent
# ============================================================

def create_university_agent():
    """Create the University chatbot agent."""

    # The built-in retrieve tool reads this env var to find the KB
    os.environ["KNOWLEDGE_BASE_ID"] = KNOWLEDGE_BASE_ID
    os.environ["AWS_REGION"] = REGION

    bedrock_model = BedrockModel(
        model_id=MODEL_ID,
        region_name=REGION,
        temperature=0.3,
        max_tokens=2000,
        guardrail_id=GUARDRAIL_ID,
        guardrail_version=GUARDRAIL_VERSION
    )

    system_prompt = """You are the University virtual assistant.
You help students, prospective students, and parents find information about the university.

Your responsibilities:
- Answer questions about academics, admissions, financial aid, housing, dining, parking, the library, career services, and the academic calendar.
- Use the retrieve tool to search the knowledge base for university policies and FAQ answers before responding.
- Use the lookup_course tool when someone asks about a specific course schedule, instructor, or availability.
- Cite your sources when referencing specific policies or dates.

Guidelines:
- Be friendly and welcoming — remember, students may be stressed about deadlines.
- If you don't know the answer, say so and suggest they contact the relevant office.
- Keep answers concise and helpful."""

    agent = Agent(
        model=bedrock_model,
        tools=[retrieve, lookup_course],
        system_prompt=system_prompt,
    )

    return agent


# ============================================================
# Run the Agent
# ============================================================

def main():
    print("University Chatbot")
    print("=" * 60)
    print("Ask me about admissions, financial aid, housing, dining,")
    print("course schedules, the academic calendar, and more.")
    print("\nType 'quit' to exit.\n")

    agent = create_university_agent()

    while True:
        user_input = input("You: ").strip()
        if not user_input:
            continue
        if user_input.lower() in ("quit", "exit", "q"):
            print("Goodbye!")
            break

        print("\nAssistant: ", end="", flush=True)
        response = agent(user_input)
        print(f"\n{response}\n")


if __name__ == "__main__":
    main()

Let's walk through what's different here compared to the manual approach.

At the top, we import a few things from the Strands framework: Agent, the @tool decorator for creating custom tools, BedrockModel for the model provider, and the built-in retrieve tool from strands_tools which queries the Knowledge Base we created earlier.

Then we define our configuration. We need a Knowledge Base ID for RAG, a Guardrail ID for content filtering, and we're using Amazon Nova as our model. All things we've already set up and used individually.

The @tool decorator is how you define custom tools in Strands. We've got a lookup_course tool that simulates looking up course information. In a real application, this would query a database. Compare this to the manual tool use code from earlier and you'll notice there is no lengthy tool schemas, message parsing, or dispatch logic. You just write a function with a docstring and Strands handles the rest.

Strands also includes a built-in retrieve tool that works directly with Bedrock Knowledge Bases. You set the Knowledge Base ID as an environment variable, and the agent decides when to use it.

We created a BedrockModel instance with the model ID, region, inference parameters, and guardrail information. Then we defined the system prompt telling the agent it's a university chatbot and how it should handle requests. Finally, we created the agent with the model, the tools list (both custom and built-in), and the system prompt.

The last piece is the interactive loop. We read input from the command line and passed it to the agent. To call the agent, all you need is agent(user_input). The framework handles the entire agent loop: when the model needs a tool, Strands executes it and sends the result back to the model.

Multi-turn conversation management is handled too because each call maintains context from previous turns as long as the program is running.

Under the hood, Strands is calling the Converse API and using the different Bedrock features we covered throughout. This should all make a lot more sense now than if you jumped right into the agent framework.

The Strands documentation has more examples and configuration options.

What's Next

We covered a lot of ground. You now have the knowledge you need to start building real applications with AI on AWS using Amazon Bedrock.

Here are some areas to explore as your application grows:

Prompt caching can significantly reduce costs on repeated context. If you have a large system prompt or tool definitions that don't change between requests, caching avoids reprocessing those tokens every time.
Cross-region inference distributes load across AWS regions to balance the inference load. Instead of hitting limits in one region and failing, Bedrock can route your requests globally.
CloudWatch monitoring tracks token usage, latency, throttling, and error rates. Setting up monitoring early helps you catch cost spikes and performance issues before they become problems.

The companion repo has the complete code for every example in this post. Clone it, run the examples, and try adapting them to your own use case. Take one of these examples and adapt it to a new use case to push your learning even further and remember to always learn the fundamentals first.

AI Agents Don’t Need Complex Workflows. Build One in Python in 10 Minutes

Morgan Willis — Thu, 26 Mar 2026 22:16:10 +0000

Building an AI agent in Python can be as easy as giving a model some tools and letting it figure out the rest.

Most agent setups start the same way: you wire up tool calls, manage retries, track state, and write the routing logic that decides what happens when. It works, but it's brittle. Every time the workflow changes, you're back in the code rewiring the sequence.

Strands is an open-source Python SDK built around a different idea.

Instead of you hardcoding the orchestration, you let the model handle it. You give it tools and a goal, and the SDK takes care of the agent loop, tool execution, and conversation state. You can go from zero to a working agent in about 10 minutes, and the same primitives that make a simple agent easy to build can be combined to give you more complex setups when you need them.

A Model Driven Approach to AI Agents

The Strands team calls this a model-driven approach. The LLM is the orchestrator and you define the capabilities it can use.

In practice, your agent code is mostly plugging in the different desired components. Here's what a basic agent looks like:

from strands import Agent

agent = Agent(system_prompt="You are a helpful assistant.")
response = agent("What's the capital of France?")
print(response)

That's a working agent. It uses Amazon Bedrock as the model provider by default, but you can swap in any supported provider. We'll use OpenAI for the rest of this post.

Setting Up a Python AI Agent with OpenAI

Install the SDK with the OpenAI extension:

pip install 'strands-agents[openai]' strands-agents-tools

Set your API key:

export OPENAI_API_KEY=your_api_key_here

Now create an agent that uses OpenAI:

import os
from strands import Agent
from strands.models.openai import OpenAIModel

model = OpenAIModel(
    client_args={"api_key": os.environ["OPENAI_API_KEY"]},
    model_id="gpt-4o",
)

agent = Agent(
    model=model,
    system_prompt="You are a helpful assistant.",
)

response = agent("What's the capital of France?")
print(response)

Run it with python agent.py and you should get a response. The agent handles the API call to the OpenAI model and response parsing for you.

So far, this doesn't have any tools it can use to interact with the real world. It does however have a main agent loop handled, and is the starting point from which you will build a more capable agent.

The Building Blocks of an AI Agent

Strands has a couple of main building blocks you should be aware of: agents, tools, models, and hooks. Understanding how they fit together is most of what you need to know.

Models

A model is the LLM provider. Strands supports Bedrock (the default), OpenAI, Anthropic, Google Gemini, Meta Llama, Ollama for local models, and several others. You configure the model once and pass it to your agent.

model = OpenAIModel(
    client_args={"api_key": os.environ["OPENAI_API_KEY"]},
    model_id="gpt-4o",
    params={"temperature": 0.3},
)

You can set inference parameters using params. One good one to note is temperature. Use a lower temperature for factual tasks or a higher temperature for creative ones. Other supported inference parameters depend on the model.

Giving Your Agent Tools

Tools are Python functions that extend what the agent can do beyond generating text.

Here's a custom tool to return weather data:

from strands import tool

@tool
def get_weather(location: str) -> str:
    """Get the current weather for a location.

    Args:
        location: City name, e.g. "Seattle"
    """
    # In a real app, this would call a weather API
    return f"Weather in {location}: Sunny, 72°F"

The @tool decorator is all you need. The docstring matters because the model uses it along with the function's type hints to decide when to call this function and what arguments to pass. Clear docstrings lead to better tool usage.

Strands also ships with a community tools package that includes common utilities:

from strands_tools import calculator, python_repl, http_request

These give your agent the ability to do math, run Python code, and make HTTP requests out of the box.

Wiring It All Together

The agent brings everything together. You give it a model, tools, and a system prompt:

agent = Agent(
    model=model,
    tools=[get_weather, calculator],
    system_prompt="You are a helpful assistant that can check weather and do math.",
)

When you call the agent with a message, it enters an agent loop. The model reads the message, decides if it needs to use any tools, calls them if it does, reads the results, and either calls more tools or generates a final response. This loop continues until the model decides it has enough information to answer.

You don't write any of that loop logic, the SDK handles it for you.

Hooking into the Agent Lifecycle

Hooks let you subscribe to lifecycle events in the agent loop without modifying the agent's core logic. The agent emits events at specific points during execution: before and after model calls, before and after tool calls, when messages are added, and at the start and end of each invocation. You register callbacks for the events you care about.

Here's a hook that logs every tool call:

from strands.hooks import BeforeToolCallEvent, AfterToolCallEvent

def log_tool_call(event: BeforeToolCallEvent):
    print(f"Calling tool: {event.tool_use['name']}")
    print(f"With input: {event.tool_use['input']}")

def log_tool_result(event: AfterToolCallEvent):
    print(f"Tool {event.tool_use['name']} completed")

agent = Agent(
    model=model,
    tools=[get_weather, calculator],
    system_prompt="You are a helpful assistant.",
)

agent.add_hook(log_tool_call, BeforeToolCallEvent)
agent.add_hook(log_tool_result, AfterToolCallEvent)

The available events cover the full lifecycle:

BeforeInvocationEvent / AfterInvocationEvent for the overall request
BeforeModelCallEvent / AfterModelCallEvent for LLM calls
BeforeToolCallEvent / AfterToolCallEvent for tool execution
MessageAddedEvent when messages are added to conversation history

Hooks are useful for logging, metrics, basic guardrails, and adding logic to your agents lifecycle. You can also cancel a tool call from a BeforeToolCallEvent hook by setting event.cancel_tool to a message, which stops the tool from executing and sends that message back to the model as an error. For example, you can check the tool name and arguments, and block it if something looks wrong.

Once you start building more complex agents, you'll find yourself wanting to bundle related hooks and tools into reusable packages. We'll get to that later in this post.

Building a Multi-Tool Agent

Here's a more complete example. This agent has a few tools as examples to do things like look up weather, do calculations, and count letters in words. The actual tools themselves don't matter much for learning how to build an agent, as those will be unique to your specific use case. For now, we are just exploring how to wire all of the pieces together into an agent that does things:

import os
from strands import Agent, tool
from strands.models.openai import OpenAIModel
from strands_tools import calculator

@tool
def get_weather(location: str) -> str:
    """Get the current weather for a location.

    Args:
        location: City name, e.g. "Seattle"
    """
    return f"Weather in {location}: Sunny, 72°F"

@tool
def letter_counter(word: str, letter: str) -> int:
    """Count occurrences of a specific letter in a word.

    Args:
        word: The word to search in
        letter: The single letter to count
    """
    return word.lower().count(letter.lower())

model = OpenAIModel(
    client_args={"api_key": os.environ["OPENAI_API_KEY"]},
    model_id="gpt-4o",
)

agent = Agent(
    model=model,
    tools=[get_weather, calculator, letter_counter],
    system_prompt="You are a helpful assistant.",
)

response = agent("""I have a few questions:
1. What's the weather in Seattle?
2. What is 1547 * 382?
3. How many r's are in "strawberry"?
""")
print(response)

The agent will call each tool as needed, collect the results, and give you a single coherent response. You didn't have to write any routing logic or decide which tool to call for which question. The model handles that.

How AI Agents Remember Conversations

Agents maintain conversation context automatically within a running process. Each call to the agent adds to the conversation history, so the model remembers what was said earlier:

agent("My name is Morgan.")
response = agent("What's my name?")
print(response)  # Will remember "Morgan"

This works across multiple turns without any extra code because the SDK manages the message history internally.

That said, this memory only lives as long as the agent's process lives. If you're running a script locally and it exits, the history is gone. If you're hosting an agent behind an API, the process restarts with every request so message history is not maintained.

This is because LLMs are stateless by default, and the conversation history that makes them feel stateful is just a list of messages that gets sent with every request.

For anything beyond a local script, you need to persist that history somewhere.

Session Managers

Strands provides session managers that save and restore conversation state across invocations. The simplest option is FileSessionManager, which writes session data to the local filesystem:

from strands import Agent
from strands.session.file_session_manager import FileSessionManager

session_manager = FileSessionManager(
    session_id="my-session",
    storage_dir="./sessions",
)

agent = Agent(
    model=model,
    system_prompt="You are a helpful assistant.",
    session_manager=session_manager,
)

# First run
agent("My name is Morgan.")

# Later, even after a restart, the agent remembers
agent = Agent(
    model=model,
    system_prompt="You are a helpful assistant.",
    session_manager=FileSessionManager(
        session_id="my-session",
        storage_dir="./sessions",
    ),
)
response = agent("What's my name?")
print(response)  # Still remembers "Morgan"

FileSessionManager stores each message as a JSON file on disk. It works well for local development. For hosted setups, you'd swap in a session manager backed by a database or a managed memory service like Amazon Bedrock AgentCore Memory. The integration pattern is the same, but you'd need to provision the infrastructure for the data store.

Managing the Context Window

There's another problem that shows up in longer conversations. Every LLM has a context window, which is the maximum amount of tokens it can process in a single request. Your system prompt, the full conversation history, tool definitions, and the model's response all have to fit inside that window.

For short conversations this isn't an issue. But if your agent runs for dozens of turns, or if tools return large results, the conversation history can grow past what the model can handle.

Strands provides a few out of the box conversation managers to deal with this:

The sliding window manager keeps the most recent messages and drops the oldest ones when the history gets too long:

from strands import Agent
from strands.agent.conversation_manager.sliding_window_conversation_manager import SlidingWindowConversationManager

agent = Agent(
    model=model,
    conversation_manager=SlidingWindowConversationManager(
        window_size=40,  # keep the last 40 messages
    ),
)

This is simple and predictable: old messages fall off the end. The downside is that the agent loses context from earlier in the conversation. If a user said something important 50 messages ago, it's gone.

The summarizing manager takes a different approach. Instead of dropping old messages, it summarizes them and then keeps the summary:

from strands import Agent
from strands.agent.conversation_manager.summarizing_conversation_manager import SummarizingConversationManager

agent = Agent(
    model=model,
    conversation_manager=SummarizingConversationManager(
        summary_ratio=0.3,  # summarize the oldest 30% of messages
        preserve_recent_messages=10,  # always keep the last 10
    ),
)

When the context gets too large, the summarizing manager takes the context and generates a summary using the LLM. It then replaces those messages with the summary. The agent keeps the gist of what happened earlier without the full verbatim history. This costs an extra model call when summarization triggers, but it preserves more context than a simple sliding window.

Which one you pick depends on your use case. For short, focused interactions, the sliding window is fine. For longer sessions where earlier context matters, the summarizing manager is worth the extra cost.

There are also more advanced techniques you can use to manage your context window. Figuring out all the ways to manage context is called context engineering, and is an entire discipline in the AI engineering world. For simple agents, sliding window or summarization are good places to start.

How the Agent Loop Ties It All Together

Stack these pieces together and you get a pretty capable agent without writing much code. A model handles reasoning and tool selection.

Tools extend what the agent can do.
Hooks give you control over the lifecycle.
Session managers persist state across restarts.
Conversation managers keep the context window under control.

The agent loop ties it all together: the model calls tools, reads results, handles errors by trying a different approach, and returns the final response.

This is the starting point. Once you start building more advanced agents you'll need more capabilities. That's where plugins come in.

Plugins are classes that bundle hooks and tools together into behavioral modifications you can attach to any agent. The SDK ships with a few built-in plugins that show what this looks like in practice, and you can build your own custom plugins as needed.

Extending Your Agent with Plugins

Steering

Steering is a plugin that evaluates the agent's output and sends corrective feedback when the response drifts from your guidelines. You give it a system prompt that defines the rules, and it uses a separate LLM call to judge each response before it reaches the user.

from strands import Agent
from strands.vended_plugins.steering import LLMSteeringHandler

agent = Agent(
    model=model,
    tools=[get_weather, calculator],
    plugins=[
        LLMSteeringHandler(
            system_prompt="Ensure all responses are professional and concise. "
            "Reject any response that includes speculation or unverified claims."
        ),
    ],
)

Under the hood, the steering plugin hooks into the agent's lifecycle using a BeforeToolCallEvent hook. It intercepts tool calls, runs them through the evaluator, and returns one of three actions: proceed (let it through), guide (reject with feedback so the agent retries), or interrupt (escalate to a human). You don't write any of that logic. You just describe the rules in the system prompt for the steering handler.

This is useful for enforcing tone in customer-facing agents, preventing agents from calling tools with dangerous arguments, or evaluating if agents are following directions.

Skills

Skills are modular instructions that agents discover and activate at runtime. They follow the Agent Skills specification, an open standard for packaging agent capabilities as folders containing instructions, scripts, and resources.

A skill might teach an agent how to perform a code review following your team's conventions, how to deploy to a specific environment, or how to write content in a particular style.

The agent only loads a skill's metadata (name and description) initially. When the agent decides a skill is relevant to the current task, it activates it and pulls in the full instructions. This keeps the context window clean since the agent only loads what it needs.

Building Your Own Plugins

You can also build custom plugins. A plugin is a class that extends Plugin and uses @hook and @tool decorators:

from strands import Agent, tool
from strands.plugins import Plugin, hook
from strands.hooks import BeforeToolCallEvent, AfterToolCallEvent

class LoggingPlugin(Plugin):
    """A plugin that logs all tool calls and provides a utility tool."""

    name = "logging-plugin"

    @hook
    def log_before_tool(self, event: BeforeToolCallEvent) -> None:
        """Called before each tool execution."""
        print(f"[LOG] Calling tool: {event.tool_use['name']}")
        print(f"[LOG] Input: {event.tool_use['input']}")

    @hook
    def log_after_tool(self, event: AfterToolCallEvent) -> None:
        """Called after each tool execution."""
        print(f"[LOG] Tool completed: {event.tool_use['name']}")

    @tool
    def debug_print(self, message: str) -> str:
        """Print a debug message.

        Args:
            message: The message to print
        """
        print(f"[DEBUG] {message}")
        return f"Printed: {message}"

# Using the plugin
agent = Agent(plugins=[LoggingPlugin()])
agent("Calculate 2 + 2 and print the result")

When the agent initializes, it scans the plugin for @hook and @tool methods and registers them automatically. You can stack multiple plugins on the same agent, and each one manages its own hooks and state without interfering with the others.

Beyond plugins, Strands supports multi-agent patterns where agents invoke other agents as tools, MCP (Model Context Protocol) servers for connecting to external tool providers, and structured output for getting typed responses. These are all topics for another post. The point is that the same primitives (agents, tools, models, hooks) compose into more complex setups without requiring you to learn a different API.

Install and Build Your First AI Agent

The fastest path from here:

Install: pip install 'strands-agents[openai]' strands-agents-tools
Set your API key: export OPENAI_API_KEY=your_key
Write a simple agent with one custom tool
Run it and see what happens

The Strands documentation has more examples, including multi-agent setups, observability, and production deployment patterns. The GitHub repo has the source and community tools.

The SDK is open source and actively developed. If you've been putting off building an agent because the frameworks felt heavy, give Strands a look. The barrier to entry is low, and it provides enough composability to keep up with you as your use case gets more complex.

AI Agents Are Your API's Biggest Consumer. Do They Care About Good Design?

Morgan Willis — Tue, 24 Mar 2026 13:50:14 +0000

We've always designed APIs for humans. A well built API means obsessing over naming conventions, RESTful patterns, and clear documentation because the goal is simple: make systems easy for developers to understand. But AI is changing who the consumer of software is, and developers are asking whether the rules we've followed for decades still hold up.

When the primary user of an API is an AI system that reads documentation, adapts to unfamiliar patterns, and experiments when something fails, maybe consistent APIs and clean abstractions don't matter anymore.

Everything in me wants to reject this concept. My gut instinct is to say "my APIs need to be pretty or I'll die".

I've been thinking about this a lot lately, and unfortunately I think good arguments are made from both angles on this. Let's walk through both sides.

The Case for "Abstractions Don't Matter Anymore"

Some developers believe AI will reduce the importance of traditional abstractions. I've heard this take a lot.

LLMs are extremely good at pattern recognition. They can read documentation, inspect responses, experiment, and adapt when things fail. From this perspective, messy systems don't seem like a big problem. A human might struggle with inconsistent naming or poor documentation, but an AI can simply figure it out through trial, error, and being smarter than me.

So maybe if abstractions exist to make code understandable to humans, and the primary consumer is no longer human, then the old rules don't apply.

In practice, you can build agent systems that work around poorly designed APIs. Say you're integrating with an API where half the endpoints return errors as HTTP status codes and the other half always return 200 with an error field buried in the response body.

The agent pulls the docs, writes the code, and it looks reasonable. Then it runs the tests and it breaks because the code is checking status codes on an endpoint that never returns them.

The agent reads the error, adds response body parsing, and tries again. Maybe it over-corrects and starts modifying the way it's handling status codes everywhere, breaking a different call. So, it adjusts again. Then finally, the third try works. These systems exist today and they get there eventually.

As models continue to improve, API design will matter less and less as the models can brute force their way to working code.

The Case for "Abstractions Still Matter"

Now for the other side of the argument.

A human developer interacts with an API occasionally. An AI system like a coding agent might interact with it hundreds of times in a single session. When something is poorly designed, the problems compound fast in the form of unnecessary retries, token-heavy debugging loops, and ugly workarounds.

I ran into this with an API that had inconsistent naming across its endpoints. I was debugging an issue with an app I was building, and my coding agent kept thinking it had identified the issue because a parameter name for an API I was using didn't align with historical patterns for this type of API. That wasn't the issue at all, it was completely irrelevant, but the agent kept getting hung up on it.

Every time I debug something that uses this specific API my coding agent always says "I found it! The parameter name should be X instead of Y!" Then it changes it and deploys again and it doesn't work because that wasn't the issue. It kept making the same wrong assumption across sessions.

Unlike a human who hits a weird error and remembers it next time, LLMs are stateless by default. Every new session starts fresh, and agents can spin up tons of sessions in a single workflow. Each of which will run into the same problem.

Every ambiguity in an API has a token cost and poor API design has direct financial consequences in a way that wasn't true when the only cost was developer frustration.

And another thing I've noticed: if you watch a coding agent work through a problem like this, it often gives up and tries a different approach entirely after a few tries. It'll swap libraries, use a different endpoint, or cobble together a workaround using some other approach. That's the AI doing exactly what it should do, adapting.

From the developer perspective, I don't always like the workarounds the AI chooses. Sometimes one unclear API makes my AI think it needs to redesign my entire component. Other times it bypasses the API in a way where it looks like it's working but it's actually relying on some weird hardcoded values somewhere.

And from your perspective of an API owner, the AI just decided not to use your API. Your messy design just cost you a new user.

The Human Compatibility Problem

There's another angle to this too. The abstractions we built for humans are now embedded in how AI systems learn.

Modern software ecosystems contain decades of common coding conventions. These were originally created to help humans understand systems. But those same patterns now appear throughout model training data, and that has consequences.

When you name your endpoint /api/v2/users/{id}, the model has seen that pattern millions of times. It knows what to expect. When you name it /backend/person/fetch?identifier={id}, you're fighting against the weight of its training. The model can learn your pattern, but there's friction.

Coding assistants are increasingly abstracting away the act of writing syntax from developers which is great until you need to peek under the abstraction.

If an agent generates code using unfamiliar patterns or unconventional APIs, a human still has to review it, debug it, and maintain it. We wouldn't want agents writing assembly language even if it ran faster, because most of us can't read it. The same logic applies to API conventions. Familiar patterns keep the code understandable for the humans who still have to live with it.

The patterns we created to help humans are now baked into how AI understands software, and that path dependence matters in both directions. Breaking conventions costs you in AI effectiveness and in human readability.

You Can Engineer Around It (If You Can Afford It)

Writing this blog turned my question from "does API design and thoughtful abstraction matter anymore?" into "how much money do you have?"

Every time an AI system has to figure out how something works, that's tokens being consumed and a potentially hacky workaround making its way into your code base. The adaptability is real, but anyone who's been ripping Claude Opus 4.6 with the 1 million token context window using agent teams knows that this is not free.

You can throw tokens at bad abstractions, build sophisticated systems to work around them, add layers of verification, validation, and correction. Multiple agents checking each other's work. Memory and caching layers to avoid repeated discovery. But wouldn't it be nice to just get it right the first try? Clean abstractions and good API design can give that to you.

Ideally you have both things in place. Clean APIs, meaningful abstractions, and clear documentation means the agent has minimal friction and looping. Then add in enough scaffolding around the agent to recover when an API it's using doesn't have all of those things. That combination reduces the cost of code generation, keeps code human readable and debuggable, and the adaptability is still built in.

The Agent Buddy System: When Prompt Engineering Isn't Enough

Morgan Willis — Wed, 18 Mar 2026 18:01:01 +0000

Most AI agents don’t reliably follow directions, and that’s one of the biggest reasons they never make it from POC to production.

This is how deploying agents usually plays out: you write clear instructions in your prompt, test against every scenario you can think of, and ship it. Then the agent skips steps, drifts from your guidelines, or invents behavior you didn't anticipate. So you add more detail, more constraints, more explicit directions.

The prompt is getting huge now but you’re sure you’ve captured all the rules. You deploy again. Same problem. Eventually, you hit a wall and give up.

I ran into this firsthand trying to create a simple AI assistant to help me write. I gave it samples of my writing style, told it to write like me, and it did start off okay. But after a few turns it drifted back into generic AI-speak. I'm talking em dashes everywhere, staccato sentences for dramatic effect, and that weird "It's not about X, it's about Y" framing that sounds profound but actually says nothing. By the end of a long session, the output usually sounds nothing like me.

This example makes the problem obvious because you can read the output and immediately tell something’s off. But the same thing happens in more serious scenarios, like compliance checks, customer support flows, or multi-step workflows where the stakes are higher.

What's actually happening is that as conversations get longer, the model pays less attention to earlier instructions.

Prompt engineering helps, but it can only take you so far. What you need is a feedback loop that catches drift and corrects it before the response ever reaches the user.

The Agent Buddy System

Instead of trying to make one agent behave perfectly, the solution was to introduce a second agent to the system. One does the work, and the other checks it. That’s what I’ve been calling the agent buddy system.

The main agent handles the task: writing, reasoning, calling tools, whatever it needs to do. The buddy sits alongside it, watching the output. If the agent skips a step, tries to misuse a tool, or drifts from the defined rules, the buddy steps in and helps get things back on track.

The idea is simple: don’t rely on the model to always follow instructions. Assume it will drift, and build something that corrects it when it does.

This is essentially using an LLM as a judge. The evaluator model inspects the output from the worker model and decides whether it meets the criteria. If it does, the response goes through. If not, it sends guidance and the agent can try again.

It turns out that having two models that disagree with each other is safer than having one model that just does whatever it wants.

You can build this pattern yourself, but I used the Strands Agents SDK because it already supports this kind of feedback loop through a feature called steering.

Steering lets you inject just-in-time guidance into the agent’s execution instead of front-loading everything into a massive prompt and hoping for the best.

Under the hood, Strands steering works through hooks in the agent’s lifecycle. You can intercept tool calls before they execute to run custom validations, or evaluate the model’s response after it’s generated to check things like tone, format, or adherence to the prompt.

The steering agent intercepts the call and returns one of three actions: Proceed (accept), Guide (reject with feedback for retry), or Interrupt (escalate to a human).

Building a Writing Buddy

To fix my AI writing problem, I built a steering handler that checks every response against a style guide with examples of my actual writing. If the output doesn’t sound like me, the handler catches it and asks for a rewrite before I ever see it.

In Strands, this means creating a SteeringHandler and attaching it to your agent as a plugin.

For my use case, I only needed to evaluate the final output, so I used steer_after_model() to inspect each response and decide whether to accept it or send it back with feedback.

Here’s my VoiceSteeringHandler:

class VoiceSteeringHandler(SteeringHandler):
    """Evaluates writing output against a style guide using an LLM judge.

    Intercepts model responses via steer_after_model and uses a separate
    steering agent to check for style violations. If a violation is found,
    it guides the agent to rewrite with targeted feedback.
    """

    def __init__(self, style_guide: str, max_retries: int = 3):
        super().__init__(context_providers=[])
        self.style_guide = style_guide
        self.max_retries = max_retries
        self.retry_count = 0

    async def steer_after_model(
        self, *, agent: "Agent", message: Message, stop_reason: StopReason, **kwargs: Any
    ):
        """Evaluate model output against the style guide."""
        print("\n[STEERING] Evaluating model output...")
        text = " ".join(
            block.get("text", "") for block in message.get("content", [])
        )

        if self.retry_count >= self.max_retries:
            self.retry_count = 0
            return Proceed(reason="Max retries reached, accepting output")

        # Use a separate steering agent as an LLM judge
        steering_agent = Agent(
            system_prompt=f"""You evaluate writing against a style guide.
            Catch clear violations, not nitpicks.

            STYLE GUIDE:
            {self.style_guide}

            REJECT for: banned words/phrases from the style guide, em dashes,
            "It's not X. It's Y." reframing, obvious marketing tone, or meta-commentary.

            APPROVE if: tone is developer-to-developer with no banned words/phrases/patterns.
            When in doubt, APPROVE.

            Respond with APPROVE or REJECT: [quote the violation].""",
            model=agent.model,
            callback_handler=None,
        )

        result = str(steering_agent(f"Evaluate this text:\n\n{text}"))

        if "REJECT:" in result.upper():
            self.retry_count += 1
            feedback = result.split("REJECT:", 1)[-1].strip()
            return Guide(
                reason=f"Fix this issue: {feedback[:300]}. "
                "Only fix the cited issue. Output only the content, nothing else."
            )

        self.retry_count = 0
        return Proceed(reason="Output approved by steering agent")

Then to attach it to your main agent, you use a plugin like this:

model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    region_name="us-east-1",
)

return Agent(
    model=model,
    system_prompt=f"""You are a writing assistant that writes in a specific voice.
    Follow every rule in the style guide below. Output only the requested writing.
    Never add meta-commentary or questions like "Would you like me to adjust?"

    STYLE GUIDE:
    {style_guide}""",
    plugins=VoiceSteeringHandler(style_guide=style_guide),
)

When the steering agent sees the output doesn't match, the handler returns Guide with specific feedback. The agent discards its response and tries again, knowing exactly what went wrong. After max_retries attempts, it lets the response through rather than looping forever.

The evaluator prompt checks for voice match against your examples, but also flags AI vocabulary (words like "crucial," "delve," "tapestry"), structural patterns (em dashes, pseudo-profound reframing), and other tells that make text sound machine-generated. You give it paragraphs from your actual writing, and it asks "does this new text sound like these examples?" It's essentially a style linter powered by an LLM.

That’s a judgment call, and this is where steering really shines. Instead of trying to build complicated, deterministic evaluation logic, you let a model make that call and provide targeted feedback.

Does It Work?

Yes. Here’s what I saw in my own testing before getting into larger-scale results.

I ran a small evaluation: 5 multi-turn writing sessions where a simulated user iteratively refines a piece, repeated 5 times each using Claude Sonnet 4.5. That's the kind of back-and-forth that happens in real writing workflows, and it's where drift becomes noticeable. The baseline voice adherence averaged 25% by the end of the sessions, but the steered version held at 100%.

For single-turn prompts with more capable models, both performed about the same for a small evaluation dataset, because larger models are already pretty good at following style guides on their own. The difference shows up in the longer sessions where drift compounds, or when weaker models are used.

That's a modest eval set, so take the exact numbers directionally rather than as gospel. But the pattern consistently showed unsteered sessions degraded noticeably after a few turns, while steered sessions stayed on voice throughout.

The more compelling evidence comes from Clare Liguori, Senior Principal Software Engineer at AWS, who ran a similar evaluation at a much larger scale. She tested five approaches to guiding agent behavior on a library book renewal agent across 3,000 runs.

Simple prompt instructions reached 82.5% accuracy, meaning roughly one in five interactions failed
Agent SOPs hit 99.8%, but at 3x the token cost
Graph-based workflows reached 80.8%, often failing outside predefined paths
Steering hit 100% across 600 runs while using 66% fewer input tokens than SOPs and 47% fewer output tokens than workflows

The most common failure without steering was skipping the book status check before renewing (43% of failures), followed by missing the confirmation message (40%). These are exactly the kinds of steps models deprioritize as context grows.

Things To Consider

This pattern works well, but there are a few things you should consider.

Latency

Each steering intervention adds another model call. If the handler returns Guide, the agent has to regenerate with feedback, which can mean two or three round trips for a single response. Once you add in tool calls the latency becomes a real factor.

That’s fine for background tasks or workflows where accuracy matters more than speed. But it’s the wrong tradeoff for real-time applications where users expect quick responses and the stakes are low.

Token costs

Tokens do add up, but the picture is more nuanced than you might expect.

Steering uses more tokens than simple prompt instructions because you’re sending feedback back to the agent when it strays. But compared to approaches that actually achieve high accuracy, like SOPs, steering is often more efficient.

You should reach for steering when a single prompt isn't enough, but try using the single prompt approach first.

Steering prompt quality

The quality of your steering prompts directly impacts performance.

If your handler gives vague feedback, the agent can get stuck retrying without improving. Set retry limits, make your Guide feedback specific, and if the same correction keeps firing, fix the prompt instead of increasing retries.

And remember, you're using a model to judge another model. That means they can share the same blind spots. If both the worker and the evaluator miss the same kind of mistake, steering won't catch it.

Try using two different models, and for high-stakes use cases, pair this with deterministic checks where you can.

When not to use steering

Steering assumes you have a clear definition of "correct." That works for style guides, compliance rules, and structured workflows. It doesn't work as well for creative tasks where you actually want the model to surprise you because steering will pull it back toward whatever your evaluator thinks is right. And if your criteria can be expressed as deterministic checks (regex, schema validation, rule engines), maybe skip steering. It's slower, costs more, and adds uncertainty where you don't need it.

Beyond Writing Assistants

Reliable agents come from the systems you build around them.

Steering applies anywhere an agent needs consistent behavior over time. Customer service agents maintaining tone across dozens of interactions, code review bots enforcing your team's conventions, or compliance workflows where skipping a step has real consequences.

The pattern is the same: evaluate the output, provide guidance, retry if needed. You just swap the evaluator criteria.

Clare Liguori’s post walks through her full evaluation of the library book renewal agent. The steering documentation covers the full API.

Some agents need a buddy to keep them on track. Steering gives you that.

Is Software Engineering Cooked? Not Yet. But Maybe.

Morgan Willis — Tue, 03 Mar 2026 15:09:26 +0000

"Software engineering is solved." This is all I see lately when scrolling LinkedIn, X, or Reddit.

The message is loud and clear: developers are cooked and we should all pivot immediately and become plumbers or electricians.

It's not a completely crazy idea. AI coding tools have improved at a pace that has affected software development more than almost any other white-collar field. Developers who are deep into these tools can generate large amounts of working code quickly, and the days of memorizing syntax and writing everything by hand are already over.

The obvious response is that code generation is only one slice of the software engineering pie. When I say this, usually everyone nods along.

But we rarely talk about what the rest of the pie actually is. If code generation is mostly solved, then the question is what remains.

Right now the results from AI coding tools are wildly uneven. Some developers report massive productivity gains. Others say it makes them slower and produces nothing but AI slop.

A big part of the difference comes down to how the systems around AI coding tools are built and used.

Software engineering looks solved because code generation is advancing quickly. Understanding why explains what we're seeing right now and where this is headed.

A good place to start is with a simple question: why is code generation a winning use case for generative AI?

Why Code Is Easy for AI

Software development is heavily pattern based. We work with language syntax, framework conventions, data structures, standard patterns, reusable components, and familiar structures.

Most code follows established patterns. The training data for these models includes massive volumes of code from real-world examples and open source projects. There are millions of implementations of common patterns to learn from.

Writing code is largely about applying known patterns to new problems. You need a REST endpoint? There are thousands of examples. Need to validate user input? Want to implement authentication? The patterns are also very well documented.

This is why syntax and implementation for most common use cases is mostly solved. The models have seen these patterns thousands of times. They can reproduce them reliably with minor variations to fit your specific context.

But pattern matching alone doesn't explain why AI coding tools work as well as they do. There's another factor that makes software particularly suited to AI generation.

Software Is Verifiable

Correct software for most non-AI related use cases has a simple property: the same input produces the same output. That's determinism.

The actual implementation can vary. If you give the same user story to two different developers, they will produce two different solutions.

The structure, abstractions, and technologies might differ. But from an end-user perspective, the functional requirements must be met.

That property makes software unusually well suited to AI-assisted development. Nondeterministic systems can generate many possible implementations, but we can still measure whether the result is correct. Either the behavior is correct or it isn't. The tests pass or they don't.

The function of the software is the most important part, which can be measured and tested automatically.

But lots of different fields have measurable outputs that can be tested and iterated upon. Why are software developers likely to be among the first groups to feel the full impact of AI adoption?

Automation Is Our Culture

Part of the reason is that software engineering is already deeply mechanized, and it's in our culture to automate everything that we can.

We have CI/CD pipelines, automated tests, automated security scanning, and metrics everywhere. We already understand how to build feedback loops and automate processes. Now we're applying those same skills and lessons learned to other parts of our work in ways that weren't possible before generative AI. And the reason it's accelerating so quickly is that we are both the domain experts and the ones building the systems.

This creates extremely rapid iteration that is less likely to happen organically in other fields.

If a developer discovers a truly innovative AI engineering technique on Monday, by Wednesday it's in a blog post with thousands of readers. By Friday multiple people have attempted to build tools around it. Within weeks, if something else hasn't replaced it, the idea gets integrated into mainstream workflows. Rinse and repeat.

But even with all of this progress, implementation is still only one slice of the software engineering pie. It takes a lot of work to get systems working in production reliably.

The Rest of the Software Engineering Pie

Working code is only one dimension of correctness. Production-ready software requires many dimensions to align at once, like security, scalability, architecture, maintainability, integration, performance, stability, and dependencies. A system that works correctly in isolation can still fail when any of these dimensions are ignored.

This is often where the conversation turns to the idea that developers will all become architects. If implementation becomes automated, humans focus on higher-level system design.

There is truth to that, and I think this is where we are now. Developers who understand system design have an advantage in today's market. But will higher-level design remain purely a human activity? That assumption is already being tested.

Encoding Judgment

Many of the harder parts of software engineering come down to judgment. What does "good" look like for a particular system? Which tradeoffs are acceptable? Where should the complexity live? This kind of judgment is what separates working code from production-ready software.

Developers are already experimenting with ways to encode that judgment into automated systems. These systems look less like chatbots and more like coordinated pipelines. One component generates an implementation. Others verify behavior by running tests, scan for issues with deterministic tools, and evaluate architectural concerns. Higher-level review can be layered in, where models evaluate tradeoffs, consistency, and design decisions.

In many ways, this looks like a natural extension of CI/CD. Instead of validating code after a human writes it, the system validates code as it's generated. The feedback loop gets tighter.

This is also part of why specification-driven development is gaining traction. If models generate the implementation, developers need to define behavior precisely enough that correctness can be tested automatically. The discipline shifts from writing code to defining what correct code looks like.

None of this is solved. Tried and true patterns for automating higher-level concerns like architectural evaluation and performance validation don't exist yet, though many teams are actively working on it (with varying degrees of success).

Adoption will be incremental, and regulated industries will move slower because reliability and accountability matter more than speed. But each dimension of engineering judgment that gets encoded into a validation step is one more thing the system can handle.

For those using the latest and greatest tooling, the hardest part of being a developer has shifted from learning syntax to developing judgment. That shift may make the early stages of a software career harder, because judgment takes time and experience to build.

Are We Cooked?

Is software development cooked? Not yet.

Maybe in the future. But I think that humans will be in the loop for longer than we inside the AI bubble currently believe. Only time will tell if I'm right.

However, the job is changing faster than most of us expected. A developer can ship features in hours that used to take days. Small teams can build products that previously required dozens of engineers.

In the short term, there's a growing need for developers who can build the systems that make AI-generated code production-ready.

We used to spend most of our time writing implementations. That time is shrinking. What's growing is everything around it: figuring out what correct software looks like and building the validation to prove it.

The Python Function That Implements Itself

Morgan Willis — Tue, 24 Feb 2026 13:11:06 +0000

What if you could write a Python function where the docstring is the implementation? You define the inputs, the return type, and you write the validation logic that defines what "correct" means. AI handles the rest.

That's the programming model behind AI Functions, a new experimental library from Strands Labs.

Strands Labs is a new GitHub organization where experimental features of the Strands Agents SDK are being built in the open.

With AI functions you still write the validation logic, but instead of implementing the function itself, you let the AI handle generation and self-correct against your checks.

A Different Way to Write AI-Powered Code

Most AI-powered code follows the same pattern. You call the model, parse the response, write validation checks, handle errors, and retry when things go wrong. It's tedious boilerplate that everyone writes slightly differently.

AI Functions inverts this pattern.

You write a function signature, a docstring that serves as the prompt, a return type that defines the contract, and post-conditions that define what correct looks like. There is no function body. The function executes on an LLM instead of a CPU.

The key here is that you still write real validation code. Post-conditions are normal Python functions you author. You define the acceptance criteria, and the system enforces it.

The Receipt Parser

Let's see what this looks like with a receipt parser.

Receipts are a good fit for this pattern because the extraction itself is fuzzy (vendors format receipts differently, line items vary, tax rules change), but the validation is deterministic. You can write a post-condition to check whether the math adds up with plain arithmetic.

In practice, most receipts start as images or PDFs. This example assumes you've already extracted the text using OCR or a document processing service, and now you need to turn that raw text into structured, validated data.

We'll build something that handles that second step: extracting structured data from receipt text and validating that the math actually adds up.

from pydantic import BaseModel, Field
from ai_functions import ai_function

class LineItem(BaseModel):
    description: str = Field(description="Item or service description")
    quantity: int = Field(description="Number of units")
    unit_price: float = Field(description="Price per unit")
    amount: float = Field(description="Total for this line item (quantity * unit_price)")

class ReceiptData(BaseModel):
    vendor: str = Field(description="Vendor or company name")
    invoice_number: str = Field(description="Invoice or receipt number")
    date: str = Field(description="Invoice date (YYYY-MM-DD format)")
    items: list[LineItem] = Field(description="List of line items")
    subtotal: float = Field(description="Sum of all line item amounts before tax")
    tax: float = Field(description="Tax amount")
    total: float = Field(description="Final total (subtotal + tax)")

def validate_math(result: ReceiptData) -> None:
    """Validate that all math is internally consistent."""
    errors = []

    # Check line items: amount = quantity × unit_price
    for i, item in enumerate(result.items):
        expected = item.quantity * item.unit_price
        if abs(item.amount - expected) > 0.01:
            errors.append(
                f"Line item {i} ({item.description}): amount {item.amount} != "
                f"quantity {item.quantity} * unit_price {item.unit_price} = {expected}"
            )

    # Verify subtotal = sum of line items
    items_sum = sum(item.amount for item in result.items)
    if abs(result.subtotal - items_sum) > 0.01:
        errors.append(f"Subtotal {result.subtotal} != sum of line items {items_sum}")

    # Confirm total = subtotal + tax
    expected_total = result.subtotal + result.tax
    if abs(result.total - expected_total) > 0.01:
        errors.append(f"Total {result.total} != subtotal {result.subtotal} + tax {result.tax} = {expected_total}")

    if errors:
        raise ValueError("\n".join(errors))

@ai_function(
    description="Parse a receipt or invoice text and extract structured expense data",
    post_conditions=[validate_math],
    max_attempts=3,
)
def parse_receipt(receipt_text: str) -> ReceiptData:
    """
    Extract structured data from this receipt/invoice.
    Receipt text: {receipt_text}

    Instructions:
    - Extract all line items with their quantity, unit price, and total amount
    - Calculate subtotal as the sum of all line item amounts
    - Extract tax amount (if no tax is listed, use 0.0)
    - Calculate total as subtotal + tax
    - Use YYYY-MM-DD format for the date
    - Ensure all math is consistent
    """

The Pydantic models define the shape of the output. The @ai_function decorator marks this as an AI-powered function. The docstring becomes the prompt, with {receipt_text} as a template variable for the input. The return type tells the system what structure to generate.

Post-conditions let you define what "correct" means in your specific domain. They're standard Python functions that enforce your business logic. The math has to add up and the vendor name can't be empty. The date has to be in the right format. These aren't things you can guarantee with prompt engineering alone.

Here's what happens when you call parse_receipt with some receipt text.

Under the hood, the library hands off to a Strands agent loop. It takes your docstring (with the receipt text filled in), sends it to the model, and asks it to return a ReceiptData object.

Because it's running through a Strands agent, the function gets access to the same tool-use capabilities that Strands agents have, and as the integration matures, potentially other Strands features as well. But from your perspective, as the caller, it's just a function call that returns a Pydantic model.

Once the model responds, validate_math runs against the result. It checks whether each line item's amount equals quantity times unit price, whether the subtotal equals the sum of all line items, and whether the total equals the subtotal plus tax.

If everything checks out, you get your ReceiptData back. If validate_math raises a ValueError, the library takes that error message ("Subtotal 1,492.30 != sum of line items 1,492.80") and sends it back to the model along with the original prompt. The model sees exactly what it got wrong and tries again. This loop repeats up to max_attempts times, so with max_attempts=3, the model gets three chances to produce output that passes your checks.

Worth noting: validate_math checks internal consistency, not extraction accuracy. If the model misreads "$8,400" as "$840" from messy OCR output, the math could still check out while being completely wrong. But that's what additional post-conditions are for. You could write one that cross-references extracted values against the raw input text, checking whether the total the model returned actually appears in the receipt. If it doesn't, something went wrong during extraction, not just during math. The pattern scales to whatever "correct" means for your use case.

You could add more post-conditions too. Maybe validate_completeness to check that required fields aren't empty. Maybe validate_date_format to ensure dates parse correctly. Each one is just a Python function that raises an error when something's wrong.

The Tradeoffs

This pattern is clean, but there are some tradeoffs.

Latency is the first one. Each retry is another model call. If you set max_attempts=3, you're looking at up to three round trips to the model. That's fine for batch processing and background jobs. It's not great for user-facing APIs where you need sub-second responses.

The second tradeoff is cost. Retries multiply your API spend, and each invocation uses a fresh instance of the agent. If your post-conditions fail frequently, you're paying for multiple attempts per extraction.

This retry loop is a feature, not a bug. Monitor your validation failure rates. If post-conditions are failing on most first attempts, your prompt needs work, not more retries. Post-conditions are there to catch the edge cases, not to fix fundamentally broken prompts.

You're trading latency and cost for correctness guarantees on logic you never had to implement.

You didn't have to anticipate every receipt format, handle every edge case for how vendors list line items, or write a parser that accounts for the dozen ways people format currency. The model handles that ambiguity, and the post-conditions catch the errors.

That's the right trade for document processing pipelines, financial data extraction, and any task where a wrong answer is worse than a slow answer. It's the wrong trade for real-time chat interfaces or high-volume, cost-sensitive operations.

The library is experimental and it's a new repo from Strands Labs. It's worth exploring, and expect it to change as it matures.

The Pattern Underneath

What makes this really interesting to me is the programming model. You declare intent through the function signature and docstring. You define correctness through post-conditions, and the AI handles the implementation.

This separation keeps your validation logic as real Python code that you control, test, and version. It's not buried in a prompt or hoping the model "understands" what you mean by correct. When requirements change, you update the post-conditions. When the model improves, you get better first-attempt success rates without changing your code.

Post-conditions give you a way to programmatically define "correct" for your domain, which is something prompt engineering alone can't do. A prompt can tell the model to "make sure the math adds up," but a post-condition actually checks it and provides specific feedback when it doesn't.

I had a ton of fun experimenting with this new project. Try the pattern yourself, the library is available at https://github.com/strands-labs/ai-functions.

From POC to Production-Ready: What Changed in My AI Agent Architecture

Morgan Willis — Thu, 19 Feb 2026 14:33:08 +0000

Most AI agent tutorials show the same problematic pattern: a front-end client directly invoking an agent backend.

I wrote a blog, We Need to Talk about AI Agent Architectures, that explored why this pattern is a problem and highlighted a few other patterns you should use instead.

The core argument was straightforward: agents are a capability inside the system, not the system itself.

The response to that post told me the topic resonated, so I did the next logical thing. I went and built the patterns I shared and created a repo so you can try them out too.

The reference repository walks through multiple step-by-step iterations, showing how to evolve an agent architecture from a POC to secure and flexible production-ready patterns.

The repo is here: aws-samples/sample-ai-agent-architectures-agentcore

I also made a video walk-through of the entire solution end-to-end you can watch here:

This post covers what I built, what I learned along the way, and what you should watch out for when building your own agent architectures.

Starting With the Anti-pattern

I started where most people start. Browser talks to agent, and that's it.

I wrote a simple LangGraph agent that had a few sample tools, and I created a front-end that I could run locally to be able to interact with it.

I hosted this agent using Amazon Bedrock AgentCore Runtime and used Amazon Cognito to handle auth.

AgentCore Runtime validates the token before invoking the agent, and the whole thing worked pretty easily. I had my agent hosted in the cloud and only authenticated users could access it.

But then I started asking the questions I raised in the original post:

What happens when someone hammers this endpoint?
Where do I enforce rate limits?
Where does input validation go?
Where can I put the business logic needed to expand the functionality of this app?

The answer to all of those questions was: forget about it or shove it in the agent code, because there is simply nowhere else for these things to be addressed.

So I started adding in the necessary components to tackle these issues following the patterns I laid out in the original blog.

Adding a Proper Front Door

With the first iteration of this architecture, I added a proper front door: Amazon API Gateway with AWS Web Application Firewall in front of the agent.

That gave me rate limiting, web traffic filtering, and an Amazon Cognito authorizer at the API Gateway level. The user can authenticate at the API level, and the agent still uses OAuth for inbound authentication.

This is the first step away from the anti-pattern.

It felt like a solid improvement, but when a colleague of mine was reviewing my solution, they found a security gap that I think a lot of us would miss.

The Authentication Bypass Problem

Let's take a step back.

The API Gateway uses OAuth to authenticate incoming requests. When a user logs in and invokes the agent, API Gateway verifies the JWT passed in from the client.

Then, API Gateway turns around and forwards that exact same token to the agent running on AgentCore Runtime to be validated. One token, used all the way through.

The problem with this is that the same token that satisfies the API Gateway also satisfies the agent directly. If a user has a valid JWT and knows the AgentCore endpoint, they can bypass the API Gateway entirely.

Your rate limits, WAF rules, and any other protections you put in front of the agent become optional. A savvy user can just go around them.

This opens you up to a Denial of Wallet attack. This is where someone floods your system with requests, the serverless hosted backend scales up to absorb those requests, and then you're hit with a fat bill later down the line.

This might not be an obvious gap at first, because you might think "Well, how would anyone know my agent endpoint? You need both the token AND the endpoint to invoke it. As long as someone doesn't know the endpoint, they'll be forced to go through the API Gateway."

This is called security through obscurity. You're counting on someone not knowing the endpoint. But sometimes identifiers like ARNs, account numbers, and agent IDs can leak accidentally in various ways.

It's not enough to operate a production system using security by obscurity as your defense.

I left this gap in the examples I published in the repo (with disclaimers) deliberately because I think it is the kind of thing teams will hit in practice.

Closing the Security Gap

To address this issue, I introduced a lightweight AWS Lambda function between the gateway and the agent and switched the agent to use IAM authentication instead of OAuth.

That way, the token that is used to authenticate with the API is different from what is being used to securely invoke the agent. A malicious actor can no longer invoke my agent directly.

Only the AWS Lambda function with the correct permissions attached to its IAM execution role can invoke the agent.

By separating user authentication from backend invocation permissions, we eliminate the possibility of a client bypassing the API protections.

This is the pattern I recommend as the starting point for most production workloads.

Cognito handles user identity, API Gateway + WAF handle traffic protection and shaping, Lambda handles request processing, and the agent handles reasoning.

This represents an application with a single endpoint. Most real-world applications have more than one endpoint.

Time for the next iteration.

Expanding Application Functionality

What I did next is add conversation history to the application: persistent memory for the agent, conversation history displayed on the front-end, and the ability to pick up where you left off across sessions.

To achieve this, I introduced a second endpoint for conversations in API Gateway, a second Lambda function for the conversation retrieval logic, an Amazon DynamoDB table for conversation metadata, and I used Amazon Bedrock AgentCore Memory for storing the full conversation history.

The second endpoint and Lambda function gave me a place to run logic that does not require the agent, like retrieving past conversations from memory to display.

This reinforces a key design principle: only invoke the LLM when you actually need reasoning, and handle everything else with traditional application infrastructure.

This is where you can really start to see how to evolve this pattern to adapt to a more complex use case.

The Agent Wasn’t the Real Problem

The agent code barely changed across all iterations. What changed was everything around it. That progression is the whole point of sharing this example.

As the system needed tighter security, traffic controls, memory, and additional endpoints, the agent stayed focused on what agents do.

This is why it's so important to design agent architectures applying the same systems design thinking we apply to everything else. It lets you isolate responsibilities, keep reasoning separate from traffic control and business logic, and prevent your agent from becoming an accidental "Big Ball of Mud".

You want to build an architecture around your agent that can evolve as your requirements evolve.

What Comes After the Basics

The patterns we covered here tackle foundational concerns: traffic protection, auth boundaries, separation of responsibilities. These are well-understood problems with well-understood solutions.

The design challenges we face with deploying AI agents to production that come next are potentially less straightforward.

For example:

How do you control which tools an agent can call, and under what conditions?
How do you audit what data the agent accessed and what actions it took at scale?
How do you prevent the agent from doing something that is perfectly valid in one context but inappropriate in another, while still allowing it when it makes sense?
How do you ensure your agent is following the instructions you gave it end-to-end?

These are the kinds of problems teams hit as agents move from basic assistants to systems that take actions on behalf of users and organizations that have real-world consequences.

The answers require new patterns and solutions that we have not yet fully worked out or adopted widely as an industry.

This post tackled the basics. You need to get the foundational architecture right first, because none of the harder problems get easier if you are also fighting your own infrastructure design choices.

Go Forth and Build Agents

In my original post, I argued that agents are a capability inside the system, not the system itself. Building these patterns reinforced that.

Every iteration made the agent more useful, more secure, and more operable, not by changing the agent, but by building the right architecture around it.

Good architecture makes your agent better without the agent needing to know about it.

Go fork the repo, deploy the iterations, and adapt the patterns to your own use cases.

If you found this useful, star the repo so others can find it too. And if you want more context on why these patterns matter, start with the original post: We Need To Talk About AI Agent Architectures.

Deploying AI Agents on AWS Without Creating a Security Mess

Morgan Willis — Mon, 12 Jan 2026 20:26:01 +0000

Most agents that are useful need access to private data.

They need to query internal databases, call internal systems, or read data that was never intended to be public. These requirements immediately raise questions about network exposure, credential handling, and compliance.

How does the agent connect to a private database? Where does it run? How do you handle multiple users without sharing execution state? How do you grant access to private systems without hardcoding credentials or widening network access?

This post walks through an example of how I answered those questions for an agent I deployed to AWS.

You can find the full-length video where I build this solution end-to-end here.

The running example

I built a simple logistics helper agent using Strands Agents SDK and an OpenAI model. It answers questions about shipments by querying a live PostgreSQL database running on Amazon Relational Database Service (RDS) inside an Amazon Virtual Private Cloud (VPC) on AWS.

The easy part was building the agent logic. I got it running locally using mocked tools for early testing.

The hard part was deploying the agent in a way that:

does not expose the database publicly
does not embed credentials in code
does not punch unnecessary holes in the network
properly isolates user sessions

AWS provides the building blocks to solve these problems, but you still need to make deliberate choices about how they fit together.

This post uses the logistics agent as a running example. Each snippet is either from the agent code or the infrastructure files that deploy it.

Amazon Bedrock AgentCore primer

In this example, AgentCore Runtime is the hosting environment for the logistics agent.

AgentCore Runtime is a managed, serverless, hosted runtime that runs agents in isolated sessions, handles authentication, scaling, and lifecycle management without requiring you to completely rewrite your agent for integration. It is framework and model agnostic, and supports multiple protocols including: HTTP, MCP, and A2A.

You can read more about Amazon Bedrock AgentCore Runtime here.

The architecture at a glance

The diagram above shows the architecture for the backend of the logistics agent, including how it connects to the private database and external model provider, OpenAI.

The agent runs on Amazon Bedrock AgentCore Runtime.
The AgentCore Runtime deploys Elastic Network Interfaces (ENIs) into private subnets inside a VPC to allow connectivity with private resources.
The database runs on a private RDS instance in the same VPC.
The agent reads database connection information from AWS Systems Manager Parameter Store.
The agent reads secrets from AWS Secrets Manager (database credentials and the OpenAI key for the model provider).
VPC endpoints keep calls to AWS services on the AWS network, including calls to AgentCore, AWS Systems Manager, and AWS Secrets Manager.
A NAT Gateway provides outbound internet access so the agent can call OpenAI for inference.
IAM controls:
- who can invoke the agent
- what AWS APIs the agent can call once invoked

If you want to see the full code or AWS Cloud Development Kit (CDK) stack, the step by step guide can be found on GitHub here.

A quick map of the security concerns and supporting AWS features

Security concern	AWS primitive	Where it shows up in this example
Inbound authentication for invocations	AgentCore Runtime support for IAM SigV4 or OAuth (JWT)	The caller invoking `InvokeAgentRuntime`
Session isolation	AgentCore Runtime sessions	Runtime behavior (no shared process across users)
Secrets	AWS Secrets Manager	Agent loads DB credentials and OpenAI key at runtime
Non-secret config	AWS SSM Parameter Store	Agent loads endpoint, DB name, and secret ARNs
AgentCore Runtime agent permissions	IAM execution role	Role associated to the agent in AgentCore Runtime
Private connectivity to AWS services	VPC endpoints	VPC endpoints
Private connectivity to RDS	VPC networking and security groups	Runtime ENIs in private subnets and security group rules
Egress only internet access from private subnets	NAT gateway	NAT Gateway in a public subnet with private subnet route tables for 0.0.0.0/0

Keep this table in mind, as this post will dive deeper into each row.

The agent code, trimmed to the parts that matter

This is the logistics helper agent written in Python using Strands Agents SDK, with some details removed for brevity. The full sample can be found here.

import json
import boto3
import logging
import pg8000.native
from strands import Agent, tool
from strands.models.openai import OpenAIModel
from bedrock_agentcore import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

# Cached within a single runtime session
_db_config = None
_db_credentials = None
_db_connection = None

def _load_db_config():
    ...

def get_db_connection():
    ...

@tool
def get_shipment_status(reference_no: str) -> str:
    ...

@tool
def find_delayed_shipments() -> str:
    ...

SYSTEM_PROMPT = """You are a logistics tracking assistant with access to a real-time shipment database.
..."""

_agent = None
_openai_model = None

def _get_openai_model():
    ...

def _initialize_agent():
    ...

@app.entrypoint
def logistics_query(payload):
    user_query = payload.get("query")
    if not user_query:
        return "Please provide a query in the format: {\"query\": \"your question here\"}"

    agent = _initialize_agent()
    result = agent(user_query)
    return result.message["content"][0]["text"]

if __name__ == "__main__":
    app.run()

How to make the agent AgentCore Runtime compatible

Before we dive into the details about what the specific features in AWS are being used for security in this example, let’s first review how to make an agent AgentCore Runtime compatible. In your agent file, the code needed for integration is minimal.

from bedrock_agentcore import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

@app.entrypoint
def logistics_query(payload):
    ...

That @app.entrypoint is a decorator for the handler or entrypoint for your agent. AgentCore Runtime will call that function with a payload whenever an invocation hits the agent.

Behind the scenes, this is implementing the AgentCore Runtime service contract for HTTP which you can read more about here.

The important part is that it will implement the /invocations endpoint on port 8080 which allows us to invoke the agent once it’s deployed to runtime.

This example is for an agent built using Strands Agents SDK, you can find code snippets supporting other frameworks here.

Deploying the agent to AgentCore Runtime

Once the agent is wired up, you have options for deployment:

AgentCore CLI starter toolkit: fast iteration, good for development and early testing. You can use the command line to run agentcore configure to configure your agent, then agentcore deploy to deploy it to runtime. Read more about this here.
Infrastructure as code (AWS CloudFormation or AWS CDK): best for production deployments. You can find the AgentCore Construct Library for AWS CDKhere.

I’ll be using snippets from the AWS CDK template I created to deploy the agent in the following sections.

Inbound authentication and authorization

For the logistics agent, the first security boundary is deciding who is allowed to invoke the agent.

That means you need an inbound authentication mechanism, and I don’t know about you, but I am not rolling my own auth.

AgentCore Runtime supports two inbound authentication options:

IAM (SigV4): the caller signs the request with AWS credentials. An IAM policy on the caller determines whether they’re allowed to invoke the agent runtime, the same way authorization works for other AWS APIs.
OAuth 2.0 (JWT bearer tokens): the caller authenticates with an identity provider and sends a JWT bearer token. The agent runtime validates that token (via your configured IdP).

The logistics helper agent uses IAM for inbound authentication. When the agent is invoked, AgentCore Runtime validates the incoming request. There is no code related to authentication in the actual agent itself. AgentCore Runtime handles that for you.

What IAM permissions look like for the invoker

When invoking the agent, the invoker needs permission to call the bedrock-agentcore:InvokeAgentRuntime API on the runtime ARN.

Example policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowInvokeAgentRuntime",
            "Effect": "Allow",
            "Action": "bedrock-agentcore:InvokeAgentRuntime",
            "Resource": "arn:aws:bedrock-agentcore:us-east-1:123456789012:runtime/logistics_agent"
        }
    ]
}

Important distinction:

This is the invoker’s permission (who can call the agent).
Later we’ll define the agent IAM execution role (what the agent can do once it starts running).

NOTE: In this example, I’m invoking the runtime directly using local IAM credentials, the AgentCore CLI, and the AWS SDK as a proof of concept. In a real-world system, I would place an API hosted using a service like Amazon API Gateway in front of the agent as a proxy. API Gateway would handle end-user authentication and request validation, then it can use its own IAM role to call the InvokeAgentRuntime API for the agent. I wrote another blog about why you should have at least a proxy component sitting in front of your agents here.

Isolating users and execution state

If multiple users hit the same agent at the same time, you don't want a process with data and state shared across users. You also don't need to reinvent the wheel by creating a multi-tenant isolation mechanism yourself.

AgentCore Runtime runs agents in isolated environments, called sessions.

Each time someone invokes your agent, AgentCore either creates a new session or routes the request to an existing session (if you supply a session ID). Agent sessions run in a dedicated microVM with isolated CPU, memory, and filesystem resources.

Storing and accessing credentials and connection details securely

Now that we know how to invoke the agent and how the agent runs in isolated sessions, the next question is: how does the logistics agent gain access to private systems without baking sensitive data into the code or environment variables?

There are two different kinds of data the agent needs in order to connect to the Amazon RDS database and OpenAI model:

Secrets, like database credentials and the OpenAI API keys
Configuration, like hostnames, database names, and the ARNs of the secrets themselves so they can be retrieved programmatically

What to store where

You should use AWS Secrets Manager to store sensitive values like:
- The RDS username and password
- The OpenAI API key
You should use AWS Systems Manager Parameter Store to store non-secret configuration data like:
- The RDS endpoint
- The database name
- The ARNs of the secrets that contain credentials

This split gives you a few practical benefits:

Secrets can be rotated independently
You can audit secret access
You avoid the temptation to pass credentials around “just to make it work”

Configuration lookup using AWS SSM Parameter Store

This code snippet allows the agent to read three parameters from AWS SSM Parameter Store:

The RDS endpoint
The database name
The secret ARN for the DB credentials

Example Python code using boto3 to access AWS SSM Parameter Store:

ssm_client = boto3.client("ssm", region_name=AWS_REGION)
response = ssm_client.get_parameters(
    Names=[
        "/agentcore/rds/endpoint",
        "/agentcore/rds/database",
        "/agentcore/rds/secret-arn",
    ]
)

params = {p["Name"]: p["Value"] for p in response["Parameters"]}
_db_config = {
    "endpoint": params["/agentcore/rds/endpoint"],
    "database": params["/agentcore/rds/database"],
    "secret_arn": params["/agentcore/rds/secret-arn"],
}

This pattern keeps configuration out of code and out of deployment artifacts, and access can be tightly scoped to only the specific parameter paths the agent needs.

From a security standpoint, this also creates a clear separation of responsibilities: Parameter Store answers where the database is and which secret to use to connect, while Secrets Manager controls what the credentials actually are.

If configuration details need to change, you update it centrally without redeploying code, and if access needs to be revoked or audited, it’s handled through IAM rather than application logic.

This keeps configuration flexible, secrets isolated, and permissions explicit.

Fetching credentials from AWS Secrets Manager

Once the logistics agent knows which secret to retrieve, it fetches the credentials from AWS Secrets Manager.

Example Python code using boto3 to access AWS Secrets Manager:

secrets_client = boto3.client('secretsmanager', region_name=AWS_REGION)
secret_response = secrets_client.get_secret_value(
    SecretId=_db_config['secret_arn']
)

_db_credentials = json.loads(secret_response['SecretString'])

The same pattern is used to retrieve the OpenAI API key. The agent never reads secrets from disk, environment variables, or configuration files. Everything comes from managed services at runtime.

When you're working with secrets in code, be careful not to log the secret payload or connection strings; treat exceptions as potentially sensitive and sanitize logs.

Granting the agent permission to call AWS APIs

Every AWS API call the logistics agent makes (AWS SSM, AWS Secrets Manager, Amazon CloudWatch) is authorized through the IAM execution role attached to the agent runtime.

This is distinct from the invoker permissions described earlier. It defines what the runtime can do after an invocation starts.

In this example, the role itself was created using the AWS CDK, and that role is assumed by the AgentCore Runtime service principal and granted least-privilege access to:

read specific SSM parameters
read specific AWS Secrets Manager secrets
write logs/traces/metrics to Amazon CloudWatch

Example code snippet from the AWS CDK stack that defines the IAM permissions for the parameters and secrets the agent needs:

runtime_role.add_to_policy(
    iam.PolicyStatement(
        actions=["ssm:GetParameter", "ssm:GetParameters"],
        resources=[
            f"arn:aws:ssm:{self.region}:{self.account}:parameter/agentcore/rds/endpoint",
            f"arn:aws:ssm:{self.region}:{self.account}:parameter/agentcore/rds/database",
            f"arn:aws:ssm:{self.region}:{self.account}:parameter/agentcore/rds/secret-arn",
        ],
    )
)

runtime_role.add_to_policy(
    iam.PolicyStatement(
        actions=["secretsmanager:GetSecretValue"],
        resources=[db_secret_arn, openai_secret_arn],
    )
)

If the agent tries to fetch a secret it is not allowed to read, it fails.

Additionally, the AgentCore Runtime IAM execution role is the only principal allowed to read these secrets, scoped to specific secret ARNs, and secret encryption is handled by Secrets Manager (optionally with a customer-managed KMS key if you need tighter controls).

Allowing the agent to access private resources inside an Amazon VPC

Because the logistics agent queries a private RDS instance, the runtime itself should run inside the same VPC.

To achieve this, the agent runtime is deployed using the VPC network mode configuration.

The AgentCore Runtime VPC network mode configuration

By default, AgentCore Runtime does not deploy agents to a VPC. Deploying agents to a VPC using VPC network mode enables you to have an agent that connects to other resources within that VPC without opening up any network security holes. This makes it easier to allow your agent to work with private databases, call internal APIs, or integrate with other existing systems running in a VPC.

Example code snippet from the AWS CDK stack that defines the AgentCore Runtime resource using VPC network mode:

runtime = CfnResource(
            self,
            "AgentCoreRuntime",
            type="AWS::BedrockAgentCore::Runtime",
            properties={
                "AgentRuntimeName": "logistics_agent_cdk",
                "Description": "Runtime for logistics Strands agent with RDS backed tools",
                "RoleArn": runtime_role.role_arn,
                "NetworkConfiguration": {
                    "NetworkMode": "VPC",
                    "NetworkModeConfig": {
                        "Subnets": list(private_subnet_ids),
                        "SecurityGroups": [runtime_sg.security_group_id],
                    },
                },
                "AgentRuntimeArtifact": {
                    "CodeConfiguration": {
                        "Code": {
                            "S3": {
                                "Bucket": asset_bucket_name,
                                "Prefix": asset_object_key,
                            }
                        },
                        "EntryPoint": ["agent.py"],
                        "Runtime": "PYTHON_3_12",
                    }
                },
            },
        )

When an agent is invoked with VPC network mode configured, elastic network interfaces, or ENIs, are created in the configured private subnets. This gives each runtime session private IP addresses and allows it to connect to resources inside the VPC, like the logistics RDS database, over internal VPC networking.

VPC endpoints for accessing AWS services

Once the runtime is configured to run inside private subnets, the next issue pops up: the logistics agent still needs to call AWS APIs to work.

SSM and Secrets Manager are AWS services.
AgentCore itself is an AWS service.
CloudWatch Logs is an AWS service.

Without VPC endpoints, these API calls would typically route through public AWS service endpoints via a NAT gateway and traverse the public internet. In many environments, that pattern is not acceptable for compliance or security reasons.

VPC endpoints allow those calls to stay entirely on the AWS private network. By deploying endpoints for services like AWS Systems Manager Parameter Store, AWS Secrets Manager, Amazon Bedrock AgentCore Runtime, and Amazon CloudWatch, API traffic is routed privately within the VPC, reducing reliance on NAT gateways and eliminating exposure to the public internet.

Restricting database access to only the agent runtime

The VPC connectivity feature puts the agent in the right network. It does not actually allow the agent to communicate with the RDS database. That comes from security groups.

In this setup:

The agent has a security group.
The RDS instance has a security group.
The RDS security group allows inbound PostgreSQL traffic only from the agent security group.

This has one security group referencing another (sometimes called security group chaining). Security group chaining makes it so that you don’t have to allow a CIDR range or open access within the VPC.

What the RDS security group rule should look like conceptually

Inbound:
- Protocol: TCP
- Port: 5432 (Postgres)
- Source: runtime security group ID

By default, security groups allow all outbound network traffic. You can also restrict the allowed outbound traffic to only what’s required (RDS port, VPC endpoints, and approved egress).

This example also assumes TLS is enforced for the RDS connection so database traffic is encrypted in transit.

A note on database access and query scope

A key design choice in this example that has not been covered yet is that the logistics agent never generates SQL directly.

It can only invoke prewritten tools that execute parameterized queries defined in code. This design avoids letting the model construct arbitrary queries against the database, which introduces risks ranging from accidental data exposure to destructive operations.

The agent can choose which tool to call and which parameters to supply, but it cannot change the shape of the query, the tables involved, or the operations being performed. That keeps the database interaction predictable and reviewable, even as agent behavior evolves.

Even with tool-based access, the database credentials used by the agent are scoped to read-only access on the required schema and views. Database permissions remain the final layer of protection if a tool is misconfigured, expanded later, or reused in ways that were not originally anticipated. It's important to have a layered approach to security.

Allowing outbound internet access from a private subnet

At this point, the logistics agent can talk to:

RDS privately inside the VPC
AWS services privately through VPC endpoints

But the OpenAI model is not an AWS service. So, the agent also needs outbound internet access.

Because the agent runs in private subnets, it cannot reach the internet directly.

The pattern that allows egress only traffic is:

A NAT gateway or NAT instance deployed to a public subnet
Private subnet route tables to direct internet bound traffic (0.0.0.0/0) to the NAT gateway

That gives you controlled egress without giving the runtime a public IP. This also keeps your AWS service calls private through VPC endpoints while still enabling external calls for OpenAI model invocation.

Putting all the pieces together

By the time everything is wired up, the security model is pretty straightforward.

This post has focused on the foundational security and deployment mechanics required to run an agent against private systems. It does not represent a complete architecture and intentionally does not cover application-level authorization, data encryption, fine-grained data access controls, model-specific safety techniques, or cost optimization strategies, all of which depend heavily on the specific use case. Those pieces build on top of the patterns shown here rather than replacing them.

Within this scope, the security model comes down to a few clear responsibilities. None of these controls are unique to agents, but skipping them is how agent deployments turn into security problems.

1) Lock down who can invoke the agent

Inbound access is handled by AgentCore Runtime.

If you use IAM, invokers need bedrock-agentcore:InvokeAgentRuntime permission on the runtime ARN.
If you use OAuth, your callers authenticate with your IdP and present a JWT, and the runtime validates it.

Either way, you have a clear, externalized answer to “who can call this thing.”

2) Do not share execution state across users

AgentCore Runtime sessions give you per-session isolation.
Your agent isn’t running as one long-lived server process that every user shares.
Within a session, you can cache data. Across sessions, state is isolated.

3) Authorize agents to make AWS API calls via an IAM execution role

Once invoked, the runtime assumes an IAM role that defines exactly what AWS API calls it can make:

No static credentials needed and if the role doesn’t allow it, the agent can’t do it.

4) Allow secure access to private resources inside a VPC

Use VPC Network Mode in your AgentCore Runtime configuration
AgentCore Runtime deploys ENIs to selected private subnets
Use VPC endpoints for communication with AgentCore and other AWS services
VPC endpoints keep AWS service traffic on the private AWS network

5) Only allow appropriate database access from the agent

Security groups provide an instance level firewall:

RDS inbound traffic is limited to traffic coming from the runtime security group on the necessary database port
no CIDR-based broad rules
no “anything in the VPC can connect”

6) Provide narrow egress internet access for OpenAI

NAT gateway gives the runtime outbound access from private subnets
Route tables send internet-bound traffic to NAT
AWS service calls can still stay private via VPC endpoints

That’s, at a minimum, what it takes to deploy an agent that accesses private systems without creating a security mess.

If you want to follow this example or adapt this architecture, the full repo includes the infrastructure and the deployment steps, plus cleanup.

And check out the video where I walk you through building the whole solution end-to-end here.

We Need To Talk About AI Agent Architectures

Morgan Willis — Mon, 08 Dec 2025 23:00:17 +0000

AI agents are getting easier to build and host. With agentic frameworks and cloud-based hosting environments, you can deploy an agent to the cloud in an afternoon. It is now possible to assemble a multi-agent setup with memory, observability, and MCP connected tools without a huge amount of code or infrastructure work.

This convenience, paired with AI coding assistants making it easier than ever to ship, has created a trend that is worth talking about. Many developers are wiring UIs directly to their agents as if the agent runtime is the entire backend. It looks clean. It feels efficient. It also happens to be what most demos show, so it is understandable that teams take that pattern and run.

The diagram illustrates a direct client→agent architecture with a single entrypoint where the agent runtime replaces the entire backend.

This works well when you are exploring ideas. Once you move beyond a demo and into a real application, that client to agent pattern may start to break down. This is not because any specific agent runtime itself is limited, but because real production systems still need the same architectural layers they have always needed.

Web applications still need input sanitization. APIs still need rate limits. Business logic still needs a home. Services still need to coordinate with other systems. As soon as those pieces enter the picture, the architecture starts to look a lot more familiar.

AI agents expand what an application can do, but they do not erase the fundamentals of good systems design. The agent itself is not the system. It is a capability inside the system.

Let’s talk about what that means and why it matters.

Why Direct Client → Agent Is an Incomplete Architecture

Terminology note: In this post, runtime refers to a managed environment that executes agent logic on the server side. I use Amazon Bedrock AgentCore Runtime as an example throughout, but the same concepts apply to other hosted environments. Agent or agent service is your deployed code containing the agent framework, prompts, and tool integration.

Upstream services or modules are all components that handle requests before they reach the agent (UI, gateways, routers, backends).

Downstream services or modules are the tools and resources the agent calls (MCP tools, APIs, databases, internal services).

When the client talks directly to the agent runtime, responsibilities that normally live in other components can either get lost entirely or end up pushed into your agent code where they do not belong.

Without typical components of web architectures, the agent is expected to handle:

Request and security boundaries

Input sanitization, API level authorization rules, web traffic filtering, rate limiting, throttling, and safety checks.
Application and system orchestration

Coordinating services, enforcing business rules that span multiple systems, and managing workflow transitions that require durability outside an agent session.
Resilience and operational concerns

Retries, backoff behavior, event buffering, and behaviors that protect downstream systems.

Agents or hosted runtimes may be able to handle some of these tasks, but they were never designed to be your entire backend, your middle tier, or your web server. This is the same reason why we don’t point clients directly at AWS Lambda functions in most production systems without protective layers. In a similar way, agents are not meant to be directly front-end facing services for most use cases.

Where This Architecture Breaks Down in Practice

Here are 3 ways the client→agent pattern can break down in production:

Traffic, cost, and load patterns become hard to control.

When the UI talks directly to a single agent service without upstream boundaries, there is no clean place to enforce rate limits, handle noisy clients, or cap usage per user. A small bug, a retry loop, or a surge in usage can translate into a flood of LLM calls, driving unpredictable latency and inference costs without a structured way to throttle or shed load.
Every change shares the same blast radius because everything ships in one deployment unit

When validation logic, business rules, integration code, and agent behavior all live in the same service, every small change requires touching and redeploying the entire agent app. A tweak to a business rule, a simple bug fix, or a prompt change all share the same blast radius and rollback path, which slows iteration and makes failures harder to localize.
Refactoring becomes brittle as the system grows.

When the agent service acts as the entire backend, every aspect is fused into a single deployment unit. Additionally, many agent runtimes expose a single entrypoint like POST /invoke, which means every feature, workflow, and behavior enters through one undifferentiated entrypoint.

Nothing distinguishes one operation from another, so you lose the natural places where you would normally enforce permissions, validate input, or apply business rules.

With this setup, extending the architecture becomes difficult. Adding new functionality, queues, or workflow orchestration later means untangling tightly coupled logic. Adding features risks rewriting the agent, because the system never developed the separation needed to evolve cleanly.

Why Separation of Concerns Still Matters

We break systems into modules because each piece handles a specific kind of complexity so the rest of the system doesn’t have to. That separation of concerns keeps responsibilities contained, avoids logic leaking across boundaries, allows for decoupling, and makes the system more predictable at scale.

Testability also suffers when everything runs inside a single boundary. Isolating components, mocking dependencies, and doing targeted regression testing is far easier when concerns are separated into clear modules.

Experienced developers and systems engineers know this intuitively, but the rapid progress in AI tooling has lowered the barrier to agent deployment in a way that lets people ship agents before they have the architectural context to support them.

We, as a technical community, should amplify real-world patterns and lessons learned. Providing more examples of advanced use cases alongside simplified tutorials will allow us to learn together and move towards a set of guidelines for well-architected agents.

Balancing Simplicity and Structure in Agentic Systems

Just like any other solution, as you introduce more moving parts, you are now responsible for operating and maintaining them.

Additional components add complexity in the same way that having a load balancer, an API gateway, and a database connection pool adds complexity. These components exist because they absorb or abstract the handling of specific categories of risk, or responsibility, so your core application code does not have to. They make the entire system more reliable.

None of this means you must build a massive, highly distributed, micro-serviced architecture to use agents correctly.

You can run a simple, clean setup with a load balancer and router component in front of your agent, or add an API gateway for basic shaping and protection, and stop there. That pattern is perfectly valid for many teams, especially early on.

At the same time, companies operating at global scale or projects with complex requirements will naturally need more components. They may introduce additional services for orchestration, workflow durability, message buffering, network connectivity, or cross-system coordination. These architectures are more complex because the requirements and traffic patterns call for that complexity.

Both ends of that spectrum are reasonable. What matters is choosing the right architecture for your use case and constraints. The goal is not to chase complexity for complexity’s sake, and it is also not to flatten everything into a single module. It is to introduce the minimum number of components that meaningfully reduce risk, improve security, and enable flexibility as your system grows and changes.

That balance is what helps you start simple without boxing yourself into a corner later.

What Belongs in the Agent vs the Backend?

With all of that being said, what does belong in the agent vs in other components?

Agent frameworks make it easy to blur these boundaries, but keeping them clear is what prevents the system from collapsing into an expensive mess. The way you decide to build your agent heavily depends on your use case and technology choices. Agentic frameworks vary in their implementation, and so do the requirements from case to case. There is no one size fits all answer. Here are some high-level guidelines for getting started.

What typically belongs upstream (UI, gateway, router, backend)

Input shaping, validation, rate limiting, and web traffic filtering
Core business logic
Coordinating between services or orchestrating complex workflows
Workflow state, retries, orchestration, and durability

Separating these concerns keeps security, validation, and business rules separate from core agent code, reduces the blast radius of changes, and lets you change agent behavior without constantly reworking the logic that keeps the system running on a basic level.

What typically belongs inside the agent

Invoking LLMs using agentic frameworks
Tool selection and orchestration logic
Agent session state, context, and memory handling

An agent is generally responsible for interpreting goals, choosing actions, and reasoning over context. Decisions might come from the model, from graph level orchestration, or from deterministic routing depending on the framework and use case.

What typically belongs in tools

Reading or writing data
Querying systems of record
Triggering deterministic code
Invoking internal or external APIs
Triggering another agent to do work

Tools encapsulate actions. The model may determine when the tools are needed and the tools control how they execute the underlying operation.

AWS Architecture Patterns for AI Agents

Agents fit into systems just like any other capability fits into an application. You can keep things lightweight or expand into more distributed designs as your scale and needs change.

The patterns highlighted below are intentionally leaving out other parts of agentic systems like memory, MCP servers, RAG, and multi-agent communication. Those are important topics, but those components sit inside the agent runtime or downstream from it rather than in the upstream architectural components we are focusing on here.

You can extend or adapt these patterns for your use case. I will use AWS services as examples, with Amazon Bedrock AgentCore Runtime as the agent runtime, though you could swap these components with services from other providers and keep the same patterns.

Quick Amazon Bedrock AgentCore Primer

Because the following examples use AWS services, here are the basics. AgentCore Runtime is a managed serverless environment for hosting AI agents. It handles deployment, scaling, and session management, and integrates with many tools and services both inside and outside of AWS. It supports both IAM and OAuth based identity so you can plug it into existing security models. To learn more about AgentCore Runtime, click here.

1. Minimal API Gateway Pattern

Client → Amazon API Gateway + AWS Web Application Firewall (WAF)→ Amazon Bedrock AgentCore Runtime → Downstream services

Use this when

You are moving from prototype to production and want a small number of well understood layers.
You need basic protections like auth, rate limits, and input validation but do not yet have a large service ecosystem.

API Gateway and AWS WAF provide authentication, rate limits, routing, web traffic filtering, and a controlled boundary before the agent is invoked.

You can optionally include an AWS Lambda function between the API Gateway and the agent runtime which lets you write custom logic when invoking the agent, including deterministic input validation or other logic.

AgentCore Runtime handles inbound identity using OAuth or IAM.

If you later need queuing for incoming messages, you can include Amazon SQS between API Gateway and the agent and use a Lambda function that processes messages and invokes AgentCore Runtime. That lets you handle spiky traffic or ordered message processing without changing how the agent itself works.

2.Traditional Backend + Agent Pattern

Client → Application Load Balancer + AWS WAF → Web server(s), e.g., on Amazon EC2, Amazon ECS, or AWS Lambda→ Amazon Bedrock AgentCore Runtime → Downstream services

Use this when

You already have a web backend that you need to integrate into or if you need a designated component for routing and business logic.
You have non-trivial logic or workflow orchestration requirements.

Many production workloads still run traditional web backends. Those architectures do not disappear or need a major overhaul when you add an AI agent. You extend them.

The client sends requests through an Application Load Balancer which can integrate with AWS WAF for web filtering. From there, the request is sent to a web backend on Amazon EC2, containers, or Lambda.

The backend handles business logic and system coordination. The agent is a capability it uses, invoked via a VPC endpoint so traffic remains private.

3. Deep Automation Agent Pattern

Events coming from Amazon EventBridge→ AWS Step Functions → AWS Lambda → Amazon Bedrock AgentCore Runtime → Downstream Systems

Use this when

The value of the agent lives in backend processes, not in a chat UI.
You want agents to be one part of a larger workflow.
Work is triggered by events, schedules, or pipelines rather than direct user interaction.

Here, the agent is a part of a larger workflow, pipeline, or automation task. Agents can potentially run asynchronously, with no user facing UI at all.

Events from Amazon EventBridge or scheduled runs can invoke the agent in AgentCore Runtime directly using IAM as the authentication method. You can optionally introduce AWS Step Functions as a way to coordinate the steps of a long-running or multi-phased workflow that mixes deterministic and nondeterministic steps.

Step Functions provides a workflow control mechanism so the agent does not need to manage retries, branching, or overall workflow state.

The agent does its work and calls downstream services or tools as needed, while coordination between steps is handled by Step Functions. This allows you to run deterministic steps using services like AWS Lambda before, Amazon Simple Notification Service for notifying relevant parties after, or invoke various services in parallel to your agent. Again, you could swap out Step Functions for another workflow orchestrator and the concept still applies.

These patterns let you start simple, introduce components only when needed, and grow into more distributed or mature architectures.

The Takeaway

If you remember nothing else from this post, remember this: the question is not whether you can connect your client directly to an agent. You technically can. The question is whether you should.

In the short term, it may feel fast and simple. In the long term, it leads to a brittle system that is difficult to extend, hard to understand, and expensive to maintain.

A well-structured architecture lets agents be first class participants in your system without being overloaded by concerns that belong elsewhere. That is how you get the best of both worlds: the power of agentic reasoning combined with the reliability of proven distributed system design.

And yes, the client→agent tutorials are still useful. They exist to teach one focused concept without burying you in use case specific and complex details. They show you how to get an agent running, not how to design the full application around it.

But once you move toward production, the question becomes: Did we build a full system or did we stop at the agent?

The agent is the brain. The architecture is the body. You need both.

If you want to learn more about agentic design patterns on AWS, visit Agentic AI patterns and workflows on AWS and stay tuned for more blog posts from the AWS team where we explore specific architectures for agentic AI use cases and advanced design patterns.