Forem: Raj Murugan

Part 3: Wiring It Into AWS DevOps Agent — AgentSpace, register-service, and the IAM Trust Policy That Ate My Afternoon

Raj Murugan — Thu, 30 Apr 2026 16:14:55 +0000

Part 1 framed why an org-aware DevOps agent has to bridge state and intent. Part 2 built the MCP server that holds the intent half. This post is the integration story — the CDK that takes that Lambda from "callable with curl" to "AWS DevOps Agent calls it automatically when an alarm fires."

Most of what's interesting in Part 3 is the IAM. AWS DevOps Agent is new enough that the trust-policy ergonomics aren't documented well, and a few of the moves you have to make are non-obvious. I'll show the working CDK, then walk through the three places I burned an afternoon.

I'll also close with a real OIDC gotcha I hit while deploying this very blog post — not in the demo system, in the rajmurugan.com pipeline. Same family of failure mode, different surface. It's the kind of thing you only see in production.

The three-stack split, and why

The whole system is three CDK stacks deployed in order:

┌─────────────────────────────────────────────────────────────┐
│  KnowledgeBaseStack                                         │
│  ├── S3 bucket  (versioned, BlockPublicAccess, RETAIN)      │
│  ├── Bedrock VectorKnowledgeBase  (Titan Embeddings V2)     │
│  └── S3DataSource  (markdown corpus → KB)                   │
│  Exports: KbBucketName, KbId, KbArn, KbDataSourceId         │
└──────────────────────────────┬──────────────────────────────┘
                               │ KbId, KbArn
┌──────────────────────────────▼──────────────────────────────┐
│  McpServerStack                                             │
│  ├── ECR repository  (image pushed before deploy)           │
│  ├── Secrets Manager  (auto-generated 48-char API key)      │
│  ├── Lambda DockerImage  (FastMCP from Part 2)              │
│  └── Function URL  (AuthType=NONE, key enforced in handler) │
│  Exports: McpFunctionUrl, McpApiKeySecretArn                │
└──────────────────────────────┬──────────────────────────────┘
                               │ FunctionUrl, ApiKeySecretArn
┌──────────────────────────────▼──────────────────────────────┐
│  DevOpsAgentStack                                           │
│  ├── OperatorAppRole  (composite principal — see below)     │
│  ├── CfnAgentSpace                                          │
│  ├── CfnService  (mcpserver, X-API-Key)                     │
│  └── CfnAssociation  (binds AgentSpace → MCP, scopes tools) │
│  Exports: AgentSpaceId, McpServiceId, OperatorAppRoleArn    │
└─────────────────────────────────────────────────────────────┘

Three stacks not because there are three of anything special, but because the deploy order matters and it's much easier to enforce that with stacks than with tags inside one stack. The KB has to exist before the Lambda, because the Lambda's IAM policy and env var both reference the KB. The Lambda has to exist before AgentSpace, because the register-service call wants the Function URL and the API key value. CDK respects the order via cdk.Fn.importValue() between stacks.

I picked CFN Exports over SSM Parameter Store for the cross-stack refs. SSM works too, and it gives you nicer ergonomics for refactoring across stack boundaries; the downside is that SSM lookups happen at deploy time, so a bad lookup is a deploy failure with a confusing error. CFN Exports are validated at synthesis, which surfaces the failure earlier. For a three-stack demo, exports are right. For a 30-stack platform, switch to SSM.

KnowledgeBaseStack — the easy one

There's not much to say here. The CDK Labs generative-ai-cdk-constructs package gives you a VectorKnowledgeBase construct that wires the OpenSearch Serverless backing store and the embeddings model in one shot.

import { bedrock } from '@cdklabs/generative-ai-cdk-constructs';

export class KnowledgeBaseStack extends cdk.Stack {
  public readonly bucket: s3.Bucket;
  public readonly knowledgeBase: bedrock.VectorKnowledgeBase;
  public readonly dataSource: bedrock.S3DataSource;

  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    this.bucket = new s3.Bucket(this, 'KbDocs', {
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      encryption: s3.BucketEncryption.S3_MANAGED,
      versioned: true,
      enforceSSL: true,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });

    this.knowledgeBase = new bedrock.VectorKnowledgeBase(this, 'Kb', {
      name: 'intent-guard-kb',
      embeddingsModel: bedrock.BedrockFoundationModel.TITAN_EMBED_TEXT_V2_1024,
      instruction:
        'Intent Guard corpus: ADRs, incidents, planning docs, runbooks, ' +
        'and meeting notes. Return passages that document decisions, risk ' +
        'acceptances, and incidents relevant to the query.',
    });

    this.dataSource = new bedrock.S3DataSource(this, 'KbDocsDs', {
      bucket: this.bucket,
      knowledgeBase: this.knowledgeBase,
      dataSourceName: 'intent-guard-docs',
    });

    new cdk.CfnOutput(this, 'KbBucketName', { value: this.bucket.bucketName,
      exportName: 'IgKbBucketName' });
    new cdk.CfnOutput(this, 'KbId', { value: this.knowledgeBase.knowledgeBaseId,
      exportName: 'IgKbId' });
    new cdk.CfnOutput(this, 'KbArn', { value: this.knowledgeBase.knowledgeBaseArn,
      exportName: 'IgKbArn' });
  }
}

Two notes worth saying out loud:

removalPolicy: RETAIN on the bucket. The corpus is your org's institutional memory. Do not let cdk destroy take it with the stack. If you genuinely want to clean up, empty the bucket by hand first. A retained bucket on cdk destroy is the exact behaviour I want.

The instruction string ends up in the model's context when the agent reasons about whether to use this KB. Treat it like the docstrings in Part 2 — write it as if a model is reading it (because one is). I phrase mine as what it has and what to return, not what kind of knowledge base this is.

McpServerStack — the chicken-and-egg, and the right Lambda shape

The MCP Lambda is a Docker image function, because the FastMCP + Mangum + boto3 stack is pip-installable but not zip-friendly at any reasonable size. Docker images on Lambda pull from ECR.

That sets up a chicken-and-egg problem you have to solve once: CDK can create an ECR repo, but it can't deploy a Lambda that references an image that doesn't exist yet. First-time deploy from a clean account fails with a vague "image not found" error.

The fix: bootstrap the ECR repo before cdk deploy. I do it in CI (a bootstrap.sh step the deploy workflow runs once if the repo isn't present), and I import the existing repo in the stack rather than creating it:

this.repository = ecr.Repository.fromRepositoryName(this, 'McpImageRepo', 'intent-guard-mcp');

The image gets built and pushed to that ECR repo by the CI workflow, then cdk deploy runs. The CDK references the image by tag (latest for synth tests, ${git-sha} for real deploys via -c imageTag=...).

The rest of the stack is reasonably standard:

this.apiKeySecret = new secretsmanager.Secret(this, 'McpApiKey', {
  description: 'API key presented by the DevOps Agent to the MCP Lambda.',
  generateSecretString: {
    passwordLength: 48,
    excludePunctuation: true,
    excludeCharacters: '/@" \\',  // keep it URL-safe and shell-safe
  },
  removalPolicy: cdk.RemovalPolicy.DESTROY,
});

const kbId = cdk.Fn.importValue('IgKbId');
const kbArn = cdk.Fn.importValue('IgKbArn');

this.lambdaFunction = new lambda.DockerImageFunction(this, 'McpLambda', {
  functionName: 'intent-guard-mcp',
  code: lambda.DockerImageCode.fromEcr(this.repository, { tagOrDigest: imageTag }),
  architecture: lambda.Architecture.ARM_64,
  memorySize: 1024,
  timeout: cdk.Duration.seconds(30),
  environment: {
    KB_ID: kbId,
    MCP_API_KEY_SECRET_ARN: this.apiKeySecret.secretArn,
  },
});

this.apiKeySecret.grantRead(this.lambdaFunction);
this.lambdaFunction.addToRolePolicy(
  new iam.PolicyStatement({
    actions: ['bedrock:Retrieve', 'bedrock:RetrieveAndGenerate'],
    resources: [kbArn],
  }),
);

this.functionUrl = this.lambdaFunction.addFunctionUrl({
  authType: lambda.FunctionUrlAuthType.NONE,
  invokeMode: lambda.InvokeMode.BUFFERED,
});

A few opinionated choices:

ARM_64 because it's cheaper at the same performance for boto3 + FastMCP workloads. Tested both architectures before settling.

1024 MB memory. Not because I need the memory — because CPU on Lambda is allocated proportionally, and the FastMCP cold start at 512 MB was painful enough to notice. 1024 brought it down to about 800ms.

30-second timeout. Bedrock KB retrieval is fast (sub-second on a small corpus), but I want headroom for a slow API call without paging the user. Lambda's max is 15 minutes; tune to your retrieval p99.

generateSecretString excludes /, @, ", space, \. When AgentSpace builds the auth header, those characters cause grief — @ gets URL-encoded inconsistently, slashes confuse some loggers. Restricting the alphabet costs you ~2 bits of entropy across 48 characters. Worth it.

grantRead on the secret — not just secretsmanager:GetSecretValue. The CDK helper sets up the resource policy on the secret too, which catches the case where someone else's stack creates the secret and your Lambda tries to read it. Always use the helper, never the raw policy statement.

DevOpsAgentStack — where the real work happens

This is the stack that took me three days to get right. The CFN schema for AWS DevOps Agent (AWS::DevOpsAgent::AgentSpace, AWS::DevOpsAgent::Service, AWS::DevOpsAgent::Association) is straightforward when you know what to write. Knowing what to write is the hard part.

I'll show you the whole thing, then walk through the three gotchas.

import * as devopsagent from 'aws-cdk-lib/aws-devopsagent';

export class DevOpsAgentStack extends cdk.Stack {
  public readonly agentSpace: devopsagent.CfnAgentSpace;
  public readonly mcpService: devopsagent.CfnService;
  public readonly association: devopsagent.CfnAssociation;

  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const mcpFunctionUrl = cdk.Fn.importValue('IgMcpFunctionUrl');
    const mcpApiKeySecretArn = cdk.Fn.importValue('IgMcpApiKeySecretArn');

    // ── Gotcha 1: composite trust policy ─────────────────────────
    const agentSpaceArnPattern = cdk.Stack.of(this).formatArn({
      service: 'aidevops',
      resource: 'agentspace',
      resourceName: '*',
      arnFormat: cdk.ArnFormat.SLASH_RESOURCE_NAME,
    });

    const operatorAppRole = new iam.Role(this, 'OperatorAppRole', {
      roleName: 'intent-guard-operator-app',
      assumedBy: new iam.CompositePrincipal(
        new iam.PrincipalWithConditions(
          new iam.ServicePrincipal('aidevops.amazonaws.com'),
          {
            StringEquals: { 'aws:SourceAccount': cdk.Stack.of(this).account },
            ArnLike:      { 'aws:SourceArn': agentSpaceArnPattern },
          },
        ),
        new iam.AccountRootPrincipal(),
      ),
      description: 'Operator App role: aidevops service + account users.',
    });

    // ── Gotcha 2: explicit chat actions on the operator role ─────
    operatorAppRole.addToPrincipalPolicy(
      new iam.PolicyStatement({
        actions: ['aidevops:ListChats', 'aidevops:CreateChat', 'aidevops:SendMessage'],
        resources: ['*'],
      }),
    );

    this.agentSpace = new devopsagent.CfnAgentSpace(this, 'AgentSpace', {
      name: 'intent-guard-northwind',
      description: 'Intent Guard demo for the northwind-quote service.',
      operatorApp: { iam: { operatorAppRoleArn: operatorAppRole.roleArn } },
    });

    // ── Gotcha 3: secret resolved at deploy time, not in template ─
    const apiKeyValue = cdk.Token.asString(
      cdk.Fn.join('', [
        '{{resolve:secretsmanager:', mcpApiKeySecretArn, ':SecretString}}',
      ]),
    );

    this.mcpService = new devopsagent.CfnService(this, 'McpService', {
      serviceType: 'mcpserver',
      serviceDetails: {
        mcpServer: {
          name: 'intent-guard-mcp',
          endpoint: mcpFunctionUrl,
          description: 'Intent Guard MCP — ADRs, incidents, planning, meeting notes.',
          authorizationConfig: {
            apiKey: {
              apiKeyName: 'intent-guard-mcp-key',
              apiKeyValue,
              apiKeyHeader: 'X-API-Key',
            },
          },
        },
      },
    });
    this.mcpService.addDependency(this.agentSpace);

    this.association = new devopsagent.CfnAssociation(this, 'McpAssociation', {
      agentSpaceId: this.agentSpace.attrAgentSpaceId,
      serviceId:    this.mcpService.attrServiceId,
      configuration: {
        mcpServer: {
          name: 'intent-guard-mcp',
          endpoint: mcpFunctionUrl,
          tools: [
            'search_architectural_decisions_tool',
            'get_decision_details_tool',
            'check_risk_acceptance_status_tool',
            'get_related_incidents_tool',
          ],
        },
      },
    });
  }
}

Now the three gotchas, in the order they hit me.

Gotcha 1: the composite trust policy

The first version of the OperatorAppRole I wrote had this trust policy:

assumedBy: new iam.ServicePrincipal('aidevops.amazonaws.com'),

Deploy fails with:

Resource handler returned message: "SourceArn and SourceAccount Role
validation failed for OperatorAppRole. The trust policy doesn't include
either SourceArn or SourceAccount."

That's AWS's standard confused-deputy protection talking. When a service principal can be invoked across accounts, you have to constrain which resource is allowed to invoke it, and from which account. The condition keys for that are aws:SourceArn and aws:SourceAccount. AWS DevOps Agent rejects roles that don't have both.

Fine — except the aws:SourceArn you want is the AgentSpace's ARN, and the AgentSpace doesn't exist yet at the time the role is being created. The role is a property of the AgentSpace. Chicken meet egg.

The escape hatch: aws:SourceArn accepts wildcards, and ArnLike is a valid condition operator. So you write the condition against agentspace/* in your account:

const agentSpaceArnPattern = cdk.Stack.of(this).formatArn({
  service: 'aidevops',
  resource: 'agentspace',
  resourceName: '*',
});

new iam.PrincipalWithConditions(
  new iam.ServicePrincipal('aidevops.amazonaws.com'),
  {
    StringEquals: { 'aws:SourceAccount': cdk.Stack.of(this).account },
    ArnLike:      { 'aws:SourceArn': agentSpaceArnPattern },
  },
)

This says: "the aidevops service can assume this role, but only when the source is an agentspace in this account." Which is what you want; it stops a different account's AgentSpace from somehow assuming your role. The aws:SourceAccount belt-and-braces is required by the service even though the SourceArn already implies it.

Then the second half of the trust: the IAM users who actually log into the Operator Web App. They sign in with their own credentials and the role is assumed on their behalf. So the trust policy also has to allow account principals:

assumedBy: new iam.CompositePrincipal(
  new iam.PrincipalWithConditions(/* aidevops with conditions, above */),
  new iam.AccountRootPrincipal(),
),

CompositePrincipal ORs the principals together. The role can be assumed either by the aidevops service (with the source conditions) or by any IAM principal in this account. Both are needed. Drop either and the Operator Web App breaks in a different way.

This is the kind of thing that is one line in the docs once you know to look for it, and three days of poking at CloudTrail when you don't.

Gotcha 2: the three explicit chat actions

The Operator Web App has a chat experience baked in. To use it, the role assumed by the operator needs three actions explicitly granted:

operatorAppRole.addToPrincipalPolicy(
  new iam.PolicyStatement({
    actions: ['aidevops:ListChats', 'aidevops:CreateChat', 'aidevops:SendMessage'],
    resources: ['*'],
  }),
);

Without these, the Operator Web App loads, the role is assumable, but the chat panel just hangs and eventually shows an opaque "couldn't load chats" error. CloudTrail shows the aidevops:ListChats deny.

These three are not implied by any of the AWS-managed policies I tried (AdministratorAccess works, but you don't want operators running with that). There's no AIDevOpsAgentChatUser managed policy at the time of writing. Bake the three actions into your inline policy and move on.

Gotcha 3: the secret has to be resolved at deploy time, not synth time

The apiKeyValue field on CfnService.serviceDetails.mcpServer.authorizationConfig.apiKey is a string. The naive thing to do is read the secret value and pass it in:

// DON'T DO THIS
const secret = secretsmanager.Secret.fromSecretCompleteArn(
  this, 'McpApiKey', mcpApiKeySecretArn,
);
const apiKeyValue = secret.secretValue.unsafeUnwrap();  // returns a Token

The problem is the wording: "unsafeUnwrap". CDK is warning you that secretValue is a Token, and synthesising a Token into a string usually means it ends up as a plaintext value in your CloudFormation template. CFN templates land in S3 as part of the deploy. Having a plaintext API key in there is a leak.

The right pattern is a CFN dynamic reference. Dynamic references are special strings of the form {{resolve:secretsmanager:<arn>:SecretString}} that CloudFormation expands at deploy time, server-side, not at synth time. The plaintext never enters the template.

const apiKeyValue = cdk.Token.asString(
  cdk.Fn.join('', [
    '{{resolve:secretsmanager:', mcpApiKeySecretArn, ':SecretString}}',
  ]),
);

cdk.Fn.join is used instead of plain string concatenation because mcpApiKeySecretArn is itself a Token (it came from cdk.Fn.importValue('IgMcpApiKeySecretArn')). Concatenating it with + would synthesise the Token in the wrong context.

When CFN deploys this stack, it sees the dynamic reference, calls Secrets Manager itself, and inlines the plaintext only into the final resource — not into the template. The synth output and CloudTrail both see only the dynamic-reference string. The plaintext is never written down anywhere it shouldn't be.

This pattern is general — it works for any field that takes a string and shouldn't have a plaintext secret in it. Worth keeping in your back pocket for any CDK code that touches credentials.

The webhook forwarder — turning alarms into agent calls

There's one more component I haven't shown, because it's small enough to fit in a sidebar: the webhook forwarder.

When you wire up Cloudwatch / PagerDuty / Dynatrace / ServiceNow as triggers for the agent, they each speak a different webhook payload format. AWS DevOps Agent expects a specific shape. The webhook forwarder is a 50-line Lambda with a Function URL that:

Validates an X-Signature header (HMAC-SHA256 with a secret from Secrets Manager) so the endpoint can't be replayed by random internet traffic.
Normalises the upstream payload into the agent's expected shape.
Posts to the agent's runtime endpoint.

I'll write this up as its own short post — it's not specific to Intent Guard, it's a useful building block whenever you want to plug N event sources into one downstream consumer.

The break-glass pattern is also here: the webhook forwarder reads an SSM parameter on every invocation. Set the parameter to paused and the forwarder drops events on the floor. Set it to live and it forwards. Operators can flip the switch without a redeploy, which is the entire point of break-glass.

A real OIDC gotcha I hit on this very blog

I'll close with a story that's not from Intent Guard, but is exactly the same family of failure you'll hit when wiring DevOps Agent up to your own infrastructure. It happened on the rajmurugan.com pipeline that's hosting this very post.

The site deploys via GitHub Actions to S3 + CloudFront, using a GitHub OIDC role for AWS auth (no stored credentials). The role's trust policy was fine:

{
  "Effect": "Allow",
  "Principal": { "Federated": "arn:aws:iam::<acct>:oidc-provider/token.actions.githubusercontent.com" },
  "Action": "sts:AssumeRoleWithWebIdentity",
  "Condition": {
    "StringEquals": { "token.actions.githubusercontent.com:aud": "sts.amazonaws.com" },
    "StringLike":   { "token.actions.githubusercontent.com:sub":
                      "repo:rajmurugan01/rajmurugan-site:ref:refs/heads/main" }
  }
}

Every deploy since the role was created had failed with:

Could not assume role with OIDC: Not authorized to perform sts:AssumeRoleWithWebIdentity

I'd assumed the OIDC provider was busted, or the role had a typo. Neither. The actual cause was one line in .github/workflows/deploy.yml:

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production    # ← this line

When a workflow declares an environment:, GitHub's OIDC provider issues a JWT with sub set to repo:.../environment:production, not repo:.../ref:refs/heads/main. The two are mutually exclusive — you get one or the other depending on whether the workflow scopes itself to an environment.

My trust policy expected the ref form. The JWT had the environment form. They never matched. Every deploy failed for a month before I caught it.

The fix was a one-line workflow edit (drop the environment: production line, since I wasn't using environment-scoped secrets yet) and the next push deployed cleanly.

The reason I'm telling you this in a Part 3 about wiring DevOps Agent: OIDC trust policies are JWT-claim-matching, and the JWT shape depends on configuration you don't always notice. If you're plugging an external service into your AWS account and authentication is silently failing, the answer is almost never "the role doesn't exist" — it's "the trust policy condition doesn't match the actual claim shape." Dump the JWT (you can do that in a CI step before the assume-role attempt), look at what's actually in sub and aud, and adjust.

This shape of debugging applies as much to the AWS DevOps Agent service principal trust as it does to GitHub OIDC. Same pattern, different surface.

Where this leaves you

Put the three stacks together and you have:

A Bedrock Knowledge Base that ingests data/**/*.md from S3.
An MCP Lambda exposing four tools (Part 2) over Streamable HTTP with API-key auth.
An AWS DevOps Agent AgentSpace bound to that MCP via register-service, with the four tools whitelisted in the Association config.
An Operator Web App at https://aidevops.console.aws.amazon.com/... that an SRE can sign into to ask questions.
A webhook forwarder that turns CloudWatch / PagerDuty / Dynatrace alarms into agent invocations directly, no human click required.

The Northwind scenario from Part 1 plays out end to end:

CloudWatch alarm fires on the /tweak endpoint's error rate.
Webhook forwarder signs and forwards to the agent.
Agent calls check_risk_acceptance_status_tool(service="northwind-quote").
MCP returns the structured finding for ADR-004, sixty days overdue.
Agent calls get_related_incidents_tool(query="bedrock throttling", signals=["bedrock_throttling"]).
MCP returns matching incident reports.
Agent composes a response citing the ADR ID, the days_overdue figure, and the relevant incident — with the break-glass recommendation lifted directly from the runbook the ADR linked to.
Operator sees this in chat about ninety seconds after the alarm fired, with no human in the path between alarm and answer.

That's the bar Part 1 set. We're there.

Where this doesn't leave you

Three things I'm not pretending this system does, before someone takes it to production and gets bitten.

Auto-remediation is not in the loop. The agent surfaces a recommendation; a human runs it. You can wire it to automation, and the SSM-driven break-glass pattern is exactly the right hook for that, but the demo I built keeps the human in. For incident response, that's the boundary I want.

Multi-account org graph isn't here. A real org has dozens of accounts, and your ADRs probably reference resources across them. The version I've shown is single-account. The pattern generalises — you make the MCP tools cross-account by the role they assume, not by the data they hold — but the demo doesn't show it.

Eval harness isn't here. The agent's answers are right roughly nine times in ten on the question shapes I tested. Nine times in ten is not good enough for unattended automation. You want a proper eval harness that scores retrieval quality and citation accuracy on a held-out test set before this thing runs without supervision. I'll write that up separately; it's its own post.

Closing

The thing I want you to take away from the whole series:

Generic AI agents see state. Org-aware AI agents see intent and state, and the bridge between them is a typed query layer over your team's documented decisions. Your monitoring tools handle the state half — they always have. The intent half is the work.

The architecture I've shown — Bedrock KB + frontmatter-aware MCP + AWS DevOps Agent + signed webhook forwarder — is one way to do that bridge. There are others. The specifics matter less than the principle: do not stuff your wiki into the system prompt. Build a typed retrieval surface, make metadata the contract, and let the agent ask.

If you build something similar, I'd love to see it. I'm at github.com/rajmurugan01 and on dev.to.

That's the series. Thanks for reading.

Part 2: The MCP Server — Turning ADRs and Incidents into a Queryable Org-Knowledge Surface

Raj Murugan — Thu, 30 Apr 2026 16:14:22 +0000

In Part 1 I argued that an org-aware DevOps agent has to see two things at once: state (what your infrastructure currently is) and intent (what your team decided it should be). The first half is solved by mature observability. The second is what this series is actually about.

This post is the deep dive on the half I built. The MCP server. Four tools, one Bedrock Knowledge Base, and a small but load-bearing decision about where the structure lives.

The thing I want you to take away: the MCP isn't a wrapper around a search bar. It's a typed query layer over your org's documented decisions, with metadata as the contract.

Let's get into it.

Why MCP, not prompt-stuffing

The first version of this build, like everyone's first version, had every ADR pasted into the system prompt. It worked great for a week. Then I added a second service. Then a third. By the time I had a real corpus, four things had broken:

Cost. Every turn re-pays for the same context window. With Claude Sonnet at the rates I was running, putting fifty ADRs in the system prompt added a few cents per call. Multiply by an on-call rotation answering thirty alarms a week and the maths gets uncomfortable.
Freshness. When a team updated an ADR, the system prompt didn't update. The agent kept citing decisions that had been superseded a month ago.
No filtering. The model has to read the whole corpus every turn to figure out which decision applies to the alarm in front of it. That works for ten documents and fails for two hundred.
The model gets lazy with prose. With everything in context, it tends to summarise rather than retrieve specific clauses. You ask "what was the expiry date on the IP allowlist?" and you get a paraphrase, not the date.

MCP solves all four. The agent decides when to retrieve, the tool returns only the chunks that matched, and the org's source of truth lives in S3 + a Bedrock Knowledge Base where it can be updated by anyone with a markdown editor and a git push.

The trade is one round trip per query. For a 3am incident response, that round trip is well worth it.

The four-tool API

The MCP server exposes exactly four tools. I tried five and six before settling here; this is the smallest set that lets the agent answer the questions I actually want it to answer.

search_architectural_decisions(query, service?, top_k?)
  → semantic search across ADRs, planning docs, meeting notes
  → "what did we decide about Bedrock retries last quarter?"

get_decision_details(id)
  → fetch one document by id
  → "show me ADR-004 in full"

check_risk_acceptance_status(service, as_of?)
  → list expired / expiring / active risk acceptances
  → "are any northwind-quote risk acceptances overdue?"

get_related_incidents(query, service?, signals?, top_k?)
  → find post-incident reviews matching a query or signal set
  → "have we hit Bedrock throttling on this service before?"

That last argument — signals: list[str] — is the one that earns its keep. Incidents in my corpus have a frontmatter signals: [bedrock_throttling, latency_spike] and the tool intersects requested signals with each incident's set. That turns "find similar incidents" from a vibes-based semantic search into a structured filter the agent can actually trust.

The agent picks which of these to call. I do not script the order; the system prompt just says "you have these four tools, here's when each is appropriate." In practice the model uses search_architectural_decisions first ~70% of the time, check_risk_acceptance_status when the alarm is service-tagged, and the others as follow-ups.

Anatomy of one tool — check_risk_acceptance_status

This is the tool that does the most distinctive work. It's the one that turns "there's an ADR about an IP allowlist" into "that allowlist's 30-day risk acceptance expired ten days ago." Date math, structured filter, no LLM hallucination.

Here's the whole thing, anonymised but otherwise unchanged from the production code:

"""Tool: check_risk_acceptance_status."""
from __future__ import annotations
from datetime import date
from typing import Any
from pydantic import BaseModel, Field, field_validator

from src.clients.knowledge_base import KnowledgeBaseClient
from src.tools._common import as_of_date, parse_iso_date

EXPIRING_WINDOW_DAYS = 14


class CheckRiskInput(BaseModel):
    service: str = Field(..., min_length=1, max_length=128)
    as_of: str | None = None

    @field_validator("service")
    @classmethod
    def _strip_service(cls, v: str) -> str:
        stripped = v.strip()
        if not stripped:
            raise ValueError("service must be non-empty")
        return stripped


def classify(expires: date, as_of: date) -> tuple[str, int]:
    """Return (status, days_overdue). days_overdue is negative if not yet expired."""
    days_overdue = (as_of - expires).days
    if days_overdue > 0:
        return "expired", days_overdue
    if abs(days_overdue) <= EXPIRING_WINDOW_DAYS:
        return "expiring_soon", days_overdue
    return "active", days_overdue


def check_risk_acceptance_status(
    client: KnowledgeBaseClient,
    service: str,
    as_of: str | None = None,
) -> dict[str, Any]:
    args = CheckRiskInput(service=service, as_of=as_of)
    ref_date = as_of_date(args.as_of)

    hits = client.retrieve(f"{args.service} risk acceptance expires", top_k=20)
    seen: set[str] = set()
    findings: list[dict[str, Any]] = []
    for hit in hits:
        if hit.type != "adr":
            continue
        if hit.service != args.service:
            continue
        expires_raw = hit.frontmatter.get("expires")
        if not expires_raw:
            continue
        try:
            expires = parse_iso_date(expires_raw)
        except ValueError:
            continue
        if not hit.id or hit.id in seen:
            continue
        seen.add(hit.id)
        status, days_overdue = classify(expires, ref_date)
        findings.append({
            "id": hit.id,
            "title": hit.title,
            "expires": expires.isoformat(),
            "days_overdue": days_overdue,
            "status": status,
            "s3_uri": hit.s3_uri,
        })
    findings.sort(key=lambda f: f["days_overdue"], reverse=True)
    return {
        "service": args.service,
        "as_of": ref_date.isoformat(),
        "findings": findings,
    }

Three things to notice, because they explain a lot of design choices that come up later in this post.

1. The retrieval query is semantic, the filter is exact. The KB call uses the natural-language string "northwind-quote risk acceptance expires". That gets us in the right neighbourhood — Bedrock's HYBRID search picks up ADRs about risk and expiry. The structured filter (hit.type != "adr", hit.service != args.service, expires_raw present and parseable as a date) then guarantees we only return ADRs for the right service with a real expiry date. You do not let the model freelance on this. You make the tool deterministic.

2. The classify function is deliberately boring. Three statuses, one constant for the "expiring soon" window, no LLM in the loop. Date math should never be model-driven. This is where I've watched other teams put a Bedrock Converse call in to "interpret" the date, and that is exactly the wrong place for it.

3. The output shape is structured JSON. The agent does not get prose; it gets a list of findings with id, title, expires, days_overdue, status. When the agent then writes its response to the on-call, it cites these fields. That's why "wrong citations are visible, not silent" actually holds — there is no hidden text the model can paraphrase wrong; the model can only quote what the tool returned.

A typical run for the Northwind ADR-004 from Part 1, sixty days past the March 1 deadline, returns:

{
  "service": "northwind-quote",
  "as_of": "2026-04-30",
  "findings": [
    {
      "id": "ADR-004",
      "title": "Synchronous Bedrock call in /tweak — temporary",
      "expires": "2026-03-01",
      "days_overdue": 60,
      "status": "expired",
      "s3_uri": "s3://intent-guard-kb-docs/adrs/ADR-004-sync-bedrock.md"
    }
  ]
}

The agent then cites ADR-004 and the sixty-day overdue figure verbatim in its response. That's the loop.

Frontmatter is the contract

You cannot do what check_risk_acceptance_status does if expires lives in prose like "the team agreed this exception would be reviewed by early March." The structured filter only works if expires: 2026-03-01 is a YAML field at the top of the document.

So the corpus convention is: every doc has frontmatter, every frontmatter has a controlled set of fields.

---
type: adr                # adr | runbook | incident | planning | meeting_notes | architecture
id: ADR-004
title: Synchronous Bedrock call in /tweak — temporary
date: 2026-01-12
status: accepted          # accepted | superseded | rejected
service: northwind-quote
expires: 2026-03-01
---

Six fields. None of them optional except expires (only ADRs that are temporary risk acceptances set it). service is the join key that makes everything else cross-correlatable — it's how check_risk_acceptance_status(service="northwind-quote") finds documents about the right service, and how a future incident report can be matched to the ADR it might have been predicted by.

Two things this convention costs me, and one thing it buys me.

Cost 1. Every doc has to be authored by someone who knows the schema. I document it in a README inside data/ and reject PRs that don't follow it. For a small team this is fine. For a large org you'd want a template + a CI check that validates frontmatter — about 30 lines of Python. I'll publish that separately.

Cost 2. Bedrock KB doesn't have native support for "metadata-aware retrieval" the way some vector stores do. The KB ignores my YAML frontmatter at index time — it gets indexed as plain text alongside the document body. That means I can't use Bedrock's metadata filters; I have to parse the frontmatter back out after retrieval, in the MCP tool. More on that below.

Buy. Once a field is structured, any tool can filter on it. I added signals to my incident frontmatter purely so the agent could ask "have we seen this combination of symptoms before?" That feature took ten lines in get_related_incidents.py, because the contract was already in place.

The retrieval client, and the bug Bedrock KB chunking gives you for free

Here's the bug. Bedrock Knowledge Base, in its default chunking strategy, sometimes collapses your perfectly valid YAML frontmatter onto a single line.

What you write in S3:

---
type: adr
id: ADR-004
service: northwind-quote
expires: 2026-03-01
---

# ADR-004: Synchronous Bedrock call...

What you get back from the Retrieve API, sometimes, depending on the chunk boundary:

--- type: adr id: ADR-004 service: northwind-quote expires: 2026-03-01 ---

# ADR-004: Synchronous Bedrock call...

Notice the lack of newlines between the keys. PyYAML refuses to parse that — it's not valid YAML. So you cannot just yaml.safe_load(chunk_text) and expect frontmatter to come back.

I lost an evening to this before realising what was happening. The fix is a three-tier parser that handles each shape:

def parse_frontmatter(text: str) -> tuple[dict[str, Any], str]:
    """Three-tier fallback because Bedrock KB chunking is inconsistent
    about preserving whitespace in the frontmatter fence."""

    # Tier 1: standard multi-line fenced YAML
    m = _FRONTMATTER_MULTILINE.search(text)
    if m:
        try:
            parsed = yaml.safe_load(m.group("body")) or {}
            if isinstance(parsed, dict):
                return _stringify(parsed), text[m.end():].lstrip()
        except yaml.YAMLError:
            pass

    # Tier 2: fenced frontmatter collapsed onto one line (Bedrock's quirk)
    m_inline = _FRONTMATTER_INLINE.match(text)
    if m_inline:
        parsed = _split_inline_frontmatter(m_inline.group("body"))
        if parsed:
            return _stringify(parsed), text[m_inline.end():].lstrip()

    # Tier 3: loose key:value scan over leading lines
    loose: dict[str, Any] = {}
    for line in text.splitlines():
        ...

Tier 1 handles the standard case. Tier 2 catches the collapsed-fence form by splitting "key: value key: value" runs at identifier-colon boundaries (with bracket-depth awareness so inline arrays like signals: [a, b] don't get split inside the brackets). Tier 3 is a paranoid loose scan for documents that came back with no fences at all.

The reason I'm walking you through this in detail: if you're building anything similar, you will hit this. Bedrock KB is a managed service, the chunking is not configurable to the level you'd want, and your retrieval-time parser has to be robust to the wire format the API actually returns. Plan for it.

The full retrieval client wraps bedrock-agent-runtime:Retrieve with HYBRID search (vector + keyword) and parses frontmatter on the way out:

class KnowledgeBaseClient:
    def retrieve(self, query: str, *, top_k: int = 5) -> list[Retrieval]:
        resp = self._client.retrieve(
            knowledgeBaseId=self.kb_id,
            retrievalQuery={"text": query},
            retrievalConfiguration={
                "vectorSearchConfiguration": {
                    "numberOfResults": top_k,
                    "overrideSearchType": "HYBRID",
                }
            },
        )
        return [self._to_retrieval(raw) for raw in resp.get("retrievalResults", [])]

    @staticmethod
    def _to_retrieval(raw: dict[str, Any]) -> Retrieval:
        text = str((raw.get("content") or {}).get("text") or "")
        s3 = (raw.get("location") or {}).get("s3Location") or {}
        score = float(raw.get("score") or 0.0)
        frontmatter, _ = parse_frontmatter(text)
        return Retrieval(score=score, s3_uri=str(s3.get("uri") or ""),
                         content=text, frontmatter=frontmatter)

HYBRID search matters here. Pure vector search alone routinely misses on document IDs (the strings ADR-004, SEC-2024-09-12 aren't semantically anchored — they're tokens). Keyword search alone misses on phrasing ("the bedrock retry decision" vs "synchronous Bedrock call"). HYBRID does both. For a small corpus of decisions and incidents, the difference is the difference between "agent finds the right ADR every time" and "agent occasionally hallucinates an ADR-007 that doesn't exist."

The transport — FastMCP on Lambda + Function URL

The MCP server itself is small. FastMCP, four @app.tool decorators, an ASGI handler that Lambda runs through Mangum.

"""FastMCP server — registers the four Intent Guard tools over Streamable HTTP."""
from functools import lru_cache
from fastmcp import FastMCP

from src.clients.knowledge_base import KnowledgeBaseClient
from src.tools.check_risk import check_risk_acceptance_status
from src.tools.get_decision import DecisionNotFoundError, get_decision_details
from src.tools.get_incidents import get_related_incidents
from src.tools.search_decisions import search_architectural_decisions

app: FastMCP = FastMCP("intent-guard")


@lru_cache(maxsize=1)
def get_client() -> KnowledgeBaseClient:
    """Lazily instantiate the KB client — reads KB_ID/AWS_REGION at first use."""
    return KnowledgeBaseClient()


@app.tool
def search_architectural_decisions_tool(
    query: str,
    service: str | None = None,
    top_k: int = 5,
) -> dict[str, Any]:
    """Semantic search across ADRs, planning docs, and meeting notes.

    Use this to find architectural decisions, deferred work, and discussions
    related to a query. Filter by `service` (e.g. "northwind-quote") when you
    know which service is affected.
    """
    return search_architectural_decisions(
        get_client(), query=query, service=service, top_k=top_k
    )


# get_decision_details_tool, check_risk_acceptance_status_tool,
# get_related_incidents_tool all follow the same shape.

Two design choices that ride along with this.

The tool docstring is the API contract. AgentSpace introspects each tool's docstring at registration time and presents that text to the model as the tool's description. Whatever you write in the docstring is what the model sees when deciding whether to call this tool. "Filter by service (e.g. 'northwind-quote') when you know which service is affected" is how I get the model to actually pass the service parameter — without it the model frequently calls the tool with service=None and over-fetches. Treat your docstrings as prompt engineering surface, not internal documentation.

@lru_cache on get_client(). The Lambda container is reused across invocations. Without the cache, every cold-ish invocation re-instantiates the boto3 client (~200ms). With it, the first invocation pays the cost and the rest reuse. This is the right shape for any per-Lambda singleton — config, clients, secrets — that you want to outlive a single request.

The Lambda itself is wired up via Mangum's ASGI adapter, deployed as a Docker image, and exposed through a Lambda Function URL with AuthType=NONE. Why no Lambda-native auth? Because AgentSpace's "register-service" flow expects to authenticate to MCP servers via an X-API-Key header it presents itself, not via IAM SigV4. So the Function URL is open at the network layer, and the API key is enforced inside the handler.

API-key auth without the leaks

Three things I wanted from the auth layer:

The expected key never lives in the synthesised CloudFormation template (so it's not visible to anyone with read access to CFN).
The key is rotatable without redeploying the Lambda.
The comparison is constant-time so no timing oracle.

The pattern that gives you all three: store the key in Secrets Manager, fetch it on Lambda cold start, compare with hmac.compare_digest per request.

import hmac, json, os, boto3

class ApiKeyMiddleware:
    """ASGI middleware that enforces X-API-Key against a Secrets Manager value."""

    def __init__(self, app, *, secret_arn=None, region=None):
        self._app = app
        self._secret_arn = secret_arn or os.environ["MCP_API_KEY_SECRET_ARN"]
        self._region = region or os.environ.get("AWS_REGION", "us-east-1")
        self._expected_key: str | None = None
        self._client = boto3.client("secretsmanager", region_name=self._region)

    def _load_key(self) -> str:
        if self._expected_key is not None:
            return self._expected_key
        resp = self._client.get_secret_value(SecretId=self._secret_arn)
        raw = resp["SecretString"]
        # Accept plain string OR a JSON blob like {"api_key": "..."}
        try:
            parsed = json.loads(raw)
            if isinstance(parsed, dict) and "api_key" in parsed:
                raw = str(parsed["api_key"])
        except json.JSONDecodeError:
            pass
        self._expected_key = raw
        return raw

    async def __call__(self, scope, receive, send):
        if scope.get("type") != "http":
            await self._app(scope, receive, send)
            return
        if scope.get("path") == "/health":           # smoke tests bypass auth
            await self._app(scope, receive, send)
            return
        provided = _header(scope, b"x-api-key")
        expected = self._load_key()
        if provided is None or not hmac.compare_digest(provided, expected):
            await _send_401(send)
            return
        await self._app(scope, receive, send)

A few things worth calling out for production work:

hmac.compare_digest is the line that matters. A naive == comparison is timing-attacking; it short-circuits on first mismatch. With a constant-time compare, you don't leak how many leading bytes were correct.
/health bypass. I want to be able to curl the Function URL from a CI job without providing the key, just to check the Lambda is alive. That's strictly less powerful than a normal request, but it has saved me debugging time often enough to be worth the allowance.
Module-cached key. The _expected_key field on the middleware instance lives as long as the Lambda container does, which is up to a few hours. If you rotate the key in Secrets Manager, in-flight Lambdas will keep the old key until they cycle. For a demo that's fine; for production you want a tighter TTL or a versioned secret with overlap.
JSON-or-plain-string. Secrets Manager has a generateSecretString mode that produces JSON like {"password": "..."}. If your secret was generated that way, you have to fish the value out. Accepting both shapes makes the middleware portable across deploy paths.

What lands in S3, and how

The corpus is a directory tree of markdown files:

data/
├── adrs/
│   ├── ADR-003-temporary-ip-allowlist.md
│   ├── ADR-004-sync-bedrock-tweak.md
│   └── ADR-007-cost-controls-bedrock.md
├── runbooks/
│   ├── RB-002-bedrock-throttling.md
│   └── RB-005-incident-response.md
├── incidents/
│   ├── SEC-2024-09-12.md
│   └── SEC-2025-11-03.md
├── planning/
│   └── 2026-Q1-platform-roadmap.md
├── meeting-notes/
│   └── platform-sync-2026-04-15.md
└── architecture/
    └── northwind-quote-overview.md

A small ingestion script does data/**/*.md → s3://<bucket>/, preserving the folder structure. Bedrock KB's S3 connector picks up the changes and re-indexes. The whole sync is idempotent — re-running on the same content is a no-op.

I keep this script outside CDK on purpose. CDK is for infrastructure; the corpus is content. You want a non-engineer to be able to commit a markdown file and have it ingested without a stack deploy. The pattern that works: corpus lives in the same repo, a GitHub Action on push to main runs the sync into the demo bucket, and nothing else needs to know.

Summary, before we wire it up

What you have at the end of Part 2 is a Lambda you can curl directly:

curl -sX POST "$FUNCTION_URL" \
  -H "X-API-Key: $MCP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"tools/call","id":1,
       "params":{"name":"check_risk_acceptance_status_tool",
                 "arguments":{"service":"northwind-quote"}}}' | jq .

…and get back the structured findings JSON I showed earlier. No agent, no AgentSpace, no Operator Web App. Just a typed surface over your org's documented decisions.

That alone is a useful thing to have. You can plug it into anything that speaks MCP — Claude Desktop, the Claude Agent SDK, any other MCP-aware host.

In Part 3 we wire it into AWS DevOps Agent so that the agent calls these tools automatically when an alarm fires. CDK for the AgentSpace, the register-service flow, the IAM trust policy gotcha that ate me alive (composite principal + SourceArn confused-deputy condition), and the webhook forwarder that turns CloudWatch / PagerDuty / Dynatrace events into agent invocations. That post is where the full system finally answers a 3am page.

→ Continue to Part 3: Wiring it into AWS DevOps Agent — AgentSpace, register-service, and the IAM trust policy that ate my afternoon (coming this week)

Part 1: Intent vs State — How AWS DevOps Agent Closes the Gap Between What Your System Is and What You Decided It Should Be

Raj Murugan — Thu, 30 Apr 2026 15:33:16 +0000

A few weeks ago I was at an AWS roundtable in Auckland. A dozen heads of platform around one table, every one of them shipping AI agents into production, every one of them describing the same gap.

Their agents could read AWS docs. They could call the AWS API. They could write Terraform. They could even, on a good day, propose a fix for a real incident.

What none of them could do: tell on-call whether the API exposure on the billing service is a regression — or a 30-day risk acceptance the team signed off on last quarter.

That gap is what this whole series is about.

I'll show you the AWS DevOps Agent setup I built to close it. The companion implementation is Intent Guard — a demo I'll publish alongside this series (anonymised; I can't share my employer's copy of it).

The thing that took me longest to internalise: the hard part of an org-aware agent isn't the AI. The hard part is figuring out where your org's actual decisions live, getting them in front of the model at the moment they matter, and giving the model enough metadata to know which ones still apply.

Let me start with the framing the rest of the series rests on.

When something breaks at 3am, what do you actually look at?

There are two stacks of evidence about your system, and they get used very differently.

The "what IS" stack. Logs, metrics, traces, CloudTrail. This is the mature half. Every major incident tool — Datadog Watchdog, New Relic AI, PagerDuty AIOps — is excellent at this. They do anomaly detection, alert correlation, change attribution. By 2026 this is a solved-enough problem that the moment alerts fire, you have minutes of automated triage.

The "what SHOULD BE" stack. ADRs, runbooks, planning docs, incident write-ups, the meeting notes where someone agreed to defer the OAuth work until after launch. Your org wrote all this once. Then mostly nobody reads it again.

Here is the uncomfortable truth: nobody reads both in the first hour. The on-call pulls dashboards. They scroll logs. If they're senior, they ask in Slack: "Did anything change?" If nothing changed, they go deeper into the metrics.

What they almost never do, in hour one, is open the ADR repo and search for "circuit breaker" or "rate limit" — because they don't have a reason to suspect the incident is about a decision the team made three months ago that quietly slipped past its deadline.

That's the gap. The worst incidents I've watched in the last few years weren't about recent changes. They were about decisions made months earlier that turned into debt while no alarm was watching.

If state is the mature half of the problem, intent is the half that nobody has automated yet.

What changed when AWS DevOps Agent shipped

AWS DevOps Agent was announced at re:Invent 2025 and went GA in April 2026. Underneath it, it runs on Bedrock AgentCore with Claude as the default model. From a builder's perspective, the interesting bits are:

It accepts triggers from CloudWatch, PagerDuty, Dynatrace, ServiceNow — or any signed webhook.
It runs an autonomous investigation across your telemetry, CloudTrail, code repos, and any Bedrock Knowledge Bases you've registered.
It surfaces a finding with citations — log lines, trail events, KB document IDs.

The mental shift: it's not "another AIops tool". It's an SRE who has read every ADR you've ever written, and starts the runbook the moment your alarm fires.

That changes what's possible in hour one — but only if your org's decisions are actually in a Knowledge Base in a shape the agent can use. Most orgs' ADRs aren't. That's the work.

A useful way to position this against existing tools:

                 │ Reads                     │ Misses
─────────────────┼───────────────────────────┼─────────────────────────
AIops incumbents │ Telemetry — anomaly       │ Your ADRs.
(Watchdog, New   │ detection, alert          │ Your runbooks.
Relic AI,        │ correlation.              │ Your decisions.
PagerDuty AIOps) │                           │
─────────────────┼───────────────────────────┼─────────────────────────
DevOps Agent +   │ Telemetry plus your       │ —
curated KB       │ documented intent.        │

I want to be careful here: the "what IS" tools are mature and good. This isn't replacement. It's the layer they don't see.

A concrete scenario: 60 days past a commitment

Generic framing only goes so far. Let me make this real with the demo I'll use throughout the series.

Northwind Logistics is a fictional B2B SaaS. (Customer is fictional. The architecture and incident shape are real, drawn from work I've done.) They run on AWS, ECS Fargate, RDS, App Runner. They have an internal feature called northwind-quote that turns a customer brief into a costed proposal — the magic happens in a /tweak endpoint that calls Bedrock synchronously to apply natural-language adjustments like "swap to Nova Pro" or "reduce to 10M tokens/day".

The team shipped northwind-quote in January 2026. They knew the synchronous Bedrock call was a risk. They captured that risk in an ADR:

---
type: adr
id: ADR-004
title: Synchronous Bedrock call in /tweak — temporary
date: 2026-01-12
status: accepted
service: northwind-quote
expires: 2026-03-01
---

# ADR-004: Synchronous Bedrock call in /tweak

## Status
ACCEPTED (TEMPORARY) — circuit breaker due 2026-03-01

## Context
Launch deadline. Need /tweak working. Bedrock throttling rare in
test traffic; we accept the risk for one sprint.

## Decision
Call Bedrock synchronously from the request path.
Add a circuit breaker + degraded mode by 2026-03-01.

## Risk Acceptance
- 30-50K req/day. Spikes Mon mornings.
- If Bedrock throttles, /tweak errors are user-visible.
- DO NOT silently extend.

Two things to notice in that frontmatter, because the entire system depends on them:

expires: 2026-03-01 — this is a structured field, not a sentence buried in prose. A retrieval tool can filter on it.
service: northwind-quote — also structured. Filterable. Joinable to telemetry.

March 1 came and went. Other priorities took over. The circuit breaker never shipped. No alarm watched the deadline. No ticket got auto-created. The ADR sat in a repo, exactly where the team filed it, perfectly accurate, completely unread.

Today is April 30. Bedrock has a regional hiccup. Users clicking Apply tweak start seeing 5xx errors. Someone pages on-call.

The on-call's hour-one question is the right one: "What changed?"

The honest answer to that question — and the one a generic agent can never produce — is:

Nothing changed in the code. Sixty days of elapsed time changed.

ADR-004 (filed Jan 12) accepted synchronous Bedrock as a risk and committed to a circuit breaker by 2026-03-01. The deadline passed without the work landing. Today's symptom is exactly the failure mode the ADR predicted.

That answer doesn't come from telemetry. It comes from the ADR repo. And it has to come within the first ten minutes of the incident, or it doesn't matter.

The architecture, end-to-end

Here is the whole setup on one page:

┌─────────────────────────────────────────────────────────────────┐
│  Triggers                                                       │
│  CloudWatch alarm  ·  PagerDuty  ·  Dynatrace  ·  ServiceNow    │
│  └────────────────────┬────────────────────────────────────────┘│
└────────────────────── │ ─────────────────────────────────────── ┘
                        │ HMAC-signed webhook
┌──────────────────────▼──────────────────────────────────────────┐
│  Webhook forwarder  (Lambda, HMAC-SHA256, Secrets Manager)      │
│  Signs every event, routes to the agent. No human clicks.       │
└────────────────────── │ ─────────────────────────────────────── ┘
                        │ tools/call
┌──────────────────────▼──────────────────────────────────────────┐
│  AWS DevOps Agent  (Bedrock AgentCore + Claude)                 │
│  AgentSpace + Operator Web App                                  │
│  Reads: App Runner logs · CloudTrail · code repos · KB          │
└──────┬─────────────────────────────────┬────────────────────────┘
       │                                 │
       │ aws.* (state)                   │ MCP tools/call (intent)
       │                                 │
┌─────▼──────────┐         ┌────────────▼────────────────────────┐
│  AWS APIs      │         │  MCP server  (Lambda Function URL)  │
│  CloudWatch    │         │   • search_architectural_decisions  │
│  CloudTrail    │         │   • get_decision_details            │
│  App Runner    │         │   • check_risk_acceptance_status    │
│  Lambda        │         │   • get_related_incidents           │
│  ...           │         │     ↓ each tool: Retrieve + filter  │
└────────────────┘         └────────────────┬────────────────────┘
                                            │ Retrieve
                              ┌─────────────▼──────────────────┐
                              │  Bedrock Knowledge Base        │
                              │  Titan Embeddings V2           │
                              │  └── S3:                       │
                              │       data/adrs/               │
                              │       data/runbooks/           │
                              │       data/incidents/          │
                              │       data/planning/           │
                              │       data/architecture/       │
                              └────────────────────────────────┘

Five things that earned their place on this diagram, because they are the load-bearing decisions:

1. The agent runtime is AWS DevOps Agent on Bedrock AgentCore, not raw Bedrock Agent. I get AgentSpace (a per-operator session container), a built-in Operator Web App so I'm not shipping a frontend at 3am, native trigger inputs from CloudWatch/PagerDuty/Dynatrace/ServiceNow, and the AgentCore runtime properties (long-running sessions, JWT-validated invocations, streaming) underneath. For a system meant to be used by SREs under stress, "I don't ship a frontend" is not a small win.

2. Triggers go through a signed webhook forwarder, not directly into the agent. Lambda + HMAC-SHA256 + a secret in Secrets Manager. This sounds like over-engineering for a demo and is exactly right for production: every alarm source has a different payload shape, and you want one place to normalise + sign before the agent ever sees it. Replay attacks against agent endpoints are not theoretical.

3. The org context lives behind a custom MCP server, not in the system prompt. First instinct, when you start, is to paste your ADRs into the model's context. That falls apart inside a week — context bloats, costs rise, decisions go stale the moment they change, and you cannot filter. An MCP lets the agent decide when to retrieve, what to filter on, and pay the token cost only when the question is actually about org context. I'll go deep on the four tools in Part 2.

4. Frontmatter is the contract, not the prose. The MCP tools don't return whole documents. They return chunks filtered by type, service, expires, signals — fields I control via YAML frontmatter on every doc. That's why check_risk_acceptance_status can ask "give me every ADR for northwind-quote where expires is in the past" without the LLM having to parse free text.

5. The agent reads both halves. It pulls App Runner logs and CloudTrail for state, and queries the KB through MCP for intent, and correlates them in the same turn. That correlation — log line + ADR ID + elapsed-days math — is what produces an answer the on-call could not have produced from either side alone.

Two practical properties of this design worth calling out, because they are the things I get asked about every time I demo it:

Wrong citations are visible, not silent. The agent quotes specific log lines and specific ADR IDs in its finding. If retrieval brings back the wrong document, you can see it on the screen. The failure mode I've actually hit is stale KB content — an ADR that should have been updated and wasn't — not invention.

Auto-remediation is not in this flow. The agent surfaces a structured recommendation. A human runs it. You can wire it to automation later if you want. For incident response, the human-in-the-loop boundary is where I want it.

What you'll build across this series

Part	What you'll build
Part 1 (this post)	The Intent-vs-State framing and the system architecture
Part 2	The MCP layer — turning ADRs, runbooks, and incidents into a queryable org-knowledge surface, with frontmatter as the contract
Part 3	Wiring it into AWS DevOps Agent — webhook forwarder, AgentSpace, register-service, IAM, and the gotchas I hit

By the end of Part 3, the on-call from the Northwind scenario does not type anything. The CloudWatch alarm fires. Two minutes later there's a finding in the channel, citing ADR-004 and the App Runner log line that triggered it, with the recommendation pulled directly from the ADR's break-glass section: flip the SSM parameter that puts /tweak into degraded mode. Sixty days of unread decision turns into a two-minute hour-one answer.

That's the bar.

A short FAQ before we go deeper

The same handful of questions came up at the roundtable and most rooms I've shown this in. Here's the short version — Parts 2 and 3 fill in the detail.

What I deliberately left out

Skills. The first version of this build had no Claude/agent Skills layer. ADRs and runbooks are doing all the work for now. The natural next layer is process — escalation order, ticketing etiquette, "always page X before Y" — and Skills are how I'd encode that. I'll write about Skills once I've actually shipped that layer in anger, not before.

Multi-account org graph. A real org has dozens of AWS accounts. The version I'm describing here is single-account on purpose so the moving parts are visible. The pattern generalises and I'll come back to it.

Eval harness. The agent's answers are good enough to demo and to surface the right ADR roughly nine times in ten on the question shapes I tested. That is not the same as good enough to trust unattended. Evals are a separate post.

Cost. Under fifty dollars a month for a demo account with light traffic. Real numbers depend on investigation volume and KB size. I'll benchmark properly in a later post; cost was not the bottleneck for me.

In Part 2 we get into the MCP server itself. Four tools, four KB filters, and the small but load-bearing decision to parse YAML frontmatter inside the retrieval client rather than in the agent's prompt. That decision is the difference between an agent that reads your docs and an agent that knows your org.

→ Continue to Part 2: The MCP Server — turning ADRs and incidents into a queryable org-knowledge surface (coming this week)

Part 5: CI/CD for Bedrock AgentCore with GitHub Actions and AWS OIDC (No Stored Credentials)

Raj Murugan — Mon, 30 Mar 2026 21:38:35 +0000

Storing AWS access keys in GitHub Secrets is the wrong approach. They rotate, they get leaked, and they're a compliance headache.

The correct approach in 2025 is OIDC: GitHub Actions proves its identity to AWS using a short-lived token, assumes an IAM role, and gets temporary credentials. No stored keys, no rotation, no secrets to leak.

This post walks through the complete CI/CD setup for AgentCore: OIDC config, the build/push/deploy pipeline, and the dual-tag ECR strategy that makes rollback practical.

Why OIDC over stored credentials

With stored AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY:

Keys are long-lived (you rotate them, right? right?)
Rotation requires updating secrets in every affected repo
A leak (accidental commit, log output, third-party action) gives an attacker permanent access until rotated
Keys are attached to an IAM user — you need a separate user per CI/CD system

With OIDC:

GitHub generates a short-lived OIDC token per workflow run
AWS validates the token against the trusted identity provider
IAM role is assumed — credentials expire in 1 hour maximum
No secrets to rotate, no keys to leak
Trust policy is scoped to specific repos and branches

Setting up OIDC

Step 1: Create the IAM OIDC provider (once per AWS account)

aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1

This tells AWS to trust tokens from token.actions.githubusercontent.com.

Step 2: Create the deploy IAM role

The trust policy scopes the OIDC trust to your specific repo:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<ACCOUNT>:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub":
            "repo:rajmurugan01/bedrock-agentcore-starter:*"
        }
      }
    }
  ]
}

The StringLike condition with * allows any branch. For production deployments, lock it down:

"StringEquals": {
  "token.actions.githubusercontent.com:sub":
    "repo:rajmurugan01/bedrock-agentcore-starter:ref:refs/heads/main"
}

Step 3: Attach permissions to the deploy role

The role needs:

ecr:GetAuthorizationToken — login to ECR
ecr:BatchGetImage, ecr:GetDownloadUrlForLayer, ecr:PutImage, etc. — push to ECR
bedrock-agentcore-control:UpdateAgentRuntime — update the Runtime after pushing a new image
ssm:GetParameter — read Runtime ID and other config from SSM

The deploy workflow

The full file is .github/workflows/deploy-agent.yml.

Key sections:

Trigger

on:
  push:
    branches: [main]
    paths:
      - 'apps/customer-service-agent/**'
  workflow_dispatch:
    inputs:
      environment:
        type: choice
        options: [dev, stg, prd]

The paths filter means the workflow only triggers when agent code changes — not on every push to main. Infrastructure changes (CDK) run in a separate workflow.

OIDC credential configuration

permissions:
  id-token: write   # Required to receive the OIDC token
  contents: read

steps:
  - name: Configure AWS credentials
    uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
      aws-region: us-east-1

The id-token: write permission is what enables OIDC. Without it, GitHub doesn't generate the OIDC token and the step fails.

Build for linux/amd64

- name: Build Docker image
  working-directory: apps/customer-service-agent
  run: |
    docker build \
      --platform linux/amd64 \
      -t ${{ env.ECR_URI }}:latest \
      -t ${{ env.ECR_URI }}:${{ env.GIT_SHA }} \
      .

This produces two tags simultaneously in one build — no rebuilding.

The dual-tag ECR strategy

- name: Push to ECR
  run: |
    docker push ${{ env.ECR_URI }}:latest
    docker push ${{ env.ECR_URI }}:${{ env.GIT_SHA }}

:latest — AgentCore always pulls :latest when you call update-agent-runtime. This tag must always point to the most recent image.

:<git-sha> (e.g., :a1b2c3d4) — pinned to a specific commit. If :latest introduces a regression, you can roll back by pushing the previous SHA tag as :latest:

# Rollback to a previous image
docker pull <ecr-uri>:a1b2c3d4
docker tag <ecr-uri>:a1b2c3d4 <ecr-uri>:latest
docker push <ecr-uri>:latest
# Then trigger update-agent-runtime again

Updating the AgentCore Runtime

After pushing the image, we tell AgentCore to pull the new :latest:

- name: Update AgentCore Runtime
  run: |
    RUNTIME_ID=$(aws ssm get-parameter \
      --name "/customerServiceAgent/${{ env.ENVIRONMENT }}/runtime-id" \
      --query Parameter.Value --output text)

    aws bedrock-agentcore-control update-agent-runtime \
      --agent-runtime-id "${RUNTIME_ID}" \
      --agent-runtime-artifact '{"containerConfiguration":{"containerUri":"${{ env.ECR_URI }}:latest"}}' \
      --role-arn "${{ secrets.EXECUTION_ROLE_ARN }}" \
      --network-configuration '{"networkMode":"VPC","networkModeConfig":{"securityGroups":["${{ secrets.AGENT_SECURITY_GROUP_ID }}"],"subnets":["${{ secrets.AGENT_SUBNET_IDS }}"]}}' \
      --region us-east-1

Remember Gotcha #7 from Part 2: --role-arn and --network-configuration are both mandatory. The --role-arn is the execution role (the role AgentCore uses at runtime), not the deploy role the workflow is running as.

The CI workflow

Runs on every push and pull request:

# .github/workflows/ci.yml
jobs:
  lint-python:
    steps:
      - run: pip install ruff black
      - run: ruff check customer_service_agent/
      - run: black --check customer_service_agent/

  test-infra:
    steps:
      - run: npm ci
      - run: npm test              # Jest CDK unit tests
      - run: npm run synth         # CDK synth smoke test

The CDK synth must succeed without AWS credentials. This works as long as cdk.context.json is committed to the repo — it contains the VPC lookup cache that CDK needs for deterministic synthesis.

If cdk.context.json is missing (or the VPC lookup context changed), CDK will try to call the AWS API during synth and fail in CI. Regenerate it locally: cdk context --clear && cdk synth.

Multi-environment promotion

The workflow_dispatch trigger lets you manually promote a build:

on:
  workflow_dispatch:
    inputs:
      environment:
        required: true
        type: choice
        options: [dev, stg, prd]

Combined with GitHub Environments (configured in repository Settings → Environments), you can require manual approval before deploying to stg or prd:

Push to main → auto-deploys to dev
Manually trigger workflow → select stg → GitHub requires approval from reviewers
After approval → deploys to stg
Manual trigger → select prd → same approval gate

The environment: key in the job declaration activates the GitHub Environment's protection rules:

jobs:
  deploy:
    environment: ${{ inputs.environment || 'dev' }}

GitHub Secrets to configure

Secret	Where it comes from
`AWS_DEPLOY_ROLE_ARN`	ARN of the OIDC role you created
`EXECUTION_ROLE_ARN`	CDK output `ExecutionRole` ARN
`AGENT_SECURITY_GROUP_ID`	CDK output Security Group ID
`AGENT_SUBNET_IDS`	CDK output subnet IDs (comma-separated)

These are repo-level secrets (Settings → Secrets and variables → Actions). For multi-environment setups, use environment-level secrets to have different values per environment.

End-to-end flow

Developer pushes to main
    ↓
GitHub Actions: ci.yml runs (lint + CDK tests, ~2 min)
    ↓
GitHub Actions: deploy-agent.yml triggers (paths: apps/**)
    ↓
Configure AWS credentials (OIDC, ~10s)
    ↓
docker build --platform linux/amd64 (~3-5 min)
    ↓
docker push :latest + :<sha> to ECR (~1-2 min)
    ↓
update-agent-runtime CLI (~30s)
    ↓
AgentCore pulls new image, restarts container instances
    ↓
New code is live

Total time from push to live: ~8-10 minutes.

In the final part, we look at cost — how much this system actually costs to run, where prompt caching saves the most, and how to set CloudWatch alarms before your bill surprises you.

→ Continue to Part 6: Cost & Performance

Originally published at rajmurugan.com. This is Part 5 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.

Part 3: Building the AI Agent with Strands Agents SDK, Prompt Caching, and AgentCore Memory

Raj Murugan — Mon, 30 Mar 2026 21:37:54 +0000

With the CDK infrastructure in place (Part 2), we need an actual agent to run inside it.

The agent is a Python application that:

Exposes an HTTP endpoint AgentCore can call
Uses the Strands Agents SDK to run a Bedrock-backed reasoning loop
Integrates with AgentCore Memory for persistent context
Uses Bedrock Guardrails on every invocation

The full source is in apps/customer-service-agent/ in the demo repo.

Why Strands over LangChain or LlamaIndex?

When I started this project, LangChain was the default answer for "I need to build an agent." I used it, ran into friction, and switched to Strands. Here's why:

Strands is AWS-native. It's built to integrate directly with Bedrock services — prompt caching, guardrail configs, tool definitions. With LangChain, you write adapter code to bridge from LangChain abstractions down to raw Bedrock APIs. With Strands, you're calling the Bedrock API directly through a thin, intentional abstraction.

Tool definitions are simpler. In LangChain, you define tools with StructuredTool.from_function() or subclass BaseTool. In Strands, you decorate a function with @tool and the docstring becomes the description:

# LangChain approach
from langchain.tools import StructuredTool
from pydantic import BaseModel, Field

class OrderLookupInput(BaseModel):
    order_id: str = Field(description="The order ID")

def lookup_order_status(order_id: str) -> str:
    ...

tool = StructuredTool.from_function(
    func=lookup_order_status,
    name="lookup_order_status",
    description="Look up the current status of an order",
    args_schema=OrderLookupInput,
)

# Strands approach
from strands import tool

@tool
def lookup_order_status(order_id: str) -> str:
    """Look up the current status of a customer order by order ID."""
    ...

Active development matches AgentCore. Strands is developed at a cadence that tracks AgentCore releases. New AgentCore features show up in Strands before they make it to LangChain adapters.

Project structure

customer_service_agent/
├── __init__.py
├── config.py       # Settings from env vars
├── prompts.py      # System prompt
├── tools.py        # @tool definitions
├── memory.py       # AgentCore Memory boto3 helpers
├── agent.py        # BedrockModel setup + streaming
└── main.py         # FastAPI app

config.py — environment variables

Everything the agent needs is injected as environment variables by AgentCore. In production, you set EnvironmentVariables on the CfnRuntime resource in CDK. Locally, you use .env.local.

from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    agentcore_memory_id: str = ""
    bedrock_guardrail_id: str = ""
    bedrock_guardrail_version: str = "1"
    aws_region: str = "us-east-1"
    environment: str = "dev"

    # Primary: Claude Sonnet 4.6
    primary_model_id: str = "anthropic.claude-sonnet-4-6-20251001-v1:0"

    # Background: Nova Pro (~15x cheaper per token)
    background_model_id: str = "amazon.nova-pro-v1:0"

    class Config:
        env_file = ".env.local"

settings = Settings()

The dual-model strategy

The agent uses two Bedrock models for different tasks:

Claude Sonnet 4.6 for main conversations — best reasoning, multi-step tool use, nuanced responses. More expensive but worth it for the customer-facing output.

Amazon Nova Pro for background tasks — ~15x cheaper per token. Ideal for:

Classifying intent before routing
Summarising long conversation history
Generating internal labels/tags
Any task where "good enough" is sufficient

Prompt caching — the 90% cost saving

This is the most impactful optimisation in the whole system.

Prompt caching works like this: you mark part of your prompt as a "cacheable prefix". Bedrock caches those tokens server-side for ~5 minutes. On subsequent calls that use the same prefix, you pay the cache read price instead of the full input token price.

For Claude Sonnet 4.6:

Cache write: $3.00 per 1M input tokens (same as normal)
Cache read: $0.30 per 1M input tokens
Saving: 90% on cached tokens

The system prompt is the perfect candidate for caching — it's the same on every request in a session:

from strands.models import BedrockModel
from botocore.config import Config

# Adaptive retry — Bedrock throttles hard under load
boto_config = Config(
    retries={"max_attempts": 5, "mode": "adaptive"},
    read_timeout=120,
)

primary_model = BedrockModel(
    model_id="anthropic.claude-sonnet-4-6-20251001-v1:0",
    boto_config=boto_config,
    additional_request_fields={
        # Enable prompt caching (Anthropic beta feature on Bedrock)
        "anthropic_beta": ["prompt-caching-2024-07-31"],
    },
    guardrail_config={
        "guardrailIdentifier": settings.bedrock_guardrail_id,
        "guardrailVersion": settings.bedrock_guardrail_version,
    },
)

# System prompt with cache_control: ephemeral
# This marks the prompt as a cacheable prefix for Bedrock
cached_system_prompt = [
    {
        "text": SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},  # Cache this prefix
    }
]

For a 1,500-token system prompt at 100 requests/day with 5 turns each:

Without caching: ~450,000 system prompt tokens/day × $3/1M = $1.35/day just for system prompts
With caching: first turn normal price, turns 2-5 at cache read → ~$0.27/day
Saving: ~$1/day, ~$365/year on just the system prompt

The saving scales linearly with session length and request volume.

Tool definitions

Strands tools are just decorated Python functions. The function signature defines the input schema, and the docstring is sent to the model as the tool description:

from strands import tool

@tool
def lookup_order_status(order_id: str) -> str:
    """
    Look up the current status and details of a customer order by order ID.
    Use this when a customer asks about their order, delivery, or shipment.

    Args:
        order_id: The order ID (format: ORD-XXXXXX)

    Returns:
        Order details including status, items, and estimated delivery date.
    """
    # Your implementation here
    ...

@tool
def search_product_faq(query: str) -> str:
    """
    Search the product FAQ and policy knowledge base for answers to customer questions.
    ...
    """
    ...

Tools are passed to the Agent constructor. Strands handles the tool invocation loop — calling the tool when the model decides to use it, feeding the result back, and continuing the reasoning loop until the model produces a final response.

AgentCore Memory integration

AgentCore Memory provides persistent storage across sessions without you building any of the storage infrastructure.

The three strategy types:

Semantic — stores facts and user profile information. Consolidates information like "user prefers email contact", "user is on premium plan".
Summary — stores compressed session history. "On 2025-03-15 user reported late delivery of order ORD-001234. Issue was resolved."
UserPreference — stores interaction style. "User prefers brief responses without extra detail."

The memory client is a standard boto3 client:

import boto3
from botocore.config import Config

memory_client = boto3.client(
    "bedrock-agentcore-memory",
    config=Config(retries={"max_attempts": 5, "mode": "adaptive"}, read_timeout=30),
)

# Store a conversation turn
memory_client.create_event(
    memoryId=settings.agentcore_memory_id,
    actorId=actor_id,         # Identifies the user (e.g., user ID from JWT)
    sessionId=session_id,     # Identifies the conversation session
    messages=[
        {"role": "USER", "content": [{"text": user_message}]},
        {"role": "ASSISTANT", "content": [{"text": assistant_message}]},
    ],
)

# Retrieve relevant memories before each invocation
response = memory_client.retrieve_memory_records(
    memoryId=settings.agentcore_memory_id,
    actorId=actor_id,
    searchQuery=user_message,  # Semantic search over stored memories
    maxResults=5,
)
memories = response.get("memoryRecords", [])

The retrieved memories are prepended to the user message as context:

memory_context = "
".join(
    f"- {record['content']['text']}"
    for record in memories
    if record.get("content", {}).get("text")
)

enriched_message = f"[Past context:]
{memory_context}

{user_message}"

The streaming agent loop

The agent produces a streaming response via the Strands stream_async method. Each chunk is forwarded as an SSE event:

async def stream_agent_response(user_message, actor_id, session_id):
    # 1. Retrieve memories
    memories = retrieve_relevant_memories(actor_id=actor_id, query=user_message)
    enriched_message = prepend_memory_context(memories, user_message)

    # 2. Build agent with tools
    agent = Agent(
        model=primary_model,
        system_prompt=cached_system_prompt,
        tools=[lookup_order_status, search_product_faq],
    )

    # 3. Stream response
    full_response_parts = []
    async for chunk in agent.stream_async(enriched_message):
        if chunk.get("type") == "text":
            text = chunk.get("text", "")
            if text:
                full_response_parts.append(text)
                yield f"data: {text}

"  # SSE format

    # 4. Store turn in memory
    store_conversation_turn(
        actor_id=actor_id,
        session_id=session_id,
        user_message=user_message,
        assistant_message="".join(full_response_parts),
    )

    yield "data: [DONE]

"

The adaptive retry config

Bedrock throttles hard when you exceed your model's TPS (tokens per second) limit. Without retry logic, throttled requests fail immediately.

mode: "adaptive" uses a token bucket algorithm — it monitors the throttle rate and automatically backs off when it detects pressure:

from botocore.config import Config

boto_config = Config(
    retries={"max_attempts": 5, "mode": "adaptive"},
    read_timeout=120,   # Streaming responses can take 30-90s for complex tool chains
    connect_timeout=10,
)

The difference between "standard" and "adaptive" retry modes:

standard: fixed exponential backoff between retries
adaptive: adjusts retry rate based on observed throttling, converges to a sustainable rate faster

For agentic workloads that run multi-step tool chains — and thus make many Bedrock calls in sequence — "adaptive" consistently outperforms "standard".

Putting it all together

The agent.py file wires everything together:

primary_model = BedrockModel(
    model_id=settings.primary_model_id,
    boto_config=boto_config,
    additional_request_fields={"anthropic_beta": ["prompt-caching-2024-07-31"]},
    guardrail_config={"guardrailIdentifier": settings.bedrock_guardrail_id, ...},
)

cached_system_prompt = [{"text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}]

async def stream_agent_response(user_message, actor_id, session_id):
    memories = retrieve_relevant_memories(actor_id, user_message)
    enriched = prepend_memory_context(memories, user_message)

    agent = Agent(
        model=primary_model,
        system_prompt=cached_system_prompt,
        tools=[lookup_order_status, search_product_faq],
    )

    full_response = []
    async for chunk in agent.stream_async(enriched):
        if chunk.get("type") == "text" and chunk.get("text"):
            full_response.append(chunk["text"])
            yield f"data: {chunk['text']}

"

    store_conversation_turn(actor_id, session_id, user_message, "".join(full_response))
    yield "data: [DONE]

"

In Part 4, we set up the local Docker dev environment so you can iterate on the agent code without deploying to AWS on every change.

→ Continue to Part 4: Running Locally with Docker

Originally published at rajmurugan.com. This is Part 3 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.

Part 6: Cost & Performance for Bedrock AgentCore — Prompt Caching, Model Selection, and CloudWatch Alarms

Raj Murugan — Mon, 30 Mar 2026 21:34:23 +0000

You've deployed the agent. It works. Now let's make sure it doesn't cost you a surprise at the end of the month.

This is the part that most tutorials skip. Real production systems need cost visibility before incidents — not after. Here's everything I've done to keep costs predictable and to save money where it counts.

The cost components

An AgentCore deployment has several cost drivers:

Component	Pricing model
Bedrock model invocations	Per token (input + output)
AgentCore Runtime	Per container-hour (when active)
AgentCore Memory	Per memory operation
ECR	Per GB stored + data transfer
CloudWatch Logs	Per GB ingested
S3 (if used)	Negligible for this setup

The dominant cost is almost always Bedrock model invocations. Everything else is small by comparison.

Prompt caching: the biggest lever

If you haven't read Part 3 carefully, go back and re-read the prompt caching section. It's the highest-impact optimisation in the system.

Quick recap: by marking your system prompt with cache_control: ephemeral, Bedrock caches those tokens and charges the cache read price on subsequent calls.

For Claude Sonnet 4.6:

Cache write: $3.00 / 1M input tokens
Cache read: $0.30 / 1M input tokens (10x cheaper)
Output tokens: $15.00 / 1M output tokens (not cached)

For a 1,500-token system prompt:

Scenario	Cost per turn
Without caching	$0.0045 (system prompt) + output tokens
With caching (turns 2+)	$0.00045 (system prompt) + output tokens
Saving per turn	~$0.004

That sounds small. Scale it:

100 users × 10 conversations/day × 5 turns each = 5,000 turns/day
4,000 of those turns are turns 2+ (caching applies)
Saving: 4,000 × $0.004 = $16/day → $480/month on system prompt tokens alone

The saving scales linearly with session depth and volume.

Enable prompt caching:

primary_model = BedrockModel(
    model_id="anthropic.claude-sonnet-4-6-20251001-v1:0",
    additional_request_fields={"anthropic_beta": ["prompt-caching-2024-07-31"]},
)

cached_system_prompt = [{"text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}}]

Model selection strategy

Not every task needs Claude Sonnet 4.6. Using the right model for each task type dramatically reduces costs.

Task	Recommended model	Reason
Main conversation	Claude Sonnet 4.6	Best reasoning, multi-turn, complex tool use
Intent classification	Amazon Nova Pro	Simple classification, ~15x cheaper
Session summarisation	Amazon Nova Pro	Structured output, no complex reasoning needed
FAQ matching	Amazon Nova Pro or embedding model	Simple retrieval pattern
Billing dispute analysis	Claude Sonnet 4.6	Complex reasoning required

Current pricing comparison (us-east-1):

Model	Input ($/1M)	Output ($/1M)
Claude Sonnet 4.6	$3.00	$15.00
Amazon Nova Pro	$0.80	$3.20
Amazon Nova Lite	$0.06	$0.24

For a classification task that returns 1-2 tokens and processes 500 input tokens:

Claude Sonnet 4.6: $0.0015 per call
Amazon Nova Pro: $0.0004 per call
Saving: ~75% just by routing to the right model

In agent.py, the Nova model is available alongside the primary model:

nova_model = BedrockModel(
    model_id="amazon.nova-pro-v1:0",
    boto_config=boto_config,
)

Use it when you need a cheap background task before or after the main conversation.

AgentCore lifecycle configuration

AgentCore has two lifecycle settings that affect cost:

Idle timeout (IdleTimeoutInSeconds): how long AgentCore waits before pausing a container instance after the last request. Set in the CDK stack:

LifecycleConfiguration: {
  IdleTimeoutInSeconds: 900,       // 15 minutes
  MaxSessionDurationInSeconds: 28800, // 8 hours
}

Lower idle timeout = containers paused sooner = lower cost for bursty workloads
Higher idle timeout = containers stay warm longer = better latency for returning users
The sweet spot depends on your session gap pattern. 15 minutes is a reasonable default.

Max session duration: the hard limit per session. 8 hours is appropriate for a long-running assistant. For short transactional interactions, you could reduce this.

CloudFront PriceClass_100

For the blog/portfolio site, using PriceClass.PRICE_CLASS_100 restricts CloudFront distribution to US and European edge locations only. This cuts CF cost by ~50% compared to the global price class.

For a personal portfolio with mostly English-speaking traffic, the 95th percentile of users are in the US and Europe anyway.

// infra/lib/hosting-stack.ts
priceClass: cloudfront.PriceClass.PRICE_CLASS_100,

For the AgentCore endpoint itself, there's no CloudFront in front — AgentCore is a regional service.

CloudWatch alarms: catch runaway costs before they hit your bill

Two alarms are critical for an AgentCore deployment.

Alarm 1: OutputTokenCount spike

An agentic loop that gets stuck (tool keeps failing, model keeps retrying) can generate thousands of output tokens per minute. This alarm fires when output tokens per 5 minutes exceed a threshold:

new cloudwatch.Alarm(this, 'OutputTokenAlarm', {
  alarmName: `customerServiceAgent-OutputTokenCount-dev`,
  metric: new cloudwatch.Metric({
    namespace: 'AWS/Bedrock',
    metricName: 'OutputTokenCount',
    dimensionsMap: { ModelId: 'anthropic.claude-sonnet-4-6-20251001-v1:0' },
    statistic: 'Sum',
    period: cdk.Duration.minutes(5),
  }),
  threshold: 50_000,    // Tune to your expected usage
  evaluationPeriods: 2,
  comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
  treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
});

Set the threshold to 2-3x your normal peak. Monitor for a week after launch to establish a baseline, then tune.

Alarm 2: InvocationLatency P99

High P99 latency indicates your agent is taking too long — possibly waiting on a tool timeout, or the model is iterating excessively:

new cloudwatch.Alarm(this, 'LatencyAlarm', {
  metric: new cloudwatch.Metric({
    namespace: 'AWS/Bedrock',
    metricName: 'InvocationLatency',
    statistic: 'p99',
    period: cdk.Duration.minutes(5),
  }),
  threshold: 30_000,   // 30 seconds
  evaluationPeriods: 3,
  comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
});

Both alarms publish to the SNS topic (also in the CDK stack), which sends you an email. For production, replace email with a PagerDuty or Slack notification via SNS → Lambda → webhook.

Actual cost estimates

For a moderately used customer service agent at ~500 conversations/day, 5 turns each:

Component	Monthly estimate
Bedrock (Claude Sonnet 4.6, with caching)	$120-180
Bedrock (Nova Pro for classification)	$5-10
AgentCore Runtime	$15-30 (depends on idle config)
AgentCore Memory operations	$5-10
ECR storage	$1-2
CloudWatch Logs	$3-5
Total	~$150-240/month

Without prompt caching: add ~$60-80/month to the Bedrock line.

Without the dual-model strategy (Claude Sonnet 4.6 for everything): add ~$20-30/month to the Bedrock line.

These numbers will vary significantly based on your conversation length and output token counts. The alarms will tell you when something is outside the expected range.

Quick optimisation checklist

Before going to production:

[ ] Prompt caching enabled (anthropic_beta: ["prompt-caching-2024-07-31"])
[ ] System prompt marked with cache_control: ephemeral
[ ] Nova Pro used for background tasks (not Claude for everything)
[ ] Idle timeout set appropriately (900s is a good default)
[ ] OutputTokenCount alarm configured and tested
[ ] InvocationLatency alarm configured and tested
[ ] SNS topic with email subscription (or PagerDuty) set up
[ ] CloudFront PriceClass_100 set (blog site)
[ ] Model invocation logging enabled (for debugging cost spikes)

Wrapping up the series

Over 6 parts, we built a complete production AI agent on AWS:

Part 1: Why AgentCore — the Lambda limitations and what AgentCore solves
Part 2: CDK infrastructure — the full stack + 9 gotchas documented
Part 3: The Python agent — Strands SDK, prompt caching, AgentCore Memory
Part 4: Local dev loop — Docker, platform flags, .env pattern
Part 5: CI/CD — GitHub Actions OIDC, ECR dual-tag strategy, Runtime updates
Part 6 (this post): Cost and performance — prompt caching savings, model selection, alarms

The full demo repo is at github.com/rajmurugan01/bedrock-agentcore-starter. Every pattern in this series maps to real code in that repo.

If this series saved you some debugging time (or a surprise AWS bill), star the repo and share it. If I got something wrong or you've found a better pattern, open an issue — I'll update the posts.

← Back to Part 5: CI/CD with GitHub Actions OIDC

Originally published at rajmurugan.com. This is Part 6 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.

Part 4: Running Your AgentCore Agent Locally with Docker (The Right Way)

Raj Murugan — Mon, 30 Mar 2026 21:32:01 +0000

You've written the agent code. Before pushing it to ECR and waiting for AgentCore to pull it, you want to run it locally and confirm it actually works.

This part covers the local Docker dev loop — including a critical flag that's easy to miss and silently produces the wrong result.

The `--platform linux/amd64` requirement

This is the most important thing in this entire post.

Amazon Bedrock AgentCore runtime is x86_64 only. If you build your Docker image without specifying the platform, Docker uses your host architecture:

On an Intel Mac or Linux x86_64 machine: builds linux/amd64 ✅
On an Apple Silicon Mac (M1/M2/M3/M4): builds linux/arm64 ❌

The arm64 image will work perfectly in your local Docker test because your Mac is arm64. But when AgentCore pulls :latest and tries to run it on an x86_64 host, the container exits immediately with an exec format error — and AgentCore reports the Runtime as FAILED with a cryptic message.

Always build with --platform linux/amd64:

docker build --platform linux/amd64 -t customer-service-agent:local .

This forces Docker to produce an x86_64 image regardless of your host architecture. On an Apple Silicon Mac, Docker uses QEMU emulation to run the build — it's a bit slower but produces the correct output.

The Dockerfile

The Dockerfile is in apps/customer-service-agent/:

FROM --platform=linux/amd64 python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY customer_service_agent/ ./customer_service_agent/

# AgentCore always calls port 8080 — this is not configurable
EXPOSE 8080

# Health check — AgentCore probes GET /health before routing traffic
HEALTHCHECK --interval=10s --timeout=5s --start-period=30s \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')" || exit 1

CMD ["uvicorn", "customer_service_agent.main:app", "--host", "0.0.0.0", "--port", "8080"]

Two things to note:

--platform=linux/amd64 in the FROM line ensures the base image is also x86_64
Port 8080 is hardcoded — AgentCore doesn't let you configure this

The `.env.local` pattern

In production, AgentCore injects environment variables from the EnvironmentVariables block you set in the CDK CfnRuntime resource. Locally, we replicate this with a .env.local file.

Copy .env.local.example to .env.local and fill in the values from your CDK stack outputs:

# .env.local
AGENTCORE_MEMORY_ID=xxxxxxxxxxxxxxxxxxxx
BEDROCK_GUARDRAIL_ID=xxxxxxxxxxxxxxxx
BEDROCK_GUARDRAIL_VERSION=1
AWS_REGION=us-east-1
ENVIRONMENT=dev
LOG_LEVEL=DEBUG

Get the values from CDK outputs or SSM:

aws ssm get-parameter --name /customerServiceAgent/dev/memory-id --query Parameter.Value --output text
aws ssm get-parameter --name /customerServiceAgent/dev/guardrail-id --query Parameter.Value --output text

Running the container locally

cd apps/customer-service-agent

# Build for linux/amd64 (even on an Apple Silicon Mac)
docker build --platform linux/amd64 -t customer-service-agent:local .

# Run with real AWS dev credentials
docker run --rm \
  --platform linux/amd64 \
  -p 8080:8080 \
  --env-file .env.local \
  -v "$HOME/.aws:/root/.aws:ro" \
  customer-service-agent:local

The -v "$HOME/.aws:/root/.aws:ro" mounts your local AWS credentials into the container as read-only. This lets the agent call Bedrock and AgentCore Memory using your dev credentials, exactly as it would with the execution role in production.

Don't do this in production. In production, the execution role is attached to the container by AgentCore. Mounting credentials is a local-only pattern.

Verifying the health check

Once the container starts, verify the health endpoint:

curl http://localhost:8080/health
# → {"status":"healthy","environment":"dev"}

AgentCore probes GET /health before routing any traffic to a container instance. If the health check fails, AgentCore marks the instance as unhealthy and doesn't send requests to it.

Testing with curl

The agent responds to POST /invoke with a Server-Sent Events stream. The --no-buffer flag is important — without it, curl buffers the response and you don't see streaming:

curl -X POST http://localhost:8080/invoke \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is the status of order ORD-001234?"}
    ],
    "sessionId": "test-session-1",
    "actorId": "user-test-123"
  }' \
  --no-buffer

You should see SSE events streaming back:

data: I'll look up order ORD-001234 for you.

data: **Order ORD-001234:**
data: - Status: In Transit
data: - Items: Wireless Headphones (x1), Phone Case (x2)
data: - Estimated delivery: April 5, 2025
data: - Tracking: UPS-9876543210

data: [DONE]

Common local dev errors

exec format error — You built an arm64 image and are running it on an x86_64 host (or vice versa). Add --platform linux/amd64 to both docker build and docker run.

Connection refused on port 8080 — Container hasn't started yet or the health check is failing. Check docker logs <container-id>.

NoCredentialsError — The .aws mount isn't working or the profile in .env.local doesn't match a profile in ~/.aws/credentials. Try AWS_PROFILE=default or remove the profile and let boto3 use instance metadata chain.

ResourceNotFoundException on memory client — AGENTCORE_MEMORY_ID is empty or wrong. Check the value against the SSM parameter. The memory module gracefully falls back (skips memory operations) if the ID is empty, so this shouldn't crash the agent.

Slow response on Apple Silicon — You're running an x86_64 container under QEMU emulation. This is ~3-5x slower than native and is expected for local testing. The deployed version on AgentCore's x86_64 hosts will be much faster.

The local dev loop

1. Edit Python code
   ↓
2. docker build --platform linux/amd64 -t customer-service-agent:local .
   ↓
3. docker run --rm --platform linux/amd64 -p 8080:8080 --env-file .env.local \
     -v "$HOME/.aws:/root/.aws:ro" customer-service-agent:local
   ↓
4. curl -X POST http://localhost:8080/invoke ... --no-buffer
   ↓
5. Iterate until response is correct, then push to ECR

In Part 5, we automate steps 2-5 via GitHub Actions with OIDC — so every push to main builds the image, pushes it to ECR, and updates the AgentCore Runtime.

→ Continue to Part 5: CI/CD with GitHub Actions OIDC

Originally published at rajmurugan.com. This is Part 4 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.

Part 2: CDK Infrastructure for Amazon Bedrock AgentCore (And Every Gotcha You'll Hit)

Raj Murugan — Mon, 30 Mar 2026 21:31:59 +0000

This is the post I wish had existed when I was debugging my first AgentCore CDK deploy at midnight.

AgentCore is a relatively new service and CDK support is still catching up to the API. There are at least 9 specific traps that will silently fail, throw cryptic errors, or leave your CloudFormation stack in an unrecoverable state if you don't know about them.

I'm going to walk through the complete CDK stack and call out every one of them.

The full source is in infra/lib/agentcore-stack.ts in the demo repo.

The stack: what we're creating

CustomerServiceAgentStack-dev
├── KMS Key (rotation enabled, RETAIN on delete)
├── CloudWatch LogGroup — /aws/bedrock-agentcore/runtimes/...
├── CloudWatch LogGroup — /aws/bedrock/model-invocations/...
├── IAM Role (InvocationLogging) — bedrock.amazonaws.com principal
├── CfnResource — AWS::Bedrock::ModelInvocationLoggingConfiguration ← Gotcha #3
├── SNS Topic — cost alarm notifications
├── CloudWatch Alarm — OutputTokenCount
├── CloudWatch Alarm — InvocationLatency
├── Bedrock CfnGuardrail + CfnGuardrailVersion
├── CfnResource — AWS::BedrockAgentCore::Memory (3 strategies)  ← Gotcha #8
├── ECR Repository (IMPORTED, not created)                       ← Gotcha #2
├── IAM Role (ExecutionRole) — bedrock-agentcore.amazonaws.com
├── Security Group (allowAllOutbound: false)                     ← Gotcha #6
├── CfnResource — AWS::BedrockAgentCore::AgentRuntime
└── SSM Parameters (7 parameters for all ARNs/IDs)

Let's go through each section and the gotchas that come with it.

Project setup

infra/
├── bin/app.ts
├── lib/agentcore-stack.ts
├── test/agentcore-stack.test.ts
├── jest.config.js
├── package.json
├── tsconfig.json
└── cdk.json

package.json needs aws-cdk-lib and constructs as dependencies, plus jest + ts-jest as devDependencies for the unit tests.

Gotcha #1: AgentCore naming — NO HYPHENS

This one is not in the documentation in any obvious place. AgentCore Runtime names, Memory names, and Memory strategy names must match this regex:

^[a-zA-Z][a-zA-Z0-9_]{0,47}$

Hyphens fail at deploy time with a cryptic CloudFormation validation error:

Value 'customer-service-agent-dev' at 'agentRuntimeName' failed to satisfy
constraint: Member must satisfy regular expression pattern: [a-zA-Z][a-zA-Z0-9_]{0,47}

The fix is simple — use camelCase or underscores:

// ❌ This will fail at deploy time
const runtimeName = `customer-service-agent-${env}`;

// ✅ This works
const runtimeName = `customerServiceAgent_${env}`;
const memoryName  = `customerServiceAgentMemory_${env}`;

This applies to every naming field in AgentCore: Runtime names, Memory names, and the names of individual Memory strategies.

Gotcha #2: ECR chicken-and-egg

This one will waste a full deploy cycle if you don't know about it.

AgentCore Runtime requires a valid container image at :latest when the CfnRuntime resource is created. The timing problem: CDK creates both the ECR repo and the Runtime in the same deploy → the Runtime creation fails because the ECR repo is empty.

The fix is a two-step process:

Step 1: Run the bootstrap script once before the first cdk deploy:

# infra/scripts/bootstrap-ecr.sh

aws ecr create-repository \
  --repository-name "customerserviceagent/runtime-dev" \
  --region us-east-1

# Push a placeholder (any valid linux/amd64 image)
docker pull --platform linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2
docker tag public.ecr.aws/amazonlinux/amazonlinux:2 \
  <account>.dkr.ecr.us-east-1.amazonaws.com/customerserviceagent/runtime-dev:latest
docker push <account>.dkr.ecr.us-east-1.amazonaws.com/customerserviceagent/runtime-dev:latest

Step 2: In CDK, import the repo — never create it:

// ✅ Import (repo already exists from bootstrap script)
const agentRepo = ecr.Repository.fromRepositoryName(this, 'AgentRepo', ecrRepoName);

// ❌ Don't do this — CDK would create it empty and the Runtime deploy would fail
const agentRepo = new ecr.Repository(this, 'AgentRepo', { ... });

Gotcha #3: No CDK L1 for `ModelInvocationLoggingConfiguration`

If you try to enable Bedrock model invocation logging, you'll find that aws-cdk-lib (up to 2.245.0) has no L1 construct for AWS::Bedrock::ModelInvocationLoggingConfiguration.

Searching the CDK docs or autocomplete for CfnModelInvocationLoggingConfiguration returns nothing. You must use raw CfnResource:

// ❌ This doesn't exist in aws-cdk-lib
import { CfnModelInvocationLoggingConfiguration } from 'aws-cdk-lib/aws-bedrock';

// ✅ Use raw CfnResource with the CloudFormation type string
new cdk.CfnResource(this, 'ModelInvocationLogging', {
  type: 'AWS::Bedrock::ModelInvocationLoggingConfiguration',
  properties: {
    LoggingConfig: {
      CloudWatchConfig: {
        LogGroupName: invocationLogGroup.logGroupName,
        RoleArn: invocationLoggingRole.roleArn,
      },
      TextDataDeliveryEnabled: true,
      ImageDataDeliveryEnabled: false,
      EmbeddingDataDeliveryEnabled: false,
    },
  },
});

The IAM role for this must have logs:CreateLogStream and logs:PutLogEvents on the log group ARN, and it must be assumed by bedrock.amazonaws.com.

Gotcha #4: VPC endpoints — don't recreate existing ones

AgentCore runs inside a VPC. It needs to reach Bedrock, ECR, and SSM without going through the public internet (for both performance and security).

The trap: if your VPC was provisioned by Terraform or another CDK stack, it may already have interface endpoints for Bedrock, ECR, and S3. Creating duplicate interface endpoints with the same private DNS name fails with:

private-dns-enabled cannot be set because there is already a conflicting
DNS domain for bedrock-runtime.us-east-1.amazonaws.com in this VPC

For managed VPCs: Use Vpc.fromLookup() and skip creating VPC endpoints. Assume they already exist.

For the demo (default VPC): No pre-existing endpoints, so we create the minimum needed:

const vpc = ec2.Vpc.fromLookup(this, 'DefaultVpc', { isDefault: true });

// Only add endpoints if they don't already exist in your VPC
vpc.addInterfaceEndpoint('BedrockRuntimeEndpoint', {
  service: ec2.InterfaceVpcEndpointAwsService.BEDROCK_RUNTIME,
  securityGroups: [agentSg],
});

vpc.addGatewayEndpoint('S3Endpoint', {
  service: ec2.GatewayVpcEndpointAwsService.S3,
});

Gotcha #5: KMS + CloudWatch LogGroup key policy

If you want to encrypt a CloudWatch LogGroup with a customer-managed KMS key, the key policy must explicitly grant logs.amazonaws.com permission to use the key:

// The key policy must include this:
kmsKey.addToResourcePolicy(new iam.PolicyStatement({
  principals: [new iam.ServicePrincipal(`logs.${this.region}.amazonaws.com`)],
  actions: ['kms:Encrypt', 'kms:Decrypt', 'kms:GenerateDataKey', 'kms:DescribeKey'],
  resources: ['*'],
}));

Without this, CloudWatch silently fails to encrypt logs (or the log group creation fails with a KMS access denied error).

For most use cases, skip the CMK entirely. CloudWatch uses AWS-managed encryption by default. The only reason to add a CMK is if you have a compliance requirement that mandates customer-controlled key rotation.

Gotcha #6: Security Group egress is inline, not a separate resource

This one catches you in CDK unit tests, not in the actual deployment.

When using allowAllOutbound: false and calling addEgressRule(Peer.ipv4(cidr), Port.tcp(443)), CDK embeds the egress rule inside the SecurityGroup resource's SecurityGroupEgress array:

const agentSg = new ec2.SecurityGroup(this, 'AgentSg', {
  vpc,
  allowAllOutbound: false,  // Disables default egress-all rule
});

agentSg.addEgressRule(
  ec2.Peer.ipv4(vpc.vpcCidrBlock),
  ec2.Port.tcp(443),
  'HTTPS to VPC CIDR',
);

There is no separate AWS::EC2::SecurityGroupEgress resource created for this. In CDK assertions tests:

// ❌ This will find 0 resources — it doesn't exist as a separate resource
template.hasResource('AWS::EC2::SecurityGroupEgress', {});

// ✅ Check the inline array inside the SecurityGroup resource
template.hasResourceProperties('AWS::EC2::SecurityGroup', {
  SecurityGroupEgress: Match.arrayWith([
    Match.objectLike({ IpProtocol: 'tcp', FromPort: 443, ToPort: 443 }),
  ]),
});

Note: a separate AWS::EC2::SecurityGroupEgress resource IS created when you reference another security group as the peer (cross-SG rules). This only applies to IP/CIDR-based rules.

Gotcha #7: `update-agent-runtime` requires `--role-arn` and `--network-configuration`

After deployment, when you push a new Docker image and want AgentCore to pick it up, you call update-agent-runtime. Both --role-arn and --network-configuration are now mandatory:

aws bedrock-agentcore-control update-agent-runtime \
  --agent-runtime-id <runtime-id> \
  --agent-runtime-artifact '{"containerConfiguration":{"containerUri":"<ecr>:latest"}}' \
  --role-arn arn:aws:iam::<account>:role/customerServiceAgentExecutionRole-dev \
  --network-configuration '{
    "networkMode": "VPC",
    "networkModeConfig": {
      "securityGroups": ["sg-xxx"],
      "subnets": ["subnet-aaa", "subnet-bbb"]
    }
  }' \
  --region us-east-1

Omitting either --role-arn or --network-configuration gives:

ValidationException: Missing required field: roleArn

The --role-arn here is the execution role (the role the Runtime assumes to pull from ECR and call Bedrock) — not the deploy role your CLI is using.

Gotcha #8: Memory stuck in `CREATING` during rollback

If your CDK deploy fails after the AgentCore Memory resource starts creating, the CloudFormation rollback will also fail. The Memory resource is in CREATING state and CloudFormation can't delete it.

You'll see this error in the CloudFormation events:

DELETE_FAILED: Cannot delete resource while it is in CREATING state

Recovery steps:

# 1. Find the stuck memory
aws bedrock-agentcore-control list-memories --region us-east-1

# 2. Wait for it to finish creating (usually a few minutes), then delete
aws bedrock-agentcore-control delete-memory --memory-id <id> --region us-east-1

# 3. Delete the stuck CloudFormation stack
aws cloudformation delete-stack --stack-name CustomerServiceAgentStack-dev --region us-east-1

# 4. Retry
cdk deploy

Gotcha #9: `arrayWith` in CDK assertions is order-sensitive

This one only matters if you write CDK unit tests, but it will confuse you when it does.

Match.arrayWith([patternA, patternB]) requires the elements to appear in the same order as in the synthesised template:

// The template has filtersConfig: [PROMPT_ATTACK, HATE, INSULTS, SEXUAL, VIOLENCE]

// ✅ Works — PROMPT_ATTACK before HATE
Match.arrayWith([
  Match.objectLike({ Type: 'PROMPT_ATTACK' }),
  Match.objectLike({ Type: 'HATE' }),
])

// ❌ Fails — even though both are present, order doesn't match the template
Match.arrayWith([
  Match.objectLike({ Type: 'HATE' }),
  Match.objectLike({ Type: 'PROMPT_ATTACK' }),
])

The fix: write your Match.arrayWith patterns in the same order as the properties appear in your CDK code.

The CDK unit test strategy

With all these gotchas, testing matters. Here's the approach from infra/test/agentcore-stack.test.ts:

Snapshot test — the primary safety net. Any change to the synthesised template fails CI until explicitly updated.
Naming regex test — verify Runtime and Memory names match the no-hyphens regex.
IAM trust test — verify bedrock-agentcore.amazonaws.com is the principal on the execution role.
Security Group inline egress test — verify the pattern from Gotcha #6.
SSM parameter tests — verify all 7 parameters exist with the correct paths.
Protocol test — verify ServerProtocol: 'HTTP'.
Guardrail filter order test — verify filters appear in the correct order (Gotcha #9).

Run tests with:

cd infra && npm test

For deterministic synthesis without AWS credentials, commit cdk.context.json (the VPC lookup cache). Without it, CDK would try to call the AWS API during cdk synth, breaking CI.

Deploying

cd infra
npm install

# First time only
./scripts/bootstrap-ecr.sh    # Gotcha #2 — must run before cdk deploy

# Deploy
export CDK_DEFAULT_ACCOUNT=<your-account-id>
export ENVIRONMENT=dev
cdk deploy

The deploy takes 5-10 minutes. The Memory resource is the slowest to provision (~3-4 minutes).

In Part 3, we write the Python agent that runs inside the container AgentCore manages.

→ Continue to Part 3: Building the Agent with Strands SDK

Originally published at rajmurugan.com. This is Part 2 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.

Part 1: Why I Chose Amazon Bedrock AgentCore (And What Lambda Gets Wrong for AI Agents)

Raj Murugan — Mon, 30 Mar 2026 21:30:54 +0000

I built a production AI agent on AWS. Not a demo, not a proof of concept — a real system with persistent memory, guardrails, CI/CD pipelines, and users who depend on it not going down at 2am.

The thing nobody tells you: the hard part isn't the AI. The hard part is the infrastructure around it.

This series is my attempt to document everything I had to figure out the hard way — from architecture decisions in Part 1 all the way to cost optimisation in Part 6. The companion demo repo is at github.com/rajmurugan01/bedrock-agentcore-starter.

Let's start at the beginning: why Amazon Bedrock AgentCore, and why not the "obvious" serverless approach.

The obvious approach: Lambda + Bedrock

If you've shipped anything serverless on AWS, your first instinct is Lambda. You know it, it has great tooling, CDK support is mature, and it scales to zero.

For a simple Bedrock wrapper — get a message, call InvokeModel, return a response — Lambda is fine. But the moment you add conversational state, it starts to crack.

Here's what a real conversational AI agent needs:

Session state — the agent needs to remember what happened earlier in the conversation
Long-running processing — LLMs can take 30-90 seconds for complex multi-tool chains
Memory across sessions — the agent should know who the user is from previous conversations
Streaming responses — users expect tokens to appear progressively, not wait 60 seconds for a blob

Let's look at how Lambda handles each of these.

Problem 1: Lambda's 15-minute timeout

Lambda has a hard maximum execution timeout of 15 minutes. For a simple Q&A, that's fine. But for an agentic loop — where the model calls tools, processes results, calls more tools, and reasons over everything — you can easily hit 5-10 minutes per complex interaction.

And I haven't even mentioned the user's session. If a user comes back after 20 minutes and continues the conversation, that's a new Lambda invocation with zero context.

Problem 2: Session state storage

Lambda is stateless by design. Every invocation is independent. For conversational state, you need to:

Store session state somewhere (DynamoDB, ElastiCache, S3)
Load it at the start of every Lambda invocation
Save it at the end of every invocation
Handle the edge case where the Lambda times out mid-session
Build a session expiry and cleanup mechanism

That's a lot of undifferentiated infrastructure for a problem that isn't your core business.

Problem 3: Cross-session memory

Beyond session state, real assistants need memory — the ability to remember that a user's preferred contact method is email, that they're a premium customer, that they had a billing dispute last month.

With Lambda, you'd need to build this yourself: a vector database for semantic recall, a summarisation pipeline to consolidate old sessions, a retrieval step before each invocation. Entirely custom, entirely your problem to maintain.

What AgentCore actually does

Amazon Bedrock AgentCore is AWS's managed infrastructure for running AI agents. It's designed specifically for the workload pattern that Lambda handles poorly.

Here's the mental model: AgentCore is a managed container orchestrator for long-running, stateful AI agent sessions. You ship a Docker container with your agent code. AgentCore handles:

Container lifecycle — starts, stops, scales, and restarts containers
Session routing — routes each user session to the right container instance
Memory persistence — built-in Semantic, Summary, and UserPreference memory strategies
JWT validation — validates Cognito (or custom) JWTs before your code even runs
VPC networking — runs your containers inside your VPC without cold start penalties
SSE streaming — handles the HTTP connection and SSE protocol for you

The architectural difference:

Lambda approach:
  User message → API Gateway → Lambda (cold start?) → load session from DynamoDB →
  call Bedrock → save session to DynamoDB → return response → Lambda exits

AgentCore approach:
  User message → AgentCore Runtime (JWT validated) → your container (already warm) →
  call Bedrock → response streams back → container stays warm for next message

The architecture we're building

┌────────────────────────────────────────────────────────────────┐
│  GitHub Actions (OIDC)                                         │
│  ├── Build Docker (linux/amd64)                                │
│  ├── Push to ECR (:latest + :<sha>)                           │
│  └── update-agent-runtime CLI                                  │
└──────────────────────────────┬─────────────────────────────────┘
                               │
                    CDK v2 TypeScript deploys:
                               │
┌──────────────────────────────▼─────────────────────────────────┐
│  AWS Infrastructure (us-east-1)                                │
│                                                                │
│  AgentCore Runtime                                             │
│  ├── Cognito JWT authoriser                                    │
│  ├── AG-UI HTTP protocol (SSE streaming)                      │
│  └── Container: Python agent on port 8080                     │
│                                                                │
│  AgentCore Memory (3 strategies)                               │
│  Bedrock Guardrail (prompt injection + PII)                   │
│  CloudWatch alarms (token count + latency)                    │
└────────────────────────────────────────────────────────────────┘

Primary model: Claude Sonnet 4.6 with prompt caching
Background model: Amazon Nova Pro (cheap classification/summarisation)
CI/CD: GitHub Actions OIDC — no stored AWS credentials

Series roadmap

Part	Topic
Part 1 (this post)	Architecture & why AgentCore
Part 2	Full CDK stack + 9 deployment gotchas
Part 3	Python agent with Strands SDK + prompt caching
Part 4	Docker local dev loop
Part 5	GitHub Actions OIDC + ECR + Runtime updates
Part 6	Cost breakdown + alarms

Full demo repo: github.com/rajmurugan01/bedrock-agentcore-starter

Originally published at rajmurugan.com. This is Part 1 of the Ultimate Guide to Building AI Agents on AWS with Bedrock AgentCore series.

# Complete Guide to RAG Evaluations in Amazon Bedrock

Raj Murugan — Tue, 20 Jan 2026 11:36:07 +0000

Introduction

In the rapidly evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the capabilities of large language models (LLMs). By grounding LLMs with external knowledge bases, RAG systems can generate more accurate, relevant, and up-to-date responses, mitigating issues like hallucination and outdated information. Amazon Bedrock provides a robust platform for building and deploying RAG applications, offering a suite of foundation models and tools to streamline development.

However, the true power of a RAG system lies not just in its construction, but in its continuous evaluation and refinement. Ensuring that your RAG application consistently delivers high-quality responses requires a systematic approach to assessment. This comprehensive guide will walk you through the process of setting up and conducting RAG evaluations within Amazon Bedrock, focusing on automatic assessment of your knowledge base performance. We will cover everything from initial prerequisites and environment setup to creating evaluation jobs, monitoring key metrics, and interpreting results, empowering you to build and maintain highly effective RAG solutions.

Prerequisites & Environment Setup

Before diving into the RAG evaluation process, it's essential to ensure your environment is correctly configured and you have the necessary prerequisites in place. This section outlines the foundational requirements for a smooth evaluation experience.

Essential Prerequisites

To begin, you will need:

An AWS Account: Access to an active Amazon Web Services (AWS) account is fundamental for utilizing Amazon Bedrock and its associated services like S3 and IAM.
Basic Knowledge of AWS S3 and IAM Roles: Familiarity with Amazon S3 for data storage and AWS Identity and Access Management (IAM) for managing permissions is crucial. You will be interacting with S3 buckets for storing evaluation datasets and configuring IAM roles for service access.

Environment Configuration

Careful environment setup ensures compatibility and optimal performance for your RAG evaluations:

AWS Region Selection: It is recommended to use either the US East (N. Virginia) or US West (Oregon) AWS regions. These regions typically offer the broadest support for the latest Amazon Bedrock features and foundation models. Always verify that your chosen services and models are available in your selected region.
Model Selection: For the purpose of this guide, we will primarily use the Amazon Nova Micro v1.0 model. This model is a good starting point for evaluations due to its balance of performance and cost-effectiveness. However, it is imperative to:
- Verify Regional Support: Confirm that the Amazon Nova Micro v1.0 model (or any other model you choose) is supported in your selected AWS region.
- Check Pricing: Always review the pricing for your chosen model, as costs can vary based on model type, usage, and region. Understanding the cost implications upfront will help manage your AWS expenditure effectively.

By ensuring these prerequisites are met and your environment is properly configured, you lay the groundwork for a successful and insightful RAG evaluation journey within Amazon Bedrock.

Step-by-Step Guide to RAG Evaluation

This section provides a detailed, step-by-step walkthrough of how to set up and execute RAG evaluations in Amazon Bedrock. Each step is designed to guide you through the process, from creating your knowledge base to analyzing the evaluation results.

Step 1: Create a Knowledge Base

The foundation of any RAG application is its knowledge base. This knowledge base serves as the external data source that the LLM will retrieve information from. Follow these instructions to set up your knowledge base in Amazon Bedrock:

Navigate to Amazon Bedrock: In the AWS Management Console, search for and select "Amazon Bedrock."
Access Knowledge Bases: From the Bedrock console, go to the left-hand navigation pane and select Knowledge Bases. Then, click on the Create knowledge base button.
Provide Knowledge Base Details:
- Knowledge base name: Enter a descriptive name for your knowledge base (e.g., myFirstBedrockKB).
- Description: (Optional) Provide a brief description of your knowledge base.
- IAM service role: Choose to Create and use a new service role. Make a note of the role name, as it will be useful for future reference and permissions management.
Configure Data Source:
- Data source: Select S3 as your data source type.
- Data source location: Specify "This AWS account".
- S3 URI: Provide the S3 URI for your S3 bucket where your data is stored (e.g., s3://mykbbucket). This bucket should contain the documents that your RAG application will use.
- Chunking and parsing configurations: For initial setup, you can keep the default settings. These configurations determine how your documents are split and processed for retrieval.
Select Embeddings Model: Choose "Titan Text Embeddings v2" as your embeddings model. This model will convert your documents into vector embeddings, enabling semantic search and retrieval.
Set up Vector Database: For the vector database, select "Quick create a new vector store". You can choose between "Amazon OpenSearch Serverless" or "Amazon S3 Vector Store (In Preview)". The vector database stores the embeddings and facilitates efficient similarity searches.
Create Knowledge Base: Click Next and then Create knowledge base.

Note: The creation of the knowledge base and the associated vector database can take some time. Please be patient during this provisioning process.

Step 2: Sync Data Source

Once your knowledge base is created, you need to synchronize it with your data source to ensure that the latest information is available for retrieval. This step ensures that any updates or new documents in your S3 bucket are ingested and indexed by the knowledge base.

Go to your Knowledge Base: Navigate back to the Amazon Bedrock console and select your newly created Knowledge Base.
Navigate to Data Source Tab: Within your Knowledge Base details, click on the Data source tab.
Select Your Data Source: From the list of data sources, select the one you configured in the previous step.
Initiate Sync: Click on the "Sync" button.

Note: Similar to the creation process, syncing the data source with your Knowledge Base can take some time, especially for large datasets. Monitor the status in the console until the synchronization is complete.

Step 3: Test Your Knowledge Base

Before proceeding with formal evaluations, it's a good practice to manually test your knowledge base to get a preliminary understanding of its retrieval capabilities. This helps in identifying any immediate issues with data ingestion or relevance.

Navigate to Test Knowledge Base: In the Amazon Bedrock console, go to your Knowledge Base and select the Test Knowledge Base tab.
Select Model: Choose "Amazon Nova Micro" as the model for testing.
Enter Questions: In the chat interface, enter specific questions related to the data you ingested into your knowledge base. For example, if your data contains information about product service intervals, you might ask: "What is the recommended service interval for your product?"
Review Responses: Carefully review the responses provided by the knowledge base. Verify that they are accurate, relevant, and directly supported by your source data. Pay attention to whether the responses correctly retrieve information from your documents.
Iterative Testing: Try different types of questions, including those that require precise factual recall and those that involve more general understanding. This iterative testing helps you gauge the breadth and depth of your knowledge base's retrieval capabilities.

Tip: To ensure the Knowledge Base is retrieving relevant information correctly, ask specific questions that can be directly answered by the content within your data sources.

Step 4: Creating Evaluation Examples

To automatically evaluate your RAG system, you need a dataset of evaluation examples. These examples consist of prompts and their corresponding reference responses, which the evaluation job will use to assess the quality of your knowledge base's retrieval. This process involves creating a batchinput.jsonl file.

Copy Example for a Single Record: Begin by visiting the official AWS documentation for prompt retrieve examples [1]. This documentation provides a structured JSON format for evaluation inputs. Copy an input record example, ensuring to remove any extraneous spaces to maintain valid JSONL formatting. A typical example might look like this:
```
{"conversationTurns":[{"prompt":{"content":[{"text":"What is the recommended service interval?"}]},"referenceResponses":[{"content":[{"text":"The recommended service interval is two years."}]}]}]}
```
Create More Examples: Manually creating a large number of diverse evaluation examples can be time-consuming. To expedite this process, you can leverage tools like Amazon Q Developer (Free version) to generate additional samples. Focus on creating a variety of prompts that cover different aspects of your knowledge base content and expected user queries.
Save All Records to batchinput.jsonl: Consolidate all your generated evaluation examples into a single file named batchinput.jsonl. Each line in this file must be a valid JSON object, representing one evaluation example. Ensure the file adheres strictly to the JSONL (JSON Lines) format, where each line is a self-contained JSON object, without commas between objects or an enclosing array.

Note: It is crucial that your batchinput.jsonl file is correctly formatted. You can use online JSON formatters and validators like jsonformatter.org or jsonlint.com to verify its integrity before proceeding.

Step 5: Upload the File to S3

With your batchinput.jsonl file prepared, the next step is to upload it to an Amazon S3 bucket. This S3 location will serve as the input for your RAG evaluation job in Amazon Bedrock.

Prepare Your batchinput.jsonl File: Ensure your file contains all the evaluation examples and is correctly formatted as JSONL, as detailed in the previous step.
Navigate to the AWS S3 Console: In the AWS Management Console, search for and select "S3."
Select Your S3 Bucket: Locate and select the S3 bucket you intend to use for storing your evaluation input (e.g., mybatchinferenceinput). If you don't have a dedicated bucket, you may need to create one.
Initiate Upload: Click on the "Upload" button.
Select Your File: Drag and drop your batchinput.jsonl file into the upload area, or use the "Add files" button to browse and select it from your local machine.
Review and Confirm: Review the upload settings. For evaluation input files, default settings are usually sufficient, but ensure public access is not inadvertently granted if your data is sensitive.
Complete Upload: Click "Upload" to finalize the process.

Important: Double-check that your batchinput.jsonl file is in the correct JSONL format with no extra spaces or malformed JSON objects. Incorrect formatting can lead to errors during the evaluation job processing.

Step 6: Create an Evaluation Job

Now that your evaluation examples are ready and uploaded to S3, you can create an evaluation job in Amazon Bedrock to automatically assess your knowledge base.

Navigate to Amazon Bedrock Evaluations: In the AWS Management Console, go to Amazon Bedrock. In the left-hand navigation pane, select Inference and Assessment, then Evaluations, and finally RAG.
Create New Evaluation: Click on the "Create" button to start configuring a new evaluation job.
Provide Evaluation Details:
- Evaluation name: Enter a unique and descriptive name for your evaluation job.
- Description: (Optional) Provide a brief description of the evaluation.
Select Evaluator Model: Choose "Amazon Nova Micro v1.0" as the evaluator model. This model will be used to automatically score the responses generated by your knowledge base against the reference responses you provided.
Specify Source: Select "Bedrock Knowledge Base" as the source for the evaluation. Then, choose your specific Knowledge Base (e.g., myFirstBedrockKB) from the dropdown list.
Define Evaluation Type and Metrics:
- Evaluation type: Select "Retrieval only". This focuses the evaluation on the quality of the information retrieved by your knowledge base.
- Metrics: Under the Metrics section, select "Context relevance" and "Context coverage". These are crucial metrics for assessing how well the retrieved context aligns with the prompt and how comprehensively it covers the necessary information.
Configure Input and Output Locations:
- Input: Specify the S3 location of your batchinput.jsonl file (e.g., s3://mybatchinferenceinput/batchinput.jsonl).
- Output: Choose an S3 output bucket and prefix where the evaluation results will be stored (e.g., s3://mymodelevaloutput/output).
Set up Service Role: Create and use a new service role for this evaluation job. This role grants Bedrock the necessary permissions to access your S3 buckets and run the evaluation. Remember to note down the role's name for future reference.
Initiate Evaluation: Review all your settings and click "Create evaluation job".

Once the evaluation job is created, Amazon Bedrock will begin processing your evaluation examples, using the chosen evaluator model to score the retrieval performance of your knowledge base. You can monitor the job's status in the Bedrock console.

Monitoring and Detailed Analysis

After setting up and running your RAG evaluation job, monitoring its performance and conducting detailed analysis of the results are crucial steps. This allows you to gain insights into the efficiency and effectiveness of your knowledge base. The following steps, illustrated in the provided flowchart, guide you through this process.

Step 1: Prerequisites for Monitoring

Before you can effectively monitor and analyze your RAG evaluations, ensure you have the following foundational elements in place:

Amazon Bedrock and Bedrock Knowledge Base: Your RAG application, including the Amazon Bedrock service and your configured Knowledge Base, must be operational.
Prompt Dataset in S3: The batchinput.jsonl file containing your evaluation prompts and reference responses should be stored in an accessible S3 bucket, as this is the input for your evaluation jobs.

Step 2: Enable Logging

To capture the necessary metrics and logs for monitoring and detailed analysis, you must enable logging for your Bedrock evaluations. This ensures that invocation details and other critical information are recorded.

Navigate to Bedrock Evaluations: Go to the Amazon Bedrock console, then Inference and Assessment, and select Evaluations.
Enable Model Invocation Logging: Within the evaluation settings, ensure that Model Invocation Logging is enabled. This setting directs Bedrock to send invocation data to a logging service.
Choose S3/CloudWatch Logs: Configure where these logs should be stored. You can choose to send them to Amazon S3 for long-term storage and batch analysis, or to Amazon CloudWatch Logs for real-time monitoring and querying.

Step 3: Create Evaluation (Recap)

As previously detailed, the creation of the evaluation job is where you define what to evaluate and how. This step is a prerequisite for the monitoring phase.

Go to Bedrock Evaluations: Access the Evaluations section in Amazon Bedrock.
Create Knowledge Base Evaluation Job: Initiate the creation of a new evaluation job, specifying it as a Knowledge Base evaluation.
Configure Job Settings: Define the evaluation name, description, and select the evaluator model (e.g., Amazon Nova Micro v1.0).
Specify Prompt Dataset & S3 Output: Point to your batchinput.jsonl file in S3 as the input and define an S3 bucket for storing the evaluation output.
Click Create Evaluation Job: Launch the evaluation process.

Step 4: Monitor with CloudWatch

Amazon CloudWatch provides powerful tools for monitoring your Bedrock evaluations in real-time. You can use CloudWatch dashboards to visualize key performance indicators.

Open CloudWatch Console: In the AWS Management Console, search for and select "CloudWatch."
Go to Automatic Dashboards: In the CloudWatch console, navigate to the Dashboards section and look for automatically generated dashboards.
Select Bedrock Dashboard: Choose the dashboard specifically created for Amazon Bedrock. This dashboard typically provides pre-configured widgets for common Bedrock metrics.
View InvocationLatency Metrics: Within the Bedrock dashboard, focus on metrics such as InvocationLatency. This metric indicates the total response time of your knowledge base, which is critical for user experience.
Filter by Model ID: To narrow down your analysis, you can filter the metrics by Model ID. This allows you to observe the performance of specific models used in your RAG evaluations.

Step 5: Analyze Detailed with CloudWatch Logs Insights

For a deeper dive into individual evaluation runs and to troubleshoot specific issues, CloudWatch Logs Insights offers a powerful query language to analyze your raw logs.

Go to CloudWatch Logs Insights: In the CloudWatch console, navigate to Logs and then select Logs Insights.
Query for Individual Customer Metrics: Use the Logs Insights query editor to write custom queries that extract specific information from your Bedrock evaluation logs. You can query for details related to individual prompts, responses, and the metrics computed by the evaluator model.
Analyze Raw Logs from S3: If you configured your logs to be stored in S3, you can also directly access and analyze these raw log files using tools like Amazon Athena or other data processing services for more complex, large-scale analysis.

By following these monitoring and analysis steps, you can continuously track the performance of your RAG system, identify areas for improvement, and ensure your knowledge base is delivering optimal results.

Key Performance Metrics

Understanding the performance of your RAG system involves analyzing several key metrics that provide insights into its efficiency and effectiveness. These metrics are crucial for identifying bottlenecks, optimizing costs, and ensuring a high-quality user experience. The primary metrics to focus on include:

InvocationLatency: This metric represents the total response time of your RAG system. It measures the duration from when a request is made to the knowledge base until a response is fully generated. Lower invocation latency indicates a more responsive system, which is vital for interactive applications. High latency can point to issues with network connectivity, model inference speed, or knowledge base retrieval efficiency.
InputTokenCount: This metric tracks the number of tokens in the input provided to the LLM. In a RAG context, this typically includes the user's query and the retrieved context from the knowledge base. Monitoring input token count helps in understanding the complexity of the prompts being processed and has direct implications for cost, as most LLM providers charge based on token usage.
OutputTokenCount: This metric measures the number of tokens in the output generated by the LLM. It reflects the length and verbosity of the responses. Similar to input tokens, output token count is a significant factor in determining the operational cost of your RAG application. Optimizing the conciseness and relevance of responses can help manage this cost.
Invocations: This metric quantifies the number of successful requests made to the InvokeModel and InvokeModelWithResponseStream API operations. It provides a direct measure of the usage volume of your RAG system. Tracking invocations helps in capacity planning, understanding demand patterns, and correlating usage with overall system performance and cost.

By regularly monitoring and analyzing these key performance metrics, you can gain a comprehensive understanding of your RAG system's behavior, identify areas for optimization, and make data-driven decisions to improve its efficiency and user satisfaction.

Cost Considerations

When deploying and evaluating RAG systems on Amazon Bedrock, understanding the cost implications of different models is paramount. Pricing for LLMs is typically based on token usage, with separate rates for input and output tokens. The choice of model can significantly impact your operational expenses. Below is a table summarizing the pricing for various models available in Amazon Bedrock, based on the provided data:

Model Provider	Model Name	Input Price (per 1K tokens)	Output Price (per 1K tokens)	Region
Amazon	Nova Micro	$0.000035	$0.000140	us-east-1
Amazon	Nova Lite	$0.000060	$0.000240	us-east-1
Amazon	Nova Pro	$0.000800	$0.003200	us-east-1
Anthropic	Claude 4.0 Sonne	$0.003000	$0.015000	us-east-1
Meta	Llama 3 70B	$0.000720	$0.000720	us-east-1

As you can observe from the table, there is a considerable variation in pricing across different models. For instance, Amazon Nova Micro offers a very cost-effective option for both input and output tokens, making it suitable for initial evaluations and applications where cost efficiency is a primary concern. In contrast, models like Anthropic Claude 4.0 Sonne, while potentially offering advanced capabilities, come with a significantly higher price point.

When selecting a model for your RAG application and its evaluations, it is crucial to balance performance requirements with budgetary constraints. Consider the following:

Evaluation Frequency: Frequent evaluations will incur costs based on the number of tokens processed. Opting for more cost-effective models for evaluation jobs can help manage expenses.
Production Workloads: For production deployments, assess the expected volume of input and output tokens to project monthly costs. A small difference in per-token pricing can accumulate into substantial costs at scale.
Model Performance vs. Cost: While cheaper models might seem attractive, ensure they meet your performance benchmarks for accuracy, relevance, and latency. Sometimes, investing in a slightly more expensive model that delivers superior results can lead to better overall ROI.

By carefully analyzing these cost factors alongside performance metrics, you can make informed decisions about model selection and optimize the financial efficiency of your RAG solutions on Amazon Bedrock.

Evaluation Results Interpretation

Interpreting the results of your RAG evaluations is key to understanding the strengths and weaknesses of different models and optimizing your knowledge base. The provided spreadsheet image (pasted_file_ekW3hG_image.png) offers a comprehensive comparison across several models, highlighting various performance and quality metrics. Let's break down how to interpret such a detailed evaluation.

Overview of Models and Performance Metrics

The evaluation typically compares several models, such as Nova Micro, Nova Lite, Nova Pro, Claude 4.0 Sonne, and Llama 3 70B. For each model, several performance metrics are usually captured:

Input: This likely refers to the total number of input tokens processed during the evaluation run. A higher number indicates more extensive testing or longer prompts/contexts.
Throughput: This metric measures the processing speed, often expressed as tokens per second or invocations per second. Higher throughput indicates a more efficient model capable of handling a larger volume of requests in a given time frame.
Cost: This is a critical metric, often broken down into cost per second, cost per invocation, and cost per 1K tokens. As discussed in the previous section, these figures directly reflect the financial implications of using each model for your RAG system. Lower costs are generally desirable, provided the quality remains acceptable.

Quality Metrics

The core of RAG evaluation lies in assessing the quality of the generated responses. The spreadsheet categorizes quality into several dimensions, each contributing to a holistic view of model performance:

Correctness: Measures whether the generated response is factually accurate and free from errors. This is paramount for RAG systems, as their purpose is to provide grounded information.
Completeness: Assesses if the response addresses all aspects of the user's query and provides sufficient information. An incomplete response, even if correct, may not be helpful.
Helpfulness: Evaluates how useful and actionable the response is to the user. A helpful response goes beyond mere correctness to provide practical value.
Coherence: Determines if the response is logically structured, easy to understand, and flows naturally. A coherent response enhances user experience.
Harmfulness: Identifies if the response contains any toxic, biased, or otherwise inappropriate content. This is a crucial safety metric for all LLM applications.
Groundedness: This is particularly important for RAG systems. It verifies that all information presented in the response can be directly traced back to the provided source documents (the knowledge base). A high groundedness score indicates that the LLM is effectively utilizing the retrieved context and not hallucinating information.

Each of these quality metrics is typically scored, often on a scale (e.g., 0 to 1, or 0 to 5), with higher scores indicating better performance in that specific dimension.

Weighted Composite Score and Final Ranking

To provide an overall assessment, a Weighted Composite Score is often calculated. This score combines the individual quality metrics, allowing you to assign different weights based on the importance of each metric to your specific application. For example, if correctness and groundedness are more critical for your use case, they would receive higher weights. The formula for this composite score is usually defined within the evaluation setup.

Finally, a Final Ranking Calculation provides an ordered list of models based on their overall performance, considering both quantitative metrics (like latency and cost) and qualitative metrics (like correctness and groundedness). This ranking helps in making informed decisions about which model is best suited for your RAG application, balancing performance, quality, and cost.

By meticulously analyzing these metrics, you can identify which models excel in certain areas, pinpoint specific weaknesses, and iteratively refine your knowledge base, prompt engineering, or even the underlying LLM choice to achieve optimal RAG performance.

Important Notes & Reminders

As you embark on your RAG evaluation journey in Amazon Bedrock, keep the following important notes and reminders in mind to ensure efficient resource management, security, and best practices.

Resource Cleanup

AWS services, especially those involving machine learning models and data storage, can incur significant costs if left running unnecessarily. It is highly recommended that you diligently delete or release all resources after completing your lab work or evaluations. This includes:

Knowledge Base: The Amazon Bedrock Knowledge Base itself.
Vector Database: The underlying vector store, whether it's Amazon OpenSearch Serverless or Amazon S3 Vector Store.
S3 Buckets: Any S3 buckets you created or used for storing source data, evaluation inputs, or outputs.
IAM Roles: The IAM roles created for the Knowledge Base and evaluation jobs.

Failing to clean up resources can lead to unexpected charges on your AWS bill. Always verify that all associated resources have been terminated or deleted.

Cross-Origin Resource Sharing (CORS)

When developing web applications that interact with Amazon Bedrock, you might encounter issues related to Cross-Origin Resource Sharing (CORS). CORS is a security feature implemented by web browsers that restricts web pages from making requests to a different domain than the one that served the web page. If your frontend application is hosted on a different domain than your Bedrock API endpoints, you will need to configure CORS policies.

For detailed information on how to configure CORS with Amazon Bedrock, please refer to the official AWS documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/cors.html

JSON Formatters

Throughout the process of creating evaluation examples, you will be working with JSONL files. Ensuring that your JSON objects are correctly formatted is crucial for the evaluation jobs to run successfully. Malformed JSON can lead to errors and failed evaluations.

Several online tools can help you validate and format your JSON content. Some popular options include:

These tools can help you quickly identify syntax errors, pretty-print your JSON for readability, and ensure compliance with the JSON standard.

By adhering to these important notes and reminders, you can maintain a secure, cost-effective, and efficient environment for your RAG evaluations in Amazon Bedrock.

Conclusion

Evaluating Retrieval Augmented Generation (RAG) systems is not merely a best practice; it is a critical component for ensuring the reliability, accuracy, and cost-effectiveness of your AI applications. This guide has provided a comprehensive walkthrough of how to leverage Amazon Bedrock's evaluation capabilities to automatically assess your knowledge base performance. From setting up your environment and creating evaluation examples to monitoring key metrics and interpreting detailed results, you now have the knowledge to systematically enhance your RAG solutions.

By diligently following these steps, you can:

Improve Response Quality: Continuously refine your knowledge base and model choices to deliver more accurate, complete, and helpful responses.
Optimize Costs: Make informed decisions about model selection based on performance and pricing, ensuring your RAG system operates efficiently within budget.
Enhance User Experience: Reduce latency and improve the relevance of information, leading to a more satisfying experience for end-users.
Maintain System Health: Proactively identify and address issues through continuous monitoring and detailed analysis of performance metrics.

We encourage you to implement these practices within your own AWS environment. The journey of building robust AI applications is iterative, and effective evaluation is the compass that guides you toward excellence. Start evaluating your RAG systems today to unlock their full potential and deliver truly intelligent solutions.

Understanding Retrieval Augmented Generation (RAG)

To better understand the evaluation process, it's helpful to visualize the core components of a RAG system. The diagram below illustrates the typical flow:

In this flow:

User Query: The user initiates a request or question.
Retrieval: The RAG system queries a Knowledge Base (an external data source) to retrieve relevant information based on the user's query.
Generation: The retrieved information is then passed to a Large Language Model (LLM), which uses this context to generate a comprehensive and grounded response.
Response: The final generated response is presented to the user.

This process ensures that the LLM's output is informed by up-to-date and specific data, making evaluations of both the retrieval and generation components critical.

RAG Evaluation Workflow Overview

The entire process of setting up, running, and analyzing RAG evaluations in Amazon Bedrock can be visualized as a clear workflow. The following flowchart provides a high-level overview of the steps involved, from initial prerequisites to detailed analysis:

This visual guide helps in understanding the sequence of operations and the interdependencies between different stages of the evaluation process.

Visualizing Key Performance Metrics

To further clarify the key performance metrics discussed, the following diagram illustrates their relationships and what they represent:

These metrics provide a quantitative foundation for assessing the efficiency and responsiveness of your RAG system.

Visualizing Model Pricing

To provide a clear overview of the cost differences, the following image illustrates the pricing structure for various models:

This visual representation emphasizes the importance of cost-conscious model selection.

Detailed Evaluation Results

The following image provides a detailed breakdown of evaluation results across different models, showcasing various performance and quality metrics:

This spreadsheet is instrumental in conducting a thorough comparative analysis of model performance.

Deep Dive on Amazon Aurora and Amazon RDS for PostgreSQL Architecture and Features

Raj Murugan — Fri, 12 Dec 2025 12:49:56 +0000

Deep Dive on Amazon Aurora and Amazon RDS for PostgreSQL Architecture and Features

Introduction

If you're considering migrating your self-hosted PostgreSQL database or transitioning your commercial databases to PostgreSQL on AWS, you'll need to choose the database service that best aligns with your requirements. AWS offers two managed PostgreSQL database options:

Amazon Aurora PostgreSQL-Compatible Edition
Amazon Relational Database Service (Amazon RDS) for PostgreSQL

This post delves into the architecture and features of Aurora PostgreSQL and RDS PostgreSQL, analyzing their performance, scalability, failover capabilities, storage options, high availability, and disaster recovery mechanisms.

Overview

Both Aurora PostgreSQL and RDS for PostgreSQL are fully managed PostgreSQL database services offering:

Provisioning various classes of DB instances
Multiple PostgreSQL-compatible versions
Managing backups and point-in-time recovery (PITR)
Replication and monitoring
Multi-AZ support
Storage auto scaling

Key Differences

Aurora PostgreSQL uses a high-performance storage subsystem customized for fast distributed storage. The underlying storage grows automatically in segments of 10 GiB, up to 128 TiB. Aurora improves upon PostgreSQL for massive throughput and highly concurrent workloads.

RDS for PostgreSQL supports up to 64 TiB of storage and uses Amazon Elastic Block Store (Amazon EBS) volumes for database and log storage. RDS manages PostgreSQL installation, upgrades, storage management, replication, and backups.

Architecture Comparison

Aurora PostgreSQL Architecture

Single virtual cluster volume supported by storage nodes using locally attached SSDs
Data automatically replicated across three Availability Zones
Shared storage model for writer and readers

RDS PostgreSQL Architecture

Classic Multi-AZ with single standby instance
Multi-AZ DB cluster with two readable standby DB instances (semi-synchronous)
Three separate Availability Zones for increased read capacity

Feature Comparison Table

Feature	Aurora PostgreSQL	RDS for PostgreSQL
Maximum Storage	128 TiB	64 TiB
Storage Type	Custom distributed storage (locally attached SSDs)	Amazon EBS (gp2/gp3, io1/io2)
Storage Growth	Automatic in 10 GiB increments	Auto scaling in 10 GiB or 10% chunks
Storage Reduction	Automatic when data deleted	Manual
IOPS Limitation	No limitation based on storage size	Depends on storage type and size
I/O Charges	Separate (I/O-Optimized available)	Included with storage type
Read Replicas	Up to 15 Aurora readers	Up to 155 read replicas (5 per instance, 3 levels of cascading)
Cross-Region Replicas	Aurora Global Database	5 cross-Region read replicas
Typical Replica Lag	Few hundred milliseconds	Few seconds (optimal) to minutes (high load)
Backup Type	Continuous and incremental	Daily full + continuous WAL archiving
Backup Performance Impact	None	Slight impact on single-AZ deployments
PITR Restore Time	Fast (incremental nature)	Slower (restore full + replay WALs)
Failover Time (Multi-AZ)	30 seconds (DNS: 10-15s, Recovery: 3-10s)	1-2 minutes (includes crash recovery)
Crash Recovery	Immediate (on-demand parallel replay)	Depends on checkpoint interval (default 5 min)
Multi-AZ Options	Single configuration	One standby or two readable standbys
Write Latency (Multi-AZ)	Standard	Up to 2x faster with two standbys
Replication Method	Shared storage	PostgreSQL streaming replication
Write Impact on Replicas	Negligible	Significant (processes transaction logs)
Data Replication	6 copies across 3 AZs	Synchronous to standby, async to replicas
Serverless Option	Aurora Serverless v2	Not available
Fast Database Cloning	Yes	No (snapshot restore only)
Query Plan Management	Yes (QPM)	Not available
Cluster Cache Management	Yes (warm cache failover)	Not available
Machine Learning Integration	Yes (native SQL)	Not available

Detailed Feature Analysis

Storage

Aurora PostgreSQL Storage

Single virtual cluster volume supported by storage nodes using locally attached SSDs
Automatic growth in 10 GiB increments up to 128 TiB
Dynamic reduction when data is deleted
Triple replication across three Availability Zones automatically
No IOPS limitation based on storage size (may need to scale DB instance)
Separate I/O charges applied per usage
I/O-Optimized configuration provides up to 40% cost savings when I/O spend exceeds 25% of Aurora database spend

RDS for PostgreSQL Storage

Amazon EBS SSD-based storage types:
- General Purpose SSD (gp2): 3 IOPS per provisioned GiB, burst up to 3,000 IOPS
- General Purpose SSD (gp3): Customized performance independent of size
- Baseline: 3,000 IOPS and 125 MiB/s for <400 GiB storage
- Provisioned IOPS (io1, io2): 1,000–256,000 IOPS range
Storage auto scaling in chunks of 10 GiB or 10% of current storage (whichever is greater)

Backup

Aurora PostgreSQL Backup

Continuous and incremental automated backups
No performance impact or interruption during backups
Fast PITR due to incremental nature
Restore time depends on volume size and transaction log count

RDS for PostgreSQL Backup

Daily automated backups during backup window
Slight performance impact on single-AZ deployments when backup initiates
Continuous WAL archiving
PITR process: Restore full backup + replay WALs to desired time
Slower for write-intensive workloads (long WAL replay time)
Tip: Frequent manual snapshots reduce PITR duration

Scalability

Aurora PostgreSQL Scalability

Up to 15 readers for read scaling and high availability
Shared storage model minimizes impact of high write workloads on replication
Minimal replica lag (few hundred milliseconds, occasionally up to 60s)
Auto-restart of readers if lag exceeds threshold
Write capacity limited by single writer instance

RDS for PostgreSQL Scalability

Up to 155 read replicas (5 per instance, 3 cascading levels)
Cascading architecture reduces overhead on source instance
Progressive replication lag with more intermediaries in cascade
Read replica promotion to standalone instances
5 cross-Region read replicas
Streaming replication via PostgreSQL WAL records
Higher replica lag risk with high write activity, storage/instance class mismatch
Two readable standbys in Multi-AZ three-AZ deployment serve both HA and scalability

Crash Recovery

Aurora PostgreSQL

No traditional checkpoints (storage system applies log records directly)
Parallel and asynchronous redo record replay per storage segment
Immediate availability after crash

RDS for PostgreSQL

Replays transaction logs since last checkpoint (default: 5 minutes apart)
Checkpoint process writes dirty pages from memory to storage
Trade-off: Frequent checkpoints reduce recovery time but may slow performance (I/O intensive)

Failover

Aurora PostgreSQL Failover

Typical failover time: 30 seconds
- DNS propagation: 10-15 seconds
- Recovery: 3-10 seconds (parallel with DNS)
Automatic promotion of reader to primary

RDS for PostgreSQL Failover

Typical failover time: 1-2 minutes
- Includes DNS propagation and crash recovery
Depends on: Crash recovery time, DNS propagation, TTL settings
Multi-AZ with two standbys: Under 35 seconds, 2x faster transaction commit latency

RDS Proxy Benefits

Both services support Amazon RDS Proxy for:

Connection pooling and sharing
Faster failover recovery
Automatic connection to new primary
Maintained idle connections during failover

High Availability and Disaster Recovery

Aurora PostgreSQL HA/DR

Storage-compute separation architecture
Six storage nodes across multiple Availability Zones
All readers accessible via instance or reader endpoints
Minimal replica lag (typically <100ms)
Aurora Global Database for cross-Region replication (<1 second latency)
Automatic DB snapshot sharing across accounts and regions
AWS Backup integration for cross-region backup sharing

RDS for PostgreSQL HA/DR

Multi-AZ deployment options:
- One standby (synchronous replication)
- Two readable standbys (semi-synchronous)
Local storage for transaction logs (WAL logs)
Write-then-flush pattern for reduced failover time and faster commits
Automated backups from standby in classic Multi-AZ
Cross-Region and same-Region replicas (transaction log-based)
Snapshot sharing across accounts and regions

Additional Aurora Features

Aurora PostgreSQL provides several value-add features:

Fast Database Cloning

Quick cloning of all databases in DB cluster
Faster than RDS snapshot restore
Ideal for: Testing schema/parameter changes, analytics on near-production data

Query Plan Management (QPM)

Control query plan changes to avoid performance degradation
Maintain optimal plans despite table/index statistics changes

Cluster Cache Management

Warm cache synchronization between designated replica and writer
Immediate cache availability after failover
Sustained performance post-failover

Aurora Serverless v2

On-demand autoscaling configuration
Full Aurora feature set (cloning, global database, Multi-AZ, multiple readers)
Automatic scaling: Starts up, shuts down, scales capacity based on needs
Instant scaling: Hundreds to thousands of transactions in seconds
Fine-grained capacity adjustments

Aurora Machine Learning

ML-based predictions via SQL
Integrated with:
- Amazon SageMaker
- Amazon Bedrock (generative AI)
- Amazon Comprehend (sentiment analysis)
No custom integrations or data movement required

Conclusion

This comprehensive analysis has explored the architectural details and feature sets of Amazon Aurora PostgreSQL-Compatible Edition and Amazon RDS for PostgreSQL. Key takeaways:

Aurora PostgreSQL excels in massive throughput, highly concurrent workloads, and provides enterprise database capabilities with minimal operational overhead
RDS for PostgreSQL offers flexibility with storage types, extensive read replica options, and cost-effective solutions for standard workloads

Both services provide robust solutions for managed PostgreSQL deployments on AWS, each with distinct strengths suited to different use cases.

Additional Resources

For further guidance on migrating to Aurora PostgreSQL or RDS for PostgreSQL:

Have questions or suggestions? Please leave a comment below.

Bedrock AgentCore: What 5 Real ANZ Enterprise Deploys Taught Us

Raj Murugan — Sun, 30 Nov 2025 12:27:44 +0000

I've spent the last 9 months shipping Bedrock AgentCore into four different ANZ enterprises (plus one internal PoC that crashed and burned).

This isn't a hello-world tutorial – it's the bruises, the invoices, and the 3 a.m. CloudWatch alarms that finally made the thing stick.

If you're about to promote an agent past the "demo for the board" stage, steal this checklist – it will save you at least one rollback.

the numbers we actually saw

Pattern	Use-case	10 k q/mo cost	p95 latency	Notes
Single agent	Simple Q&A	~AUD 180	2.1 s	Hallucinated once traffic > 2 k/day
Supervisor + 3 subs	HR triage	~AUD 420	4.3 s	60 % less duplicate Lambda code
AgentCore Runtime	SRE co-pilot	~AUD 620	3.8 s	GitOps deploy, full traces
Guardrail-wrapped	Student chat	~AUD 520	4.9 s	PII blocked, compliance happy

Supervisor pattern is the only one that survived a production spike without a hot-fix.

Single agents are great for a sprint demo – and terrible for anything that hits the internet.

Managed agents vs. AgentCore Runtime – pick one before 10 k users

I drew this on a whiteboard for our CFO after she saw the second invoice:

Rule we now write into every SoW:

PoC = managed. Day-1 prod = Runtime.

The moment you need a custom MCP tool or a side-car Lambda, the console becomes a drag.

Ground-truth data – skip it and you'll ship a liar

Our first Kindo chatbot went live with 37 manually-written examples.

Two weeks later a student asked "What grade do I need to pass?" and the agent calmly invented a 42 % cutoff (it's 50 %).

Cue 4 a.m. rollback.

We fixed it the boring way:

Exported 18 k real (de-identified) chat logs.
LLM-expanded edge cases: "give me 50 ways to ask about vacation pay".
Human reviewed, 1 200 kept.
Added sessionAttributes (studentID, semester) so the agent could look up live data.

Accuracy jumped from 67 % → 92 % and the support ticket queue dropped by half.

# pytest harness we run in CI
tests = json.load(open("ground_truth.json"))
for t in tests:
    out = agent.invoke(t["input"], sessionAttributes=t["attrs"])
    assert out["answer"] == t["expected"]

Supervisor pattern that actually compiles

Payroll bot rewrite: one supervisor + three specialised subs (policy, leave-balances, tickets).

60 % less copy-paste Lambda code, and we could unit-test each sub in isolation.

from agentcore import Agent, app

supervisor = Agent(
    model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",
    instructions="You are a router. Never answer directly – always delegate to the correct sub-agent."
)

@app.entrypoint
def lambda_handler(event, _):
    return supervisor.invoke(event["prompt"])

Gateway MCP let us plug ServiceNow REST APIs without re-writing the OpenAPI schema – biggest time-saver of the sprint.

Guardrails – the checkbox that saved our audit

First deploy forgot guardrails.

Next day a student pasted their email + TFN into the chat and the agent happily parroted it back in the response.

Security team put a red sticker on my laptop.

Now we enforce org-level guardrails before any agent alias hits prod:

Filter	Block %	Mask %	AUD / mo
PII (email, TFN)	2.1	8.4	32
Custom finance terms	1.7	3.2	22
Hate/violence	0.3	–	12
Total	4.1	11.6	66

Cheap insurance.

IaC + observability – or you'll debug in the console at 2 a.m.

We template everything in CDK (Python). One cdk deploy spins up:

AgentCore Runtime container
Lambda layers for Powertools & boto3 latest
X-Ray traces, CloudWatch dashboards, alarms

Metric	Target	Alarm
Task success	≥ 95 %	< 90 %
p95 latency	≤ 5 s	> 10 s
Token spend	≤ AUD 70/day	> AUD 140
PII leak count	0	> 0

Routing loops show up as 30 s p99 spikes – impossible to spot without traces.

10-line deploy checklist we paste into every PR

[ ] 200+ ground-truth conversations in tests/ground_truth.json
[ ] Supervisor agent uses Sonnet; subs pinned to Haiku for cost
[ ] Guardrails alias attached (BLOCK PII, MASK custom)
[ ] agentcore deploy --stage prod --approve
[ ] Powertools tracer + metrics on every handler
[ ] CloudWatch alarm for "PII leak > 0" – page the on-call
[ ] A/B toggle for Haiku fallback if token spend > budget

Starter repo we fork every time:

https://github.com/awslabs/amazon-bedrock-agentcore-samples

80 % of the boilerplate is done in 90 min – the rest is ground-truth grunt work.

If you're riding the agent hype-wave right now, remember: the demo is the easy 10 %.

These notes are for the other 90 % – the invoices, the guardrails, the 3 a.m. pages.

Steal what you need, add your own scars, and ship something that won't hallucinate when the CFO asks it a question.

Happy building, and may your p95 always be under 5 s.

Forem: Raj Murugan

Part 3: Wiring It Into AWS DevOps Agent — AgentSpace, register-service, and the IAM Trust Policy That Ate My Afternoon

The three-stack split, and why

KnowledgeBaseStack — the easy one

McpServerStack — the chicken-and-egg, and the right Lambda shape

DevOpsAgentStack — where the real work happens

Gotcha 1: the composite trust policy

Gotcha 2: the three explicit chat actions

Gotcha 3: the secret has to be resolved at deploy time, not synth time

The webhook forwarder — turning alarms into agent calls

A real OIDC gotcha I hit on this very blog

Where this leaves you

Where this doesn't leave you

Closing

Part 2: The MCP Server — Turning ADRs and Incidents into a Queryable Org-Knowledge Surface

Why MCP, not prompt-stuffing

The four-tool API

Anatomy of one tool — check_risk_acceptance_status

Frontmatter is the contract

The retrieval client, and the bug Bedrock KB chunking gives you for free

The transport — FastMCP on Lambda + Function URL

API-key auth without the leaks

What lands in S3, and how

Summary, before we wire it up

Part 1: Intent vs State — How AWS DevOps Agent Closes the Gap Between What Your System Is and What You Decided It Should Be

When something breaks at 3am, what do you actually look at?

What changed when AWS DevOps Agent shipped

A concrete scenario: 60 days past a commitment

The architecture, end-to-end

What you'll build across this series

A short FAQ before we go deeper

What I deliberately left out

Part 5: CI/CD for Bedrock AgentCore with GitHub Actions and AWS OIDC (No Stored Credentials)

Why OIDC over stored credentials

Setting up OIDC

Step 1: Create the IAM OIDC provider (once per AWS account)

Step 2: Create the deploy IAM role

Step 3: Attach permissions to the deploy role

The deploy workflow

Trigger

OIDC credential configuration

Build for linux/amd64

The dual-tag ECR strategy

Updating the AgentCore Runtime

The CI workflow

Multi-environment promotion

GitHub Secrets to configure

End-to-end flow

Part 3: Building the AI Agent with Strands Agents SDK, Prompt Caching, and AgentCore Memory

Why Strands over LangChain or LlamaIndex?

Project structure

config.py — environment variables

The dual-model strategy

Prompt caching — the 90% cost saving

Tool definitions

AgentCore Memory integration

The streaming agent loop

The adaptive retry config

Putting it all together

Part 6: Cost & Performance for Bedrock AgentCore — Prompt Caching, Model Selection, and CloudWatch Alarms

The cost components

Prompt caching: the biggest lever

Model selection strategy

AgentCore lifecycle configuration

CloudFront PriceClass_100

CloudWatch alarms: catch runaway costs before they hit your bill

Alarm 1: OutputTokenCount spike

Alarm 2: InvocationLatency P99

Actual cost estimates

Quick optimisation checklist

Wrapping up the series

Part 4: Running Your AgentCore Agent Locally with Docker (The Right Way)

The --platform linux/amd64 requirement

The Dockerfile

The .env.local pattern

Running the container locally

Verifying the health check

Testing with curl

Common local dev errors

The local dev loop

The `--platform linux/amd64` requirement

The `.env.local` pattern

Gotcha #3: No CDK L1 for `ModelInvocationLoggingConfiguration`

Gotcha #7: `update-agent-runtime` requires `--role-arn` and `--network-configuration`

Gotcha #8: Memory stuck in `CREATING` during rollback

Gotcha #9: `arrayWith` in CDK assertions is order-sensitive