Forem: Danny Teller

AgentCore Registry: 16 Skills, 1 Hour, Zero Downtime

Danny Teller — Sun, 12 Apr 2026 05:43:03 +0000

The story of migrating our governance agent from hardcoded skills to dynamic Registry loading — the wins, the gotchas, and what we learned along the way.

Why Bother?

Our AWS governance agent has 16 domain skills — security analysis, cost optimization, network intelligence, the works. Every single one of them was baked into the system prompt on every request. Ask about a single S3 bucket? Here's all 16 skill descriptions anyway.

That's a lot of tokens doing nothing useful.

The goal was simple: move skill definitions to AgentCore Registry so the agent loads only what it needs per request. Less prompt bloat, smarter skill selection, and the foundation for per-thread skill loading down the road.

The key constraint: Registry acts as a catalog only. Tool implementations stay in the agent code where they belong (lowest latency). The Registry just stores metadata — name, description, instructions, which tools a skill can use.

And because we're not reckless, we added a USE_REGISTRY env var toggle. Flip it off, and everything goes back to the old local files. Zero drama.

The Plan

We used Claude Code with agentic planning to break it into 7 phases: SDK upgrades, plugin creation, upload scripts, IAM permissions, tests, this journal, and documentation. The whole thing — plan through production deployment — took about an hour. A few decisions we locked in early:

Registry as catalog, not runtime — tools stay in-process for speed
One record per skill — 16 skills, 16 records, clean mapping
Session-scoped caching — fetch the catalog once per session, not every turn
Auto-approval enabled — all uploads come from our repo, no human review gate needed

Setup: Console and CLI Only

As of April 2026, AgentCore Registry has no Terraform provider and no CloudFormation support. So we did everything through the AWS Console and CLI — which honestly worked fine for a one-time setup, but worth knowing if you're planning to IaC everything from day one.

Created the registry in the console. We already had a working JWT auth setup with Azure Entra ID on our main AgentCore agent, so we matched that same configuration. Straightforward.

Then we needed the Registry ID.

You'd think there'd be a labeled field with a copy button, like every other AWS service. There isn't. The ID is buried inside the ARN — you have to extract it yourself, or notice it tucked into sample CLI commands at the bottom of the page.

Not a showstopper, but the kind of thing that makes you squint at the screen for five minutes wondering what you're missing.

Dear AWS: A "Registry ID" field with a copy button would save everyone some confusion.

Phase 1–2: Foundation and Plugin

Bumped our SDK versions (strands-agents and boto3), added config fields, and built the new plugin — a drop-in replacement for the old AgentSkills loader.

One thing worth knowing: match your Registry auth to your agent's auth. If your agent uses IAM, create an IAM-auth registry. If your agent uses OIDC, create a JWT registry. We use OIDC on our agent, so we went with JWT — and since the plugin runs with the agent's IAM execution role, we use the control plane APIs (list_registry_records + get_registry_record) which accept IAM regardless of registry type.

Phase 3: Upload Scripts (Where Things Got Interesting)

We wrote scripts to push all 16 skill files to the Registry. Two things bit us:

The frontmatter problem. Our SDK's Skill.instructions field helpfully strips the YAML frontmatter from skill files. The Registry's upload API helpfully requires it. We switched to reading raw file content.

The async creation dance. create_registry_record returns immediately with an ARN but no record ID field. The record is in CREATING state. You have to: extract the ID from the ARN, wait for it to reach DRAFT, then submit it for approval as a separate step. None of this is one API call.

Phase 4–5: IAM and Tests

IAM was uneventful — added registry permissions to the execution role, scoped to our specific registry ARN.

Tests were thorough: 13 unit tests for the plugin (including XML injection protection and caching behavior), 6 for the upload scripts. All green.

Phase 6: The Moment of Truth

Deployed with USE_REGISTRY=true. Opened Slack. Asked Schwarzi a question.

It worked.

The logs told the story:

Skills loaded from AgentCore Registry ✓
16 skills fetched ✓
Same tool count as before (30) ✓
Extra latency: ~1.2 seconds at agent creation, once per session

That 1.2 seconds is the catalog fetch. It happens once, then individual skill fetches (~200ms each) are cached for the rest of the session. Acceptable trade-off for dynamic loading.

"But Does It Cost More Tokens?"

This was our first question too. We'd just added API calls and latency — surely we were also burning more tokens?

No. The LLM sees exactly the same text either way.

Both plugins inject an <available_skills> XML block into the system prompt before every turn — 16 skill names and descriptions, same format, same size. When the agent activates a skill, both return the full SKILL.md content as a tool result. Same content, same tokens. The Registry is a different storage backend, not a different prompt.

The actual cost delta is purely operational: ~1.2 seconds of latency at session start, ~200ms per uncached skill fetch, and 3–4 AWS API calls per session that didn't exist before. Those are real, but they're measured in milliseconds and pennies, not tokens.

If anything, the Registry sets up future token savings. Right now we load the full catalog into the prompt every turn. With per-thread skill loading (the next phase), we'd inject only the skills relevant to the current conversation. That's a prompt reduction, not an increase — but it requires the dynamic loading infrastructure we just built.

The Real Payoff: Skill Updates Without Redeployment

The best part came after the migration. We needed to fix a skill's instructions — previously, that meant editing the markdown, rebuilding the container, and running agentcore launch. A full redeployment cycle for a text change.

Now? Edit the SKILL.md, run the update script, and the agent picks it up on the next session. No container build, no deploy, no downtime. Skill content is decoupled from the agent runtime.

That's the kind of workflow improvement that compounds over time. Every skill tweak, every prompt refinement, every new tool reference — just a markdown edit and an upload.

The Gotcha Tracker

For the detail-oriented, here's everything that went sideways:

#	What Happened	What We Did
1	Registry ID not labeled in console	Extracted from ARN
2	Registry APIs missing in boto3 < 1.42.88	Upgraded boto3
3	Upload API requires YAML frontmatter	Sent raw file content
4	Record creation is async, no ID in response	Parsed ID from ARN, polled for DRAFT status
5	Data-plane search requires matching auth type	Used control-plane APIs (accept IAM regardless)
6	Registry takes ~45s to become READY	Added polling before uploads

What We'd Tell Past Us

Match your Registry auth to your agent. IAM agent → IAM registry. OIDC agent → JWT registry. We matched our existing Azure Entra ID config and it worked on the first try.
Control plane is your friend. The control-plane APIs (list_registry_records + get_registry_record) accept IAM regardless of registry auth type. Use them for catalog fetches from your plugin.
Send the raw file, frontmatter and all. The SDK strips it; the Registry needs it. Read the file from disk, not from the parsed object.
Record creation is a multi-step process. Create → wait for DRAFT → submit for approval → wait for APPROVED. Budget for polling and retries.
Check your boto3 version. Registry APIs landed in 1.42.88. Older versions have the clients but not the operations — a confusing failure mode.
Cache at the session level. The catalog fetch is the slow part (~1.2s). Individual skill lookups are fast (~200ms) and cache well. Don't re-fetch what hasn't changed.
Use an env var toggle for migrations. USE_REGISTRY=true/false with both code paths intact means you can roll back in seconds. No database flags, no deployment. Just flip the switch.

We Built the Same Agent Three Times Before It Worked

Danny Teller — Wed, 01 Apr 2026 19:10:44 +0000

Two months ago, our DevOps team set out to build an AWS governance agent. Something that could look across a multi-account AWS organization, find orphaned resources, flag security issues, check tag compliance, and tell you where you're bleeding money — in plain English.

We had AWS Strands Agents SDK, Amazon Bedrock AgentCore, and a reasonable amount of optimism.

What followed was two months of building, tearing down, and rebuilding. Three fundamentally different architectures. 18,000 lines of code written and then deleted. And a final system that's simpler than any of the ones that came before it.

This is the story of how we got there.

Iteration 1: "The LLM Will Figure It Out"

The first version was the obvious one. Give the LLM a set of AWS API tools — describe_instances, list_security_groups, get_cost_and_usage — and let it call them directly.

We built an AgentRouter that received user queries, a CoordinatorAgent that managed multi-agent flow, and wired it all to boto3 calls. The LLM would receive a question like "find unused security groups in our production VPC," reason about which APIs to call, and chain them together.

It worked. Sort of.

The problem wasn't that the LLM couldn't call AWS APIs. It could. The problem was that AWS APIs are inconsistent, paginated, rate-limited, and return wildly different response shapes across services. A question about orphaned EBS volumes required the LLM to:

List all volumes
Filter for available state
Cross-reference with instance attachments
Check if any are in use by ASGs or launch templates
Handle pagination across all of these

The LLM would sometimes get this right. Sometimes it'd miss the pagination. Sometimes it'd hallucinate an API parameter. Every query was a fresh adventure in whether the model remembered the exact shape of describe_volumes response.

We were spending tokens on API exploration instead of governance analysis.

Lesson: Giving an LLM raw AWS APIs is like giving someone a phone book and asking them to plan a city. The information is there, but the abstraction is wrong.

Iteration 2: The 8,300-Line Orchestrator

If the LLM couldn't be trusted to navigate AWS APIs on its own, we'd give it structure.

We built a five-stage deterministic pipeline:

Classify — determine the intent (security audit, cost review, orphan detection)
Reason — extract entities (account IDs, VPC names, resource types)
Route — select the right tools and agents
Execute — run the tools in the correct order
Synthesize — compile results into a coherent response

This was a hybrid approach. Deterministic control flow with LLM reasoning at each stage. A SemanticIntentClassifier replaced keyword routing. A SupervisorAgent managed the pipeline. We added VPC disambiguation, cross-account routing, hallucination guards.

It felt like progress. The pipeline was predictable. Tests could target each stage. We could reason about failure modes.

But the orchestrator kept growing. Intent taxonomies needed constant updates. Every new query pattern required new routing logic. The classifier would misroute edge cases, and fixing one route would break another. VPC name resolution alone went through four bug-fix cycles.

By late February, the orchestrator was 8,300 lines across seven modules. The SupervisorAgent had been decomposed, recomposed, and decomposed again. We'd built an entire reasoning engine on top of an LLM that was already a reasoning engine.

We were fighting the model instead of using it.

Lesson: A deterministic pipeline that wraps an LLM is still deterministic. You get the rigidity of hardcoded flows with the unpredictability of language models. The worst of both worlds.

Iteration 3: The Satellite Architecture

Around the same time, we had a data freshness problem. Direct API calls were slow, rate-limited, and gave you a point-in-time snapshot. We wanted something closer to a continuously updated inventory.

So we built a satellite architecture:

Lambda scanners deployed across every AWS account via StackSets
Each scanner would enumerate resources on a schedule, enrich them with metadata and relationships
Results flowed into S3 Vectors — four buckets with 52 vector indexes, organized by domain (core, data, compute, ops)
An embedding service (Titan Embed Text v2, 1024 dimensions) vectorized everything
The agent would query the vector store instead of calling AWS APIs directly

The infrastructure was impressive on a whiteboard. Service-aware vector indexes. Cross-account routing. Relationship graphs embedded alongside resource metadata. A reconciliation pipeline to handle eventual consistency.

In practice:

Cost: Titan embedding calls for every resource in every account on every scan. Lambda execution time kept climbing — we bumped timeouts from 10 to 15 to 20 minutes. 57 VPC endpoints to give Lambdas access to S3 Vectors.

Consistency: Vector deduplication was a constant battle. Resources would appear twice after re-scans. Cross-account vector routing had subtle bugs where resources from one account would surface in queries about another.

Freshness: The thing we built this to solve. Scans ran on schedules, so the vector store was always behind reality. Users would ask about a security group that was created an hour ago and get nothing back.

Complexity: 3,700 lines of scanner code across eight modules. A 611-line S3 Vectors client. A 388-line relationship index. A 286-line embedding service. Infrastructure stacks for Step Functions, S3 Vectors buckets, scanner Lambdas, and all the IAM plumbing to connect them.

We had built a distributed data pipeline to feed an LLM that still couldn't reliably answer "show me the unused EBS volumes."

Lesson: When your data layer is more complex than your analysis layer, you've probably over-solved the wrong problem.

The Moment It Clicked

In late February, we stepped back and asked a different question: What if we stopped building infrastructure and started using what AWS already provides?

AWS Config Aggregator already has a continuously updated inventory of resources across all accounts. It supports SQL queries. It's maintained by AWS. It doesn't need Lambdas, embeddings, or vector stores.

And Strands SDK already handles tool selection, invocation, and chaining. It doesn't need a five-stage pipeline to decide which tool to call.

On February 27, we deleted the 8,300-line orchestrator and replaced it with a single Strands agent. On March 2, we deleted the entire scanner infrastructure and replaced S3 Vectors with Config Aggregator queries.

The diff was dramatic: 18,000 lines removed. The new agent was roughly 1,500 lines.

What We Built Instead

The current system has one Strands agent running Haiku. No multi-agent orchestration. No intent classification. No vector stores. The LLM picks tools from a registry, and the tools handle the complexity of talking to AWS.

Here's what makes it work:

Data: Config Aggregator + Resource Explorer 2

Instead of building our own data layer, we query AWS's:

SELECT resourceId, resourceType, configuration
WHERE resourceType = 'AWS::EC2::SecurityGroup'
AND accountId = 'XXXXXXXXXXXX'

Config Aggregator covers ~85% of resource types. Resource Explorer 2 fills the gaps. For anything with zero results, a direct API fallback fires automatically. No Lambdas. No embeddings. No eventual consistency problems.

Tools: Dispatchers Over Individual Functions

Early versions exposed 70+ individual tools to the model. Each tool call required the LLM to pick from a massive schema — roughly 30K tokens of tool definitions per request. The model would get confused, pick the wrong tool, or combine tools incorrectly.

We consolidated into 8 domain dispatchers:

@tool
def security_tool(action: str, **params):
    """Security operations: findings, sg_rules, nacl_analysis,
    kms_keys, compliance_check, ..."""
    return dispatch(action, params, _ACTIONS)

One security_tool with an action parameter replaces 11 individual tools. The LLM sees 8 clear categories instead of 70 ambiguous options. Tool schema dropped by 64% — about 10K fewer tokens per request.

Analysis: Context Builders, Not LLM Reasoning

The earlier architectures asked the LLM to analyze raw AWS API responses. "Here are 200 security groups, figure out which ones are orphaned."

Now, deterministic context builders pre-process the data:

Orphan detection applies known rules (no attachments, no references, default VPC)
Security analysis flags known-bad patterns (0.0.0.0/0 ingress, missing encryption)
Cost analysis identifies savings opportunities from utilization data

The LLM receives pre-analyzed context and focuses on what it's good at: explaining findings in plain language and prioritizing recommendations.

Skills: Progressive Disclosure

15 domain skills (cost optimization, security analysis, network intelligence, etc.) load on-demand based on query context. Each skill gates which tools the agent can access and provides domain-specific guidance.

This means the model doesn't see irrelevant tool instructions. A cost question loads cost-optimization guidance and cost-related tools. The context stays focused.

The Well-Architected Sub-Agent

For deep assessments, we spawn a separate Strands agent running Sonnet (the main agent runs Haiku). This sub-agent has its own tool set — 20+ check functions for EBS, RDS, IAM, S3 — and its own system prompt tuned for Well-Architected Framework analysis.

It's the one place where we use a more powerful model, and only when the user explicitly asks for an assessment.

Strands SDK: What Actually Matters

After using Strands through three architectural iterations, here's what ended up being load-bearing:

Prompt caching (CacheConfig(strategy="auto")): Our system prompt is substantial — core orchestration plus dynamically loaded skills. Caching it across invocations cut latency and cost meaningfully.

The Plugin API: A single GovernancePlugin class registers hooks for before/after tool calls, after model calls, and after invocation. This gives us telemetry, token tracking, and post-response validation without touching the agent's core logic.

event.resume: After the agent finishes responding, a hook can inspect the output and inject a follow-up. We use this for orphan analysis validation — if the agent's response doesn't match our deterministic findings, the hook sets event.resume with a correction prompt and the agent self-corrects. No infinite loops, no separate validation agent.

Streaming with agent.stream_async(): Real-time progress in Slack. Users see which tools are running, partial results, and the final analysis as it generates. This turned a 30-second wait into a 30-second experience of watching the agent work.

Agent cancellation: Long governance scans can take a while. Users can cancel mid-flight, and the agent cleans up gracefully.

Conversation summarization: Governance conversations can run long — "now check the other account," "what about the network side." The SDK's SummarizingConversationManager keeps conversation history manageable while preserving critical context like account IDs and prior findings.

AgentCore: Production Without the Ops

AgentCore handles the parts we didn't want to build:

Runtime hosting: Docker container on Graviton, managed by AgentCore. No ECS cluster to maintain.
Memory: Short-term and long-term session memory with episodic strategies. Cross-session learning — the agent remembers what it found in previous conversations and injects those reflections into new sessions.
JWT authentication: Corporate identity provider integration for user identity. The agent knows who's asking and can scope responses to their permissions.
Guardrails: Bedrock Guardrails filter content to prevent secrets leakage in governance responses.

Bringing It to Slack with AG-UI

An agent that lives behind an API endpoint is useful. An agent that lives in the channel where your team already works is adopted.

We built a Slack bot as a separate service — Slack Bolt running in Socket Mode on ECS Fargate. When a user messages the bot, it calls the AgentCore-hosted agent over HTTPS. But early versions had a problem: governance scans take 15–30 seconds. Users would type a question, stare at a typing indicator, and wonder if anything was happening.

AgentCore supports AG-UI (Agent-User Interface), a streaming protocol that surfaces what the agent is doing in real time. We built a custom Strands-to-AG-UI adapter that translates the SDK's internal events into an SSE stream:

POST /invocations
Accept: text/event-stream

→ TOOL_CALL_START: discover_resources (scanning 3 accounts...)
→ TEXT_DELTA: "Found 47 security groups across..."
→ TOOL_CALL_START: security_tool (analyzing findings...)
→ TEXT_DELTA: "12 groups have unrestricted ingress..."

The same endpoint serves both modes. Accept: text/event-stream gets the SSE stream; a normal request gets synchronous JSON. The Slack bot consumes the stream and progressively updates the Slack message — users see tools firing, partial results appearing, and the final analysis building in real time.

A few things we learned the hard way about streaming:

Never clear your accumulated response text when a new tool call starts. The agent can call tools mid-response, and clearing the buffer silently drops everything it said before the tool call.
Track unique tool stages, not individual calls, for progress display. Otherwise you flood the Slack message with duplicate updates.
Track text offsets after both tool-call-start and tool-call-end events. Text can appear between tool calls, and missing either offset creates gaps in the streamed output.

The result: what used to be a 30-second black box is now a 30-second live feed of the agent working through the problem. Users trust it more because they can see it thinking.

The Numbers

Metric	Satellite Architecture	Current
Lines of code	~25,000	~7,000
Tool definitions exposed to LLM	70+	25 (8 dispatchers + 17 individual)
Tool schema tokens per request	~30K	~10K
Infrastructure components	S3 Vectors (4 buckets), 52 indexes, Lambda scanners, Step Functions, 57 VPC endpoints	Config Aggregator (AWS-managed), 20 VPC endpoints
Data freshness	Minutes to hours (scan schedule)	Real-time (Config Aggregator)
Embedding costs	Per-resource per-scan	Zero
Live test pass rate	Inconsistent	109/109 (100%)

What We'd Tell Ourselves Two Months Ago

"Not enough coverage" isn't always a good reason to build your own. We dismissed Config Aggregator early because our research showed it didn't cover every resource type we wanted to query. So we built an entire scanning pipeline to get 100% coverage. What we discovered later: Config Aggregator covers ~85% of resource types, Resource Explorer 2 fills most of the gaps, and a simple direct API fallback handles the rest. Three lines of fallback logic replaced 3,700 lines of scanner code. Perfect coverage wasn't worth the complexity cost.

Don't build orchestration on top of orchestration. Strands SDK is already a tool-use loop. Adding a five-stage pipeline on top of it added complexity without adding capability.

Token budget is an architecture constraint. 70 tools meant 30K tokens of schema per request. The dispatcher pattern wasn't just cleaner code — it was a 64% reduction in per-request cost.

Deterministic preprocessing, LLM synthesis. The best division of labor: code handles data collection and known-rule analysis. The LLM handles explanation, prioritization, and natural language. Don't ask the model to do what a SQL query can do better.

Ship the boring version first. Every complex architecture we built was an attempt to solve problems we hadn't actually encountered yet. The current system handles real governance queries from real users. That's the bar.

Built with AWS Strands Agents SDK and Amazon Bedrock AgentCore.