Forem: Aj

What is an MCP server? The developer's plain-English guide to Model Context Protocol

Aj — Wed, 29 Apr 2026 17:52:27 +0000

MCP stands for Model Context Protocol. An MCP server is a lightweight process that exposes tools, resources, and data to an AI agent — giving it the ability to interact with the outside world.

Without MCP, an AI model lives inside a conversation. It can read what you type and respond. That's it.

With an MCP server, the same model can query a database, read files from your filesystem, call an external API, execute code, search the web, check your calendar, and push commits to GitHub — all within a single conversation, autonomously, without you doing any of it manually.

MCP is the bridge between AI reasoning and real-world action. It is why the shift from AI chatbots to AI agents happened so fast in 2025-2026.

Why MCP exists

Before MCP, every team that wanted to connect an AI model to an external tool built their own integration. Different schemas, different transport formats, different authentication patterns. A company using three AI models and five internal tools needed fifteen custom integrations. Maintaining them as models updated was a recurring engineering tax.

Anthropic published the Model Context Protocol specification in November 2024 as an open standard. The goal: one protocol that any AI model can use to talk to any tool, regardless of who built either.

The analogy that actually works: MCP is to AI agents what USB is to peripherals. Before USB, every device had its own connector, its own driver, its own installation ritual. After USB, you plug it in and it works. MCP does the same thing for AI and tools — standardises the connection so the model and the tool don't need to know anything specific about each other.

How an MCP server works

An MCP server exposes three types of things to an AI model:

Tools — functions the model can call. A tool has a name, a description, and a JSON schema defining its inputs. When the model decides it needs to use a tool, it calls the tool with appropriate inputs and receives the result. Examples: query_database, read_file, send_email, create_github_issue, invoke_lambda.

Resources — data the model can read. Files, database records, API responses. Resources give the model access to context it wasn't trained on — your company's internal documentation, a customer's account history, the contents of a code repository.

Prompts — reusable prompt templates the server makes available to the model. Useful for standardising how the model approaches specific tasks within a particular domain.

The communication flow is straightforward:

User → AI model → MCP client → MCP server → external system
                                     ↓
User ← AI model ← MCP client ← tool result

User asks the model to do something that requires an external tool
The model identifies which MCP tool it needs and calls it with appropriate parameters
The MCP client forwards the call to the MCP server
The MCP server executes the tool — queries the database, reads the file, calls the API
The result comes back to the model
The model uses the result to continue its reasoning and respond to the user

MCP transport types

MCP servers communicate over two transport mechanisms:

stdio (Standard Input/Output) — the MCP server runs as a subprocess on the same machine as the client. Communication happens through stdin/stdout pipes. This is the standard for local MCP servers — Claude Desktop, Claude Code, and most developer tools use stdio for local MCP integration.

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/aj/projects"]
    }
  }
}

HTTP with SSE (Server-Sent Events) — the MCP server runs as an HTTP service. The client connects over HTTP and receives events via SSE. This is the transport for remote MCP servers — servers running on cloud infrastructure, accessible from anywhere, handling multiple clients simultaneously.

{
  "mcpServers": {
    "production-tools": {
      "url": "https://api.yourcompany.com/mcp",
      "headers": {
        "Authorization": "Bearer your-token"
      }
    }
  }
}

The choice matters for architecture. Local stdio servers are simple to deploy and have no network latency. Remote HTTP servers enable shared tools across a team, centralised access control, and deployment on cloud infrastructure like AWS Lambda.

Building your first MCP server

Here is a minimal MCP server in Python that exposes one tool — querying an AWS DynamoDB table:

import json
import boto3
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent

app = Server("dynamodb-tools")
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")

@app.list_tools()
async def list_tools():
    return [
        Tool(
            name="query_table",
            description="Query a DynamoDB table by partition key. Returns matching items as JSON.",
            inputSchema={
                "type": "object",
                "properties": {
                    "table_name": {
                        "type": "string",
                        "description": "The DynamoDB table name to query"
                    },
                    "partition_key": {
                        "type": "string",
                        "description": "The partition key attribute name"
                    },
                    "partition_value": {
                        "type": "string",
                        "description": "The partition key value to filter on"
                    }
                },
                "required": ["table_name", "partition_key", "partition_value"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "query_table":
        table = dynamodb.Table(arguments["table_name"])
        from boto3.dynamodb.conditions import Key
        response = table.query(
            KeyConditionExpression=Key(
                arguments["partition_key"]
            ).eq(arguments["partition_value"])
        )
        return [TextContent(
            type="text",
            text=json.dumps(response["Items"], default=str)
        )]

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream, app.create_initialization_options())

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Install dependencies:

pip install mcp boto3

Add to Claude Desktop's claude_desktop_config.json:

{
  "mcpServers": {
    "dynamodb-tools": {
      "command": "python",
      "args": ["/path/to/your/server.py"]
    }
  }
}

Now Claude can query your DynamoDB tables directly from a conversation.

MCP on AWS Lambda — the production architecture

Local stdio servers work for development. Production deployments use AWS Lambda — serverless, auto-scaling, zero idle cost, accessible from Claude on Bedrock.

The architecture:

Claude on Bedrock
      ↓
API Gateway (HTTP endpoint)
      ↓
Lambda function (MCP server)
      ↓
DynamoDB / S3 / RDS / any AWS service

# lambda_handler.py — MCP server running on Lambda
import json
import boto3
from mcp.server import Server
from mcp.server.lambda_handler import create_lambda_handler
from mcp.types import Tool, TextContent

app = Server("production-tools")

@app.list_tools()
async def list_tools():
    return [
        Tool(
            name="get_customer",
            description="Retrieve customer record by ID from the production database.",
            inputSchema={
                "type": "object",
                "properties": {
                    "customer_id": {
                        "type": "string",
                        "description": "The unique customer ID"
                    }
                },
                "required": ["customer_id"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "get_customer":
        dynamodb = boto3.resource("dynamodb")
        table = dynamodb.Table("customers")
        response = table.get_item(Key={"customer_id": arguments["customer_id"]})
        item = response.get("Item", {})
        return [TextContent(type="text", text=json.dumps(item, default=str))]

handler = create_lambda_handler(app)

The IAM role for the Lambda function needs only the permissions required for the tools it exposes. If the server queries DynamoDB: dynamodb:GetItem and dynamodb:Query on the specific table. Nothing broader. This is the least-privilege principle applied to MCP tool access — the agent can only do what the Lambda role allows, regardless of what the model tries to call.

The security boundary MCP creates

MCP introduces a trust boundary that every engineer building agentic systems needs to understand.

The AI model decides which tools to call and with what parameters. The MCP server executes those calls. The model cannot directly access the underlying systems — it can only go through the tools the server exposes.

This boundary is where security controls live:

Tool schema as access control — the tools you define in your MCP server are the only operations available to the model. If you don't expose a delete_record tool, the model cannot delete records regardless of what it is instructed to do.

IAM as execution boundary — the AWS identity the MCP server runs as determines what it can actually do. A Lambda function with read-only IAM permissions cannot write data even if the tool implementation attempts it.

Input validation as injection defence — validate all inputs before passing them to downstream systems. A model can be manipulated into calling tools with adversarial inputs via prompt injection. Your server should reject inputs that don't match expected patterns before they reach your database or API.

This is why understanding MCP architecture is part of Domain 2 of the CCA-001 Claude Certified Architect exam — Tool Design and MCP Integration. The exam tests whether you can design MCP servers with appropriate security boundaries, not just whether you can make them work.

MCP servers AWS ships out of the box

AWS now ships official MCP servers for its core services. These are ready to use without building anything:

Amazon S3 MCP — read, write, and list S3 objects from Claude
Amazon DynamoDB MCP — query and scan DynamoDB tables
AWS Lambda MCP — invoke Lambda functions directly from Claude Code
Amazon SageMaker MCP — manage training jobs and endpoints
Amazon Bedrock MCP — invoke foundation models, manage guardrails
AWS CloudWatch MCP — query logs and metrics
Amazon EC2 MCP — describe and manage EC2 instances

Install and configure any of these the same way as a custom server — add to your MCP client config and the tools become available immediately.

What MCP means for the engineer building on Claude

MCP is not optional knowledge for engineers building production AI systems in 2026. It is the standard integration layer for the entire Claude ecosystem — Claude Code uses it to connect to external tools, Bedrock agents use it to access enterprise data, and the Claude Partner Network ecosystem is built around MCP-compatible tooling.

The engineers who understand MCP architecture — how to design tool schemas, how to structure the security boundary, how to deploy MCP servers on Lambda, how to debug tool-calling failures — are the ones who can build production agentic systems that actually work reliably.

The StackHawks Pro track on Cloud Edventures has a dedicated MCP on AWS Lambda lab: building a production MCP server from scratch, deploying to Lambda behind API Gateway, connecting it to Claude on Bedrock, and testing the full tool-calling loop. Real AWS environment, automated validation, no billing risk.

The CCA-001 Claude Certified Architect certification covers MCP Integration in Domain 2 — 18% of the exam. Not as theory but as production architecture scenarios you have to solve correctly.

👉 cloudedventures.com

What are you connecting to Claude via MCP right now? Drop it in the comments.

Anthropic just admitted Claude Code broke. Here's exactly what happened, what they fixed, and what it means for your workflows.

Aj — Fri, 24 Apr 2026 13:16:41 +0000

For the past several weeks, engineers using Claude Code have been filing complaints. Responses felt off. Reasoning felt shallower. Coding quality dropped noticeably from what they'd come to expect. Many assumed Anthropic had intentionally degraded the model — what the developer community calls "nerfing."

Anthropic denied it. Then they proved they were right by publishing a full postmortem.

On April 20, Anthropic confirmed: Claude Code's quality degraded. The underlying model was not changed. Three separate product-level changes caused the regression, each independently, stacking on top of each other. All three have now been fixed as of April 20 (v2.1.116).

Here is exactly what broke, why it matters for production workflows, and what changed.

The three things that actually broke Claude Code

Anthropic traced the complaints to three separate changes that affected Claude Code, the Claude Agent SDK, and Claude Cowork. The API was not impacted.

Issue 1 — Reasoning effort silently dropped

The default reasoning effort level was reduced at the product level. Engineers were getting shallower responses not because the model was less capable, but because it was being instructed to think less. This change made it past multiple human and automated code reviews, unit tests, end-to-end tests, automated verification, and internal dogfooding. It was subtle enough that even Anthropic's own internal processes didn't catch it immediately.

Fix: Restored to higher default reasoning effort across Claude Code, Claude Agent SDK, and Claude Cowork.

Issue 2 — A caching bug silently dropped thinking history

A bug in context management caused thinking history to be dropped during stale sessions. This was at the intersection of Claude Code's context management, the Anthropic API, and extended thinking. The regression only appeared in a specific corner case — stale sessions — which made it extremely difficult to reproduce and identify. It took over a week of investigation to confirm the root cause.

The notable detail: Anthropic back-tested the offending pull requests using Opus 4.7. Opus 4.7 found the bug. Opus 4.6 did not. This is why Opus 4.7's improved code reasoning matters in practice — not just on benchmarks.

Fix: Caching bug patched in v2.1.101 (April 10). Thinking history now correctly persists across sessions.

Issue 3 — A verbosity prompt change hurt coding quality

Claude Opus 4.7 tends to be more verbose than its predecessor — a known behavioral difference noted at launch. To reduce verbosity, a prompt change was made. That change went too far and reduced coding quality alongside verbosity. The tradeoff was not caught before deployment.

Fix: Verbosity prompt change reverted. Usage limits also reset for subscribers affected during the degraded period.

Why this matters beyond the immediate fix

The three-bug postmortem is worth understanding for reasons that go beyond "Claude Code works again."

Product-layer changes can silently degrade model quality. The model never changed. What changed were instructions, caching behaviour, and prompting — all at the product layer sitting above the model. Engineers building production systems on Claude Code or the Claude API need to understand that model quality can degrade from sources they don't control and can't directly observe. This is not unique to Anthropic — it is a systemic property of building on top of hosted AI services.

Extended thinking sessions are sensitive to context management. The caching bug only appeared in stale sessions with extended thinking enabled. Engineers using long-horizon agentic workflows — exactly the workflows that Claude Code and AgentCore are designed for — are most exposed to context management bugs. The fix is in, but the lesson is: if your long-running agentic workflow suddenly produces degraded output, context management is now a confirmed failure mode worth investigating.

The verbosity-quality tradeoff is real and non-trivial. Opus 4.7 is more verbose. The attempts to reduce that verbosity damaged coding quality. This means engineers running Opus 4.7 in production who are trying to manage output length through prompt changes need to be careful — the model's verbosity is partially load-bearing. Reducing it through aggressive prompt constraints may reduce quality alongside token count.

Opus 4.7 found the bug that Opus 4.6 missed. This is the understated line in the postmortem. When Anthropic used Opus 4.7 to code review the PR that introduced the caching bug, Opus 4.7 caught it. Opus 4.6 didn't. For engineers evaluating whether to migrate to Opus 4.7, this is concrete evidence of improved code reasoning beyond benchmark scores.

What changed in Claude Code v2.1.116

The April 20 release that contains all three fixes also ships additional stability improvements. From the release notes:

Fixed connecting to a remote session overwriting local model settings in ~/.claude/settings.json
Fixed typeahead showing "No commands match" error when pasting file paths starting with /
Fixed plugin reinstall not resolving dependencies at the wrong version
Fixed unhandled errors from file watcher on invalid paths or file descriptor exhaustion
Fixed Remote Control sessions getting archived on transient CCR initialization during JWT refresh
Fixed subagents resumed via SendMessage not restoring the explicit cwd they were spawned with

The /loop workflow improvements and Remote Control session stability fixes are particularly relevant for engineers running Claude Code in long-horizon agentic workflows.

Also this week: Anthropic Managed Agents launched

Separately from the Claude Code fix, Anthropic launched Managed Agents — a hosted Claude Platform service specifically designed for long-horizon agent work.

The key design principle behind Managed Agents: harnesses encode assumptions about what Claude cannot do on its own. Those assumptions go stale as models improve. A concrete example from Anthropic's own engineering work: Claude Sonnet 4.5 would terminate tasks prematurely as it detected its context limit approaching — a behaviour Anthropic calls "context anxiety." The harness added context resets to compensate. With a better model, that compensation may no longer be needed — or may actively limit performance.

Managed Agents provides stable interfaces for sessions, harnesses, and sandboxes specifically so that as model capabilities improve, the harness can be updated without rebuilding the entire agent infrastructure.

What Managed Agents provides:

Durable state across long-horizon tasks — the agent does not lose context mid-workflow
Safer tool access — tool permissions managed at the infrastructure level, outside the agent's reasoning loop
Faster startup for reliable long-running tasks
Memory in public beta — persistent memory across sessions using the managed-agents-2026-04-01 header

This is the infrastructure layer for production agentic systems — not a demo environment. The stable session interfaces and tool safety boundaries are exactly what the YOLO attack postmortem called for: controls applied outside the model's reasoning loop.

What this means for Claude Code workflows right now

If you are running Claude Code in CI/CD pipelines: Update to v2.1.116. The stale session caching bug could affect any pipeline step that reuses a session across extended runs. The -p/--print non-interactive mode is not affected (the API layer was not impacted), but session-based workflows should be validated post-update.

If you are using extended thinking with Claude Code: Verify that thinking history is persisting correctly after the update. The caching bug was specifically in the intersection of extended thinking and session management.

If you are running Opus 4.7: Do not add aggressive verbosity constraints to your prompts. The postmortem confirms that reducing verbosity through prompt changes damages coding quality. If output length is a concern, use max_tokens to cap output length rather than prompting the model to be less verbose.

If you are building multi-agent systems: Look at Managed Agents for long-horizon workflows. The stable session and harness interfaces are a meaningful improvement over managing session lifecycle yourself.

The CLAUDE.md angle

The Claude Code quality regression is a direct argument for understanding CLAUDE.md configuration at depth. The three issues that caused the regression — reasoning effort, context management, and verbosity — are all areas where CLAUDE.md configuration directly affects agent behaviour.

Engineers who understand how CLAUDE.md hierarchy composes (global → project → directory), how to configure reasoning effort for specific tasks, and how to structure prompts that don't accidentally trade quality for length are more resilient to this class of regression. They notice degradation faster, diagnose it more accurately, and adapt their configuration rather than waiting for a patch.

This is Domain 3 of the CCA-001 Claude Certified Architect certification — Claude Code Configuration and Workflows. The exam specifically tests whether you understand how configuration decisions at the CLAUDE.md level affect agent behaviour in production. The regression Anthropic just documented is a real-world exam question.

The Cloud Edventures CCA-001 track includes the Navigator's Compass path — hands-on missions covering CLAUDE.md configuration, slash commands, plan-execute pipelines, and CI/CD integration with Claude Code in real AWS environments with automated validation.

👉 cloudedventures.com/labs/track/claude-certified-architect-cca-001

Have you noticed the improvement since v2.1.116? What changed in your workflows? Drop it in the comments.

Stanford's 2026 AI Index just dropped. Junior developer employment is down 20%. Here's what the data actually says.

Aj — Wed, 22 Apr 2026 17:05:53 +0000

The Stanford Institute for Human-Centered AI released its 2026 AI Index today. It is the most comprehensive annual measurement of where AI actually stands — not where the press releases say it stands.

One number is going to dominate headlines for the next week.

Employment among software developers aged 22 to 25 has fallen nearly 20% since 2024, even as their older colleagues' headcount continues to grow.

Before you panic or dismiss this, it's worth understanding what the data actually says, what it doesn't say, and what the engineers who are not in that declining cohort are doing differently.

What the Stanford AI Index actually found

The report is 500+ pages. Here is what matters for engineers:

The junior developer employment cliff is real. The 20% decline in employment for developers aged 22-25 is not anecdotal. It is measured across employers and cross-referenced against broader macroeconomic conditions. The report acknowledges that AI may not be the sole cause — macroeconomic factors play a role — but notes that AI appears to be a significant contributing factor, and that the pattern repeats in other high-AI-exposure roles like customer service.

AI is boosting productivity by 26% in software development. This is the other side of the same coin. The reason fewer junior developers are being hired is not that software is being written less — it is that each senior developer is producing substantially more. A team of five senior engineers with AI tools is now doing what previously required a team of eight, with the eight including three junior developers. The elimination is at the entry level.

AI adoption has hit 53% of the global population in three years. Faster than the personal computer. Faster than the internet. The estimated value of generative AI tools to US consumers alone reached $172 billion annually by early 2026, with the median value per user tripling between 2025 and 2026.

A third of organizations expect AI to shrink their workforce in the coming year. The McKinsey survey cited in the report shows planned headcount reductions concentrated in service, supply chain, and software engineering. This is forward-looking, not historical — it is what employers are planning to do next, not what they have already done.

Anthropic leads global model rankings as of March 2026. The report uses community-driven Arena rankings where users compare models on identical prompts. Anthropic's top model leads by 2.7% over the nearest competitor. US and Chinese models have traded places at the top multiple times since early 2025.

The actual pattern — who is declining vs who is growing

The 20% number is not distributed evenly. The Stanford data is specific: it is developers aged 22-25. Their older colleagues — developers in their 30s and 40s — are seeing headcount grow.

This reveals the mechanism. AI is not replacing software engineering as a discipline. It is replacing the specific tasks that junior developers were hired to do: boilerplate code, basic CRUD operations, scripted testing, routine data processing, straightforward bug fixes.

Senior engineers use AI to do those tasks themselves, without handing off to a junior. The junior developer role — which historically served as the entry point where developers built experience doing those tasks — is being compressed.

The implication is uncomfortable: the traditional path into software engineering is narrowing precisely at the moment when AI is making senior engineers more productive. You cannot become senior without first being junior. But the junior roles are disappearing.

This is not unsolvable. It means the path has changed, not closed. But the path that worked five years ago — get hired as a junior, learn on the job, progress — is significantly harder now.

What the engineers who are not declining are doing

The headcount growth is in specific areas. From the Stanford data and the broader job market pattern:

Infrastructure and platform engineering. The engineers who build and maintain the systems that AI runs on. Lambda functions, Bedrock agents, SageMaker pipelines, ECS clusters, Kubernetes. These are not roles AI is replacing — they are roles AI is creating demand for. Every agentic AI system deployed needs cloud infrastructure underneath it. As deployment accelerates, demand for infrastructure engineers accelerates with it.

ML engineering and MLOps. Building, training, evaluating, and maintaining machine learning models in production. SageMaker Pipelines, Model Monitor, Bedrock model deployment, real-time inference optimisation. The AWS ML Engineer Associate (MLA-C01) certification maps directly to this job category. It is one of the fastest-growing roles in the market.

AI systems architecture. Designing multi-agent systems, tool schemas, MCP server integrations, Bedrock Guardrails policies, AgentCore deployments. The engineers who understand how to architect AI systems — not just use them — are on the growing side of the employment curve. This is what the Claude Certified Architect (CCA-001) certification tests.

Security engineering for AI systems. As AI agents handle more sensitive operations — accessing databases, processing financial data, making decisions with real consequences — the engineers who understand IAM least privilege, Bedrock Guardrails, and agentic security patterns are in growing demand. The YOLO attack research published this month confirms the attack surface is expanding faster than the defensive architecture being deployed.

System design at senior level. The engineers who can design distributed systems that handle 100K requests per second, design fault-tolerant architectures, architect data-intensive systems for real-time processing — these engineers are not being replaced by AI. AI cannot sit in the architecture review and make tradeoff decisions based on organisational context, cost constraints, and team capability.

The honest assessment

The Stanford Index's finding about junior developer employment is not a reason to leave software engineering. It is a reason to be specific about which skills you are building.

The error is treating "software engineer" as a single category when the employment data clearly shows it is splitting into two trajectories.

Below the line: tasks that AI can do at $0.10 per hour — routine code generation, basic configuration, scripted testing, standard CRUD. Employment in this layer is declining because AI is replacing the tasks, not necessarily the title.

Above the line: system design judgment, cloud infrastructure for AI workloads, security architecture for agentic systems, ML operations, multi-agent orchestration. Employment in this layer is growing because every AI system deployed creates more demand for it.

The question is not "will AI take my job." The question is "which side of the line are my current skills on, and am I moving toward the growing side or the declining side."

The certification signal

The Stanford data has a specific implication for certification strategy.

Certifications that test whether you know API parameter names or can recall service feature lists are in the declining category. AI can answer those questions better than most humans.

Certifications that test production architecture judgment — whether you can design a multi-agent system that handles failures correctly, implement IAM policies that correctly scope access, build a SageMaker pipeline that doesn't silently fail at 3am — are in the growing category.

Two certifications that directly correspond to the growing side of the employment data:

AWS ML Engineer Associate (MLA-C01) — tests hands-on competency with SageMaker, Bedrock, Kinesis, Glue, Athena, and MLOps practices. Maps directly to the ML engineering and MLOps roles where headcount is growing.

Claude Certified Architect CCA-001 — tests production architecture judgment for agentic AI systems. Multi-agent orchestration, MCP server design, Bedrock Guardrails, tool schema engineering. The only certification that validates the exact skills required to architect the AI systems that are replacing junior developer tasks.

Both require demonstrable hands-on competency in real AWS environments. Both test judgment that AI cannot replicate. Both correspond to roles where the Stanford data shows employment growing, not declining.

The hands-on lab preparation for both — in real isolated AWS Bedrock sandboxes, with automated validation, no personal AWS account required — is what the MLA-C01 and CCA-001 tracks on Cloud Edventures provide.

The engineers who will not be in the declining cohort in next year's Stanford Index are the ones building depth in infrastructure, ML operations, and AI systems architecture right now, while the scarcity premium on those skills still exists.

👉 cloudedventures.com

The Stanford 2026 AI Index is worth reading in full — all 500+ pages are available at aiindex.stanford.edu. Which finding hit you hardest? Drop it in the comments.

Claude Opus 4.7 is on Bedrock. Amazon just bet $25 billion it's the future. Here's what engineers need to know.

Aj — Tue, 21 Apr 2026 11:20:53 +0000

Two things happened this week that belong in the same sentence.

On April 16, AWS added Claude Opus 4.7 to Amazon Bedrock — Anthropic's most capable publicly available model, with 87.6% on SWE-bench Verified and 69.4% on Terminal-Bench 2.0. Then on April 20, Amazon announced it would invest up to an additional $25 billion in Anthropic, on top of the $8 billion it had already committed — with Anthropic pledging to spend more than $100 billion on AWS technologies over the next ten years.

This is not routine model news. This is the largest corporate AI infrastructure bet in history, coinciding with a model release that changes what's possible in production agentic systems.

If you build on AWS and use Claude, both of these developments affect your architecture immediately.

What Claude Opus 4.7 actually changes

Claude Opus 4.7 is Anthropic's most intelligent Opus model for advancing performance across coding, long-running agents, and professional work, powered by Amazon Bedrock's next-generation inference engine.

The numbers are real. The model records 64.3% on SWE-bench Pro, 87.6% on SWE-bench Verified, and 69.4% on Terminal-Bench 2.0. These are not marketing benchmarks — SWE-bench Verified tests whether an AI model can actually resolve real GitHub issues in production software repositories. 87.6% means Opus 4.7 successfully resolves nearly 9 in 10 real software engineering tasks it is given.

But the headline numbers matter less than the operational changes.

Adaptive thinking. The model runs on Bedrock's next-generation inference engine with dynamic capacity allocation, adaptive thinking — letting Claude allocate thinking token budgets based on request complexity — and the full 1M token context window. This is significant. Previous models required you to set a fixed thinking token budget. Opus 4.7 decides how much reasoning the task actually requires and allocates accordingly. Simple tasks use fewer tokens. Complex reasoning tasks use more. Your costs align with actual complexity rather than a fixed overhead.

Long-running agent stability. The area where Opus 4.7 matters most for production teams is not raw benchmark scores — it is sustained performance over long autonomous runs. Agentic workflows that require 50, 100, or 200+ sequential tool calls have historically degraded in quality as context accumulated. Opus 4.7 was specifically trained to stay on track over longer horizons. For engineers building multi-agent systems, orchestration workflows, or coding agents that run for hours — this is the change that directly affects production quality.

The migration is not a drop-in swap. This is the part most articles skip. Starting with Claude Opus 4.7, temperature, top_p, and top_k parameters are no longer supported. The recommended migration path is to omit these parameters entirely from your requests and use prompting to guide the model's behavior. If your production code passes temperature=0 expecting deterministic outputs, it will not work with Opus 4.7. AWS explicitly flags that teams may need prompt changes and evaluation harness tweaks. Treat this as a migration, test against your specific workloads, and don't assume existing prompts will produce identical results.

Zero operator data access. The model provides zero operator access — meaning customer prompts and responses are never visible to Anthropic or AWS operators — keeping sensitive data private. For regulated industries and enterprise deployments, this is the governance requirement that clears the path to production. Your inference runs in hardware-isolated Nitro enclaves with strict separation between hosting and logging systems. FedRAMP High compatible.

The $25 billion bet — what it actually means

The dollar figure is staggering. Amazon has agreed to invest up to $25 billion in Anthropic, on top of the $8 billion it has already committed, as part of an expanded agreement to build out AI infrastructure. Anthropic committed to spending more than $100 billion on AWS technologies over the next ten years, including Trainium — Amazon's custom AI chips.

To understand what this means, you need to understand why Anthropic is doing it.

Anthropic said enterprise and developer demand for Claude, as well as a "sharp rise" in consumer usage, has led to "inevitable strain" on its infrastructure that has impacted its reliability and performance.

This is Anthropic publicly acknowledging that Claude is capacity-constrained. The model is in higher demand than the infrastructure can currently serve. The $25 billion is not speculative investment — it is Anthropic buying the compute to keep up with demand it already has.

"Our users tell us Claude is increasingly essential to how they work, and we need to build the infrastructure to keep pace with rapidly growing demand," Anthropic CEO Dario Amodei said. "Our collaboration with Amazon will allow us to continue advancing AI research while delivering Claude to our customers, including the more than 100,000 building on AWS."

100,000 customers building on AWS with Claude. That number has more than tripled in under two years. The deal is the infrastructure response to adoption that already happened, not a bet on adoption that might happen.

What this means for the architecture stack

Three specific implications for engineers building on Bedrock today.

Bedrock is where Claude's most capable models live first. Claude Opus 4.7 launched on Bedrock. Claude Mythos launched exclusively on Bedrock. The pattern is consistent: Anthropic's most advanced and most restricted models enter production through AWS first. If you're building systems that need access to frontier models under enterprise governance, Bedrock is not one option among several — it is the path.

The inference engine upgrade matters for production scale. The new Bedrock inference engine uses updated scheduling and scaling logic. Instead of hard throttling during demand spikes, it queues requests with dynamic capacity allocation. For teams running agentic workflows with bursty, unpredictable request patterns, this changes the failure mode from "hard 503 errors" to "slight latency increase under load." That is a significant improvement in production reliability.

The Anthropic-AWS relationship is now a decade-long structural commitment. $100 billion in AWS compute over ten years is not a partnership that gets reconsidered at the next annual review. Anthropic has committed its model training and serving infrastructure to AWS Trainium and Bedrock through 2036. Engineers betting their production AI stack on Bedrock are betting on a platform with a committed ten-year runway, not a quarter-to-quarter cloud deal.

The migration checklist for Opus 4.7

If you're running Opus 4.6 in production and considering upgrading:

Step 1 — Remove temperature, top_p, top_k from all API calls. These parameters are no longer supported. Passing them will cause errors. Remove them and adjust model behaviour through prompting instead.

Step 2 — Budget for higher token usage. Opus 4.7 uses approximately 1.0x to 1.35x more output tokens than Opus 4.6 depending on content type and reasoning load. Adaptive thinking means complex requests will use more tokens than before. Reprice your cost models before switching production traffic.

Step 3 — Test your eval harness explicitly. Don't assume benchmark improvements translate directly to your specific use case. Run your existing evaluation suite against Opus 4.7 before migrating any production traffic.

Step 4 — Use the new model ID. Model ID: anthropic.claude-opus-4-7. Available via the Anthropic Messages API, the Converse API, Invoke API, AWS SDK, and CLI.

Step 5 — Check your region. Claude Opus 4.7 is available at launch in US East (N. Virginia), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm), with up to 10,000 requests per minute per account per region. If your production workload runs in another region, verify availability before migrating.

# Python — Anthropic SDK via Bedrock Mantle
from anthropic import AnthropicBedrockMantle

client = AnthropicBedrockMantle(aws_region="us-east-1")

message = client.messages.create(
    model="anthropic.claude-opus-4-7",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": "Design a distributed architecture for 100k RPS across 3 AWS regions."
    }]
)
print(message.content[0].text)

# Python — boto3 via bedrock-runtime
import json
import boto3

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="anthropic.claude-opus-4-7",
    messages=[{
        "role": "user",
        "content": [{"text": "Design a fault-tolerant SQS consumer with DLQ and CloudWatch alerting."}]
    }]
)
print(response["output"]["message"]["content"][0]["text"])

Why the CCA-001 certification matters more now than it did last month

The $25 billion commitment has a specific implication for engineers holding or pursuing the Claude Certified Architect certification.

When Anthropic and AWS commit a combined $125 billion to the Claude-on-Bedrock stack over the next decade, they are signalling the longevity of that architecture. Engineers building expertise in Bedrock Guardrails, AgentCore Policy, MCP server design, and multi-agent orchestration on Claude are building expertise in infrastructure that has a ten-year committed runway.

Certifications in deprecated or transitional technology depreciate. Certifications in infrastructure with that level of committed backing appreciate.

The CCA-001 Claude Certified Architect certification covers the exact architecture stack that Opus 4.7 operates within — agentic loops, tool design, multi-agent orchestration, Bedrock Guardrails, context management for long-running tasks. All of these domains become more important as Opus 4.7 makes longer autonomous runs more reliable and more common.

The hands-on lab preparation for the CCA-001 — 22 missions in real AWS Bedrock sandboxes, covering all five exam domains with automated validation — is what the Cloud Edventures CCA-001 track provides. As of today, you can run those labs against the same Bedrock infrastructure that now hosts Opus 4.7.

👉 cloudedventures.com/labs/track/claude-certified-architect-cca-001

Are you migrating from Opus 4.6 to 4.7 in production? What's your eval harness showing? Drop it in the comments.

The YOLO Attack: how hackers are hijacking AI agents by flipping one switch

Aj — Mon, 20 Apr 2026 12:31:19 +0000

There is a mode in AI coding agents called YOLO mode.

The name was coined by security researcher Johann Rehberger. It refers to a single configuration state where an agent approves every tool call automatically — no user confirmation required. The agent just runs. Whatever it is asked to do, it does.

YOLO mode exists because it is genuinely useful. When you trust the environment and want maximum throughput, stopping to approve every tool call is friction. So developers turn it on.

Attackers have noticed.

What the YOLO Attack actually is

The exploit is deceptively simple. Here is the sequence Rehberger demonstrated:

An attacker embeds a malicious prompt somewhere the agent will encounter it — a web page it browses, a GitHub issue it reads, a code comment it processes, a document it summarises
The injected prompt contains one instruction: enable YOLO mode (auto-approve all tools)
The agent follows the instruction — because it cannot distinguish between data it is processing and instructions it should execute
The attacker's second instruction then runs arbitrary commands: open a terminal, delete files, exfiltrate credentials, install software, make network requests
All of it executes without any user prompt, because the user confirmation gate has been disabled by the injected content

This is not a theoretical demonstration. The complete exploitation chain was documented against GitHub Copilot: an attacker embeds prompt injection in public repository code comments, the victim opens the repository with Copilot active, the injected prompt instructs Copilot to modify .vscode/settings.json enabling YOLO mode, subsequent commands execute without user approval, and the attacker achieves arbitrary code execution.

The vulnerability is not in the AI model. The vulnerability is in the architecture. An agent operating with broad tool access and an auto-approve mode has no mechanism to verify whether the instruction to enable that mode came from its legitimate user or from adversarial content in something it was asked to read.

Why this is getting worse, not better

The attack surface for YOLO-style exploits is expanding on three fronts simultaneously.

Agents are getting more autonomous. The entire direction of AI development in 2026 is toward less human intervention in the loop. Agentic AI that stops to ask for permission on every action is considered poorly designed. AWS AgentCore, Claude Code, and every major AI development framework is pushing toward longer autonomous runs, more tool calls per session, and higher trust levels. YOLO mode is not a bug — it is a design goal. Which means the attack surface is growing by intention.

MCP has created a new trust boundary. The Model Context Protocol introduces a trust boundary between LLM agents and external tools. A malicious MCP server receives tool-call requests in plaintext and can return forged results, so the same manipulation and collection techniques transfer with adaptation to the MCP message format. Every MCP server your agent connects to is a potential injection point. The agent trusts that the MCP server is returning legitimate tool results. A compromised or malicious MCP server can return results that contain injection payloads — which the agent processes as instructions.

Third-party routers are an unexamined attack surface. API routers — used as intermediaries between agents and model APIs — drop TLS sessions and have access to all plaintext data, including API keys and credentials being transferred between the agent and the models. Among a corpus of free routers, 8 inject malicious code into returned tool calls, and 2 deploy adaptive evasion — waiting for 50 prior calls before activating, or restricting payload delivery to autonomous YOLO mode sessions. Developers use third-party LLM routers routinely. Most have never considered that the router sits at a trust boundary capable of rewriting tool call responses in transit.

The structural problem no one wants to say out loud

LLMs cannot distinguish between data and instructions. This is not a failure of current models that future models will fix. It is a property of how transformer-based language models work. The model processes all input as tokens. Whether those tokens represent "the text you were asked to summarise" or "a new instruction superseding your previous ones" — to the model, they are both sequences of tokens.

Every defence against prompt injection — system prompt hardening, output filtering, input sanitisation — reduces the attack surface. None of them eliminate it. As of mid-2026, prompt injection continues to be ranked number one in the OWASP LLM Top 10, and complete prevention remains elusive due to the probabilistic nature of LLMs, necessitating defence-in-depth strategies combining technical controls and human awareness training.

The implication: security for AI agents cannot be achieved by making the model smarter. It must be achieved by the architecture surrounding the model — the policies, gates, and controls that operate outside the model's reasoning loop.

This is precisely why Bedrock Guardrails, AgentCore Policy, and IAM least-privilege matter as architectural decisions, not as optional hardening steps. If you are building agents that call tools, the question is not whether to implement these controls. The question is whether you understand them well enough to implement them correctly.

What YOLO mode looks like in your architecture

If you are building AI agents — with Claude Code, Bedrock AgentCore, LangGraph, CrewAI, or any framework — your architecture either has explicit controls on tool approval, or it effectively has YOLO mode enabled by default.

The questions that determine your exposure:

Who can enable auto-approve in your agent? Is YOLO mode (or equivalent) gated by an IAM policy, a runtime configuration, or just a developer preference that any injected prompt could override?

What is your agent reading? Every external data source an agent processes — web pages, documents, database records, code files, API responses — is a potential injection vector. The attack surface is as wide as the agent's data access.

What tools does your agent have access to? An agent that can only read an S3 bucket and write to a DynamoDB table has a bounded blast radius when compromised. An agent with file system access, network access, and the ability to spawn processes does not.

What happens when your MCP server returns unexpected content? Does your agent validate MCP tool results before acting on them? Does it have a policy that prevents certain tool calls regardless of what the MCP server requests?

Are you using third-party LLM routers? If so — do you know whether those routers can inspect or modify the tool call responses your agent receives?

The defence architecture

The research community has converged on three complementary controls that meaningfully reduce the YOLO attack surface without eliminating agent autonomy:

Fail-closed policy gates — explicit allow-lists of tool calls that can execute without user confirmation, with everything else defaulting to denied. This is what AgentCore Policy implements: policies applied outside the agent's reasoning loop, at the tool call intercept point, so the agent cannot instruct its way around them. Even if injection succeeds and YOLO mode is "enabled" within the model's reasoning, the policy gate at the infrastructure layer still applies.

Response-side anomaly screening — examining tool call responses for content that looks like instructions before returning them to the agent. Flags when an MCP tool result contains language patterns that suggest injection rather than legitimate data.

Append-only transparency logging — immutable logs of every tool call the agent made, what it received, and what it did next. When an incident occurs, the audit trail exists. CloudTrail with Bedrock inference logging provides this at the infrastructure level when configured correctly.

The layered model: IAM restricts what tools the agent can access at all. AgentCore Policy restricts what tool calls execute without confirmation. Guardrails filter model inputs and outputs for harmful patterns. CloudTrail logs every action. This is defence-in-depth — not because any single layer is sufficient, but because attackers who bypass one layer still hit the next.

What this means for AI engineers right now

The YOLO attack is not a niche security concern. It is the default failure mode of agentic AI built without explicit security architecture.

As the ecosystem moves toward longer autonomous runs, more MCP integrations, and higher trust levels — every AI engineer building agents needs to understand:

Why the architecture surrounding the model matters as much as the model itself
How to design tool schemas and tool approval gates that limit blast radius
How to configure Bedrock Guardrails and AgentCore Policy to enforce controls outside the reasoning loop
How to use IAM least privilege to constrain what any single agent can actually touch

These are not advanced security concepts. They are foundational architecture decisions for anyone building production AI systems in 2026.

The MCP server on AWS Lambda lab in StackHawks Pro covers exactly how MCP trust boundaries work — what the server can see, what it can return, and where the security boundary sits. The CCA-001 Claude Certified Architect track covers Bedrock Guardrails and AgentCore Policy in depth — not as theory but as hands-on lab missions in real Bedrock sandbox environments.

The Security and Resilience path in Blueprint Bay takes this further: hands-on system design challenges covering zero-trust architecture, GuardDuty integration, and production incident response — including scenarios that mirror exactly the class of attack the YOLO exploit represents.

Understanding the attack is step one. Being able to architect the defence is what the certification tests.

👉 cloudedventures.com

What tool approval controls are you using in your agent architecture right now? Drop it in the comments — this conversation is worth having.

CCA-001 study guide: how to pass the Claude Certified Architect exam in 2026

Aj — Fri, 17 Apr 2026 18:10:18 +0000

The Claude Certified Architect Foundations (CCA-001) is Anthropic's first official technical certification. Launched March 12, 2026, it tests whether you can actually build production AI systems with Claude — not whether you can answer trivia about API parameters.

This is the study guide I wish existed when I started preparing for it.

It covers the exact exam structure, all five domains with their weightings, the six production scenarios, the concepts that trip most candidates, and the hands-on preparation path that actually works.

What the CCA-001 exam tests

The exam is 60 questions, 120 minutes, proctored. Passing score is 720 on a scaled 100–1,000 range. Cost: $99, available through the Anthropic Claude Partner Network.

The most important thing to understand upfront: every question is anchored to a production scenario. You are not asked "what is an agentic loop?" — you are placed in a specific system design situation and asked which architectural decision is correct, which failure mode you're facing, or which implementation pattern will hold in production.

Candidates who have only read documentation fail because the distractors are genuinely tempting if you have surface-level knowledge. The correct answers consistently point toward real engineering judgment — the kind that only develops from having actually built and broken agentic systems.

The five exam domains

The exam is weighted across five domains. Know these ratios — they tell you where to spend study time.

Domain 1 — Agentic Architecture and Orchestration (27%)

The heaviest domain. Tests multi-agent system design, task decomposition, hub-and-spoke orchestration models, and the failure modes that emerge when agents interact.

Key concepts you must understand at depth:

Coordinator-subagent architecture: when a central coordinator delegates to specialist subagents vs when a single agent handles the full task
Agentic loop design: how agents reason, act, observe, and re-reason — and where loops break in production
Context isolation for subagents: why you isolate context per subagent to prevent context leakage and token bloat
Token economics in multi-agent systems: how coordinator overhead compounds when subagents don't have focused context

The most common mistake candidates make in this domain: treating orchestration as a purely logical problem. The exam tests whether you understand the operational reality — latency, cost, failure propagation, partial completion handling.

Domain 2 — Tool Design and MCP Integration (18%)

Tests how you design tools that agents use reliably, and how Model Context Protocol servers connect agents to external systems.

Key concepts:

Tool schema design: why description quality determines whether Claude calls the tool correctly, not the implementation
MCP server architecture: transport types (stdio vs HTTP), session management, tool registration patterns
Tool boundary design: how to prevent reasoning overload from too many tools or overlapping tool responsibilities
Error handling in tool execution: what the agent receives when a tool fails and how that shapes recovery behavior

The most common mistake: thinking MCP is just "an API wrapper." MCP introduces specific session lifecycle and discovery patterns that the exam tests at depth.

Domain 3 — Claude Code Configuration and Workflows (20%)

Tests CLAUDE.md hierarchy, custom slash commands, the -p flag for CI/CD, and the context:fork pattern for skill isolation.

Key concepts:

CLAUDE.md hierarchy: global (~/.claude/CLAUDE.md) vs project-level vs directory-level, and how they compose
Slash commands: how to define them, when to use context:fork to isolate skill execution from the main session
CI/CD integration: the --print / -p flag for non-interactive mode, --output-format json for structured output in pipelines
Skills architecture: context:fork in frontmatter runs the skill in a sub-agent, keeping verbose output out of the main session context

Domain 3 catches people who have only used Claude as a chat tool but haven't used Claude Code in production pipelines. The exam gets specific about flag combinations and config file locations.

Domain 4 — Prompt Engineering and Structured Output (20%)

Tests how to make production prompts reliable and how to enforce structured output at scale.

Key concepts:

JSON schema enforcement: constraining model output to a schema through system prompts and validation retry loops
Few-shot techniques for complex formats: when examples are necessary vs when zero-shot is sufficient
Self-evaluation patterns: implementing retry loops where the model receives its own error logs and corrects them without human intervention
System prompt architecture for complex applications: modular prompt composition, how to structure prompts that are maintainable across a team

The exam distinguishes between prompt engineering for demos (optimise for the impressive case) and for production (optimise for failure rate and consistency). Nearly every question in this domain has a "looks right" distractor that would work for a demo but fail at scale.

Domain 5 — Context Management and Reliability (15%)

Tests how to handle long-context tasks, agent handoff patterns, and confidence calibration.

Key concepts:

Context window management: strategies for tasks that exceed context limits — summarisation, chunking, external memory
Handoff patterns: how one agent passes state to another agent cleanly, what must be preserved vs what can be reconstructed
Confidence calibration: how agents should signal uncertainty and when to escalate vs retry vs fail gracefully
Session continuity: how AgentCore Memory supports cross-session persistence for long-running workflows

The six production scenarios

The exam presents four of these six scenarios randomly. Every question in your exam is anchored to those four scenarios. Study all six — they overlap significantly in the skills they test.

Scenario 1 — Customer Support Resolution Agent

An agent that handles customer support queries end-to-end. Tests: multi-turn conversation management, tool use for account lookups, escalation patterns, confidence calibration for edge cases.

Scenario 2 — Code Generation with Claude Code

Using Claude Code in a development workflow. Tests: CLAUDE.md configuration, slash command design, CI/CD integration with the -p flag, context management for large codebases.

Scenario 3 — Multi-Agent Research System

A coordinator agent that delegates research subtasks to specialist subagents. Tests: task decomposition, context isolation per subagent, result synthesis, failure handling when a subagent returns an error.

Scenario 4 — Developer Productivity with Claude

Claude integrated into an engineering team's daily workflow. Tests: Claude Code configuration, memory and context persistence across sessions, CLAUDE.md hierarchy for team-wide settings vs individual developer settings.

Scenario 5 — Claude Code for CI/CD

Claude Code embedded in a continuous integration pipeline. Tests: non-interactive mode (-p flag), structured JSON output, schema-enforced PR comments, failure modes when Claude Code encounters ambiguity in automated context.

Scenario 6 — Structured Data Extraction

An agent that extracts structured data from unstructured documents. Tests: JSON schema design, validation retry loops, few-shot examples for format consistency, handling documents that don't match expected patterns.

The concepts that trip most candidates

Context leakage in multi-agent systems

When a coordinator passes its full context to a subagent, the subagent inherits reasoning, instructions, and conversation history that are irrelevant to its task. This bloats token usage and degrades subagent performance. The correct pattern: isolate context per subagent, passing only the task specification and the minimum necessary context.

Tool description vs tool implementation

Most candidates focus on the implementation of tool functions. The exam tests whether you understand that Claude's tool-calling behaviour is determined almost entirely by the description field in the tool schema — not the code. A perfectly implemented tool with a vague description will be called incorrectly or not at all.

The context:fork pattern

Skills defined in CLAUDE.md with context: fork in frontmatter run in an isolated sub-agent. This prevents verbose skill output from polluting the main conversation context. Understanding when to use context:fork vs running inline is a Domain 3 question type.

Non-interactive Claude Code

In CI/CD pipelines, Claude Code runs in non-interactive mode using the -p flag (or --print). Combined with --output-format json and --json-schema, it produces structured output that can be parsed by pipeline tooling. The exam tests specific flag combinations and failure modes when the agent encounters ambiguity it cannot resolve without human input.

Pigouvian retry loops

The self-evaluation pattern: the model generates output, a validation function checks it against a schema, and if validation fails, the error message is fed back to the model for correction — without human intervention. Understanding when this pattern is appropriate (deterministic validation criteria) vs when it creates loops (ambiguous validation criteria) is a Domain 4 question type.

How to prepare

Official resources first

Complete the Anthropic Academy courses on Skilljar before anything else. The flagship course is 8+ hours and covers the foundational architecture concepts the exam assumes. These are not optional — the exam references specific Anthropic design patterns that appear in the official curriculum.

Then build, not just read

The exam rewards candidates who have built real Claude systems. The scenarios are grounded in production problems — partial completions, tool failures, context limits, schema validation errors. Reading about these doesn't build the intuition the exam tests.

What to build before the exam:

A multi-agent system where a coordinator delegates to at least two specialist subagents
A Claude Code workflow with custom slash commands and a CLAUDE.md that sets project-wide context
A CI/CD integration using -p flag with JSON schema output
A tool schema where you've deliberately written a poor description and observed the resulting tool-calling behaviour

The official practice exam

After registering, Anthropic provides access to a 60-question official practice exam in the same scenario format as the real exam. Complete it before booking your real exam slot. The explanations for incorrect answers are specifically useful — they clarify exactly why the distractor is wrong, which is harder to learn from any other source.

Hands-on lab preparation

The gap the exam is designed to close: most AI certification study involves reading documentation and answering multiple-choice questions. The CCA-001 tests production architecture judgment that only develops from actually building systems.

The CCA-001 track on Cloud Edventures provides 22 hands-on lab missions in isolated real AWS Bedrock sandboxes — covering all five exam domains:

Navigator's Compass (Domain 3): CLAUDE.md configuration, slash commands, plan-execute pipelines, CI/CD integration
Tool Architect (Domain 2): Custom tool schema design, MCP server builds, function calling patterns, tool boundary design
Prompt Engineering (Domain 4): JSON schema enforcement, few-shot techniques, validation retry loops, system prompt architecture
Multi-Agent Systems (Domain 1): Coordinator-subagent architecture, context isolation, task decomposition, failure handling
Production Reliability (Domain 5): Context window management, handoff patterns, confidence calibration, AgentCore Memory integration

Each mission runs in a real AWS Bedrock environment with automated step validation. You get immediate feedback on whether your architecture decision is correct — the same kind of feedback loop the exam demands.

No AWS account needed. No billing risk.

👉 cloudedventures.com/labs/track/claude-certified-architect-cca-001

Frequently asked questions

Who should take the CCA-001?

Backend and full-stack engineers adding AI architecture skills, cloud engineers specialising in AI infrastructure, AI/ML engineers wanting formal validation of agentic system skills, solutions architects designing Claude-powered systems for enterprise clients. The exam assumes at least 6 months of hands-on experience with the Claude API and Claude Code.

What does the exam cost?

$99 USD. Available through the Anthropic Claude Partner Network.

How long is the certification valid?

Anthropic has not published an expiry date for the Foundations certification, consistent with how it was launched as the entry point to a multi-level credential stack.

What comes after CCA-001?

Anthropic has confirmed additional certifications targeting advanced architects are planned for later in 2026. The Foundations certification is explicitly positioned as the entry point.

What's the passing score?

720 on a scaled score of 100–1,000.

Is it multiple choice?

Yes, 60 multiple-choice questions. But unlike typical multiple-choice exams, the distractors are carefully constructed to be plausible if you have partial knowledge. Candidates who have studied documentation but not built real systems routinely find the exam harder than expected.

Have a specific CCA-001 preparation question? Drop it in the comments.

Mathematicians just proved that AI layoffs are a trap — and why cloud and AI engineers are on the right side of it

Aj — Thu, 16 Apr 2026 15:37:29 +0000

I read a paper last week that made me put my laptop down and stare at the wall for a bit.

Not because it said AI will take jobs. Everyone says that now. Most of it is either doom scrolling dressed up as analysis, or breathless optimism about upskilling your way out of a structural problem.

This paper was different.

Two researchers at the University of Pennsylvania and Boston University — Brett Hemenway Falk and Gerry Tsoukalas — built a formal game-theoretic model, ran the mathematics, and proved something genuinely unsettling: even when every firm in a market knows that mass automation will destroy the consumer demand they all depend on, they automate anyway.

Rationality doesn't save you. Perfect information doesn't save you. The structure of competition itself is the trap.

The paper is "The AI Layoff Trap," posted to arXiv on March 21, 2026. It has 1,500+ reactions on LinkedIn, been cited by JPMorgan's CEO, and is now circulating at every level of the technology industry. Here's what it actually says — and more importantly, what it means for where you want to position yourself in the labor market right now.

The Prisoner's Dilemma hiding inside every AI layoff announcement

The logic of the trap is worth understanding clearly because it changes how you interpret every headline about AI-driven job cuts.

Start with ten competing firms. AI arrives and offers each a choice: replace some human workers, cut your cost structure, gain a competitive edge. Each firm that automates gets cheaper to run. Each firm that doesn't gets undercut by the ones that did.

So far, this is the story everyone already knows. Here's the part that makes it a trap.

The workers being replaced are also consumers. When they lose their income, they stop spending. Every round of layoffs erodes the purchasing power that all ten firms depend on for revenue. Push this logic to its limit and you reach the cliff: firms automate their way to boundless productivity and zero demand. A market full of AI doing work for customers who can no longer afford to buy anything.

Every firm running this analysis can see the cliff. They automate anyway.

Because if your competitors automate and you don't, your cost structure is worse, your margins compress, you get undercut, you eventually exit the market. The individually rational move — automate — is the collectively catastrophic one. That is the Prisoner's Dilemma. And unlike a coordination failure, which can theoretically be solved by agreement, a dominant strategy is different. Rational players defect regardless of what they know. There is no stable voluntary agreement to not automate when the incentive to defect is this strong.

The paper proves this rigorously. The formal result: competitive firms automate past the socially optimal level even with perfect foresight. And two factors make the trap worse, not better:

More competition — as the number of firms increases, each firm's share of the collective demand loss from automation gets smaller. Smaller share means weaker incentive to restrain. A monopolist fully internalises the externality and restrains voluntarily. As you approach a perfectly competitive market, the wedge between private incentives and collective wellbeing approaches its maximum.

Better AI — as AI capability improves and its cost falls relative to human labour, the individual cost savings from automation increase. The trap bites harder. More displacement. Less consumer demand. Faster toward the cliff.

The sectors that are most competitive and have the best AI tools are headed toward the edge the fastest. This is not a bug. It is the mechanism.

The numbers are not hypothetical

Over 100,000 tech workers were laid off in 2025 alone, with AI cited as the primary driver in more than half the cases. Concentrated in customer support, operations, and middle management.

In February 2026, Block cut nearly half its 10,000-person workforce. CEO Jack Dorsey stated that AI had made those roles unnecessary and predicted that within a year, most companies would reach the same conclusion.

Salesforce replaced 4,000 customer support agents with agentic AI. Cognition's Devin, deployed at Goldman Sachs and Infosys, enables one senior engineer to do the work of a five-person team.

The exposure extends beyond tech. Roughly 80% of US workers hold jobs with tasks susceptible to automation by large language models. And the cost differential that drives the calculation: human knowledge work runs $50 to $200 per hour fully loaded. AI knowledge work runs $0.10 to $1.00 per hour. Two to three orders of magnitude. When the cost difference is that extreme, the trap activates regardless of foresight.

None of this is hidden from the CEOs making these decisions. They can read the same data. They automate anyway because the Prisoner's Dilemma doesn't care about awareness. It only cares about incentives.

Why the obvious solutions don't work

The paper is unusually thorough about apparent fixes. Understanding why they fail is as important as understanding the trap itself.

Upskilling and retraining: Partially reduces the gap. Cannot close it. The problem is not that workers lack skills — it is that firms have a structural incentive to automate past the optimal level regardless of worker capability. Upskilling helps individuals. It doesn't change the game-theoretic structure.

Universal Basic Income: Raises living standards for displaced workers. Doesn't change the per-task automation incentive for firms. They still race. UBI addresses the aftermath, not the mechanism.

Worker equity participation: Helpful at the margin. If workers own shares, they partially internalise the demand loss from their own displacement. The externality persists — just reduced.

Voluntary industry agreements: Fail completely. Automation is a dominant strategy. Any voluntary restraint agreement is unstable. The firm that defects captures the cost advantage. No agreement is self-enforcing when defection is individually rational.

Capital income taxes: Zero effect on the automation rate. A multiplicative tax on profits doesn't alter the first-order condition for the automation decision.

One instrument corrects the distortion: a Pigouvian automation tax — charging firms the uninternalised social cost when they replace workers with AI. This forces the individual calculation to align with the collective one. The paper also notes this tax does double duty: its revenue can fund retraining and demand support, compounding the correction over time.

Whether you find this policy politically viable or not, the structural argument about why everything else fails stands independently. The trap is real. The mechanisms that seem like they should stop it don't.

Which side of the automation layer do you want to be on

Here is where this conversation becomes directly practical.

The roles displaced first are not the ones building and operating AI systems. They are the roles applying known processes to routine tasks — customer support, operations, data processing, middle management. The paper notes that the current displacement wave is disproportionately hitting entry-level workers in these categories.

The roles on the other side of the boundary — the ones building, deploying, securing, and operating the automation infrastructure — are growing. Someone has to build the agentic AI system that replaced those 4,000 Salesforce support agents. Someone has to write the Bedrock workflows, configure the IAM policies, manage the API costs, monitor the CloudWatch metrics, debug the Lambda function when it breaks at 3am. Someone has to architect the multi-agent orchestration layer that coordinates specialised AI models across an enterprise.

That person is a cloud engineer or AI architect. And the trap the paper describes is, for now, actively working in their favour.

As automation deepens, four specific skill areas become more valuable, not less:

AWS and cloud infrastructure for AI workloads — Lambda, Bedrock, SageMaker, ECS need engineers who understand them at genuine depth. Not surface familiarity from documentation. The kind of understanding that only comes from deploying real systems, watching them break, and debugging them under pressure.

Security of agentic systems — as AI agents handle more sensitive operations — accessing databases, reading customer records, making financial decisions — IAM policy engineering, Bedrock Guardrails, and data governance become critical architectural concerns. The cost savings from automation evaporate the moment a poorly-governed agent causes a breach or regulatory violation.

Multi-agent architecture — the Salesforce case is not one model responding to queries. It is an orchestrated system of specialised agents, each calling tools, reading data, writing records. Building these systems requires understanding agentic loops, tool use, coordinator-subagent patterns, MCP server integration, and the failure modes that emerge when agents interact at scale.

Machine learning operations — as AI inference becomes a core production workload, engineers who understand SageMaker, Bedrock model deployment, MLflow pipeline management, and real-time inference optimisation hold skills that simply didn't exist as a profession five years ago.

The honest version of "you need to upskill"

The paper explicitly shows that individual upskilling is insufficient as macro policy. It doesn't change the structural incentive that drives collective over-automation. Knowing this is clarifying.

What it does not mean is that individual skill development is irrelevant. It means the direction matters enormously.

There is a clear dividing line. Below it: routine software tasks, basic configuration, scripted testing, repetitive data processing. These are the tasks AI handles at $0.10 per hour. Being in this layer is structurally precarious regardless of proficiency.

Above it: systems design for AI workloads, security architecture for agentic systems, infrastructure engineering for real-time ML inference pipelines, multi-agent coordination, debugging complex agent failures at depth. These require judgment and pattern recognition from real-world failures that current AI cannot yet replicate.

The gap between someone with genuine hands-on experience — who has deployed and debugged real IAM policies, watched real CloudWatch alarms fire, recovered from real Terraform state corruption, built and tested real Bedrock agents — versus someone who has consumed tutorials about these topics, is exactly the gap that automation closes slowly and reluctantly.

The window where these skills are scarce and highly compensated is real. It is not permanent. Building depth now, while the scarcity premium exists, is the rational individual response to a structural dynamic you can see but cannot individually stop.

The certifications that signal you're on the right side

Two certifications matter specifically in this context.

AWS ML Engineer Associate (MLA-C01) — the certification for engineers building and operating machine learning systems on AWS. Covers SageMaker, Bedrock, data pipelines with Glue and Athena, Kinesis for real-time ingestion, and MLOps practices. As more organisations move AI workloads to production, the engineers who understand this stack are the ones on the growing side of the automation boundary.

Claude Certified Architect (CCA-001) — Anthropic's first official technical certification. Launched March 2026, backed by a $100M Claude Partner Network. Covers agentic loops, MCP server architecture, multi-agent coordination, Bedrock Guardrails, and CI/CD for Claude-powered systems. As the agentic AI stack on AWS matures — and the Mythos and AgentCore launches this month confirm it is maturing fast — the engineers who understand how to architect, constrain, and audit these systems will be the ones organisations trust to deploy them.

These are not certifications that signal you studied documentation. They require demonstrating hands-on competency with real systems under real conditions.

One more thing the paper says that most summaries skip

The paper's formal model shows that the over-automation wedge is strictly increasing in N — where N is the number of firms in the market.

More competitive markets exhibit wider automation gaps. This runs directly counter to the standard economic intuition that competition disciplines firms to act in consumers' interests. Here, more competition dilutes each firm's share of the demand loss, weakening the private incentive to restrain.

The implication: the sectors where you are most likely to see aggressive AI-driven displacement are not the monopolised ones. They are the highly competitive ones — exactly the tech industry, the SaaS market, the enterprise software space where most engineers work.

If you are in a competitive tech sector, the automation pressure on the roles around you is higher than average. The acceleration is not going to stop because the competitive structure that drives it is not going to change.

The question that actually matters is not whether automation is happening. It is whether the specific skills you are building put you on the operating side of AI systems or the replaced side.

The research skills above the automation boundary — real AWS infrastructure, Bedrock agent architecture, SageMaker and MLOps, multi-agent system design — are what the Cloud Edventures platform is built around. Three tracks of hands-on labs in isolated real AWS sandboxes: Core AWS Foundations, AWS ML Engineer MLA-C01, and Claude Certified Architect CCA-001.

Not simulations. Not click-through walkthroughs. Real Lambda functions, real IAM policies, real Bedrock agents — with automated validation that tells you whether your configuration is actually correct. No AWS account needed.

The paper is worth reading in full: arxiv.org/abs/2603.20617. And the skills worth building are the ones the trap cannot reach.

👉 cloudedventures.com

Where do you think the automation boundary sits in your own role right now? This is the conversation worth having in the comments.

Claude Mythos is on AWS Bedrock. Here's what engineers need to know.

Aj — Wed, 15 Apr 2026 14:13:01 +0000

Something significant happened on April 7, 2026.

Anthropic launched Claude Mythos — a model they describe as "too powerful to be released publicly" — and made it available exclusively through Amazon Bedrock as a gated research preview under Project Glasswing.

It achieved a record-breaking 93.9% score on SWE-bench Verified. For context, the best human performance on that benchmark is around 40–50%. Claude Mythos didn't just cross a threshold — it obliterated it.

This is not another incremental model release. It is a category shift. And if you work in cloud engineering, DevOps, or AI infrastructure, the implications are significant enough that you need to understand what just shipped.

What Claude Mythos actually is

Claude Mythos Preview is a fundamentally new model class focused on cybersecurity — capable of identifying sophisticated security vulnerabilities in software, analyzing large codebases, and delivering state-of-the-art performance across cybersecurity, coding, and complex reasoning tasks.

The critical distinction from every other large language model release: Mythos was not built to be a generalist assistant. It was built to be a specialist at finding and exploiting software vulnerabilities — and then immediately applying that capability to defence.

Anthropic's positioning is explicit: "AI models have reached a level of coding capability where they can surpass all but the most elite humans in discovering and exploiting software vulnerabilities."

Read that again. Not "approach" human level. Not "match some professionals." Surpass all but the most elite.

This is the first time a lab has shipped a model with that specific framing — acknowledging that the capability is genuinely dangerous, that release requires extraordinary caution, and that the primary use case during preview is defensive: find vulnerabilities before adversaries do.

Why AWS Bedrock specifically

Claude Mythos Preview is available in gated preview in the US East (N. Virginia) Region through Amazon Bedrock as part of Project Glasswing.

The choice of Bedrock as the delivery vehicle is not incidental. It means:

Enterprise access control — Bedrock's IAM integration means access to Mythos can be governed at the role and policy level. Organisations can control which teams, workloads, and applications can invoke the model, with full CloudTrail audit trails of every API call.

Compliance infrastructure — Bedrock provides VPC endpoints, PrivateLink support, and data residency controls. For the security teams most likely to use Mythos — those working on critical infrastructure — operating inside an existing compliance perimeter without sending data to a public API is a hard requirement.

Cost allocation — AWS just launched Bedrock support for cost allocation by IAM user and role, allowing teams to tag IAM principals with attributes like team or cost center and see model inference spending flow into AWS Cost Explorer. For security research workloads that run intensive codebase analysis, cost visibility is operationally necessary.

The pattern emerging here: AWS is becoming the enterprise control plane for AI. Not because their models are the most capable — they're not — but because the surrounding infrastructure (IAM, CloudTrail, VPC, Cost Explorer, GuardDuty) is already where enterprise security teams live.

Who can access it right now

Access is currently limited to allowlisted organisations. Anthropic and AWS are prioritising internet-critical companies and open-source maintainers whose software and digital services impact hundreds of millions of users.

If you run a payment processor, a DNS provider, a major open-source project that ships in billions of devices, or critical government infrastructure — your AWS account team may reach out directly.

For everyone else: Anthropic does not currently plan to release Claude Mythos publicly, but their ultimate goal is to enable users to safely deploy Mythos-level models at scale. Within 90 days (by approximately July 2026), Anthropic will publicly report on discovered vulnerabilities, patches, and improvements.

The broader release path runs through Claude Opus. Mythos-level capabilities are expected to be integrated into a future Opus release once the safety evaluation framework from Glasswing has been proven.

What this means for cloud engineers and security architects

If you work in cloud security or platform engineering, the Mythos release changes your threat model — and your tooling landscape — in several specific ways.

Vulnerability discovery is no longer human-speed

The primary use case Anthropic has demonstrated: feed Mythos a large codebase, ask it to find security vulnerabilities, and it produces actionable findings with less manual guidance than any previous AI system.

The implication: if this capability becomes broadly accessible (and the 90-day disclosure timeline suggests it will), both defensive and offensive security teams will have access to automated vulnerability discovery at a scale and speed that has never existed. The time between a vulnerability existing and being exploited in the wild will compress dramatically.

For cloud engineers: the security configurations you set today — IAM policies, VPC security groups, S3 bucket policies, GuardDuty rules — will be tested against systems that can analyse their logic at codebase depth. Not just scan for known CVEs. Understand the actual logical structure of your access controls.

AI agent governance just became urgent

The simultaneous launch of AWS Agent Registry and Bedrock AgentCore Policy is not coincidental timing.

Policy in AgentCore gives organisations control over the actions agents can take, applied outside of the agent's reasoning loop, treating agents as autonomous actors whose decisions require verification before reaching tools, systems, or data.

The AWS Agent Registry provides organisations with a private catalogue for discovering and managing AI agents, tools, skills, MCP servers, and custom resources, with semantic and keyword search, approval workflows, and CloudTrail audit trails.

The pattern: as AI agents become more capable, the infrastructure for governing them becomes as critical as the agents themselves. The cloud engineers building agent systems in 2026 need to understand policy enforcement and audit logging as core architecture concerns, not afterthoughts.

The CCA-001 certification just became more relevant

The Claude Certified Architect certification — Anthropic's first official AI technical credential — covers exactly the architectural patterns that the Mythos/Bedrock/AgentCore ecosystem requires: agentic loops, Bedrock Guardrails, multi-agent coordination, MCP servers, and policy enforcement.

Mythos running on Bedrock inside AgentCore is not a standalone tool. It is an agent. It calls tools. It reads codebases. It produces findings. It operates within a policy and governance framework. The engineers who understand how to build, constrain, and audit agentic systems on AWS Bedrock are the engineers who will deploy these capabilities safely.

That architecture knowledge is precisely what the CCA-001 track builds hands-on.

The broader picture: what Mythos signals about where AI is going

Three things are happening simultaneously and they are connected.

Capability is outpacing comprehension. Mythos scored 93.9% on SWE-bench. The best human software engineers score around 40-50%. The gap between what the model can do and what the humans deploying it can verify is widening. This is the core challenge that AgentCore Policy, Bedrock Guardrails, and the CCA-001 certification address from different angles.

The Anthropic-AWS partnership is deepening. Mythos was not released on the public Anthropic API. It was released exclusively through Bedrock. AWS Agent Registry launched the same week. Bedrock cost allocation for IAM principals launched the same week. This is a coordinated platform play — Anthropic provides the model capability, AWS provides the enterprise control plane.

Security is the first production use case for frontier AI agents. Not customer service. Not code generation. Not document summarisation. Security — specifically vulnerability discovery — is the first domain where a frontier AI agent is being deployed in production settings with explicit acknowledgement that its capabilities exceed human expert performance. That is a remarkable statement about where we are.

What to do with this information

If you're a cloud security engineer: The threat model changes. Begin reviewing your most critical IAM policies, S3 bucket policies, and VPC security configurations with the assumption that automated analysis at Mythos-level depth will eventually be widely available. Your misconfigurations that are currently obscure will not remain obscure.

If you're a platform engineer building on Bedrock: Start understanding AgentCore Policy and the AWS Agent Registry now, before they're mandatory. The governance infrastructure is being built; the engineers who understand it will be the ones organisations rely on to deploy AI agents safely.

If you're studying for cloud certifications: The CCA-001 Claude Certified Architect certification is the only hands-on certification that covers the Bedrock/AgentCore/MCP architecture that Mythos operates within. There are no other options for validated hands-on skills in this specific stack.

If you're a student or early-career engineer: The combination of AWS security skills and AI agent architecture is the highest-value skill pairing in cloud engineering right now. The Mythos launch confirms this direction — security + AI on AWS is where the critical infrastructure work is happening.

A note on what we don't know yet

Mythos is in gated preview. The public knows very little about its actual performance outside of SWE-bench and the specific vulnerability discovery framing Anthropic has chosen.

What Anthropic has said they will share: within 90 days, a public report on discovered vulnerabilities, patches applied, and improvements made based on the Glasswing preview period. That report — expected by approximately July 2026 — will be the first detailed public evidence of what Mythos-level AI actually found in production systems.

That disclosure is worth watching closely. It will either confirm or significantly revise the current understanding of what this generation of AI can do to real-world codebases.

The engineers who understand what's happening at the infrastructure level — how Bedrock delivers it, how AgentCore governs it, how IAM and CloudTrail audit it — will be the ones building and operating the next generation of AI systems.

That infrastructure knowledge is hands-on. You learn it by building with real Bedrock environments, real IAM configurations, real agentic loops — not by reading about them.

The CCA-001 Claude Certified Architect track covers exactly this: Bedrock model deployment, AgentCore agent architecture, MCP server integration, Guardrails policy enforcement, and multi-agent coordination — in isolated real AWS sandboxes with automated validation.

👉 cloudedventures.com/labs/track/claude-certified-architect-cca-001

What do you think the 90-day Glasswing disclosure will reveal? Drop a comment.

DevOps jobs in 2026: roles, salaries, and how to actually get hired

Aj — Tue, 14 Apr 2026 17:01:53 +0000

DevOps engineer is one of the most in-demand roles in tech right now. Job boards have thousands of open positions. Salaries are strong. Companies are actively hiring.

And most people applying are getting screened out before their resume reaches a human.

This guide covers what DevOps jobs actually require in 2026, the roles that exist and what separates them, realistic salary ranges, where to find real openings, and the portfolio that gets you past the filter.

What DevOps actually means in a job description

DevOps started as a philosophy — break down silos between development and operations, ship faster, fail smaller. By 2026 it has become a specific technical skill set that hiring managers test for in interviews.

When a job posting says "DevOps Engineer," it almost always means someone who can do some combination of:

Build and maintain CI/CD pipelines (GitHub Actions, Jenkins, AWS CodePipeline)
Write infrastructure as code (Terraform, CloudFormation, Pulumi)
Manage containerized workloads (Docker, Kubernetes, ECS)
Implement monitoring and observability (CloudWatch, Datadog, Grafana)
Automate everything that used to be done manually
Integrate security into the pipeline (DevSecOps)

The degree of emphasis varies by company and seniority. An entry-level role might focus entirely on CI/CD and basic cloud. A senior role might require designing multi-account AWS organization structures and leading an SRE team.

DevOps roles in 2026 (and what each actually does)

DevOps Engineer

The generalist. Owns the CI/CD pipeline, manages cloud infrastructure, writes Terraform, keeps things running. This is the role most job postings are hiring for.

What they do day to day: Deploy pipelines, write infrastructure code, debug production incidents, automate repetitive operational tasks, handle on-call rotations.

Salary range: $90,000 – $160,000 USD (US). Varies heavily by location, company size, and experience.

Site Reliability Engineer (SRE)

Google invented SRE. The concept: apply software engineering to operations problems. SREs write code to eliminate toil, set SLOs, manage error budgets, and own production reliability.

What separates SRE from DevOps: More emphasis on reliability metrics, error budgets, and post-incident analysis. More software engineering, less pure infrastructure configuration.

Salary range: $120,000 – $200,000 USD. Typically pays more than DevOps Engineer titles.

Platform Engineer

Platform engineers build internal developer platforms — the tooling, templates, and self-service infrastructure that let application developers deploy without needing to know Kubernetes internals.

What separates Platform from DevOps: Building for internal developer experience rather than directly shipping products. Golden paths, internal CLIs, developer portals (Backstage).

Salary range: $110,000 – $180,000 USD.

Cloud Engineer

Cloud-focused infrastructure. AWS, Azure, or GCP architecture, cost optimization, migration projects. Often overlaps heavily with DevOps.

Salary range: $95,000 – $165,000 USD.

DevSecOps Engineer

DevOps with security shifted left. Integrates SAST, DAST, container scanning, secrets management, and compliance into the pipeline. Growing fast as organizations realize security needs to move at DevOps speed.

Salary range: $110,000 – $180,000 USD. Security premium is real.

What companies are actually hiring for in 2026

Job descriptions lie. They list 15 requirements and actually care about 4. Here is what actually gets you hired based on what interviewers test for:

Non-negotiable regardless of level

Git proficiency — not just commits and pushes. Branch strategy, pull request workflows, merge vs rebase, tagging releases. Interviewers notice when you don't know these.

Linux fundamentals — filesystem navigation, process management, reading logs, basic shell scripting. Everything runs on Linux. If you can't navigate it confidently you can't debug production incidents.

At least one cloud platform deeply — AWS dominates job postings (62% of cloud DevOps roles mention AWS). Knowing what services exist is table stakes. Being able to architect and debug is what gets you offers.

CI/CD hands-on experience — not "I know what a pipeline is." Can you build one from scratch? Can you debug why a deployment failed at the Terraform apply step? Can you add a new stage?

What separates candidates at different levels

Entry level: Can follow existing patterns, fix broken pipelines, deploy changes. Needs supervision on architecture decisions.

Mid level: Can design a new pipeline from requirements, write Terraform modules, make reasonable architecture decisions independently.

Senior: Can design multi-environment deployment strategies, evaluate tooling trade-offs, mentor others, lead incident response.

Where DevOps jobs are actually posted

LinkedIn Jobs — largest volume. Use filters: "DevOps Engineer", location, "Past 24 hours" for fresh postings. Apply within the first 24 hours — applications drop off sharply after 3 days.

Indeed — strong for mid-market companies and smaller engineering teams. Larger companies post here but LinkedIn is stronger.

Glassdoor — use primarily for salary data and interview reviews, not job discovery.

levels.fyi — tech company compensation data. Use to understand what "competitive salary" actually means at specific companies.

Company career pages directly — larger tech companies (AWS, Google, Microsoft, Stripe, Cloudflare) have dedicated engineering job boards. Applying directly is often faster than LinkedIn.

Twitter/X and LinkedIn posts — engineering managers at smaller companies frequently post openings informally before posting officially. Follow DevOps engineers at companies you want to work at.

r/devops on Reddit — monthly "who's hiring" threads, often with roles from teams that don't post conventionally.

The portfolio that gets you past the ATS

Certifications get your resume through the automated filter. Projects get you through the technical screen.

A hiring manager looking at two candidates — one with AWS Solutions Architect cert but no projects, one with no cert but a GitHub repo showing a deployed Terraform + GitHub Actions + Kubernetes stack — picks the second one almost every time.

Three projects that cover the full DevOps interview surface:

Project 1: Serverless API with full IaC
Lambda + API Gateway + DynamoDB, deployed entirely with Terraform. No console clicks — everything in code, version controlled, deployable with terraform apply. CloudWatch alarms for errors. IAM least privilege on every role.

This demonstrates: Lambda, IaC, IAM, monitoring, serverless architecture.

Project 2: Complete CI/CD pipeline
GitHub Actions workflow: lint and test on every PR, Docker build and push to ECR on merge, Terraform plan in staging, manual approval gate, Terraform apply to production. Use GitHub Actions OIDC for AWS authentication — no stored secrets.

This demonstrates: CI/CD, Docker, Terraform, secrets management, deployment strategy, the OIDC pattern that replaces stored AWS keys.

Project 3: Event-driven architecture
SQS + Lambda + DynamoDB Streams + SNS fan-out. An application that processes events asynchronously, handles failures with a dead letter queue, and notifies downstream systems.

This demonstrates: async architecture, error handling, distributed systems thinking — the category of questions that separate mid from senior.

The skills gap that actually blocks people

Most DevOps job seekers have consumed a lot of content — YouTube tutorials, Udemy courses, documentation. The problem: consuming content does not build the skills interviewers test.

The skills that come only from hands-on practice:

Debugging — you only develop debugging intuition by actually breaking things and fixing them
Terraform state management — remote state, locking, drift detection only makes sense after you've dealt with state corruption
IAM least privilege — understanding what permissions to grant is only clear after over-permissioning something and seeing the blast radius
Incident response — reading about blameless postmortems is different from having run one

The challenge: practicing on your own AWS account involves billing risk. Misconfigured resources, forgotten instances, data transfer costs. The anxiety about billing is why most developers consume content instead of building.

Isolated sandbox environments solve this. Cloud Edventures provides real AWS environments — Lambda, S3, IAM, DynamoDB, CloudFormation, CodePipeline, Terraform, GitHub Actions OIDC — where you complete guided lab missions with automated validation. No AWS account needed. No billing risk. The feedback loop is immediate — the system tells you whether your IAM policy is correctly scoped or your pipeline is failing for the right reasons.

The Core AWS Foundations track covers exactly the hands-on skills that DevOps job interviews test: IAM least privilege, VPC security, CI/CD with GitHub Actions OIDC, Terraform IaC, event-driven architectures with SQS and SNS.

👉 cloudedventures.com/labs/track/aws-cloud-foundations

DevOps interview prep: what to expect

Round 1 — Technical screen (phone/video)
Linux commands, basic cloud concepts, what CI/CD is, describe a deployment you've worked on. Behavioral questions about past incidents or team conflicts.

Round 2 — Technical deep dive
Walk through a project you built. They will ask follow-up questions specifically to verify you actually built it. "Why did you choose DynamoDB over RDS here?" "How does your Terraform state locking work?" "What happens if the Lambda times out?"

Round 3 — System design
Design a deployment system for a new microservice. Design a blue/green deployment strategy. How would you approach migrating a monolith to containers?

The question that trips most people: "Walk me through a production incident you handled." If you've only watched tutorials you don't have an answer. If you've built and broken real systems you do.

Salary negotiation for DevOps roles

DevOps engineers are in high demand. Companies expect negotiation.

Starting point: Research on levels.fyi and Glassdoor for the specific company and level before your first offer call.

The counter: Always counter. Even "I was hoping for X" without justification gets a 5-10% bump at most companies.

The strongest counter: A competing offer. If you have two offers, both go up. If you only have one, mention you're in final rounds elsewhere (if true).

What moves salary the most: Specialization. AWS + Kubernetes + security = higher than AWS alone. AWS + Terraform + security + SRE experience = higher still.

Where are you in the DevOps job search right now? Drop a comment — especially if you're trying to break in without prior DevOps experience.

AWS cloud security: the complete guide for 2026 (IAM, VPC, KMS, GuardDuty)

Aj — Mon, 13 Apr 2026 13:30:41 +0000

AWS cloud security is not a single feature you turn on. It is a set of overlapping controls — identity, network, data, and detection — that work together to reduce your attack surface.

Most security breaches on AWS are not caused by AWS failures. They are caused by misconfigured IAM policies, publicly accessible S3 buckets, unencrypted data, and missing detection controls. All of them are preventable. All of them are things you control.

This guide covers the six pillars of AWS security, the specific controls in each, and the mistakes that get teams breached.

The AWS shared responsibility model (your security starts here)

AWS secures the infrastructure — the physical data centers, the hypervisor, the hardware, the global network. You secure everything you put on that infrastructure.

AWS is responsible for:

Physical data centers and hardware
Global network infrastructure
Hypervisor and host OS
Managed service infrastructure (the RDS database engine, the Lambda runtime)

You are responsible for:

IAM users, roles, and policies
Network configuration (VPCs, security groups, NACLs)
Data encryption at rest and in transit
Operating system patches on EC2 instances
Application security
Monitoring and detection

The most important sentence in AWS security: misconfigured, not breached. The vast majority of AWS security incidents are self-inflicted — someone configured something wrong. Understanding what you own prevents most of them.

Pillar 1 — IAM: Identity and Access Management

IAM controls who can do what in your AWS account. It is the most critical security control and the most commonly misconfigured.

The principle of least privilege

Every IAM user, role, and service should have the minimum permissions required to do its job — nothing more.

What this looks like in practice:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-app-bucket/*"
    }
  ]
}

This Lambda function can read and write objects in one specific bucket. It cannot list buckets, delete objects, create buckets, or touch any other AWS service. Least privilege means your blast radius when something goes wrong is limited.

What least privilege is not:

{
  "Effect": "Allow",
  "Action": "*",
  "Resource": "*"
}

This is AdministratorAccess. Never attach it to application roles, Lambda functions, or EC2 instance profiles. Save it for break-glass admin scenarios with MFA enforcement.

IAM security rules every account needs

Enable MFA on the root account — immediately

The root account has unlimited access to everything in your AWS account including billing and account closure. It cannot be restricted by SCPs or permission boundaries. Protect it with MFA and never use it for daily work.

Never create IAM access keys for the root account

If root access keys exist, delete them. There is no legitimate reason for them.

Use roles, not users, for application access

EC2 instances, Lambda functions, ECS tasks — none of them should have IAM users or access keys. Attach an IAM role. AWS automatically rotates the temporary credentials. No secrets to store, no rotation to manage, no credentials to leak.

Use IAM roles for cross-account access

If your application needs to access resources in another AWS account, use a cross-account IAM role. Never pass IAM access keys between accounts.

IAM Access Analyzer

IAM Access Analyzer identifies resources that are accessible from outside your account — S3 buckets, IAM roles, KMS keys, Lambda functions, SQS queues, Secrets Manager secrets.

# Enable Access Analyzer for your account
aws accessanalyzer create-analyzer \
  --analyzer-name "account-analyzer" \
  --type ACCOUNT

Run this in every region you use. Review findings weekly. Any external access that isn't intentional is a finding worth investigating.

Pillar 2 — VPC: Network Security

Your VPC is your private network in AWS. Network security controls what can reach your resources and what your resources can reach.

Security groups vs NACLs

These two controls are frequently confused. They serve different purposes.

	Security Groups	NACLs
Level	Instance (resource) level	Subnet level
Stateful	Yes — return traffic automatic	No — must allow both directions
Rules	Allow only (no deny)	Allow and deny
Default	Deny all inbound, allow all outbound	Allow all
Best for	Controlling access to specific resources	Broad subnet-level controls

Security group best practices:

# Create a security group for a web server
aws ec2 create-security-group \
  --group-name web-server-sg \
  --description "Web server security group" \
  --vpc-id vpc-0123456789abcdef0

# Allow HTTPS from anywhere (not HTTP — use HTTPS only)
aws ec2 authorize-security-group-ingress \
  --group-id sg-0123456789abcdef0 \
  --protocol tcp \
  --port 443 \
  --cidr 0.0.0.0/0

# Allow SSH only from your office IP — never from 0.0.0.0/0
aws ec2 authorize-security-group-ingress \
  --group-id sg-0123456789abcdef0 \
  --protocol tcp \
  --port 22 \
  --cidr 203.0.113.0/24

The most dangerous security group rule:

Port 22 (SSH), Source: 0.0.0.0/0

This opens SSH to the entire internet. Your server will be receiving brute force attempts within minutes of being created. Use AWS Systems Manager Session Manager instead of SSH — no open ports required.

VPC design for security

Public subnet: Resources that need direct internet access — Load Balancers, NAT Gateways.

Private subnet: Application servers, databases — never directly internet-accessible.

Isolated subnet: Databases and sensitive data stores that should not even reach the internet outbound.

Internet Gateway
      ↓
Public Subnet (ALB, NAT Gateway)
      ↓
Private Subnet (EC2, ECS — outbound via NAT)
      ↓
Isolated Subnet (RDS, ElastiCache — no internet)

Most applications only need one thing in the public subnet: the load balancer. Everything else — application servers, databases, caches — should be in private or isolated subnets.

VPC Flow Logs

Enable VPC Flow Logs to capture all network traffic metadata for your VPC. This is essential for security investigation.

aws ec2 create-flow-logs \
  --resource-type VPC \
  --resource-ids vpc-0123456789abcdef0 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /aws/vpc/flow-logs \
  --deliver-logs-permission-arn arn:aws:iam::123456789012:role/flowlogsRole

Without flow logs, when something suspicious happens you have no record of what network traffic occurred.

Pillar 3 — S3: Data Security

S3 misconfiguration is one of the most common causes of AWS data breaches. Public S3 buckets have exposed customer data, credentials, and intellectual property at companies of all sizes.

Block public access — account level

# Block all public access for the entire account
aws s3control put-public-access-block \
  --account-id 123456789012 \
  --public-access-block-configuration \
    BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

This single command prevents any S3 bucket in your account from being made public — even if someone accidentally applies a public bucket policy. Run it on every AWS account you own.

S3 bucket policies: what to deny

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyNonHTTPS",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "Bool": {
          "aws:SecureTransport": "false"
        }
      }
    }
  ]
}

This denies any request that doesn't use HTTPS. No application should be reading S3 over unencrypted HTTP.

Enable default encryption

aws s3api put-bucket-encryption \
  --bucket my-bucket \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789012:key/your-key-id"
      }
    }]
  }'

Enable S3 encryption by default for every bucket. Use KMS (not SSE-S3) for buckets containing sensitive data — KMS gives you audit logs of every decryption.

Pillar 4 — KMS: Encryption Key Management

AWS Key Management Service manages the encryption keys for your data at rest. When you enable encryption on RDS, S3, EBS, or DynamoDB, KMS is managing the key.

Customer managed keys vs AWS managed keys

AWS managed keys (aws/service): AWS creates and manages these. You cannot see or control the key policy. Fine for basic encryption.

Customer managed keys (CMKs): You create and control these. You can restrict who can use the key, enable automatic rotation, and see audit logs of every encryption and decryption operation.

# Create a customer managed key
aws kms create-key \
  --description "Production database encryption key" \
  --key-usage ENCRYPT_DECRYPT

# Enable automatic annual rotation
aws kms enable-key-rotation \
  --key-id your-key-id

For sensitive data (PII, financial records, health information), use CMKs. The audit log alone — every encryption and decryption recorded in CloudTrail — is worth the small cost.

CloudTrail + KMS = who accessed what data

Every KMS decrypt operation is logged in CloudTrail with: who made the request, what key was used, when, and from what IP. This is the audit trail that compliance frameworks require and that incident responders use to determine the scope of a breach.

Pillar 5 — GuardDuty: Threat Detection

Amazon GuardDuty is AWS's managed threat detection service. It analyzes CloudTrail, VPC Flow Logs, and DNS logs to identify malicious activity — without you configuring any rules.

# Enable GuardDuty
aws guardduty create-detector \
  --enable \
  --finding-publishing-frequency SIX_HOURS

What GuardDuty detects:

Unusual API calls from impossible geographic locations (credential compromise)
EC2 instances communicating with known malicious IPs
Cryptocurrency mining on your compute resources
IAM access from Tor exit nodes
S3 bucket reconnaissance from external accounts
Port scanning from your EC2 instances

GuardDuty findings go to Security Hub, EventBridge, and SNS. Wire a finding to a Lambda function that auto-remediates or alerts your on-call channel.

# Create EventBridge rule: GuardDuty HIGH finding → SNS alert
aws events put-rule \
  --name "guardduty-high-findings" \
  --event-pattern '{
    "source": ["aws.guardduty"],
    "detail-type": ["GuardDuty Finding"],
    "detail": {
      "severity": [{"numeric": [">=", 7]}]
    }
  }' \
  --state ENABLED

Cost: $4/month for a low-traffic account. This is mandatory. The detection value far exceeds the cost.

Pillar 6 — CloudTrail: Audit Logging

CloudTrail records every API call made in your AWS account — who called what API, when, from where, and with what result. It is the foundation of all AWS security investigation.

# Enable CloudTrail in all regions (mandatory for security)
aws cloudtrail create-trail \
  --name "global-audit-trail" \
  --s3-bucket-name my-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events

aws cloudtrail start-logging --name global-audit-trail

--is-multi-region-trail — captures API calls in every region, including global services. Without this, an attacker operating in eu-west-1 while your trail only covers us-east-1 leaves no trace.

--enable-log-file-validation — CloudTrail signs log files so you can verify they haven't been tampered with.

The AWS Security Checklist

Immediate (do today)

[ ] Enable MFA on root account
[ ] Delete root account access keys if they exist
[ ] Block public S3 access at account level
[ ] Enable GuardDuty in every region
[ ] Enable CloudTrail as a multi-region trail

This week

[ ] Enable EBS default encryption per region
[ ] Enable S3 default encryption on all buckets
[ ] Run IAM Access Analyzer and review external findings
[ ] Audit security groups for 0.0.0.0/0 SSH/RDP rules
[ ] Enable VPC Flow Logs on all VPCs

This month

[ ] Implement least-privilege IAM for all application roles
[ ] Move all EC2 access to SSM Session Manager (eliminate SSH)
[ ] Enable AWS Config for configuration compliance monitoring
[ ] Set up GuardDuty findings → SNS/PagerDuty alerting
[ ] Enable AWS Security Hub as unified findings dashboard

Frequently asked questions

What is the most common AWS security mistake?
Overly permissive IAM policies — either AdministratorAccess attached to application roles, or wildcard actions on specific services when only 2-3 actions are needed. The second most common is publicly accessible S3 buckets.

Is AWS secure by default?
AWS infrastructure is secure by default. Your configuration is not. S3 buckets are private by default but can be made public. Security groups deny all inbound by default but can be opened to 0.0.0.0/0. The defaults are secure — the common mistakes are configurations that override those defaults.

Do I need all six pillars for a small project?
At minimum: enable GuardDuty, block public S3 access, use IAM roles (not users) for application access, and enable MFA on root. These four controls prevent most common AWS security incidents regardless of project size.

How much does AWS security cost?
GuardDuty: ~$4/month for low traffic. CloudTrail: free for management events, ~$2/month per region for S3 storage. Security Hub: ~$0.001 per finding after the first 10,000. KMS: $1/key/month + $0.03 per 10,000 API calls. Total for a properly secured small account: under $20/month.

What is the AWS Well-Architected Framework security pillar?
AWS's security pillar covers six areas: identity and access management, detection, infrastructure protection, data protection, incident response, and application security. It maps directly to the six pillars in this guide. The full documentation is at docs.aws.amazon.com/wellarchitected.

Practice these controls hands-on

Reading about IAM policies and security groups is useful. Deploying them incorrectly in a real environment — seeing what breaks and why — builds the intuition that actually matters in production.

The Security Fortress path in Cloud Edventures covers every control in this guide hands-on: IAM least privilege lab, VPC security groups and NACLs, S3 bucket policy configuration, KMS key creation and rotation, GuardDuty enabling and findings, and CloudTrail audit trail setup — in isolated real AWS sandboxes with automated validation.

No AWS account needed. No billing risk.

👉 cloudedventures.com/labs/track/aws-cloud-foundations

Which AWS security control has caused you the most pain to configure? Drop a comment.

AWS EBS explained: volume types, snapshots, and when NOT to use it

Aj — Fri, 10 Apr 2026 13:29:43 +0000

Most AWS tutorials treat EBS like it's just a cloud hard drive you attach to EC2. Plug it in, store your files, done.

That mental model is why developers end up with the wrong volume type for their workload, surprise bills from forgotten snapshots, and architecture decisions that are hard to reverse six months later.

This is the guide that actually explains how EBS works — the volume types, when to use each, when to use something else entirely, and what the common expensive mistakes look like.

What AWS EBS actually is

EBS (Elastic Block Store) is persistent block storage for EC2 instances. The two words that matter: persistent and block.

Persistent means your data survives instance stop and termination. This is different from instance store (ephemeral storage that disappears when the instance stops). If you stop an EC2 instance and start it again tomorrow, EBS data is still there. Instance store data is gone.

Block storage means the OS sees it as a raw disk — a device like /dev/xvda or /dev/nvme0n1. You format it with a filesystem (ext4, xfs), mount it, and use it like any disk. This is different from S3 (object storage, accessed via API) or EFS (file storage, accessed via NFS protocol).

The key operational fact: Each EBS volume lives in one Availability Zone and can only be attached to instances in the same AZ. If your instance is in us-east-1a, your EBS volume must also be in us-east-1a.

EBS volume types: which one do you actually need?

AWS offers six EBS volume types. Most workloads need one of two. Here is the complete breakdown:

gp3 — General Purpose SSD (the default, and usually the right choice)

IOPS: 3,000 baseline, up to 16,000 provisioned independently
Throughput: 125 MB/s baseline, up to 1,000 MB/s independently
Cost: ~$0.08/GB-month
Use for: Boot volumes, development environments, small to medium databases, web servers, almost everything

The important upgrade from gp2: on gp3, IOPS and throughput are independent of storage size. On gp2, you got 3 IOPS per GB (so a 100GB volume had 300 IOPS). On gp3, you always get 3,000 IOPS regardless of size, and you can add more without buying more storage.

If you have gp2 volumes today, migrate them to gp3. Same performance, 20% cheaper.

gp2 — General Purpose SSD (legacy, stop using this)

IOPS: 3 IOPS/GB, minimum 100, maximum 16,000
Cost: ~$0.10/GB-month
Use for: Nothing new. Migrate existing gp2 to gp3.

io2 Block Express — Provisioned IOPS SSD (when you need extreme performance)

IOPS: Up to 256,000
Throughput: Up to 4,000 MB/s
Cost: ~$0.125/GB-month + $0.065/provisioned IOPS-month
Use for: Large critical databases (Oracle, SQL Server, SAP HANA), workloads requiring sub-millisecond latency with consistent performance

io2 Block Express is expensive. A 1TB io2 volume with 32,000 IOPS costs roughly $180/month vs $80/month for the same size gp3. The performance difference is real — if your database latency is the bottleneck, io2 is worth it. If it's not your bottleneck, it's waste.

io1 — Provisioned IOPS SSD (legacy io2, skip this)

Use io2 instead. Better durability (99.999% vs 99.8%), same price or cheaper per IOPS.

st1 — Throughput Optimized HDD (big sequential reads)

Throughput: Up to 500 MB/s
IOPS: Low (not the right metric for HDDs)
Cost: ~$0.045/GB-month
Use for: Big data, data warehouses, log processing — workloads that read/write large files sequentially
Cannot be used as boot volume

st1 is significantly cheaper than SSDs. If your workload reads Kafka log files, Hadoop data, or large sequential datasets where throughput matters but IOPS don't, st1 saves significant money.

sc1 — Cold HDD (infrequently accessed archives)

Throughput: Up to 250 MB/s
Cost: ~$0.015/GB-month (cheapest EBS option)
Use for: Infrequently accessed data that still needs block storage
Cannot be used as boot volume

sc1 is basically archival storage. If you're storing data you access rarely and can tolerate slow access, sc1 is the cheapest block storage available on AWS.

The decision framework

Is this a boot volume?
  → Yes: gp3 (required — st1/sc1 can't boot)

Does your workload need >16,000 IOPS or consistent sub-ms latency?
  → Yes: io2 Block Express
  → No: Continue

Is your workload sequential large reads/writes (logs, Hadoop, analytics)?
  → Yes: st1 (throughput optimized)

Is the data infrequently accessed?
  → Yes: sc1 (cheapest)

Everything else: gp3

Nine out of ten workloads should be on gp3. The other one is a large database that needs io2.

EBS vs EFS vs S3: choosing the right storage

This is where most architecture decisions go wrong. Block, file, and object storage serve fundamentally different purposes.

	EBS	EFS	S3
Storage type	Block	File (NFS)	Object
Access pattern	Single EC2 instance	Multiple EC2 instances	Any client via API
Protocol	OS disk (mounted)	NFS	HTTP REST API
Persistence	Until deleted	Until deleted	Until deleted
Latency	Sub-millisecond	Low milliseconds	Milliseconds to seconds
Availability Zone	Single AZ	Multi-AZ	Global
Price (storage)	From $0.015/GB	From $0.30/GB	From $0.023/GB
Best for	Databases, boot volumes	Shared file systems, CMS	Media, backups, static assets

Use EBS when: You need a disk that one EC2 instance treats as its own — databases, application servers, boot volumes, anything requiring filesystem-level access.

Use EFS when: Multiple EC2 instances need to read/write the same files simultaneously — shared CMS uploads, distributed application configs, dev environments shared across instances.

Use S3 when: You're storing objects via API — static website assets, images, videos, backups, data lake files, anything accessed via URL or SDK rather than a mounted filesystem.

The most common wrong decision: using EBS for something that should be S3 (backups, logs, static files). S3 is cheaper, more durable (99.999999999%), and globally accessible. EBS is more expensive, tied to one AZ, and appropriate only when your application needs a real disk.

Snapshots: what they are and how to not forget them

An EBS snapshot is a point-in-time backup of a volume stored in S3. Snapshots are incremental — the first one copies everything, subsequent ones only copy what changed.

Create a snapshot:

aws ec2 create-snapshot \
  --volume-id vol-0123456789abcdef0 \
  --description "Production DB backup $(date +%Y-%m-%d)"

Restore a volume from snapshot:

aws ec2 create-volume \
  --snapshot-id snap-0123456789abcdef0 \
  --availability-zone us-east-1a \
  --volume-type gp3

The expensive mistake: Creating snapshots and never deleting old ones. Snapshots cost $0.05/GB-month for the data stored. A 500GB production database with daily snapshots and no retention policy generates 500GB × 30 days × $0.05 = $750/month in snapshot storage — more than the EC2 instance itself.

Set a lifecycle policy immediately:

# Create a Data Lifecycle Manager policy
aws dlm create-lifecycle-policy \
  --description "Daily snapshots, keep 7 days" \
  --state ENABLED \
  --execution-role-arn arn:aws:iam::123456789012:role/AWSDataLifecycleManagerDefaultRole \
  --policy-details '{
    "PolicyType": "EBS_SNAPSHOT_MANAGEMENT",
    "ResourceTypes": ["VOLUME"],
    "TargetTags": [{"Key": "Backup", "Value": "daily"}],
    "Schedules": [{
      "Name": "Daily",
      "CreateRule": {"Interval": 24, "IntervalUnit": "HOURS", "Times": ["03:00"]},
      "RetainRule": {"Count": 7},
      "CopyTags": true
    }]
  }'

Tag your volumes with Backup: daily, and this policy automatically creates and expires snapshots. No manual management, no forgotten snapshots accumulating cost.

EBS encryption with KMS

Encrypt EBS volumes using AWS KMS. Two ways:

1. Enable default encryption for all new volumes (recommended — set this once):

aws ec2 enable-ebs-encryption-by-default --region us-east-1

After running this, every new EBS volume in us-east-1 is automatically encrypted using your default KMS key. No per-volume configuration needed.

2. Encrypt a specific volume:

aws ec2 create-volume \
  --availability-zone us-east-1a \
  --size 100 \
  --volume-type gp3 \
  --encrypted \
  --kms-key-id arn:aws:kms:us-east-1:123456789012:key/your-key-id

Encryption adds zero latency overhead. There is no reason not to encrypt. Enable default encryption by default in every region you use.

EBS cost optimization: where teams overpay

1. Unattached volumes

When you terminate an EC2 instance, the root EBS volume is deleted (by default). But additional data volumes are not. They keep charging until explicitly deleted.

# Find all unattached volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,VolumeType,CreateTime]' \
  --output table

Run this monthly. Delete volumes you don't need. An unused 100GB gp3 volume is $8/month, quietly.

2. gp2 volumes that haven't been migrated to gp3

# Find all gp2 volumes
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].[VolumeId,Size]' \
  --output table

Migrate each to gp3 for an instant 20% cost reduction with no performance loss.

3. Over-provisioned io2 volumes

io2 charges per provisioned IOPS regardless of usage. If you provisioned 16,000 IOPS but your CloudWatch VolumeReadOps and VolumeWriteOps show average usage of 2,000 IOPS, you're paying for 14,000 idle IOPS.

Check your EBS volume metrics in CloudWatch. Right-size io2 to match actual usage.

4. Snapshots with no retention policy

Use Data Lifecycle Manager (shown above). One-time setup, saves money indefinitely.

EC2 instance store vs EBS: the critical difference

Instance store is temporary storage physically attached to the EC2 host. It is not EBS. Key differences:

	EBS	Instance Store
Persistence	Survives stop/start/reboot	Gone on stop, termination, failure
Speed	Very fast (NVMe gp3/io2)	Fastest (direct physical attach)
Cost	Charged separately	Included in instance price
Size	Any size up to 64TB	Fixed to instance type

Instance store is appropriate for: cache data you can regenerate, Hadoop temporary files, database buffer pools that can be rebuilt. Never use instance store for data you cannot regenerate from another source.

If your EC2 instance is stopped for any reason — scheduled maintenance, capacity issue, failure — instance store data is gone with no recovery path.

Frequently asked questions

Can I attach one EBS volume to multiple EC2 instances?

io1 and io2 volumes support Multi-Attach, allowing up to 16 instances in the same AZ to attach the same volume simultaneously. This requires careful application-level coordination to prevent data corruption — your application must handle concurrent writes. gp2 and gp3 do not support Multi-Attach.

What happens to my EBS volume if the EC2 instance fails?

The EBS volume is unaffected. You can detach it and attach it to a new instance. This is one of EBS's core advantages over instance store.

Can I move an EBS volume to a different Availability Zone?

Not directly. Create a snapshot, then create a new volume from that snapshot in the target AZ. The snapshot → new volume path is the standard way to move EBS storage across AZs or regions.

How much does EBS cost?

gp3: $0.08/GB-month. io2: $0.125/GB-month + $0.065/provisioned IOPS-month. st1: $0.045/GB-month. sc1: $0.015/GB-month. Snapshots: $0.05/GB-month for changed data stored.

Is EBS encrypted at rest by default?

Not by default, but you can enable account-level default encryption with one command (aws ec2 enable-ebs-encryption-by-default). After that, every new volume is encrypted automatically.

Practice this in a real AWS environment

Understanding EBS volume types is one thing. Knowing which type your workload actually needs — and what happens when you get it wrong — requires working with real volumes, real IOPS metrics, and real cost dashboards.

The Core AWS Foundations track on Cloud Edventures includes hands-on labs covering EBS volume creation, snapshot lifecycle management, KMS encryption, and cost optimization — in isolated real AWS sandboxes with automated validation. No AWS account needed.

👉 cloudedventures.com/labs/track/aws-cloud-foundations

What EBS scenario are you trying to solve? Drop a comment.

I built an MCP server on AWS Bedrock in 30 minutes. Here's the exact code.

Aj — Thu, 09 Apr 2026 11:13:10 +0000

MCP (Model Context Protocol) is the most important AI infrastructure pattern of 2026. Anthropic built it, the Linux Foundation now owns it, and AWS just made it a first-class citizen in Bedrock AgentCore.

97 million SDK downloads. 13,000+ servers built by the community. And as of this month, AWS is deploying them as managed services inside your existing cloud infrastructure.

This is the tutorial I wish existed when I started. Not theory. Actual working code that deploys a real MCP server connected to AWS services in under 30 minutes.

What MCP Actually Is (In One Paragraph)

MCP is the protocol that lets AI agents use external tools reliably. Without it, your agent either hardcodes tool integrations (brittle, unmaintainable) or hallucinates function calls that don't exist.

With MCP, you define tools once as a server. Any agent — Claude, Cursor, your own custom Bedrock agent — can discover and use those tools via a standardized interface. Think USB-C for AI tools. Build once, plug in anywhere.

Your Agent (Claude / Bedrock)
        ↓
   MCP Client (asks: what tools are available?)
        ↓
   MCP Server (returns: tool schemas + executes calls)
        ↓
   Your actual APIs, databases, AWS services

What We're Building

A working MCP server that exposes two AWS tools:

query_dynamodb — lets Claude query a DynamoDB table using natural language
get_s3_summary — lets Claude list and summarize files in an S3 bucket

Then we'll connect it to a Bedrock agent and watch Claude use both tools autonomously.

Prerequisites: Python 3.11+, AWS credentials configured, boto3 installed.

Step 1 — Install the MCP SDK

pip install mcp boto3 fastmcp

FastMCP is the Python framework that makes building MCP servers significantly less painful than raw MCP. It handles the protocol layer so you write tools, not JSON-RPC boilerplate.

Step 2 — Build the MCP Server

Create aws_mcp_server.py:

import boto3
import json
from fastmcp import FastMCP

# Initialize FastMCP server
mcp = FastMCP("AWS Tools Server")

# AWS clients
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
s3_client = boto3.client("s3", region_name="us-east-1")


@mcp.tool()
def query_dynamodb(table_name: str, key_name: str, key_value: str) -> str:
    """
    Query a DynamoDB table by primary key.
    Use this when the user wants to look up specific records from a database.

    Args:
        table_name: The DynamoDB table to query
        key_name: The primary key attribute name
        key_value: The value to look up
    """
    try:
        table = dynamodb.Table(table_name)
        response = table.get_item(
            Key={key_name: key_value}
        )

        item = response.get("Item")
        if not item:
            return json.dumps({
                "found": False,
                "message": f"No record found for {key_name}={key_value} in {table_name}"
            })

        return json.dumps({
            "found": True,
            "table": table_name,
            "record": item
        }, default=str)

    except Exception as e:
        return json.dumps({"error": str(e), "table": table_name})


@mcp.tool()
def get_s3_summary(bucket_name: str, prefix: str = "") -> str:
    """
    List and summarize files in an S3 bucket.
    Use this when the user asks about files, documents, or data stored in S3.

    Args:
        bucket_name: The S3 bucket to inspect
        prefix: Optional folder prefix to filter results (e.g., 'reports/' or 'data/2026/')
    """
    try:
        paginator = s3_client.get_paginator("list_objects_v2")
        pages = paginator.paginate(
            Bucket=bucket_name,
            Prefix=prefix,
            PaginationConfig={"MaxItems": 50}
        )

        files = []
        total_size = 0

        for page in pages:
            for obj in page.get("Contents", []):
                files.append({
                    "key": obj["Key"],
                    "size_kb": round(obj["Size"] / 1024, 2),
                    "last_modified": obj["LastModified"].isoformat()
                })
                total_size += obj["Size"]

        return json.dumps({
            "bucket": bucket_name,
            "prefix": prefix or "(root)",
            "file_count": len(files),
            "total_size_kb": round(total_size / 1024, 2),
            "files": files[:20],  # Return first 20 for context window efficiency
            "note": f"Showing {min(20, len(files))} of {len(files)} files"
        }, default=str)

    except Exception as e:
        return json.dumps({"error": str(e), "bucket": bucket_name})


if __name__ == "__main__":
    # Run as stdio MCP server (for local testing with Claude Desktop / Claude Code)
    mcp.run()

The @mcp.tool() decorator does the heavy lifting — it generates the JSON schema from your Python type hints and docstring. Claude uses the docstring to decide when to call each tool. Write it from Claude's perspective: "Use this when the user wants to..."

Step 3 — Test Locally With Claude Code

Before deploying to Bedrock, test the MCP server locally. Add it to your Claude Code MCP config:

# Add to Claude Code's MCP servers
claude mcp add aws-tools -- python /path/to/aws_mcp_server.py

Restart Claude Code, then ask:

How many files are in my logs-bucket-prod S3 bucket?

You should see Claude invoke get_s3_summary automatically. If it works locally, it'll work on Bedrock.

Step 4 — Deploy to Bedrock AgentCore Runtime

AWS Bedrock AgentCore Runtime lets you deploy MCP servers as managed services — serverless, auto-scaling, with session isolation handled for you. This is the new way to run MCP in production.

4a — Create a Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY aws_mcp_server.py .

# AgentCore expects MCP servers to run on port 8080
EXPOSE 8080

# Run as HTTP MCP server for AgentCore
CMD ["python", "aws_mcp_server.py", "--transport", "http", "--port", "8080"]

requirements.txt:

fastmcp==0.9.0
boto3==1.35.0

4b — Push to ECR

# Create ECR repo
aws ecr create-repository --repository-name aws-tools-mcp-server

# Get ECR login
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS \
  --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com

# Build and push
docker build -t aws-tools-mcp-server .
docker tag aws-tools-mcp-server:latest \
  123456789.dkr.ecr.us-east-1.amazonaws.com/aws-tools-mcp-server:latest
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/aws-tools-mcp-server:latest

4c — Deploy to AgentCore Runtime

import boto3

agentcore = boto3.client("bedrock-agentcore", region_name="us-east-1")

response = agentcore.create_agent_runtime(
    agentRuntimeName="aws-tools-mcp",
    agentRuntimeArtifact={
        "containerConfiguration": {
            "containerUri": "123456789.dkr.ecr.us-east-1.amazonaws.com/aws-tools-mcp-server:latest"
        }
    },
    networkConfiguration={
        "networkMode": "PUBLIC"
    }
)

print("AgentCore Runtime ARN:", response["agentRuntimeArn"])
print("Endpoint:", response["agentRuntimeEndpoint"])

AgentCore handles: session isolation (each user gets their own MCP session in a dedicated microVM), automatic scaling, authentication, and 8-hour maximum session support for long-running operations.

Step 5 — Connect to a Bedrock Agent

Now wire the deployed MCP server into a Bedrock agent:

import boto3
import json

bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

# Your MCP server endpoint from Step 4c
MCP_ENDPOINT = "https://your-agentcore-endpoint.bedrock-agentcore.us-east-1.amazonaws.com"

def run_agent_with_mcp(user_message: str) -> str:
    """
    Bedrock agent that uses your deployed MCP server as its tool provider.
    """
    messages = [
        {"role": "user", "content": [{"text": user_message}]}
    ]

    system = [{
        "text": f"""You are an AWS assistant with access to DynamoDB and S3 tools.
        Use your tools to answer questions about the user's AWS data.
        Always use tools to get real data — never guess or make up values.
        MCP Server: {MCP_ENDPOINT}"""
    }]

    # Tool config pointing to your MCP server
    tool_config = {
        "tools": [{
            "toolSpec": {
                "name": "query_dynamodb",
                "description": "Query a DynamoDB table by primary key",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "table_name": {"type": "string"},
                            "key_name": {"type": "string"},
                            "key_value": {"type": "string"}
                        },
                        "required": ["table_name", "key_name", "key_value"]
                    }
                }
            }
        }, {
            "toolSpec": {
                "name": "get_s3_summary",
                "description": "List and summarize files in an S3 bucket",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "bucket_name": {"type": "string"},
                            "prefix": {"type": "string"}
                        },
                        "required": ["bucket_name"]
                    }
                }
            }
        }]
    }

    # Agentic loop
    for _ in range(10):
        response = bedrock_runtime.converse(
            modelId="anthropic.claude-3-sonnet-20240229-v1:0",
            system=system,
            messages=messages,
            toolConfig=tool_config
        )

        stop_reason = response["stopReason"]
        output = response["output"]["message"]
        messages.append(output)

        if stop_reason == "end_turn":
            for block in output["content"]:
                if "text" in block:
                    return block["text"]

        elif stop_reason == "tool_use":
            tool_results = []
            for block in output["content"]:
                if "toolUse" not in block:
                    continue

                tool = block["toolUse"]
                tool_name = tool["name"]
                tool_input = tool["input"]

                # Route to correct tool
                if tool_name == "query_dynamodb":
                    result = query_dynamodb(**tool_input)
                elif tool_name == "get_s3_summary":
                    result = get_s3_summary(**tool_input)
                else:
                    result = json.dumps({"error": f"Unknown tool: {tool_name}"})

                tool_results.append({
                    "toolResult": {
                        "toolUseId": tool["toolUseId"],
                        "content": [{"text": result}]
                    }
                })

            messages.append({"role": "user", "content": tool_results})

    return "Agent reached iteration limit"


# Test it
response = run_agent_with_mcp(
    "How many files are in my data-lake-prod bucket? "
    "And look up user ID 'user_123' in my Users table."
)
print(response)

The Three Things That Break MCP in Production

1 — Vague tool descriptions

Claude decides which tool to call based entirely on the description field. If your description is vague, Claude either skips the tool or calls it when it shouldn't.

# Weak — Claude might not call this when it should
@mcp.tool()
def get_data(table: str, key: str) -> str:
    """Get data from a table."""

# Strong — Claude knows exactly when to use this
@mcp.tool()
def query_dynamodb(table_name: str, key_name: str, key_value: str) -> str:
    """
    Query a DynamoDB table by primary key.
    Use this when the user wants to look up specific records,
    find customer data, retrieve order information, or access
    any structured data stored in DynamoDB.
    """

2 — Returning too much data

MCP tool results flow back into the agent's context window. A tool that returns 10,000 rows from DynamoDB will burn your context window in one call.

Always limit your returns: paginate aggressively, return summaries not raw dumps, use limit parameters in every database query.

3 — Not handling tool failures

If your tool raises an exception, the entire agentic loop breaks. Every tool should return structured JSON even on failure:

try:
    result = do_the_thing()
    return json.dumps({"success": True, "data": result})
except Exception as e:
    return json.dumps({"success": False, "error": str(e), "tool": "tool_name"})

The agent can read the error message and respond gracefully. An uncaught exception just crashes.

Stateful MCP: AWS's New Feature (March 2026)

AWS just shipped stateful MCP server support in AgentCore Runtime. This is significant.

Previously, each MCP call was stateless — the server had no memory between tool invocations. Now, each user session gets a dedicated microVM with session context preserved using an Mcp-Session-Id header.

This enables:

Elicitation — the server can ask the user follow-up questions mid-tool execution
Sampling — the server can request Claude to generate content as part of a tool's operation
Progress notifications — real-time updates for long-running operations

For long-running tasks (ML training jobs, large data exports, multi-hour simulations), this changes the architecture completely. The server can now maintain state across a multi-hour operation without requiring the agent session to stay open.

What to Build Next

With this foundation working, the natural next steps:

Add more AWS tools — CloudWatch metrics querier, RDS query executor, Lambda invoker, Step Functions status checker. Each becomes a tool your agent can use.

Add Bedrock Guardrails — wrap your MCP server with content filtering and PII detection that operates outside the agent's reasoning loop.

Multi-agent coordination — one coordinator agent that routes requests to specialist subagents, each with their own focused MCP server and tool set.

AgentCore Gateway — instead of embedding tool schemas in your agent, register your MCP server with AgentCore Gateway and let multiple agents discover the same tools via a central registry.

The full production version of this architecture — agentic loops, multi-agent systems, MCP server builds, Bedrock Guardrails, CloudWatch observability — is covered hands-on in the CCA-001: Claude Certified Architect track. Real Bedrock sandboxes, automated validation, no AWS account needed.

If you want to build the architecture in this article in a real environment without worrying about AWS billing, that's the path.

👉 cloudedventures.com/labs/track/claude-certified-architect-cca-001

What are you building with MCP? Drop a comment — especially if you hit a specific architecture problem I can help with.