Forem: Eric D Johnson

Serverless with Mama J — Why Serverless

Eric D Johnson — Fri, 22 May 2026 06:00:00 +0000

If you ever want to test how well you really understand something, try explaining it to your mom. That's exactly what I got to do with serverless.

My mom has been my biggest fan and supporter my entire technical career. A few weeks ago she told me she watches every one of my YouTube videos about serverless and wanted to know — in plain English — what it actually is and why it matters. Just so you know, my mom isn't a developer, but she's pretty tech-savvy. I thought it would be fun to record the whole conversation, and you can watch it here.

I also turned that chat into this blog so anyone who's new to development (or just curious) can follow along easily. So if that's you — welcome! I hope this helps.

Oh, and everyone calls her Mama J… so you can too. Let's dive in!

Serverless starts with servers

Yup, this is the worst-kept secret in all of serverless: it still runs on servers. So, before we talk about serverless, we need to understand what a server actually is and what it does.

The clue is right in the name — a server serves. And what does it serve? Files. That's it. Every website and every web app you've ever used is really just a bunch of different files being delivered to you. Those files are:

HTML (Hypertext Markup Language) – the content and basic structure of the page
CSS (Cascading Style Sheets) – the design and layout
JavaScript – what makes the page interactive
Media – photos, videos, music, and all the other visual goodies

The client

To talk to a server, Mama J needs a client. A client is anything that can send a request to the server and then show (or use) what the server sends back.

Most of the time, that client is a web browser on her computer, tablet, or phone. But it doesn't have to be! For example, the Amazon shopping app on her phone is also a client. So is the Alexa voice service on her Echo device — when she says "Alexa, order more Diet Dr. Pepper," Alexa acts as the client, talks to Amazon's servers, and gets the job done.

The client's job is always the same: it takes the information the server returns and turns it into something Mama J can see or use.

Here's what happens when Mama J wants to buy her son a birthday gift on Amazon.com:

She opens her browser and types Amazon.com.
The request gets routed to one of Amazon's many servers.
The server processes the request.
The server sends back the right HTML, CSS, JavaScript, and media so her browser can build the page.

The processing

That little phrase in step 3 — "the server processes the request" — is where the magic happens. This is where the server applies what we call business logic.

Think of business logic as the "brain" or the special rules that make the website actually useful for Mama J. It's the part that decides what to show her and how to make the experience personal.

For example, when Mama J visits Amazon, the site doesn't just show the same boring homepage to everyone. Thanks to business logic, it remembers she was looking at birthday gifts for her son last week, so it puts a big "Recommended for you" section right at the top with more toys, wrapping paper, and even a suggestion for the perfect card. It also knows she loves Diet Dr. Pepper, so it quietly adds a "Buy it again" button for her favorite 12-pack. None of that happens by accident — the server ran its rules: "If this is Mama J, show her the stuff she actually cares about."

The same thing happens on Facebook. When she logs in, the business logic looks at who she follows, what she's liked before, and even how long she usually spends reading family posts. Then it builds her entire news feed so the first thing she sees is a photo of her grandkids, not some random political rant. It's all custom-made just for her.

How does any of this work? Because she logged in, so the system knows it's Mama J. The server then runs its business logic — basically a set of smart instructions that say, "If this is Mama J, show her her stuff, in the order she'll like best." That's business logic in plain English.

The challenge of running servers

Servers are incredible — they power everything from video games to generative AI. But running them comes with some serious headaches.

Security and maintenance

Servers need constant care. New threats appear every single day. If your server is open to the internet, you have to be an expert in secure networking, operating-system patching, hardware upgrades, monitoring, and a dozen other things. You can't just set it and forget it — it demands 24/7 attention.

Availability and scalability

Websites have to stay up 24 hours a day, 7 days a week. That means you need at least two servers so one can cover for the other if something breaks. In reality, big sites run on thousands of servers spread across different geographic areas. They also use auto-scaling to spin up more servers when traffic spikes and scale back down when it drops.

Paying for idle services

All of that means you're paying for servers to sit there running… even when nobody's using them. Mama J nailed it when she said, "It's like keeping a restaurant fully staffed day and night, whether you have customers or not."

That's what we call idle time — and idle time is expensive.

So Mama J asked the million-dollar question: "Why don't we just call the staff in when the customers actually show up?"

Great question, Mom. That's exactly what serverless does.

The serverless approach

At AWS we've been running Amazon.com at massive global scale for over 30 years, so we know a thing or two about servers. We looked at how companies were struggling with all the work it takes to keep servers running and decided we could make it simpler.

What we set out to build

We wanted to build something that would let developers run their code at any scale without having to manage servers, security patches, scaling, or any of the other stuff that's the same for every company. We call that "undifferentiated heavy lifting."

We wanted developers to focus on their ideas instead of infrastructure. We wanted the system to handle huge traffic spikes automatically, and we wanted them to pay only for the time their code was actually running — basically, wake everything up only when customers show up.

AWS Lambda

AWS Lambda is a service that runs your code without you managing any servers. You write your code, deploy it to Lambda, and it takes care of the infrastructure — servers, networking, security, and scaling.

When a request comes in, here's what happens:

Lambda spins up a small execution environment to run your code
Your business logic does its thing
The response goes back to the user

The first time this happens, Lambda has to set everything up from scratch — that's called a cold start. After that, the environment hangs around for a while, ready for the next request. If another request comes in while it's still warm, it reuses that same environment — a warm start — and it's faster. But if things go quiet for too long, Lambda cleans it up and the next request starts fresh again.

Pricing

The pricing model is pay-per-use. You're charged based on the number of requests and how long each one runs. No requests, no charge.

For example, a website handling 1 million requests a month — each taking half a second with 512 MB of memory — would cost about $4.37 total. A traditional server running 24/7 costs $15–$30 a month, and you'd need at least two for availability, so $30–$60 a month regardless of traffic. With Lambda, a month of zero traffic costs $0.00.

Scalability

Lambda also scales automatically. By default, your AWS account gets 1,000 concurrent Lambda executions per region — that's 1,000 separate copies of your code running at the exact same time, shared across all your Lambda functions in that region. That's a soft limit — if you need more, you can request an increase from AWS. But for most workloads, 1,000 is more than enough.

To put that in perspective — say Mama J's website goes viral and 1,000 people all hit it at the same second. Lambda spins up 1,000 environments, one for each request, handles them all simultaneously, and sends back 1,000 responses. And since each request typically finishes in milliseconds, those 1,000 slots free up almost instantly and are ready for the next wave. In practice, that means Lambda can handle tens of thousands of requests per second without you configuring anything. With a traditional server, you'd be scrambling to spin up more machines before the traffic crushes you.
scale

Right tool for the job

"So, I should just use Lambda for everything?" Mama J asked.

Not quite. Lambda is a great fit for most things — web backends, APIs, data processing, scheduled jobs, chatbots, mobile backends, and event-driven architectures. For the majority of workloads, it's where I'd start. But there are a few situations where it's not the right tool:

Long-running workloads — A single Lambda invocation has a 15-minute maximum, and that applies to synchronous execution. For workloads that need to run longer — heavy video encoding, large data migrations, overnight batch jobs — you'd traditionally reach for something like Amazon ECS or AWS Batch. However, the new AWS Lambda durable functions feature changes the game by letting you build long-running asynchronous workflows that coordinate across multiple Lambda invocations, well beyond the 15-minute limit. It's a big step forward, but you still need to plan your architecture around it.
Ultra-low latency at very high throughput — Some workloads need extremely predictable response times at massive scale, like high-frequency trading systems. In those cases, always-on compute can give you tighter control.
Big legacy monoliths — If you've got a large, tightly coupled application, breaking it into small Lambda functions can be painful. That's more of an architecture challenge than a Lambda limitation, but it's worth knowing.

The thing is, it doesn't have to be all-or-nothing. Most real-world systems use a mix — Lambda where it fits, and other services where they're a better match. But Lambda is still the default starting point for most new projects because it takes so much off your plate.

Wrapping up

Mama J sat back and said, "So let me make sure I've got this. Instead of buying servers, keeping them running, paying for them when nobody's using them, and worrying about security, availability, scalability, and maintenance — you just write your code and Lambda handles the rest?"

Pretty much, Mom. That's serverless in a nutshell.

This is just the beginning, though. Lambda on its own is powerful, but the real fun starts when you connect it to other AWS services and build event-driven applications — things that react automatically when something happens, with no servers to manage anywhere in the stack.

We'll get into all of that in the next post.

Stay tuned!

200,000 MCP Servers Are Exposed. Here's Why Serverless Is Safer.

Eric D Johnson — Wed, 20 May 2026 06:00:00 +0000

I've spent a lot of time thinking about where MCP servers should live. I work with remote MCP servers constantly and do a lot of the architecture work around them. But I also use plenty of local ones. There's a simplicity to npx @modelcontextprotocol/server-whatever that's hard to argue with.

Then MCP crossed 300 million SDK downloads a month in April 2026, and a few days later, OX Security published a disclosure that put a number on what I'd been turning over in my head: the most popular MCP transport has no authentication, and 200,000 servers are running it in production.

That got me to finally put my thoughts together. The short version: the subprocess-spawn vulnerability that OX Security disclosed is specific to STDIO, the local transport. Remote MCP servers avoid that specific attack path, though they introduce different web and API security considerations. And if you're going to run remote MCP servers, I think serverless is the best place to do it.

Let's walk through it.

What OX Security found in MCP's STDIO transport

The root cause is in MCP's STDIO transport. When an MCP client connects to a server via STDIO, it passes a StdioServerParameters object to the SDK. That object contains a command field and an args array that tell the SDK which process to spawn.

The official MCP SDKs across Python, TypeScript, Java, and Rust do not sanitize those fields before passing them to the operating system. Whatever strings arrive get executed as shell commands on the host machine.

The execution sequence makes it worse. The command runs first. Then the MCP handshake tries to validate it as a legitimate server. Then the handshake fails. But the payload already ran. OX Security described this as "execute first, validate never," and that's accurate.

CVE-2025-49596: Browser to backdoor

The specific CVE that got the most attention was CVE-2025-49596, a CVSS 9.4 critical vulnerability in MCP Inspector, Anthropic's official debugging tool. The Inspector runs a proxy on localhost that accepts commands from its browser UI with zero authentication.

The attack chain: On macOS and Linux, browsers allow websites to make requests to 0.0.0.0, which the OS silently redirects to 127.0.0.1. A developer visits a malicious website. The site's JavaScript reaches MCP Inspector through the 0.0.0.0 bypass. Arbitrary code runs on the developer's machine. No phishing, no suspicious downloads. Just a website visit.

How many MCP servers are affected

The numbers from the OX disclosure:

200,000+ vulnerable server instances
150 million+ cumulative downloads across affected packages
7,000+ publicly accessible MCP servers
10+ high or critical CVEs
200+ affected open-source projects

Why serverless reduces the MCP risk surface

The STDIO model is essentially "download this script, run it as a local server, trust it." The vulnerability exists because of three assumptions baked into the STDIO transport: there's a persistent process running on the host, that process has shell access, and there's no authentication layer between the client and the process.

Serverless compute inverts all three.

No persistent process to hijack. Lambda functions run in Firecracker microVMs, isolated execution environments that the Lambda service manages. There's no customer-managed long-running daemon process exposed to clients.

No client-controlled process spawning. When Lambda serves as an MCP server, communication happens over HTTPS. The MCP Streamable HTTP transport replaces the STDIO subprocess model entirely. There's no path for a client to trigger arbitrary process execution on the host.

Auth is infrastructure, not application code. STDIO local transports rely on local process trust. With Lambda, you get IAM auth on Function URLs and IAM or Cognito authorizers on API Gateway. Auth is a configuration toggle, not something you need to build yourself.

Lambda as an MCP server: build it yourself

IAM auth with SigV4 — Set your Function URL's AuthType to AWS_IAM and every request requires Signature Version 4 signing. No tokens to manage, no OAuth dance, no plaintext API keys.

OAuth 2.1 for external agents — Front the Lambda function with API Gateway, attach a Cognito authorizer. API Gateway validates the token before the request ever reaches your function.

The awslabs STDIO adapter — Already have MCP servers built on STDIO? The awslabs/run-model-context-protocol-servers-with-aws-lambda project wraps existing STDIO-based MCP servers so they run inside Lambda. The STDIO surface stays internal to the Lambda sandbox and is never exposed to the network.

Working examples

mikegc-aws/Lambda-MCP-Server: Native Streamable HTTP on Lambda with IAM auth
awslabs/run-model-context-protocol-servers-with-aws-lambda: Official wrapper for existing STDIO servers
aws-samples/sample-serverless-mcp-servers: Collection of agent and server patterns

AgentCore: let AWS manage your MCP servers

Amazon Bedrock AgentCore centralizes auth, policy enforcement, and observability into a single endpoint. The Gateway sits in front of all your MCP servers. Interceptors let you filter which tools an agent can see based on caller identity and context. AgentCore Runtime handles containerization, scaling, and session isolation — you write the server code, AgentCore runs each session in a dedicated microVM.

But what about local file access?

If your MCP tools need local file access, they still need to run locally. That's a real limitation of any remote MCP server. For tools that work with shared files or project artifacts, S3 Files is an interesting middle ground — it mounts an S3 bucket as a local filesystem on your Lambda function.

Run your MCP servers somewhere secure

Of those 200,000 exposed STDIO servers, every one that could be redeployed as a remote MCP server behind authenticated infrastructure would remove itself from that count.

Serverless doesn't patch the protocol. It removes the conditions the vulnerability depends on. No persistent process, no STDIO surface, and auth on every request by default.

Lambda if you want control. AgentCore if you want managed. You've got options. Use them.

Dynamic Looping Comes to AWS SAM

Eric D Johnson — Mon, 18 May 2026 16:20:40 +0000

Update (May 22, 2026): Since the initial launch, local processing of Language Extensions has moved behind an opt-in flag for compatibility. You now enable it with --language-extensions, an environment variable, or a samconfig.toml entry. Details in the "Enabling local processing" section below.

AWS SAM CLI, the command-line tool for building and deploying serverless applications, now supports AWS CloudFormation Language Extensions. The one I am most excited about is Fn::ForEach, which brings dynamic looping to your YAML templates, but it's close. If you, like me, have been copy-pasting resource definitions to infinity, that stops today.

ForEach is the star, but it ships alongside Length, ToJsonString, FindInMap with default values, and conditional deletion policies. All of them work across your full local SAM workflow: build, invoke, validate, package, deploy, and sync.

In this post, I walk through what CloudFormation Language Extensions brings to SAM CLI, show you how each extension works, and demonstrate the full local development experience.

The problem: template duplication

To show why this matters, take a look at the following example. I have three AWS Lambda functions, Lambda being the serverless compute service, that each handle a different endpoint on the same API. But, almost everything about them is the same. They have the same runtime, the same memory configuration, and nearly the same structure. The only differences are the name, handler, and possibly some environment variables.

The template looks like this:

Resources:
  UsersFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.11
      Handler: users.handler
      CodeUri: ./src
      MemorySize: 256
      Environment:
        Variables:
          FUNCTION_NAME: Users

  OrdersFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.11
      Handler: orders.handler
      CodeUri: ./src
      MemorySize: 256
      Environment:
        Variables:
          FUNCTION_NAME: Orders

  ProductsFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.11
      Handler: products.handler
      CodeUri: ./src
      MemorySize: 256
      Environment:
        Variables:
          FUNCTION_NAME: Products

Three resources, nearly identical, and if I need to change the memory size or add a tracing configuration, I'm making the same edit three times. The template is fragile and hard to maintain, and it only gets worse at ten or twenty functions. So what can I do about it? That's where Language Extensions come in.

What are CloudFormation Language Extensions?

CloudFormation Language Extensions is a transform (AWS::LanguageExtensions) that unlocks a suite of extended intrinsic functions for your CloudFormation templates. These functions have existed in CloudFormation for a while. What's new is that SAM CLI now processes them locally for your entire development workflow, meaning you can build, invoke, and test locally before deploying.

The full suite includes:

Extension	What it does
Fn::ForEach	Iterate over a collection and generate resources for each item
Fn::Length	Return the length of an array
Fn::ToJsonString	Convert an object or array to a JSON string
Fn::FindInMap with DefaultValue	Look up a value in a Mappings section with a fallback when the key doesn't exist
Conditional DeletionPolicy	Use `Fn::If` in DeletionPolicy (e.g., Retain in prod, Delete in dev)
Conditional UpdateReplacePolicy	Use `Fn::If` in UpdateReplacePolicy

To enable them, I add AWS::LanguageExtensions to my template's Transform section alongside the SAM transform:

Transform:
  - AWS::LanguageExtensions
  - AWS::Serverless-2016-10-31

With that in place, I can start using Fn::ForEach to solve the duplication problem I showed earlier.

Fn::ForEach: define once, generate many

Take a look at the same three functions rewritten with Fn::ForEach. Instead of repeating the definition three times, I define it once and let the loop generate the rest:

Transform:
  - AWS::LanguageExtensions
  - AWS::Serverless-2016-10-31

Resources:
  Fn::ForEach::Functions:
    - Name
    - [Users, Orders, Products]
    - ${Name}Function:
        Type: AWS::Serverless::Function
        Properties:
          Runtime: python3.11
          Handler: !Sub "${Name}.handler"
          CodeUri: ./src
          MemorySize: 256
          Environment:
            Variables:
              FUNCTION_NAME: !Sub ${Name}

That single definition generates three functions: UsersFunction, OrdersFunction, and ProductsFunction. If I need to add a fourth, I add one item to the collection array. If I need to change the memory size, I change it in one place.

The anatomy of Fn::ForEach breaks down into four parts:

Loop name: Fn::ForEach::Functions, a unique identifier for this loop
Iterator variable: Name, the variable that takes each value in turn
Collection: [Users, Orders, Products], the values to iterate over
Template body: The resource definition using ${Name} for substitution

That covers the basic case where all functions share the same source code. However, what happens when each function needs its own code directory?

Per-function code directories

In many projects, each function lives in its own folder. Fn::ForEach handles this through dynamic artifact properties, where the CodeUri itself uses the loop variable:

Resources:
  Fn::ForEach::Services:
    - Name
    - [Users, Orders, Products]
    - ${Name}Service:
        Type: AWS::Serverless::Function
        Properties:
          Runtime: python3.11
          Handler: index.handler
          CodeUri: ./services/${Name}

With this directory structure:

services/
├── Users/index.py
├── Orders/index.py
└── Products/index.py

SAM CLI builds each function from its own directory and generates Mappings sections automatically to preserve the Fn::ForEach structure in the deployed template. To see this in action, I check .aws-sam/build/template.yaml after a build:

Mappings:
  SAMCodeUriServices:
    Users:
      CodeUri: UsersService
    Orders:
      CodeUri: OrdersService
    Products:
      CodeUri: ProductsService

Resources:
  Fn::ForEach::Services:
    - Name
    - [Users, Orders, Products]
    - ${Name}Service:
        Type: AWS::Serverless::Function
        Properties:
          CodeUri:
            Fn::FindInMap:
              - SAMCodeUriServices
              - Ref: Name
              - CodeUri
          Handler: index.handler

SAM CLI generates the SAMCodeUriServices mapping so that each collection value resolves to its own build artifact. At package time, those paths become Amazon S3 URIs. I don't need to manage any of this.

The same pattern works for API endpoints. Let me show one more example before moving on to the other extensions.

API endpoints from a loop

I can generate multiple API endpoints from a single definition by attaching an Amazon API Gateway event source inside the loop:

Resources:
  Fn::ForEach::Endpoints:
    - Endpoint
    - [users, products, orders]
    - ${Endpoint}Function:
        Type: AWS::Serverless::Function
        Properties:
          Runtime: python3.11
          Handler: index.handler
          CodeUri: ./endpoints/${Endpoint}
          Events:
            Api:
              Type: Api
              Properties:
                Path: !Sub /${Endpoint}
                Method: get

I run sam local start-api, and I get three working endpoints: /users, /products, /orders, all generated from that single resource definition.

Fn::ForEach is the biggest addition, but the other extensions in the suite solve real problems of their own.

Beyond Fn::ForEach: Length, ToJsonString, FindInMap, and more

Each of the remaining extensions addresses a specific gap in what CloudFormation templates could express before.

Fn::Length

When I generate resources from a collection, I sometimes need to know how many items are in that collection. Maybe I'm setting a concurrency limit based on the number of services, or creating an Amazon CloudWatch alarm that scales with the fleet. Previously, I'd hardcode that number and forget to update it when the collection changed. Fn::Length returns the length of an array at deploy time:

Parameters:
  ServiceNames:
    Type: CommaDelimitedList
    Default: "api,worker,scheduler"

Resources:
  ServiceCountMetric:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: !Sub "Monitoring ${Fn::Length(ServiceNames)} services"

Fn::ToJsonString

Lambda functions frequently need structured configuration passed as environment variables. The problem is that environment variables are strings, so I end up building JSON by hand inside !Sub with escaped quotes and line breaks, and it breaks the moment someone forgets a backslash.

Fn::ToJsonString solves this by converting an object to a JSON string inline:

Environment:
  Variables:
    CONFIG:
      Fn::ToJsonString:
        region: !Ref AWS::Region
        table: !Ref MyTable
        version: "2.0"

No more escaping quotes in YAML, and no more !Sub gymnastics to build JSON strings. I define the object naturally and let Fn::ToJsonString handle serialization. The function reads CONFIG as a standard JSON string at runtime, and if I add a field, I add it to the YAML object and the serialization stays correct.

Fn::FindInMap with DefaultValue

Mappings are great for region-specific or environment-specific configuration. However, Fn::FindInMap throws a hard error if the key doesn't exist. So if I add a new region or deploy to one I didn't explicitly map, the stack fails. I end up maintaining an exhaustive list of every possible key, or wrapping lookups in conditions.

Now I can provide a default value that CloudFormation uses when the key isn't found:

Mappings:
  RegionConfig:
    us-east-1:
      BucketPrefix: "use1"
    eu-west-1:
      BucketPrefix: "euw1"

Resources:
  MyBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub
        - "${Prefix}-my-app"
        - Prefix:
            Fn::FindInMap:
              - RegionConfig
              - !Ref AWS::Region
              - BucketPrefix
              - DefaultValue: "default"

If I deploy to ap-southeast-1, no crash. I get "default" instead of a stack failure.

Conditional DeletionPolicy and UpdateReplacePolicy

In a multi-environment setup, I want production Amazon DynamoDB tables and S3 buckets to survive accidental stack deletions. But in dev, I want clean teardowns without orphaned resources cluttering the account. Previously, I needed separate templates or manual post-deploy steps because DeletionPolicy only accepted a static string.

Now it accepts intrinsic functions:

Conditions:
  IsProd: !Equals [!Ref Environment, prod]

Resources:
  MyTable:
    Type: AWS::DynamoDB::Table
    DeletionPolicy: !If [IsProd, Retain, Delete]
    UpdateReplacePolicy: !If [IsProd, Retain, Delete]
    Properties:
      TableName: !Sub "${Environment}-data"
      BillingMode: PAY_PER_REQUEST

One template handles both: production retains data on deletion, dev cleans up after itself.

That covers all the extensions. The next question is how they fit into the SAM CLI workflow.

Full SAM CLI workflow support

Every SAM CLI command supports Language Extensions:

sam build: Expands loops in memory, builds each generated function
sam local invoke: Invoke expanded functions by name
sam local start-api: Serves all generated API endpoints
sam validate: Catches syntax errors and unsupported patterns locally
sam package: Preserves the Fn::ForEach structure with S3 URIs
sam deploy: Uploads your original template for CloudFormation to process
sam sync: Syncs changes to the cloud, including code-only updates

SAM CLI expands language extensions in memory for local operations because it needs to know which functions to build and invoke. But your original unexpanded template is what goes to CloudFormation. You get the full local development experience with no template modification for deployment.

Enabling local processing

Local processing of language extensions is opt-in. You enable it with the --language-extensions flag:

sam build --language-extensions
sam local invoke UsersFunction --language-extensions --event events/get-user.json
sam local start-api --language-extensions
# Test your endpoints at http://localhost:3000/users
sam deploy --language-extensions --guided

Each command needs its own activation. To avoid repeating the flag, set the environment variable:

export SAM_CLI_ENABLE_LANGUAGE_EXTENSIONS=1
sam build
sam local invoke UsersFunction

Or persist it in your samconfig.toml:

[default.build.parameters]
language_extensions = true

If multiple methods are set, the precedence order is: CLI flag > samconfig.toml > environment variable. Before you get started, there are a few constraints worth knowing about.

Limitations and constraints

Collections must be locally resolvable. Your Fn::ForEach collection can be a static list ([A, B, C]) or a parameter reference (!Ref MyParam). It cannot use Fn::GetAtt, Fn::ImportValue, or SSM/Secrets Manager dynamic references. These require cloud API calls that SAM CLI can't make locally. The error messages are clear and suggest workarounds.

Maximum 5 levels of nesting. You can nest Fn::ForEach loops (environments x services, for example), but CloudFormation caps it at 5 levels deep. You probably won't hit this in practice.

Collection values are fixed at package time. If you use a parameter-based collection with dynamic CodeUri, the parameter values you use at sam package must match what you use at sam deploy. SAM CLI warns you when this applies.

With those constraints in mind, getting started is straightforward.

Get started

This feature is available in SAM CLI v1.160.0 and later. Update and try it:

pip install --upgrade aws-sam-cli
sam --version

Take one of your templates with duplicated resources, add the AWS::LanguageExtensions transform, and replace the copy-paste with Fn::ForEach. If you don't have the latest CLI, the install guide has you covered.

This has been one of the most requested features in SAM CLI history (#5647 had years of community upvotes), and the implementation covers the full command surface. Dynamic looping in YAML, supported end-to-end. Define your resources once, generate as many as you need, and deploy with the same workflow you already know.

If you run into issues or want to see what's next, the SAM CLI repo is where it all happens.

My Brother Doesn't Code. Now He Ships Features.

Eric D Johnson — Mon, 18 May 2026 06:00:00 +0000

My brother runs a large crew of pipelayers. The math they do in the field is genuinely hard. Rolling offsets, fitting angles, engineer tape measurements in decimal feet. I don't understand any of it. But I know how to build apps, so I built him a calculator PWA.

The problem came after launch. Every time he needed something changed, he'd send me a message. "Can you make the font bigger?" "The offset calculator is wrong for 22.5 degree fittings." I'd context-switch out of whatever I was doing, try to understand what he meant, get it wrong the first time, and eventually push a fix days later. I was the bottleneck, and I didn't even understand the domain. He knows what the app needs. I know how to code. But the translation layer between us is lossy.

So I built an agent that cuts me out of the loop. He sends a message describing what he wants, the agent makes the changes, deploys a preview, and waits for his approval before pushing to production. No code. No git. No terminal. Just plain English and a preview link.

This isn't a polished product. It's a starting point. But the decisions behind it are intentional, and I think they're worth walking through.

Why Telegram?

I needed a chat interface that was dead simple to use from a phone in the field. Telegram's Bot API made this straightforward. You create a bot through @botfather, get a token, register a webhook URL, and you're receiving messages as JSON payloads. No app to build, no OAuth flow, no UI framework. The user just opens a chat and starts typing.

The bot becomes the entire interface. My brother doesn't need to know what's behind it.

The Architecture

Here's the full flow:

User sends a message to the Telegram bot
Telegram POSTs the message to an API Gateway endpoint
A Lambda function receives it and invokes an AgentCore Runtime
The AgentCore container runs Claude with access to the codebase
Claude edits code, validates, builds, and deploys a preview to S3/CloudFront
The agent sends status updates and the preview URL back through Telegram
User reviews the preview and says "ship it" (or asks for changes)
The agent merges to main and cleans up

Everything is defined in a single CDK stack. One cdk deploy creates all the infrastructure.

Claude as a Headless Developer

I chose Claude for the agent because of the Claude Agent SDK. It gives you a headless coding experience out of the box. Claude already knows how to read files, write code, run shell commands, and iterate on errors. I didn't need to build any of that tooling. I just needed to point it at a codebase and give it constraints.

The query function from the SDK is the core of it:

from claude_agent_sdk import query, ClaudeAgentOptions

async for message in query(
    prompt=message_text,
    options=ClaudeAgentOptions(
        system_prompt=system_prompt,
        allowed_tools=["Read", "Write", "Edit", "Bash", "Glob", "Grep"],
        cwd=workspace,
        max_turns=30,
    ),
):
    if hasattr(message, "result"):
        result_text = str(message.result)

That's the whole invocation. Claude gets the user's message, a system prompt with context about the project, and a set of tools it can use. It figures out the rest.

Guardrails Make It Safe

Handing an AI agent the keys to a production codebase sounds risky. It is, if you don't constrain it. The system prompt is where you define what the agent can and can't do:

GUARDRAILS:
1. ONLY modify files in src/ and tests/ directories.
2. ALWAYS run validation before deploying.
3. If validation fails, fix the issues and re-run until it passes.
4. NEVER deploy code that fails validation.
5. Keep changes focused — don't refactor unrelated code.
6. Preserve the engineer tape (decimal feet) convention.
7. Fitting angles are 11.25 / 22.5 / 45 / 90 degrees — don't change these.

Rules 6 and 7 are domain-specific. I don't know pipelaying math, but I know those values are sacred.

The workflow section tells Claude exactly how to validate, build, deploy a preview, and wait for approval:

WORKFLOW:
1. Understand the user's request. If unclear, ask for clarification.
2. Make the changes.
3. Run validation: npm run validate
4. If validation fails, fix and retry (up to 3 attempts).
5. Build and deploy to preview.
6. Share the preview URL.
7. Wait for user approval before merging.

The agent never pushes to main on its own. It always deploys to a preview URL first and waits.

Who Gets Access

Every incoming message goes through an authorization check:

if not is_authorized(message.user_id):
    return {"status": "ignored", "reason": "unauthorized"}

The allowlist is a comma-separated list of Telegram user IDs passed as a CloudFormation parameter at deploy time. Simple, intentional, hard to get wrong.

AgentCore: Ephemeral Compute with Long Runtimes

AI coding agents are slow. They need time to read code, think, write changes, run validation, fix errors, build, and deploy. That can take minutes.

Amazon Bedrock AgentCore gave me exactly what I needed. It spins up a container when a request comes in, keeps it warm for subsequent messages from the same user, and shuts it down after an idle timeout. Session affinity routes subsequent requests from the same session to the same container:

user_id = str(payload.get("message", {}).get("from", {}).get("id", "unknown"))
session_id = f"telegram-user-session-id-{user_id}"

The workspace persists. The conversation history persists. It feels stateful even though the compute is ephemeral.

The Async Trick

Telegram expects a 200 response within seconds. But the agent takes minutes. The solution is an async entrypoint:

@app.entrypoint
def handle_request(payload, context=None):
    message = parse_telegram_message(payload)

    task_id = app.add_async_task(
        f"agent-{message.user_id}",
        {"chat_id": message.chat_id, "user_id": message.user_id},
    )

    threading.Thread(
        target=_run_agent_background,
        args=(message.chat_id, message.user_id, message.text, task_id),
        daemon=True,
    ).start()

    return {"status": "accepted", "task_id": task_id}

The add_async_task call tells AgentCore "I'm not done yet, don't shut down this container." While the agent works, it sends status updates directly to Telegram through the Bot API.

One CDK Deploy

The CDK stack creates: API Gateway endpoint, Lambda relay function, AgentCore Runtime (container built from source), S3 + CloudFront for previews, and a Secrets Manager secret for the GitHub deploy key.

agent_runtime = agentcore.Runtime(
    self, "PipelayerRuntime",
    runtime_name="pipelayer_agent",
    agent_runtime_artifact=agentcore.AgentRuntimeArtifact.from_asset(AGENT_DIR),
    environment_variables={
        "TELEGRAM_BOT_TOKEN": telegram_bot_token.value_as_string,
        "DEV_BUCKET": dev_bucket_name.value_as_string,
        "DEV_CDN_DOMAIN": dev_distribution.distribution_domain_name,
        "GITHUB_DEPLOY_KEY_SECRET": deploy_key_secret.secret_arn,
    },
)

The Real Takeaway

AI agents work best when you constrain them. The guardrails in the system prompt are what make this safe enough to hand to someone who doesn't code. The agent can't touch files outside src/ and tests/. It can't deploy without passing validation. It can't push to production without explicit approval.

My brother doesn't need to understand React, or git, or AWS. He just needs to describe what he wants. The agent handles the rest, within boundaries I set.

If you want to dig into the full implementation, the repo is at github.com/singledigit/pipelayer-agent. It's not perfect, but it works. And my brother ships features now.

Creating Your First Lambda Function

Eric D Johnson — Fri, 15 May 2026 06:00:00 +0000

So you want to build your first serverless function? Let's do it. No fluff, no detours — just you, AWS SAM, and a working Lambda function running on your machine in minutes. We're not even going to deploy to AWS yet. Let's just get it working locally first.

What You'll Need

Two things:

AWS SAM CLI installed — SAM (Serverless Application Model) is the tool we'll use to create, build, and test our function. Follow the install guide for your operating system.
Docker or Finch installed — SAM uses a container runtime to simulate Lambda locally. You have two options:
- Docker Desktop — The most common option. Install it and make sure it's running.
- Finch — A lightweight open source alternative from AWS. Install it with brew install finch.

Verify your setup

Check that SAM is installed:

sam --version

If you chose Finch, initialize and start the VM:

finch vm init
finch vm start

SAM automatically detects Finch when Docker isn't running — no extra configuration needed. If you have both installed, SAM uses Docker by default. To make Finch the default on macOS, run:

sudo /usr/libexec/PlistBuddy -c "Add :DefaultContainerRuntime string finch" /Library/Preferences/com.amazon.samcli.plist

Verify your container runtime is working:

docker --version   # or: finch --version

That's it. No AWS account needed yet. We're staying local for now.

Step 1: Initialize Your Project

Open your terminal and run:

sam init

SAM is going to walk you through a few questions. For this blog, I'm using the TypeScript template, but you can choose any runtime and template you'd like — Python, Java, Go, whatever you're comfortable with. The prompts, project structure, and file names will vary depending on what you pick. Here's what I chose:

Template source — Choose 1 - AWS Quick Start Templates
Quick start template — Choose 1 - Hello World Example
Use the most popular runtime and package type? — N
Runtime — Choose 12 - nodejs24.x
Package type — Choose Zip
Starter template — Choose 2 — Hello World Example TypeScript
Enable X-Ray tracing? — N (we'll keep it simple for now)
Enable CloudWatch Application Insights? — N
Enable Lambda Insights? — N
Project name — Hit Enter to accept the default, or name it whatever you like

SAM just scaffolded an entire serverless project for you. Let's look at what we got:

sam-app/
├── template.yaml          # The blueprint for your infrastructure
├── samconfig.toml         # SAM deployment configuration
├── hello-world/           # Your function code lives here
│   ├── app.ts             # Your actual Lambda function
│   ├── package.json       # Dependencies
│   └── tests/             # Unit tests
│       └── unit/
│           └── test-handler.test.ts
└── events/
    └── event.json         # A sample test event

The two files that matter most:

template.yaml — This describes your Lambda function and any AWS resources it needs. Think of it as the blueprint.
app.ts — This is your actual function code. The thing that runs when your Lambda gets invoked.

Understanding the template

Let's look at the key part of template.yaml — the function resource:

Resources:
  HelloWorldFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: hello-world/
      Handler: app.lambdaHandler
      Runtime: nodejs24.x
      Architectures:
        - x86_64
      Events:
        HelloWorld:
          Type: Api
          Properties:
            Path: /hello
            Method: get
    Metadata:
      BuildMethod: esbuild
      BuildProperties:
        Minify: true
        Target: "es2020"
        Sourcemap: true
        EntryPoints:
          - app.ts

Here's what's going on:

Type: AWS::Serverless::Function — This tells SAM to create a Lambda function. SAM handles all the underlying CloudFormation resources for you.
CodeUri: hello-world/ — Points to the folder where your function code lives.
Handler: app.lambdaHandler — Tells Lambda which file and function to run. In this case, the lambdaHandler export in app.ts.
Runtime: nodejs24.x — The Node.js version your function runs on.

The magic is in the Events section. By adding an event with Type: Api, SAM automatically creates an API Gateway for you — no extra configuration needed. You didn't define an API Gateway resource anywhere, but SAM sees that your function wants to respond to HTTP requests at GET /hello and wires it all up behind the scenes. One line, and you've got a fully managed API endpoint in front of your Lambda.

The Metadata section tells SAM how to build your TypeScript code. Instead of running tsc and bundling manually, SAM uses esbuild — a JavaScript/TypeScript bundler. It compiles your TypeScript, minifies the output, generates sourcemaps for debugging, and packages it all up. You don't need to install esbuild yourself — SAM handles it during sam build.

Step 2: Build It

cd sam-app
sam build

SAM grabs your code, installs any dependencies, and packages everything up. You'll see a .aws-sam folder appear — that's your build output. You don't need to touch it.

Don't have the runtime installed on your machine (e.g., no Node.js)? No problem — just add --use-container and SAM will build inside a Docker/Finch container instead:

sam build --use-container

SAM will automatically pull down a container image that matches your function's runtime, build your code inside it, and output the packaged result to .aws-sam just like a normal build. You get the exact same Lambda environment without installing anything on your machine.

Step 3: Test It Locally

Here's the fun part. Run your Lambda function right on your machine:

sam local invoke

You should see a response like:

{"statusCode": 200, "body": "{\"message\": \"hello world\"}"}

That's your Lambda function running locally. No AWS account, no deployment, no charges. Just your code doing its thing.

Make It Yours

Let's make a quick change so you can see the full loop — edit, build, test. Open hello-world/app.ts and find this line:

message: 'hello world',

Change it to:

message: 'hello Lambda',

Now rebuild and invoke again:

sam build
sam local invoke

{"statusCode": 200, "body": "{\"message\": \"hello Lambda\"}"}

That's the workflow. Change your code, build, test — all local, all instant.

Want to test it as an API instead? Run:

sam local start-api

This spins up a local API Gateway. Open your browser and hit http://127.0.0.1:3000/hello and you'll see:

{"message": "hello Lambda"}

You now have a fully functional serverless API running on your laptop.

What Just Happened?

In about 5 minutes, you:

Scaffolded a serverless project with a single command
Learned how the SAM template automatically wires up a Lambda function and API Gateway
Built your TypeScript code with esbuild — no manual setup needed
Ran a Lambda function locally
Made a code change and saw it reflected instantly
Spun up a local API endpoint and hit it from the browser

All without touching AWS. All free. All on your machine.

Ready to Deploy? Let's Go Live.

You've got it working locally. Now let's put it in the cloud so the rest of the world can use it. For this part, you'll need a couple more things set up.

Additional Requirements

An AWS account — Head to aws.amazon.com/free and sign up. New customers get up to $200 in credits and a Free Plan that lets you explore AWS for up to 6 months at no cost. On top of that, Lambda has its own always-free tier — 1 million requests and 400,000 GB-seconds of compute per month.
AWS CLI installed and configured — This is how your machine talks to your AWS account. First, install it from the AWS CLI install guide. Then follow the Setting up the AWS CLI guide to configure your credentials. For most new users, you'll want the IAM user with short-term credentials option — it's the recommended approach for getting started securely.

Once you're set up, verify it's working:

aws sts get-caller-identity

If you see your account ID, you're connected.

Step 4: Deploy It

From your project directory, run:

sam deploy --guided

The --guided flag walks you through the deployment settings the first time. Here's what to expect:

Stack Name — Give it a name (like my-sam-app)
AWS Region — Pick your closest region (e.g., us-east-1)
Confirm changes before deploy — Y
Allow SAM CLI IAM role creation — Y (SAM needs to create a role for your function)
Disable rollback — N
Save arguments to config file — Y (so you don't have to answer these again)

SAM will show you a changeset — a preview of what it's about to create. Type Y to confirm, and watch it go.

When it's done, you'll see an Outputs section with a URL. That's your live API endpoint. Copy it, paste it in your browser, and...

{"message": "hello Lambda"}

🎉 Your Lambda function is live on AWS.

Clean Up

Don't want to leave resources hanging around? Tear it all down with:

sam delete

SAM removes everything it created. Clean slate.

What Just Happened?

In about 10 minutes, you:

Scaffolded a serverless project with a single command
Learned how the SAM template automatically wires up a Lambda function and API Gateway
Built your TypeScript code with esbuild — no manual setup needed
Tested your function locally and made a code change
Spun up a local API and hit it from the browser
Deployed it to AWS with a real, live URL
Cleaned it all up with one command

SAM handled all the heavy lifting — the CloudFormation stack, the IAM roles, the API Gateway, the packaging. You just answered a few questions.

What's Next?

You've got the basics down. From here you can:

Add more functions to your template.yaml
Connect your Lambda to DynamoDB, S3, or other AWS services
Set up environment variables and custom permissions
Explore event-driven architectures with SNS, SQS, and EventBridge

Want to keep learning? Check out Serverless Land — the go-to hub for all things Lambda and serverless with patterns, tutorials, and best practices. Or dive into the Sessions with SAM YouTube playlist that goes deeper into everything AWS SAM can do.

But that's for next time. For now, celebrate — you just went serverless. 🚀

Lambda Just Got a File System. I Put AI Agents on It.

Eric D Johnson — Wed, 13 May 2026 15:50:28 +0000

You've written this code before. An S3 event fires, your Lambda function wakes up, and the first thing it does is download a file to /tmp. Process it. Upload the result. Clean up /tmp so you don't run out of space. Repeat for every file, every invocation, every function in your pipeline.

S3 Files changes that. You mount your S3 bucket as a local file system, and your Lambda code just uses open(). I built a set of AI code review agents that share a workspace through a mounted S3 bucket, orchestrated by a durable function, and the file access code is the most boring part of the whole project. That's the point.

The /tmp Tax

If you've built anything on Lambda that touches S3 data, you know the pattern. You need a file. S3 doesn't give you files. It gives you objects. So you download the object to /tmp, do your work, and upload the result back.

# The old way: every Lambda developer has written this
import boto3

s3 = boto3.client("s3")

def lambda_handler(event, context):
    bucket = event["bucket"]
    key = event["key"]

    # Download to /tmp
    local_path = f"/tmp/{key.split('/')[-1]}"
    s3.download_file(bucket, key, local_path)

    # Do your actual work
    with open(local_path) as f:
        content = f.read()
    result = process(content)

    # Upload the result
    s3.put_object(Bucket=bucket, Key=f"output/{key}", Body=result)

    # Clean up so you don't fill /tmp
    os.remove(local_path)

That's a lot of ceremony for "read a file and write a file." And it gets worse when you have multiple functions that need to work with the same data. Each one downloads its own copy. Each one manages its own /tmp. If you're processing a large repo or a dataset, you're burning through the 10GB /tmp limit fast.

I'd be doing you a disservice if I didn't mention the libraries that make this less painful. Tools like s3fs and smart_open abstract some of this away. But they're still making API calls under the hood. Your code is still talking to S3 through an SDK, not through a file system.

S3 Files for Lambda

S3 Files is a new feature that mounts your S3 bucket as a local file system on your Lambda function. Your code reads and writes files at a mount path like /mnt/workspace, and S3 Files handles the synchronization back to the bucket. Changes you write show up in S3 within minutes. Changes made to S3 objects appear on the file system within seconds.

# The new way: just file paths
from pathlib import Path

WORKSPACE = Path("/mnt/workspace")

def lambda_handler(event, context):
    # Read directly from the mount
    content = (WORKSPACE / "source" / "app.py").read_text()
    result = process(content)

    # Write directly to the mount
    (WORKSPACE / "output" / "result.json").write_text(result)

No boto3 for file access. No /tmp management. No upload step. The file system IS the interface.

Under the hood, S3 Files is built on Amazon EFS. It delivers sub-millisecond latency for actively used data by caching your working set on high-performance storage. For large sequential reads, it streams directly from S3. You get file system semantics with S3 durability and economics.

Here's the thing, though. S3 Files requires a VPC. Your Lambda function needs to be in the same VPC as the mount targets, and you need a NAT gateway for outbound internet access.

I'll be honest: as a serverless guy, I generally avoid VPCs. But AWS has removed most of the hurdles over the years. VPC-attached Lambda functions no longer have the cold start penalty they used to. The networking setup is boilerplate you write once. And for what S3 Files gives you, the tradeoff is worth it. Get yourself a reusable network template and move on.

What We're Building

I wanted to test S3 Files with something more interesting than "read a CSV." So I built a serverless code review system. You point it at a public GitHub repo, and three things happen:

A durable orchestrator function clones the repo to a shared S3 Files workspace
A security review agent and a style review agent analyze the code in parallel
The results land in the same workspace as JSON files, synced back to S3

All three Lambda functions mount the same S3 bucket. The orchestrator writes files. The agents read them. No S3 keys passed between functions. No downloading to /tmp. The file system is the coordination layer.

The agents use the Strands Agents SDK with Amazon Bedrock. Each agent gets custom file tools that operate on the mount path, and Claude decides which files to read, what to analyze, and what to write. The orchestrator uses Lambda durable functions to coordinate the workflow with automatic checkpointing.

The full source is on GitHub: singledigit/lambda-s3-files-example

The SAM Template

The IaC is the part that took the most iteration. S3 Files is brand new, and the CloudFormation resource types aren't in the linter yet. Here's what I learned.

The Resource Chain

You need five resources to get S3 Files working with Lambda:

S3 Bucket with versioning enabled (required)
IAM Role for S3 Files to access the bucket
S3 Files FileSystem that bridges the bucket to NFS
Mount Targets in each AZ (network endpoints)
Access Point that controls POSIX identity for Lambda

The resource types are AWS::S3Files::FileSystem, AWS::S3Files::MountTarget, and AWS::S3Files::AccessPoint. Your IDE's CloudFormation linter won't recognize them yet. Ignore the red squiggles.

The IAM Role Gotcha

The S3 Files IAM role trusts elasticfilesystem.amazonaws.com, not s3files.amazonaws.com. This tripped me up. S3 Files is built on EFS, so the trust relationship goes through the EFS service principal.

S3FilesRole:
  Type: AWS::IAM::Role
  Properties:
    Path: /service-role/
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
        - Sid: AllowS3FilesAssumeRole
          Effect: Allow
          Principal:
            Service: elasticfilesystem.amazonaws.com
          Action: sts:AssumeRole
          Condition:
            StringEquals:
              aws:SourceAccount: !Ref AWS::AccountId
            ArnLike:
              aws:SourceArn: !Sub 'arn:aws:s3files:${AWS::Region}:${AWS::AccountId}:file-system/*'

The role needs S3 permissions to read and write the bucket. Scope it to your specific bucket ARN with aws:ResourceAccount conditions.

The Access Point

This is the important part for Lambda. The access point controls the POSIX identity your function runs as and creates a writable root directory. Without it, Lambda can mount the file system but can't write to it.

S3FilesAccessPoint:
  Type: AWS::S3Files::AccessPoint
  Properties:
    FileSystemId: !GetAtt S3FileSystem.FileSystemId
    PosixUser:
      Uid: '1000'
      Gid: '1000'
    RootDirectory:
      Path: /lambda
      CreationPermissions:
        OwnerUid: '1000'
        OwnerGid: '1000'
        Permissions: '755'

The CreationPermissions property is crucial. It auto-creates the /lambda directory with the right ownership when a client first connects. Without it, the root directory is owned by root (UID 0), and Lambda (running as UID 1000 through the access point) can't create subdirectories.

Lambda Configuration

On the Lambda side, FileSystemConfigs takes the access point ARN (not the file system ARN) and a local mount path:

OrchestratorFunction:
  Type: AWS::Serverless::Function
  DependsOn:
    - MountTargetA
    - MountTargetB
  Properties:
    FileSystemConfigs:
      - Arn: !GetAtt S3FilesAccessPoint.AccessPointArn
        LocalMountPath: /mnt/workspace
    VpcConfig:
      SecurityGroupIds:
        - !GetAtt NetworkingStack.Outputs.LambdaSGId
      SubnetIds:
        - !GetAtt NetworkingStack.Outputs.PrivateSubnetAId
        - !GetAtt NetworkingStack.Outputs.PrivateSubnetBId

The DependsOn on the mount targets is important. Lambda can't mount the file system until the mount targets are available, and they take about five minutes to create.

What I'd Do Differently

S3 Files is genuinely good for this use case. Shared file access between Lambda functions without the ceremony of S3 API calls. But a few things to know:

Consistency model matters. S3 Files provides close-to-open consistency. If Function A writes a file and Function B reads it immediately, B might not see the latest version. For my use case, the orchestrator writes first and the agents run after, so ordering is natural. If you need real-time coordination between concurrent writers, you'll want a different pattern.

VPC adds complexity. Not much, but some. You need subnets, security groups, NAT gateway for internet access. Template it once and reuse it.

Cold starts are fine. VPC-attached Lambda functions used to add 10+ seconds of cold start. That's been fixed for years. My functions cold-start in under 2 seconds with the file system mount.

The full code is at github.com/singledigit/lambda-s3-files-example. Clone it, deploy it, point it at a repo. The agents will tell you what they think.

Modern Container Builds and WebSocket APIs Come to AWS SAM

Eric D Johnson — Tue, 05 May 2026 18:09:06 +0000

SAM CLI has always been good at taking the grunt work out of serverless deployments. You define your functions and APIs in a template, and SAM handles the CloudFormation, the packaging, the deployment. It works.

Until recently, two things were missing. BuildKit support for container image builds. And a SAM resource type for WebSocket APIs.

Both are now shipped. Neither breaks existing templates.

Let's walk through it.

BuildKit Support for Image-Based Lambda Functions
The problem
When you run sam build for an image-based Lambda function, SAM uses the docker-py Python SDK under the hood. That SDK talks to the Docker daemon directly, but it doesn't support BuildKit. At all.

That means every sam build invocation uses the legacy Docker builder. You lose parallelized build stages. You lose efficient layer caching. You lose multi-stage build optimizations. You lose cross-architecture improvements that BuildKit has shipped over the past several years.

For simple single-stage Dockerfiles, this probably doesn't matter. For anything with multiple stages, private dependencies, or cross-compilation targets, it's a real bottleneck.

What shipped
SAM CLI v1.156.0 introduced the --use-buildkit flag on sam build. When you pass this flag, SAM bypasses docker-py entirely and shells out to the Docker CLI (or Finch CLI) directly. That gives you access to everything BuildKit offers.

sam build --use-buildkit
That's it. One flag.

SAM auto-detects which container runtime you have. It defaults to Docker, falls back to Finch if Docker isn't running, and respects any admin-configured preference. Finch supports BuildKit too, so either runtime works.

What you get
BuildKit brings real improvements over the legacy builder:

Parallel stage execution. Multi-stage Dockerfiles build independent stages concurrently instead of sequentially.
Better layer caching. BuildKit tracks dependencies at the file level, not just the layer level. Change one file, rebuild one layer.
Multi-stage build optimizations. BuildKit skips stages that don't contribute to the final output. The legacy builder runs every stage regardless.
Improved cross-architecture support. Building arm64 (Graviton2) images on an x86 machine is more reliable with BuildKit's QEMU integration. Lambda currently supports Graviton2 for arm64 workloads.
Build secrets
SAM CLI v1.159.0 added support for passing BuildKit parameters, including build-time secrets. This lets you pass credentials into a build stage without baking them into a layer. Useful when your Lambda function pulls packages from a private registry during the build.

syntax=docker/dockerfile:1

FROM public.ecr.aws/lambda/python:3.12

RUN --mount=type=secret,id=pip_conf,target=/etc/pip.conf \
pip install -r requirements.txt

COPY app.py ${LAMBDA_TASK_ROOT}
CMD ["app.handler"]
You configure secrets through the Metadata section of your function resource, using DockerBuildExtraParams. SAM passes these parameters straight to docker buildx under the hood.

Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
PackageType: Image
Architectures: [x86_64]
Timeout: 10
Metadata:
Dockerfile: Dockerfile
DockerContext: ./src
DockerTag: latest
DockerBuildExtraParams:
- "--secret"
- "id=pip_conf,src=$HOME/.pip/pip.conf"
The secret mounts into the build stage at the path you specify in the Dockerfile, gets used during pip install, and never appears in the final image layers. Different functions can have different build secrets, since the configuration lives in each function's Metadata block.

Tradeoffs and limitations
A few things to keep in mind.

Opt-in only. This doesn't change existing build behavior. If you don't pass --use-buildkit, nothing changes. That's intentional. SAM doesn't break working builds.

Requires Docker or Finch CLI. BuildKit support works by calling the Docker or Finch CLI directly. If your CI environment only has the Docker daemon (no CLI), this won't work. Most environments have both, but check yours.

Dockerfile syntax matters. Some BuildKit features, like secret mounts, require the # syntax=docker/dockerfile:1 parser directive at the top of your Dockerfile. Without it, the build falls back to legacy parsing and you get confusing errors.

Not for ZIP-based functions. This flag only applies to image-based Lambda functions (PackageType: Image). ZIP-based functions don't use Docker at all.

The rule of thumb I like is this: if your Dockerfile is more than a few lines, or you're building for a different architecture, turn on BuildKit. The caching alone will save you time.

WebSocket API Support
The problem
Before this release, SAM had no native resource type for WebSocket APIs. You had two options: write raw CloudFormation, or don't use SAM for that part of your stack.

Here's what a minimal WebSocket API looks like in plain CloudFormation. Three routes ($connect, $disconnect, sendMessage), each backed by a Lambda function:

Resources:
WebSocketApi:
Type: AWS::ApiGatewayV2::Api
Properties:
Name: MyWebSocketApi
ProtocolType: WEBSOCKET
RouteSelectionExpression: $request.body.action

ConnectRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref WebSocketApi
RouteKey: $connect
Target: !Sub integrations/${ConnectIntegration}

ConnectIntegration:
Type: AWS::ApiGatewayV2::Integration
Properties:
ApiId: !Ref WebSocketApi
IntegrationType: AWS_PROXY
IntegrationUri: !Sub arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${ConnectFunction.Arn}/invocations

ConnectPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref ConnectFunction
Action: lambda:InvokeFunction
Principal: apigateway.amazonaws.com
SourceArn: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${WebSocketApi}/*/$connect

DisconnectRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref WebSocketApi
RouteKey: $disconnect
Target: !Sub integrations/${DisconnectIntegration}

DisconnectIntegration:
Type: AWS::ApiGatewayV2::Integration
Properties:
ApiId: !Ref WebSocketApi
IntegrationType: AWS_PROXY
IntegrationUri: !Sub arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${DisconnectFunction.Arn}/invocations

DisconnectPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref DisconnectFunction
Action: lambda:InvokeFunction
Principal: apigateway.amazonaws.com
SourceArn: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${WebSocketApi}/*/$disconnect

SendMessageRoute:
Type: AWS::ApiGatewayV2::Route
Properties:
ApiId: !Ref WebSocketApi
RouteKey: sendMessage
Target: !Sub integrations/${SendMessageIntegration}

SendMessageIntegration:
Type: AWS::ApiGatewayV2::Integration
Properties:
ApiId: !Ref WebSocketApi
IntegrationType: AWS_PROXY
IntegrationUri: !Sub arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${SendMessageFunction.Arn}/invocations

SendMessagePermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref SendMessageFunction
Action: lambda:InvokeFunction
Principal: apigateway.amazonaws.com
SourceArn: !Sub arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${WebSocketApi}/*/sendMessage

Deployment:
Type: AWS::ApiGatewayV2::Deployment
DependsOn:
- ConnectRoute
- DisconnectRoute
- SendMessageRoute
Properties:
ApiId: !Ref WebSocketApi

Stage:
Type: AWS::ApiGatewayV2::Stage
Properties:
ApiId: !Ref WebSocketApi
StageName: prod
DeploymentId: !Ref Deployment
That's twelve resources for three routes. Around 90 lines of YAML. And I left out the Lambda function definitions.

Notice the DependsOn on the Deployment resource. If you forget that, CloudFormation tries to create the deployment before the routes exist, and the stack fails. You have to manage that dependency graph yourself.

That's a lot of ceremony for "connect a WebSocket to some Lambda functions."

What shipped
The new AWS::Serverless::WebSocketApi resource type collapses all of that into this:

Resources:
MyWebSocketApi:
Type: AWS::Serverless::WebSocketApi
Properties:
RouteSelectionExpression: $request.body.action
StageName: prod
Routes:
$connect:
FunctionArn: !GetAtt ConnectFunction.Arn
$disconnect:
FunctionArn: !GetAtt DisconnectFunction.Arn
sendMessage:
FunctionArn: !GetAtt SendMessageFunction.Arn
Compare that to the CloudFormation version. Twelve resources become one. Ninety lines become twelve. The routes, integrations, Lambda permissions, deployment, stage, and resource ordering are all handled by SAM's transform.

You define the routes. SAM generates the rest.

What SAM handles automatically
For each route you declare, SAM creates:

The AWS::ApiGatewayV2::Route resource
The AWS::ApiGatewayV2::Integration wiring the route to your Lambda function
The AWS::Lambda::Permission granting API Gateway invoke access
If you add a Lambda authorizer, the authorizer permission too
SAM also creates the deployment and stage resources, with the correct dependency ordering. No DependsOn blocks to manage.

Full feature parity
This isn't a simplified subset. The new resource type supports everything API Gateway V2 WebSocket offers:

Auth: IAM authorization and Lambda authorizers, per-route or API-wide
Custom domains: Map your WebSocket API to your own domain
Route settings: Configure throttling and logging per route via RouteSettings
Models: Attach request/response models for validation
Stage variables: Pass configuration to your integration through stage variables
Globals: Share configuration across multiple WebSocket APIs using the SAM Globals section
Use cases
WebSocket APIs are the right tool when you need a persistent, bidirectional connection. Common patterns:

Chat applications. Users send and receive messages in real time.
Live dashboards. Push metric updates to connected browsers without polling.
AI/LLM streaming. Stream token-by-token responses from a model back to the client. This one is increasingly common.
IoT command channels. Send commands to devices and receive status updates on the same connection.
If your current approach is a REST API that the client polls every few seconds, a WebSocket API will give you lower latency and lower cost. Fewer requests, fewer Lambda invocations, faster updates.

Tradeoffs and limitations
No local emulation. SAM CLI doesn't support sam local start-api for WebSocket APIs. You can test individual Lambda handlers with sam local invoke, but end-to-end local WebSocket testing isn't available yet. Deploy to a dev stage for integration testing.

No sam local start-websocket. Related to the above, there's no dedicated local command for WebSocket APIs like there is for HTTP APIs with sam local start-api.

For the Agents
If you're using an AI coding agent to build SAM applications, both of these features work out of the box with agent-driven workflows. Your agent can scaffold a WebSocket API or add BuildKit to an existing image-based function without any special setup.

For Kiro, there's an official AWS SAM Power that gives your agent SAM-aware tooling. Install it and your agent gets access to sam_init, sam_build, sam_deploy, sam_logs, and sam_local_invoke as callable tools, plus opinionated project structure guidance.

Here's what a Kiro-assisted WebSocket API scaffold looks like in practice:

You: Create a WebSocket API with connect, disconnect, and sendMessage routes.
Use the SAM Power.

Kiro: [runs sam_init] → creates project structure
[updates template.yaml] → adds AWS::Serverless::WebSocketApi
[creates Lambda handlers] → connect.py, disconnect.py, send_message.py
[runs sam_build] → builds the project
[runs sam_local_invoke] → tests the connect handler locally
The SAM Power also enforces good project structure: separate Lambda handlers in infrastructure/lambda/, proper CodeUri paths, and .aws-sam in your .gitignore. It gets you from idea to deployed WebSocket API without hand-writing boilerplate.

You can install the SAM Power from the Kiro Powers marketplace or add it directly to your project's .kiro/powers/ directory.

Getting Started
Upgrade SAM CLI to get both features:

sam --version

Need v1.156.0 or later for BuildKit, latest for WebSocket APIs

Upgrade via pip

pip install --upgrade aws-sam-cli

Or via Homebrew

brew upgrade aws-sam-cli
For BuildKit, add --use-buildkit to your sam build command. No template changes needed.

For WebSocket APIs, replace your CloudFormation resources with the new AWS::Serverless::WebSocketApi type. If you're starting fresh, the SAM template above is a working starting point.

BuildKit works with sam local for local testing. WebSocket APIs currently support sam deploy for deployment and sam local invoke for testing individual handlers, but full local WebSocket emulation isn't available yet.

Two Features, Zero Breaking Changes
These two features fill gaps that have been open for a while. BuildKit support means sam build finally uses the same build engine as the rest of the container ecosystem. WebSocket API support means you can define a real-time API in SAM the same way you define a REST API. A few lines instead of a hundred.

Neither feature changes existing behavior. Both are additive. Upgrade, try them, and keep building.

Cold Starts Are Dead

Eric D Johnson — Wed, 29 Apr 2026 03:09:17 +0000

It never fails. Every time I talk about serverless, someone pushes back with the cold start argument. I still see it in forums, in blog comments, in architecture review meetings. "Sure, but what about cold starts?"

I get it. Five or six years ago, that was a legitimate concern.

But it's 2026. The data tells a different story. And if you're still making decisions based on the cold start argument, you're arguing against a version of Lambda that hasn't existed in years.

How Long Are Lambda Cold Starts in 2026?

Let's start with what cold starts actually look like today. These numbers come from production workloads observed in the wild, not synthetic hello-world tests. Your mileage will vary by package size, initialization code, and memory configuration, but the ranges are representative of what teams are seeing in 2026.

Runtime	P50	P99	Notes
Python 3.13	200-400ms	800ms-1.2s	Fastest scripting runtime
Node.js 22	200-350ms	600ms-1s	Solid general choice
Go	50-100ms	150-250ms	Near-zero
Rust	50-80ms	100-200ms	Fastest overall
Java 21 (no SnapStart)	2-5s	6-10s	Still slow without SnapStart
Java 21 + SnapStart	90-140ms	200-400ms	Dramatically better

Running on arm64 (Graviton)? Knock another 15-40% off those numbers across the board.

Rust cold starts have been measured as low as 16ms on arm64. Sixteen milliseconds, which is less a cold start problem and more a rounding error.

The scripting runtimes, Python and Node, land in the 200-400ms range at P50. For context, that's less than the time it takes your browser to render a page after receiving the HTML.

Does VPC Still Cause Lambda Cold Starts?

This one still comes up in conversations, and it frustrates me because it was solved in 2019.

The old problem: when a Lambda function needed VPC connectivity, AWS had to create an Elastic Network Interface (ENI) on the fly. That meant 10-15 seconds of additional cold start latency on top of everything else. VPC-connected Lambda was genuinely painful.

AWS fixed this when Lambda migrated to Firecracker microVMs in 2019, dropping cold start overhead from over ten seconds to under a second. Werner Vogels recently wrote about the invisible engineering behind Lambda's network. The team used eBPF to rewrite Geneve tunnel headers, taking tunnel latency from 150 milliseconds to 200 microseconds. The VPC cold start penalty now approaches zero for most workloads.

That was seven years ago. If someone tells you Lambda cold starts are bad because of VPC, they're working from outdated information.

How Much Does SnapStart Reduce Cold Starts?

Java was the poster child for the cold start argument. And honestly, it earned that reputation. A Spring Boot app on Lambda could take 3-10 seconds to cold start. That's painful no matter how you frame it.

SnapStart changed the math. Here's the timeline:

November 2022: SnapStart GA for Java at re:Invent
November 2024: Python 3.12+ and .NET 8+ GA
2025: Expanded to additional regions, arm64 support

How it works: Lambda takes a snapshot of the initialized Firecracker microVM after your INIT code runs. That snapshot gets cached across three tiers: L1 on the worker, L2 in the placement group, S3 at the region level. On cold start, Lambda restores from the snapshot instead of re-running your initialization code.

The benchmarks are real:

Python (Flask, LangChain, Pandas): several seconds → sub-second
Java Spring Boot: 5.8s → 180ms (97% reduction)
.NET: 58-94% cold start reduction

SnapStart has continued to evolve. Python and .NET support went GA in late 2024, with additional regions and arm64 support following in 2025.

Did the Lambda INIT Billing Change Increase Costs?

In August 2025, AWS standardized billing for the Lambda INIT phase. Functions packaged as ZIP files with managed runtimes now get billed for the INIT phase, which was previously free. This triggered some alarming blog posts.

The headline claim: a "22x cost increase." Let's look at the math.

That 22x number requires a perfect storm of worst-case assumptions:

100% cold start rate (every invocation is a cold start)
2-second Java INIT duration
512MB memory configuration

Here's the problem with that scenario: AWS's own production analysis shows cold starts occur in less than 1% of invocations. Not 100%. Less than 1%.

AWS's own assessment: "most users will see minimal impact on their overall Lambda bill from this change, as the INIT phase typically occurs for a very small fraction of function invocations." The actual impact depends on your cold start ratio and INIT duration relative to handler duration. Use the CloudWatch query below to check your own numbers.

Don't take my word for it. AWS published a CloudWatch Logs Insights query so you can calculate your exact impact:

filter @type = "REPORT"
| stats
    sum((@memorySize/1000000/1024) * (@billedDuration/1000)) as BilledGBs,
    sum((@memorySize/1000000/1024) * ((@duration + @initDuration - @billedDuration)/1000)) as UnbilledInitGBs,
    UnbilledInitGBs / (UnbilledInitGBs + BilledGBs) as UnbilledInitRatio

Run that against your own functions. If your UnbilledInitRatio is anywhere near the number that would produce a 22x increase, you have a bigger problem than billing changes. You have an architecture problem.

With SnapStart enabled, INIT durations drop to sub-second, which shrinks that ratio even further.

I Decided to See for Myself

I wasn't satisfied citing someone else's numbers for a post called "Cold Starts Are Dead." So I built a benchmarker. Thirteen Lambda functions across six runtimes, both arm64 and x86_64, 50 cold start invocations each, 512MB memory, us-west-2. Minimal hello-world handlers, no frameworks, no dependencies beyond the runtime SDK. This measures the platform floor, not application init time.

Runtime	Arch	P50 (ms)	P99 (ms)	Blog Claim	Verdict
Rust	arm64	14.1	31.9	50-80ms	⚡ Faster
Rust	x86_64	17.0	29.3	50-80ms	⚡ Faster
Go	arm64	45.0	61.2	50-100ms	⚡ Faster
Go	x86_64	59.8	94.9	50-100ms	✅ Verified
Python 3.13	arm64	88.3	147.1	200-400ms	⚡ Faster
Python 3.13	x86_64	106.2	142.3	200-400ms	⚡ Faster
Node.js 22	arm64	121.5	168.5	200-350ms	⚡ Faster
Node.js 22	x86_64	155.0	231.3	200-350ms	⚡ Faster
Java 21	arm64	365.3	539.2	2-5s	⚡ Faster
Java 21	x86_64	443.8	573.5	2-5s	⚡ Faster

Every runtime came in at or below the production ranges I cited earlier. That's expected. Those production numbers include application dependencies, framework initialization, and SDK client setup. My minimal handlers isolate just the platform overhead. Think of these as the floor: your cold starts will be at least this fast, plus whatever your initialization code adds.

The arm64 advantage verified cleanly too:

Runtime	arm64 P50	x86_64 P50	Improvement
Rust	14.1ms	17.0ms	17% faster
Go	45.0ms	59.8ms	25% faster
Python	88.3ms	106.2ms	17% faster
Node.js	121.5ms	155.0ms	22% faster
Java	365.3ms	443.8ms	18% faster

17-25% faster across every runtime. If you're still deploying on x86_64 in 2026, you're leaving performance on the table for no reason.

I also tested VPC vs non-VPC with the Python function. The VPC-connected function was 1.4ms faster at P50, within noise.

One honest caveat: SnapStart on my minimal Java handler showed a ~670ms restore duration. That's because there's almost nothing to snapshot. The restore mechanism's own overhead dominates. For a Spring Boot app where SnapStart eliminates 3-5 seconds of framework init, you'd see the dramatic improvement the benchmarks above describe. SnapStart's value scales with how much init work your app does.

The full benchmarker is open source on GitHub. You can run it yourself:

git clone https://github.com/singledigit/lambda-benchmark.git
cd lambda-benchmark
sam build && sam deploy --guided
cd orchestrator
python benchmark.py --stack-name <stack> --iterations 50
python generate_report.py

When Do Lambda Cold Starts Still Matter?

Here's where cold starts still matter.

But first, is 200ms even slow? For a website, a chat app, an API powering a mobile experience, an AI agent? You won't notice it. A typical page load involves DNS resolution, TLS handshake, and content rendering that add up to far more than 200ms. If you're building AI agents, the LLM call alone takes 2-10 seconds, so 200ms of cold start is a rounding error on that. It's the first request after a quiet period, and it happens in less time than a blink.

The cases where cold starts actually matter are narrow:

Sub-10ms latency SLAs. If you're building for high-frequency trading or real-time bidding, 200ms of cold start latency is unacceptable. At that point, you're probably looking at containers on ECS or EKS where you control the warm pool directly. Provisioned Concurrency is an option too, but it has its own cost tradeoffs.

Very spiky, unpredictable traffic with strict P99 requirements. If your traffic pattern goes from zero to thousands of concurrent invocations with no ramp-up, and you have hard P99 latency SLAs, cold starts will hit that tail. You have options: containers with pre-scaled task counts, Provisioned Concurrency, or accepting slightly higher P99s during the initial burst. The right answer depends on your cost tolerance and how strict "strict" actually is.

Java without SnapStart. Just... use SnapStart. If you're running Java on Lambda without SnapStart enabled, you're choosing to have a cold start problem. The fix is a configuration change.

The honest framing: the question isn't "do cold starts exist?" They do. The question is "do they matter for your workload?" For the vast majority of workloads, the answer is no.

Conclusion

VPC cold starts were fixed in 2019, and SnapStart took Java from 5 seconds to 180 milliseconds. Python and Node sit under 200ms on arm64, and AWS's own data says cold starts happen in less than 1% of invocations. I built a benchmarker and tested it myself. The production claims are conservative, and the numbers are better than advertised.

The cold start argument had its day. That day was 2018. If you're still leading with it, I get it, it used to be real. But the data has moved on, and it's time we did too.

For the Agents

If you're using an AI coding assistant or agent, I've put together a companion skill file for this post. It includes the 2026 cold start benchmarks, SnapStart configurations, and the CloudWatch query for INIT billing analysis. Everything your agent needs to give you accurate cold start guidance instead of outdated advice.

Is this code deterministic?

Eric D Johnson — Fri, 16 Jan 2026 18:20:30 +0000

I recently posted a small code snippet in a LinkedIn poll and asked what sounded like a simple question:

Is this code deterministic?

Those are usually the dangerous questions.

I asked on purpose. I’ve been spending time talking with folks much smarter than me, reading docs, and honestly leaning on code assistants to sanity-check my thinking as I go. Durable execution has a way of surfacing edge cases you don’t normally think about, and I wanted to learn in public—right alongside everyone else.

The discussion that followed (in the original post) was excellent. It also showed how easy it is to mix together concepts like determinism, replay, retries, and idempotency. This post is my attempt to slow things down and separate those ideas, using the original example and AWS’s guidance on deterministic code in AWS Lambda durable functions.

Here’s the code that started it all:

import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const orders = event.orders.sort((a, b) => a.priority - b.priority);

    const results = [];
    for (const order of orders) {
      const result = await context.step(`process-${order.id}`, async () => {
        return processOrder(order);
      });
      results.push(result);
    }

    return { processed: results.length, timestamp: Date.now() };
  }
);

Most people voted “No, non-deterministic.”

That’s the correct answer—but not always for the reasons people first reach for.

Let’s walk through it.

Problem 1: Equal-priority ordering is under-specified

What’s happening

event.orders.sort((a, b) => a.priority - b.priority);

Modern JavaScript engines (ES2019+) guarantee that Array.prototype.sort() is stable. If two orders have the same priority, their relative order is preserved.

So no, JavaScript isn’t secretly reordering your data.

Why it still matters (and why this one is subtle)

I’ll be honest: this one felt nit-picky to me at first. If the input is the same and the sort is stable, it feels like everything should be fine.

The important realization is this: a stable sort preserves whatever order the input already had—but it doesn’t explain why that order exists.

In this code, the implicit rule becomes:

“If priorities are equal, keep whatever order the input arrived in.”

If that order is intentional and guaranteed, great. Nothing wrong here.

But if it’s incidental—maybe merged upstream, aggregated from multiple sources, or simply not meant to be meaningful—then the workflow’s step ordering now depends on an accident of the input.

Nothing is broken. But you may have just encoded behavior you didn’t mean to encode.

One more thing worth saying out loud: if the order in which steps are created doesn’t matter, you may not need to sort at all. Sorting only makes sense if you’re enforcing a real business rule like “highest priority goes first.”

Why not just wrap the ordering in a step?

This is a very common reaction—and a reasonable one.

Yes, you could wrap the sort in a step and checkpoint it. That would make the ordering fully durable and replay-stable.

But steps are not free.

They add latency. They cost money. They count toward operation limits. And they exist primarily to protect work that is slow, expensive, or has side effects.

Pure, fast, in-memory logic like sorting is already replayable. Re-running a sort of 10 items during replay is usually far cheaper than checkpointing it. Even with larger lists, the trade-off depends on size, cost, and intent.

The rule of thumb I like is this:

If the logic is pure, fast, and deterministic, don’t rush to wrap it.

If you can’t make it deterministic, or replaying it is expensive, that’s when a step makes sense.

Fix

If ordering matters, make it explicit and deterministic, without mutating the input:

const orders = [...event.orders].sort(
  (a, b) => a.priority - b.priority || a.id.localeCompare(b.id)
);

Problem 2: `Date.now()` is non-deterministic

What’s happening

timestamp: Date.now()

This value is computed at runtime, so every execution produces a different number.

Why it matters

In this handler, the timestamp is just part of the returned response. It doesn’t affect control flow or step scheduling, so it’s harmless today.

But time-based APIs are explicitly called out in the durable execution docs as a common source of non-determinism. If this value later gets stored as workflow state, passed into a step, or used in a conditional, replay behavior can change in ways that are very hard to reason about.

This is less “this is wrong” and more “this is easy to trip over later.”

Fix

If the timestamp actually matters, capture it once inside a step so it replays consistently:

const timestamp = await context.step("timestamp", async () => Date.now());

Problem 3: Side effects hidden inside a single step

Ben Kehoe correctly points out a subtle but important issue in this code.

What’s happening

await context.step(`process-${order.id}`, async () => {
  return processOrder(order);
});

A durable step can fail and be retried. Once a step completes, it won’t be re-run on replay—but retries can re-execute the step body.

If processOrder performs multiple side effects, a failure partway through can cause those side effects to run again.

Why it matters

This is not a determinism problem.

It’s not a replay problem either.

This is a retry safety problem.

If a step body can’t safely run more than once, retries can produce duplicate effects unless everything inside the step is idempotent.

Fix

Be intentional about retry boundaries and align steps with retry-safe work.

Problematic version:

await context.step(`process-${order.id}`, async () => {
  await chargeCard(order);
  await writeAuditRecord(order);
  await sendConfirmation(order);
});

If this fails after chargeCard, a retry may re-run everything.

Safer version:

await context.step(`charge-${order.id}`, async () => {
  return chargeCard(order, { idempotencyKey: order.id });
});

await context.step(`audit-${order.id}`, async () => {
  return writeAuditRecord(order);
});

await context.step(`notify-${order.id}`, async () => {
  return sendConfirmation(order, { idempotencyKey: order.id });
});

This doesn’t magically make things idempotent. It just limits the blast radius when retries happen.

Step semantics and retry intent

AWS Lambda durable functions also give you control over how steps retry.

By default, steps use AtLeastOncePerRetry semantics. If a step fails or the Lambda is interrupted, the runtime may re-execute the step body. In this mode, the retry count acts as a lower bound on executions.

If you have a step that must never run more than once, you can use StepSemantics.AtMostOncePerRetry with zero retries. In that case, a failure surfaces as an error instead of re-running the step.

Put simply:

AtLeastOncePerRetry → max attempts is a lower bound
AtMostOncePerRetry → max attempts is an upper bound

Neither is “better.” They just encode different assumptions.

Putting it all together

Once you make ordering explicit, keep non-deterministic values under control, and think carefully about retry boundaries, the handler becomes much easier to reason about.

Here are two durable-safe ways to structure it, depending on how independent your work items are and how much concurrency you want.

Durable-safe handler (step by step)

import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const orders = [...event.orders].sort(
      (a, b) => a.priority - b.priority || a.id.localeCompare(b.id)
    );

    for (const order of orders) {
      await context.step(`validate-${order.id}`, async () => {
        if (!order.id) throw new Error("Missing order id");
      });

      await context.step(`charge-${order.id}`, async () => {
        return chargeCard(order, { idempotencyKey: order.id });
      });

      await context.step(`notify-${order.id}`, async () => {
        return sendConfirmation(order, { idempotencyKey: order.id });
      });
    }

    return { processed: orders.length };
  }
);

Durable-safe handler using `context.map()`

context.map() changes the shape of the problem a bit. Each item becomes its own durable unit of work.

That matters because:

Failures are isolated to a single item
Completed items don’t get re-run because something else failed
Concurrency becomes a first-class knob (maxConcurrency)

The trade-offs are real too:

Large lists can emit a lot of steps quickly
Strict sequencing is harder to express

import { withDurableExecution, DurableContext } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const orders = [...event.orders].sort(
      (a, b) => a.priority - b.priority || a.id.localeCompare(b.id)
    );

    const mapResult = await context.map(
      "process-orders",
      orders,
      async (ctx: DurableContext, order: any) => {
        await ctx.step(`validate-${order.id}`, async () => {
          if (!order.id) throw new Error("Missing order id");
        });

        await ctx.step(`charge-${order.id}`, async () => {
          return chargeCard(order, { idempotencyKey: order.id });
        });

        await ctx.step(`notify-${order.id}`, async () => {
          return sendConfirmation(order, { idempotencyKey: order.id });
        });

        return { orderId: order.id, status: "ok" };
      },
      { maxConcurrency: 5 }
    );

    const results = mapResult.getResults();
    return { processed: results.length, results };
  }
);

Final takeaway

Durable execution encourages you to slow down just a bit and be explicit about ordering, retries, idempotency, and where work can safely be repeated.

That’s exactly why this question was worth asking—and why the conversation around it was worth having.

AWS Lambda Durable Functions: Build Workflows That Last

Eric D Johnson — Wed, 03 Dec 2025 21:33:02 +0000

Long-running workflows without managing infrastructure

Your Lambda function needs to wait for a human approval. Or retry a failed API call with exponential backoff. Or orchestrate multiple steps that span hours. How do you build that without managing servers or databases?

AWS Lambda Durable Functions solve this. Write your workflow in your programming language—Node.js, TypeScript, Python, with more coming—using straightforward async code. Lambda handles the rest: checkpointing state, resuming after waits, retrying issues, and scaling automatically. Workflows can run for up to a year, and you only pay for actual execution time—not while waiting.

What Are Durable Functions?

Durable functions are Lambda functions that can pause and resume. When your function waits for a callback or sleeps for an hour, Lambda checkpoints its state and stops execution. When it's time to continue, Lambda resumes exactly where it left off—with all variables and context intact.

This isn't a new compute model. It's regular Lambda with automatic state management. You write normal async/await code. Lambda makes it durable.

A Simple Example

Here's a workflow that creates an order, waits 5 minutes, then sends a notification:

import { DurableContext, withDurableExecution } from '@aws/durable-execution-sdk-js';

export const handler = withDurableExecution(
  async (event: any, context: DurableContext) => {
    const order = await context.step('create-order', async () => {
      return createOrder(event.items);
    });

    await context.wait({ seconds: 300 });

    await context.step('send-notification', async () => {
      return sendEmail(order.customerId, order.id);
    });

    return { orderId: order.id, status: 'completed' };
  }
);

That's it. No state machines to configure, no databases to manage, no polling loops. The function pauses during the wait, costs nothing while idle, and resumes automatically after 5 minutes.

Key Capabilities

Long execution times - Functions can run for up to 1 year. Individual invocations are still limited to 15 minutes, but the workflow continues across multiple invocations.

Automatic checkpointing - Lambda saves your function's state at each step. If something fails, the function resumes from the last checkpoint—not from the beginning.

Built-in retries - Configure retry strategies with exponential backoff. Lambda handles the retry logic and timing automatically.

Wait for callbacks - Pause execution until an external event arrives. Perfect for human approvals, webhook responses, or async API results.

Parallel execution - Run multiple operations concurrently and wait for all to complete. Lambda manages the coordination.

Nested workflows - Invoke other durable functions and compose complex workflows from simple building blocks.

How It Works: The Replay Model

Durable functions use a replay-based execution model. When your function resumes, Lambda replays it from the start—but instead of re-executing operations, it uses checkpointed results.

Here's what happens:

First invocation - Your function runs, executing each step and checkpointing results
Wait or callback - Function pauses, Lambda saves state and stops execution
Resume - Lambda invokes your function again, replaying from the start
Replay - Operations return checkpointed results instantly instead of re-executing
Continue - Function continues past the wait with all context intact

This model ensures your function always sees consistent state, even across issues and restarts. Operations are deterministic—they execute once and replay with the same result.

Learn more: Understanding the Replay Model explains how replay works, why operations must be deterministic, and how to handle non-deterministic code safely.

Common Use Cases

Approval workflows - Wait for human approval before proceeding. The function pauses until someone clicks approve or reject.

Saga patterns - Coordinate distributed transactions with compensating actions. If a step fails, automatically roll back previous steps.

Scheduled tasks - Wait for specific times or intervals. Process data at midnight, send reminders after 24 hours, or retry every 5 minutes.

API orchestration - Call multiple APIs with retries and error handling. Coordinate responses and handle partial issues gracefully.

Data processing pipelines - Process large datasets in stages with checkpoints. Resume from the last successful stage if something fails.

Event-driven workflows - React to external events like webhooks, IoT signals, or user actions. Wait for events and continue processing when they arrive.

Testing Your Workflows

Testing long-running workflows doesn't mean waiting hours. The Durable Execution SDK includes a testing library that runs your functions locally in milliseconds:

import { LocalDurableTestRunner } from '@aws/durable-execution-sdk-js-testing';

const runner = new LocalDurableTestRunner({
  handlerFunction: handler,
});

const execution = await runner.run();

expect(execution.getStatus()).toBe('SUCCEEDED');
expect(execution.getResult()).toEqual({ orderId: '123', status: 'completed' });

The test runner simulates checkpoints, skips time-based waits, and lets you inspect every operation. You can test callbacks, retries, and issues without deploying to AWS.

Learn more: Testing Durable Functions covers local testing, cloud integration tests, debugging techniques, and best practices.

Deploying with AWS SAM

Deploy durable functions using AWS SAM with a few key configurations:

Resources:
  OrderProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/order-processor
      Handler: index.handler
      Runtime: nodejs22.x
      DurableConfig:
        ExecutionTimeout: 900
        RetentionPeriodInDays: 7
    Metadata:
      BuildMethod: esbuild
      BuildProperties:
        EntryPoints:
          - index.ts

The DurableConfig property enables durable execution and sets the workflow timeout. SAM automatically handles IAM permissions for checkpointing and state management.

Learn more: Deploying Durable Functions with SAM covers template configuration, permissions, build settings, and deployment best practices.

When to Use Durable Functions

Your workflow spans multiple steps with waits or callbacks
You need automatic retries with exponential backoff
You want to coordinate multiple async operations
Your process requires human approval or external events
You need to handle long-running tasks without managing state
You prefer writing workflows as code rather than configuration

Getting Started

Install the SDK: npm install @aws/durable-execution-sdk-js
Write your function: Wrap your handler with withDurableExecution()
Use durable operations: context.step(), context.wait(), context.waitForCallback()
Test locally: Use LocalDurableTestRunner for fast iteration
Deploy with SAM: Add DurableConfig to your template
Monitor execution: Use Amazon CloudWatch and AWS X-Ray for observability

Learn More

Understanding the Replay Model - Deep dive into how durable functions work under the hood
Testing Durable Functions - Comprehensive guide to testing workflows locally and in the cloud
Deploying with AWS SAM - Complete deployment guide with templates and best practices

Summary

AWS Lambda Durable Functions let you build long-running workflows without managing infrastructure. Write straightforward async code, and Lambda handles state management, retries, and resumption. Your functions can wait for callbacks, retry issues, and run for up to a year—all while paying only for execution time.

Start with simple workflows, test locally for fast iteration, and deploy with confidence knowing Lambda manages the complexity of distributed state.

Developing AWS Lambda Durable Functions with AWS SAM

Eric D Johnson — Wed, 03 Dec 2025 21:25:35 +0000

How to configure, build, and deploy long-running workflows using SAM templates

You've written a durable function that orchestrates a multi-step workflow. Now you may want to deploy it. AWS Serverless Application Model (AWS SAM) makes this straightforward with specific configurations in your template. Let's walk through how to set up durable functions effectively.

Prerequisites

You'll need AWS SAM CLI version 1.150.1 or greater. Check your version:

sam --version

If you need to upgrade, follow the AWS SAM installation guide.

The Basics: Enabling Durable Functions

To make an AWS Lambda function durable, add the DurableConfig property to your AWS SAM template:

Resources:
  OrderProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/order-processor
      Handler: index.handler
      Runtime: nodejs22.x
      Architectures:
        - arm64
      Timeout: 900                # Function timeout: 15 minutes
      DurableConfig:
        ExecutionTimeout: 3600    # Execution timeout: 1 hour
        RetentionPeriodInDays: 7  # Keep execution history for 7 days

Understanding the two timeout settings is crucial:

Function Timeout (Timeout) controls how long each individual Lambda invocation can run. This is still capped at 15 minutes (900 seconds), just like regular Lambda functions. Each time your durable function checkpoints and resumes, it's a new invocation with its own 15-minute window.

Execution Timeout (ExecutionTimeout) controls how long the entire workflow can run across all invocations. This can be up to 1 year. Your workflow can pause, wait for callbacks, and resume many times, as long as the total elapsed time doesn't exceed this limit.

For long-running workflows where the execution timeout exceeds the function timeout, you must invoke the function asynchronously. Synchronous invocations will fail with a validation error if the execution timeout is longer than the function timeout.

Installing the SDK

Add the durable execution SDK to your function's dependency manifest. SAM will automatically install it during the build process.

TypeScript/JavaScript - Add to package.json:

{
  "dependencies": {
    "@aws/durable-execution-sdk-js": "^1.0.0"
  }
}

Python - Add to requirements.txt:

aws-durable-execution-sdk-python

For client functions that send callbacks or query execution state, add the AWS SDK:

TypeScript/JavaScript:

{
  "dependencies": {
    "@aws-sdk/client-lambda": "^3.0.0"
  }
}

Python:

boto3

Build Configuration

Configure SAM to build your TypeScript functions with esbuild:

Resources:
  OrderProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/order-processor
      Handler: index.handler
      Runtime: nodejs22.x
      Architectures:
        - arm64
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7
      Metadata:
        BuildMethod: esbuild
        BuildProperties:
          EntryPoints:
            - index.ts

SAM will automatically compile your TypeScript code and bundle dependencies into a deployment package.

IAM Permissions

AWS SAM automatically grants the necessary permissions for durable functions to checkpoint and manage their execution state. You don't need to explicitly configure these permissions.

However, you may want to configure permissions for client functions that interact with durable executions. For functions that send callbacks:

BaristaCallbackFunction:
  Type: AWS::Serverless::Function
  Properties:
    CodeUri: src/barista-callback
    Handler: index.handler
    Policies:
      - Statement:
          - Effect: Allow
            Action:
              - lambda:SendDurableExecutionCallbackSuccess
              - lambda:SendDurableExecutionCallbackFailure
              - lambda:SendDurableExecutionCallbackHeartbeat
            Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'

For functions that monitor execution state or stop executions:

MonitoringFunction:
  Type: AWS::Serverless::Function
  Properties:
    CodeUri: src/monitoring
    Handler: index.handler
    Policies:
      - Statement:
          - Effect: Allow
            Action:
              - lambda:GetDurableExecution
              - lambda:GetDurableExecutionHistory
              - lambda:ListDurableExecutionsByFunction
              - lambda:SendDurableExecutionCallbackSuccess
              - lambda:SendDurableExecutionCallbackHeartbeat
              - lambda:SendDurableExecutionCallbackFailure
              - lambda:StopDurableExecution
            Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'

These permissions allow client functions to interact with your durable workflows by sending callbacks, monitoring execution state, or stopping running executions.

Using Globals for Common Configuration

AWS SAM's Globals section reduces repetition when you have multiple functions with shared settings:

Globals:
  Function:
    Runtime: nodejs22.x
    Architectures:
      - arm64
    Timeout: 30
    MemorySize: 512
    Tracing: Active

Resources:
  OrderProcessorFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/order-processor
      Handler: index.handler
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7

  CallbackHandlerFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/callback-handler
      Handler: index.handler
      # No DurableConfig - this is a regular Lambda function

You can set common runtime, architecture, and memory settings at the global level. For DurableConfig, configure it at the function level to make it explicit which functions are durable. Functions without DurableConfig are regular Lambda functions.

A Complete Example

Here's a full AWS SAM template for a coffee ordering system with durable workflows:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Coffee Ordering System with Durable Functions

Globals:
  Function:
    Runtime: nodejs22.x
    Architectures:
      - arm64
    Timeout: 30
    MemorySize: 512
    Tracing: Active

Resources:
  # ========== AMAZON DYNAMODB ==========

  OrdersTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: CoffeeOrders
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: orderId
          AttributeType: S
      KeySchema:
        - AttributeName: orderId
          KeyType: HASH

  # ========== FUNCTIONS ==========

  CoffeeOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/coffee-order
      Handler: index.handler
      AutoPublishAlias: Live
      DurableConfig:
        ExecutionTimeout: 3600
        RetentionPeriodInDays: 7
      Environment:
        Variables:
          ORDERS_TABLE: !Ref OrdersTable
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref OrdersTable
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /orders
            Method: post
      Metadata:
        BuildMethod: esbuild
        BuildProperties:
          EntryPoints:
            - index.ts

  BaristaCallbackFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/barista-callback
      Handler: index.handler
      Policies:
        - Statement:
            - Effect: Allow
              Action:
                - lambda:SendDurableExecutionCallbackSuccess
                - lambda:SendDurableExecutionCallbackFailure
              Resource: !Sub 'arn:aws:lambda:${AWS::Region}:${AWS::AccountId}:function:${CoffeeOrderFunction}'
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /barista/accept/{orderId}
            Method: post
      Metadata:
        BuildMethod: esbuild
        BuildProperties:
          EntryPoints:
            - index.ts

Outputs:
  ApiUrl:
    Description: API Gateway endpoint URL
    Value: !Sub 'https://${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com/Prod'

  CoffeeOrderFunctionArn:
    Description: Coffee Order Function ARN
    Value: !GetAtt CoffeeOrderFunction.Arn

Building and Deploying

Build your application with AWS SAM:

sam build

This builds your functions and layers. AWS SAM automatically runs the Makefiles for your layers and uses esbuild for your TypeScript functions.

Deploy to AWS:

sam deploy --guided

The --guided flag walks you through configuration options. After the first deployment, AWS SAM saves your settings in samconfig.toml, so subsequent deploys are just:

sam deploy

Local Testing

Test your durable function locally before deploying:

sam local invoke CoffeeOrderFunction --event events/order.json

The event file contains your test payload:

{
  "orderId": "test-123",
  "attendeeId": "user-456",
  "orderDetails": {
    "drinkType": "Latte",
    "size": "Grande"
  }
}

Accessing Cloud Resources Locally

When testing locally, you often need to access deployed AWS resources like Amazon DynamoDB tables or Amazon EventBridge buses. Use the --env-vars flag with a JSON file to provide environment variables and credentials:

sam local invoke CoffeeOrderFunction \
  --event events/order.json \
  --env-vars locals.json

Create a locals.json file with environment variables for each function:

{
  "CoffeeOrderFunction": {
    "AWS_REGION": "us-east-1",
    "ORDERS_TABLE_NAME": "CoffeeOrders",
    "EVENT_BUS_NAME": "CoffeeOrderingEventBus"
  },
  "BaristaCallbackFunction": {
    "AWS_REGION": "us-east-1",
    "ORDERS_TABLE_NAME": "CoffeeOrders"
  }
}

This allows your local function to interact with deployed resources in AWS, assuming your AWS credentials have the necessary permissions. Your function code can access these variables through process.env:

const tableName = process.env.ORDERS_TABLE_NAME;
const region = process.env.AWS_REGION;

Tracking Execution State

For durable functions, you can track execution state:

# Get execution details
sam local execution get $EXECUTION_ARN

# View execution history
sam local execution history $EXECUTION_ARN

# Stop a running execution
sam local execution stop $EXECUTION_ARN

Test callbacks locally:

# Send success callback
sam local callback succeed $CALLBACK_ID --result '{"status": "accepted"}'

# Send failure callback
sam local callback fail $CALLBACK_ID --error '{"message": "Rejected"}'

# Send heartbeat
sam local callback heartbeat $CALLBACK_ID

Remote Testing

After deploying, test against your live functions:

# Invoke the function (automatically uses $LATEST qualifier)
sam remote invoke CoffeeOrderFunction --event events/order.json

# Invoke a specific qualifier/alias
sam remote invoke CoffeeOrderFunction --event events/order.json --parameter Qualifier=prod

# Get execution details
sam remote execution get $EXECUTION_ARN

# View execution history
sam remote execution history $EXECUTION_ARN

By default, sam remote invoke uses the $LATEST qualifier. You can override this with --parameter Qualifier=<your-qualifier> to test against a specific version or alias.

The execution history shows every step, checkpoint, and state transition. It's invaluable for debugging workflows in production.

Configuration Best Practices

ExecutionTimeout can be up to 1 year (31,536,000 seconds) and controls how long your entire workflow can run across all invocations. For a coffee order that waits up to 5 minutes for barista acceptance, consider setting it to 600 seconds (10 minutes) to allow some buffer. For document processing that might take hours or days, you can set it much higher - up to the 1-year maximum.

Function Timeout is still capped at 15 minutes (900 seconds) and controls how long each individual invocation can run. Set this based on how long your function needs between checkpoints. If your steps typically complete in seconds, starting with 30 seconds is reasonable. If you have longer-running operations, consider increasing it up to the 15-minute maximum.

Async vs Sync Invocation: If your execution timeout exceeds your function timeout, you must use asynchronous invocation. Synchronous invocations will fail validation when the execution timeout is longer than the function timeout. Configure your event sources (API Gateway, EventBridge, etc.) to invoke asynchronously for long-running workflows.

RetentionPeriodInDays determines how long execution history is kept. Consider using 7-30 days for development environments where you're actively debugging. For production environments where you might need to investigate issues weeks later, consider using 90+ days.

Memory affects both performance and cost. Starting with 512 MB and adjusting based on your function's needs is a common approach. Durable functions checkpoint frequently, so memory usage is typically lower than equivalent non-durable functions.

Monitoring and Observability

Enable AWS X-Ray tracing to see how your workflow executes:

Globals:
  Function:
    Tracing: Active

This traces each invocation and shows you the complete execution path across multiple invocations.

Add Amazon CloudWatch Logs Insights queries to analyze execution patterns:

fields @timestamp, @message
| filter @message like /CHECKPOINT/
| sort @timestamp desc
| limit 100

Create CloudWatch alarms for execution failures:

ExecutionFailureAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: CoffeeOrderExecutionFailures
    MetricName: Errors
    Namespace: AWS/Lambda
    Statistic: Sum
    Period: 300
    EvaluationPeriods: 1
    Threshold: 5
    ComparisonOperator: GreaterThanThreshold
    Dimensions:
      - Name: FunctionName
        Value: !Ref CoffeeOrderFunction

Common Patterns

Event-driven workflows triggered by API Gateway, EventBridge, or Amazon Simple Queue Service (Amazon SQS) - useful for order processing, approval workflows, and asynchronous task execution:

CoffeeOrderFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    Events:
      ApiTrigger:
        Type: Api
        Properties:
          Path: /orders
          Method: post
      EventBridgeTrigger:
        Type: EventBridgeRule
        Properties:
          Pattern:
            source:
              - coffee.orders
            detail-type:
              - OrderPlaced

For long-running workflows, configure API Gateway to invoke asynchronously:

CoffeeOrderFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    Timeout: 900  # 15 minutes
    DurableConfig:
      ExecutionTimeout: 86400  # 24 hours
    Events:
      ApiTrigger:
        Type: Api
        Properties:
          Path: /orders
          Method: post
          RequestParameters:
            - method.request.header.X-Amz-Invocation-Type:
                Required: false
                Caching: false

Scheduled workflows that run on a timer - useful for daily reports, periodic data processing, or scheduled maintenance tasks:

DailyReportFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    Events:
      DailySchedule:
        Type: Schedule
        Properties:
          Schedule: cron(0 9 * * ? *)  # 9 AM daily

Fan-out workflows that process items in parallel - useful for batch processing, data transformation, or concurrent API calls:

BatchProcessorFunction:
  Type: AWS::Serverless::Function
  Properties:
    # ... other properties
    ReservedConcurrentExecutions: 10  # Limit parallel executions

Troubleshooting

Validation error on invoke: If you get a validation error saying "Execution timeout must be within the function timeout," you're trying to synchronously invoke a function where the execution timeout exceeds the function timeout. Use asynchronous invocation for long-running workflows.

Individual invocation timeout: If your function times out during a single invocation, consider increasing the Timeout property. Start by doubling the current value, up to the maximum of 900 seconds. This is separate from the workflow's total ExecutionTimeout.

Workflow timeout: If your entire workflow times out, consider increasing the ExecutionTimeout in DurableConfig. This can be up to 1 year (31,536,000 seconds).

Permission denied for callbacks: Ensure client functions have the correct IAM permissions to interact with durable executions. Functions sending callbacks need SendDurableExecutionCallback* permissions with the target durable function as the resource.

Summary

Deploying durable functions with AWS SAM involves several key configurations: enable DurableConfig on your function to set execution timeout and retention period, install the durable execution SDK as a dependency, and configure esbuild for TypeScript compilation. AWS SAM facilitates the IAM permissions needed for durable execution operations. For client functions that interact with durable executions, consider explicitly configuring permissions for callback and monitoring operations.

Use Globals to reduce repetition across functions, set appropriate timeouts for both individual invocations and total workflow duration, and enable tracing for observability. Test locally with sam local invoke and sam local execution commands before deploying to AWS.

With these patterns in place, you can deploy long-running workflows that span hours or days, handle failures gracefully, and provide full visibility into execution state.

The Replay Model: How AWS Lambda Durable Functions Actually Work

Eric D Johnson — Wed, 03 Dec 2025 00:27:54 +0000

Understanding the checkpoint-based execution that makes long-running workflows possible

You write an AWS Lambda function that looks like it runs continuously for hours. But Lambda functions can only run for 15 minutes. How does this work?

The answer is replay - a checkpoint-based execution model that makes your function restart from the beginning on every invocation, but skip the work it's already done. It's elegant, efficient, and once you understand it, surprisingly intuitive.

The Core Principle

Here's the fundamental truth about durable functions:

Your handler function re-executes from the beginning on every invocation, but completed operations return cached results from checkpoints instead of re-executing.

Let's see this in action with a simple workflow:

async function processOrder(event: any, ctx: DurableContext) {
  const order = await ctx.step('create-order', async () => {
    console.log('Creating order...');
    return { orderId: '123', total: 50 };
  });

  const payment = await ctx.step('process-payment', async () => {
    console.log('Processing payment...');
    return { transactionId: 'txn-456', status: 'success' };
  });

  await ctx.wait({ seconds: 300 }); // Wait 5 minutes

  const notification = await ctx.step('send-notification', async () => {
    console.log('Sending notification...');
    return { sent: true };
  });

  return { order, payment, notification };
}

Here's what actually happens across two separate Lambda invocations:

Invocation 1 (t=0s):

Creating order...
Processing payment...
[Checkpoint: create-order completed]
[Checkpoint: process-payment completed]
[Function terminates - waiting 5 minutes]

Invocation 2 (t=300s, after wait completes):

[REPLAY MODE: Skipping create-order - returning cached result]
[REPLAY MODE: Skipping process-payment - returning cached result]
[EXECUTION MODE: Running send-notification]
Sending notification...
[Checkpoint: send-notification completed]
[Function completes]

Notice what happened: The function started from the beginning both times, but on the second invocation, create-order and process-payment didn't re-execute. The logs only appeared once, even though the code ran twice. The function seamlessly continued from where it left off.

This is replay.

Execution Modes: The Secret Sauce

The SDK operates in two modes that automatically switch based on what's happening.

ExecutionMode is when the function is executing operations for the first time. Operations execute normally, results are saved to checkpoints, logs are emitted, and side effects happen.

ReplayMode is when the function is replaying previously completed operations. Operations return cached results instantly without actual execution, logs are suppressed, and no side effects occur.

The SDK automatically transitions from ReplayMode to ExecutionMode when it reaches an operation that hasn't been completed yet.

How Checkpoints Work

Every operation creates a checkpoint that stores:

{
  operationId: "2",           // Sequential ID
  operationType: "STEP",      // STEP, WAIT, INVOKE, etc.
  operationName: "process-payment",
  status: "SUCCEEDED",        // STARTED, SUCCEEDED, FAILED, PENDING
  result: {                   // The actual return value
    transactionId: "txn-456",
    status: "success"
  }
}

When your function restarts, the SDK loads all checkpoints from storage, indexes them by operation ID, returns cached results for completed operations, and executes new operations normally.

The Determinism Requirement

For replay to work, your code must be deterministic - the same sequence of operations must happen in the same order every time.

What Breaks Determinism

// ❌ Random control flow
if (Math.random() > 0.5) {
  await ctx.step('optional-step', async () => doSomething());
}
// First run: random = 0.7, step executes, checkpoint created
// Second run: random = 0.3, step skipped
// Error: Expected operation 'optional-step' at position 2, not found!

// ❌ Time-based branching
const isWeekend = new Date().getDay() >= 5;
if (isWeekend) {
  await ctx.step('weekend-task', async () => doWeekendWork());
}
// First run: Friday (day 5), step executes
// Second run: Monday (day 1), step skipped
// Error: Replay consistency violation!

// ❌ External state
let counter = 0;
await ctx.step('step1', async () => {
  counter++; // Won't increment during replay!
  return counter;
});

How to Write Deterministic Code

The rule is simple: capture non-deterministic values inside steps.

// ✅ Capture random values in steps
const randomId = await ctx.step('generate-id', async () => {
  return crypto.randomUUID(); // Executed once, cached on replay
});

// ✅ Capture timestamps in steps
const timestamp = await ctx.step('get-timestamp', async () => {
  return Date.now(); // Same timestamp on every replay
});

// ✅ Use event data for control flow
if (event.shouldProcess) { // Deterministic - same event every time
  await ctx.step('process', async () => doWork());
}

// ✅ Capture time-based decisions in steps
const isWeekend = await ctx.step('check-day', async () => {
  return new Date().getDay() >= 5;
});
if (isWeekend) {
  await ctx.step('weekend-task', async () => doWeekendWork());
}

Replay Consistency Validation

The SDK validates that operations occur in the same order on every invocation:

// What gets validated:
// 1. Operation type (STEP, WAIT, INVOKE)
// 2. Operation name (your identifier)
// 3. Operation position (sequential order)

// Example validation error:
// "Replay consistency violation: Expected operation 'process-payment' 
//  of type STEP at position 2, but found operation 'send-email' of type STEP"

This catches bugs early. If your code's execution path changes between invocations, you'll know immediately.

A Complete Example: Order Processing with Replay

Let's examine a potential workflow through multiple invocations:

async function processOrder(event: any, ctx: DurableContext) {
  ctx.logger.info('Order processing started', { orderId: event.orderId });

  // Step 1: Validate inventory
  const inventory = await ctx.step('check-inventory', async () => {
    ctx.logger.info('Checking inventory');
    const response = await fetch(`https://api.inventory.com/check`, {
      method: 'POST',
      body: JSON.stringify({ items: event.items })
    });
    return response.json();
  });

  if (!inventory.available) {
    ctx.logger.warn('Out of stock', { missing: inventory.missing });
    return { status: 'out-of-stock' };
  }

  // Step 2: Process payment
  const payment = await ctx.step('process-payment', async () => {
    ctx.logger.info('Processing payment', { amount: inventory.total });
    const response = await fetch(`https://api.payments.com/charge`, {
      method: 'POST',
      body: JSON.stringify({ 
        customerId: event.customerId, 
        amount: inventory.total 
      })
    });
    return response.json();
  });

  // Step 3: Wait for warehouse confirmation (5 minute timeout)
  ctx.logger.info('Waiting for warehouse confirmation');
  const confirmation = await ctx.waitForCallback(
    'warehouse-confirm',
    async (callbackId) => {
      // Send callback ID to warehouse system
      await fetch(`https://api.warehouse.com/notify`, {
        method: 'POST',
        body: JSON.stringify({ orderId: order.id, callbackId })
      });
    },
    { timeout: { seconds: 300 } }
  );

  // Step 4: Send notification
  await ctx.step('notify-customer', async () => {
    ctx.logger.info('Sending customer notification');
    await fetch(`https://api.notifications.com/send`, {
      method: 'POST',
      body: JSON.stringify({
        customerId: event.customerId,
        message: 'Your order is confirmed!'
      })
    });
  });

  ctx.logger.info('Order processing completed');
  return { status: 'completed', orderId: payment.orderId };
}

Invocation Timeline

Invocation 1 (t=0s) runs in ExecutionMode. The logs show "Order processing started", "Checking inventory", "Processing payment", and "Waiting for warehouse confirmation". Checkpoints are created for check-inventory and process-payment, both marked as SUCCEEDED. The function then enters a waiting state for the callback 'warehouse-confirm', creating a checkpoint to persist the callbackId and set the timer for the timeout.

Invocation 2 (t=120s, warehouse confirms) starts in ReplayMode and transitions to ExecutionMode. During the replay phase, "Order processing started", "Checking inventory", "Processing payment", and "Waiting for warehouse confirmation" are all suppressed - the inventory and payment steps return their cached results without re-executing. Once the function reaches new operations, it switches to ExecutionMode, checkpoints the callback result, and logs "Sending customer notification" and "Order processing completed". A checkpoint is created for notify-customer marked as SUCCEEDED, and the function completes.

Notice how the function ran from the beginning both times, but the inventory and payment APIs were only called once. Logs only appeared once with no duplicates, and the function seamlessly continued after the callback.

Operation IDs: The Replay Index

Operations are identified by sequential IDs that determine replay order:

// Root context operations
await ctx.step('step1', ...);  // ID: "1"
await ctx.step('step2', ...);  // ID: "2"
await ctx.step('step3', ...);  // ID: "3"

// Child context operations (from ctx.runInChildContext)
await ctx.runInChildContext(async (childCtx) => {  // This operation gets ID: "4"
  await childCtx.step('child-step1', ...);  // ID: "4-1"
  await childCtx.step('child-step2', ...);  // ID: "4-2"
});

IDs are deterministic - they're based on execution order, not operation names. This is why operation order must be consistent.

Common Pitfalls and Solutions

1. Non-Deterministic Control Flow

// ❌ BAD: Random branching
if (Math.random() > 0.5) {
  await ctx.step('optional', async () => doWork());
}

// ✅ GOOD: Event-based branching
if (event.shouldDoWork) {
  await ctx.step('optional', async () => doWork());
}

// ✅ GOOD: Capture decision in step
const shouldDoWork = await ctx.step('decide', async () => {
  return Math.random() > 0.5;
});
if (shouldDoWork) {
  await ctx.step('optional', async () => doWork());
}

2. Mutating Closure Variables

// ❌ BAD: External mutation
let total = 0;
await ctx.step('add-items', async () => {
  total += 10; // Won't happen during replay!
});

// ✅ GOOD: Return values
const total = await ctx.step('calculate-total', async () => {
  return 10;
});

3. Side Effects Outside Steps

// ❌ BAD: Direct API calls
const data = await fetch('https://api.example.com/data');
await ctx.step('process', async () => processData(data));
// API called on every replay!

// ✅ GOOD: API calls inside steps
const data = await ctx.step('fetch-data', async () => {
  return await fetch('https://api.example.com/data');
});
await ctx.step('process', async () => processData(data));

Debugging Replay Issues

When replay goes wrong, use the execution history:

# View execution history
sam remote execution history $EXECUTION_ARN

# See detailed operation data
sam remote execution history $EXECUTION_ARN --format json

The history shows every operation that executed, the order they ran in, their results or errors, and when mode transitions occurred. Look for operations appearing in different orders, missing or extra operations, or operations with different names at the same position.

Best Practices for Replay-Safe Code

Wrap all non-deterministic operations in steps - random numbers, timestamps, API calls, and database queries should always be inside ctx.step(). Use event data for control flow rather than runtime-generated values, and never mutate closure variables - return values from steps instead. Keep operation order consistent so the same sequence happens every time. Test with multiple invocations to verify replay behavior locally, and check execution history to debug replay issues quickly.

Summary

The replay model is what makes durable functions possible. Your function restarts from the beginning on every invocation, but completed operations return cached results without re-executing. The SDK automatically switches between ReplayMode and ExecutionMode, and your code must be deterministic for replay to work correctly.

Once you internalize these principles, writing durable functions becomes natural. You write straightforward procedural code, and the SDK handles all the complexity of checkpointing, replay, and state management. The result? Long-running workflows that look like simple functions. That's the magic of replay.