Forem: Jason Wadsworth

Adaptable Infrastructure on AWS: Combining ECS and Lambda Behind an ALB

Jason Wadsworth — Tue, 27 May 2025 17:18:14 +0000

I recently attended a session given by Allen Helton (AWS Hero) on the past and future of IaC (infrastructure as code). In it he talked about a future where the infrastructure was more adaptive, allowing you to write code once and have it modify how it runs automatically. This brought me back to something I did about 7 years ago, where I had the same code running in both a container and in Lambda. It wasn't quite what Allen was talking about because it was a deploy time decision as to how it would run. That got me thinking about what it would take to make something like that work automatically. In this post, I’ll walk through an architecture that leverages both Amazon ECS and AWS Lambda behind a single Application Load Balancer (ALB), enabling you to dynamically shift traffic and infrastructure depending on usage patterns, all while running the same Node.js/Express codebase. I don't believe this is the end state that Allen was speaking of - I believe it needs to be easier and more automated (less code the developers have to write) - but I think this is an interesting start down that path.

💡 The Challenge

You have a Node.js application, and you want to serve it efficiently:

During high traffic, use ECS to handle concurrency and throughput cost-effectively.
During low traffic, save costs by scaling down ECS and using Lambda instead.

The goal: Maximize cost efficiency without sacrificing availability.

🏗️ Architecture Overview

At a high level, this setup looks like:

A separate Lambda controller function monitors traffic (via CloudWatch alarms) and adjusts the system accordingly.

⚙️ The Code: One App, Two Runtimes

You can run the same Express.js app on both ECS and Lambda with minimal changes.

In ECS

You deploy it as a typical containerized app on Fargate.

In Lambda

You wrap the Express app using a library like serverless-http:

    const express = require('express');
    const serverless = require('serverless-http');

    const app = express();
    // define routes here

    exports.handler = serverless(app);

🔀 Load Balancer Setup with Weighted Target Groups

Your ALB listener forwards traffic to a forward action with two target groups:

ECS Target Group (type: ip)
Lambda Target Group (type: lambda)

You assign weights to these target groups. Initially:

ECS weight = 100
Lambda weight = 0

Important! -
If ECS has zero healthy targets, traffic will still route to Lambda—even with a weight of 0. This provides seamless fallback during ECS spin-up.

📉 Responding to Low Traffic

You'll need to create an alarm to respond to the changes in traffic in ECS. You can configure the alarm to whatever levels makes sense for you. The alarm should be in an OK state when traffic is high enough to justify using ECS, and in an ALARM state when it should be switched to Lambda.

Tip -
I like to use EventBridge to trigger a Lambda when the state changes, but you can also connect to the alarm directly.

When a CloudWatch alarm detects low traffic:

A Lambda controller function is triggered.
It updates the ALB listener rule to:
- Set ECS weight to 0
- Set Lambda weight to 100
It scales down the ECS service to 0 tasks.

This stops container usage completely, minimizing costs while keeping the service available via Lambda.

📈 Responding to High Traffic

On rising demand:

The same CloudWatch alarm triggers the controller Lambda.
It:
- Sets ECS weight to 100
- Sets Lambda weight to 0
It scales up the ECS service (e.g., to your default task count).

While the ECS service is starting, if no healthy targets are available, the ALB continues routing traffic to Lambda—even with weight 0—ensuring a smooth transition.

✅ Benefits

Efficiency: Save money by not running idle ECS tasks.
Resilience: Lambda catches any gaps during ECS startup.
Simplicity: One codebase, two runtimes.
Flexibility: Control via Lambda and CloudWatch means no manual intervention.

⚠️ Considerations

Authentication and authorization If you're a serverless person you are probably used to using some authorization at the point of entry (e.g., at an API Gateway or AppSync). With this model you no longer have the zero trust model of IAM. You can use network security to be sure only certain sources can access your service, just be aware of the implications of doing so.
Cold Starts: Lambda functions might introduce latency during first invocation.
Startup Time: ECS services take time to start; make sure your fallback duration is appropriate.

🧠 Final Thoughts

This adaptive pattern gives you the best of both worlds: the scalability and efficiency of ECS for peak loads, and the cost savings and simplicity of Lambda when traffic drops. It has some limitations, and it's not the right solution for everyone, but it does start to look a little like that future Allen was talking about.

Check out a working example of this here

Step Toward Better Tenant Provisioning

Jason Wadsworth — Tue, 02 Jul 2024 01:59:19 +0000

In a multi-tenant SaaS application, you often need to manage resources that are tenant-specific. Whether it's a tenant-specific role, isolated DynamoDB tables, or per-tenant Cognito user pools, you need to have a way to deploy and update these resources across your application. In this blog, I'll show you how I have approached this problem.

There are three components of managing tenant-specific resources; creating new resources as a tenant is added, updating resources as your application needs changes, and deleting resources when a tenant is deleted.

The Old Way

In the past, I would use a model that looks something like the image above to create and delete resources. When a new tenant is added to the system an event is sent out and that would trigger a Lambda function. That Lambda function would use the SDK to create the necessary resources. Similarly, a delete would send out an event that would trigger a Lambda function that would use the SDK to delete tenant-specific resources.

There are some problems with this approach.

First, I like to use CDK, specifically L2 constructs, for all of my infrastructure. The SDK is very different, so there is a cognitive cost of using the SDK. You often need to remember more details, and the structure is very different.

Second, there isn't a place to go see all the resources associated with a tenant. This isn't a big deal when you have one resource per tenant, but as that grows it's nice to be able to go to a single spot to see everything that belongs to that tenant.

Third, while creating new resources when a tenant is created, and deleting them when a tenant is deleted, is pretty straightforward, updating is not. There isn't an event that you can use to trigger the update; at least not one that is a system event. Updates, unlike creates and deletes, are deploy time changes. As the application changes, you need to update the infrastructure of your tenants. That's a very different process than creating and deleting. Updating requires code that is aware of the current state and understands how to go from one to the other. It requires code to handle rolling back into a previous state when something goes wrong. Updates are complicated, and anytime I can remove complicated code I'm going to do it.

A Better Approach

Hearing others have the same problem, I wanted to find a solution to make things easier. I knew I wanted CDK and CloudFormation to be a part of the solution, and my thoughts quickly went to Step Functions. Could there be an answer there?

Here is what I figured out.

It starts with CDK. I create a Stack in CDK that holds all the resources for a tenant. This stack includes the use of CfnParameter to pass in the identifier of the tenant. Any resources that you need to create are added to this stack. The code looks something like this.

export class TenantTemplateStack extends Stack {
    constructor(scope: Construct, id: string, props: StackProps) {
        super(scope, id, props);

        const parameterTenantId = new CfnParameter(this, 'TenantId', {
            type: 'String',
            description: 'The ID of the tenant',
        });

        const tenantId = parameterTenantId.valueAsString;

        // any resources you want to provision per tenant go here
    }
}

The stack needs to be synthesized and made, so in our tenant management stack I include an S3 bucket where the template will be deployed, I synthesize the above template, and I deploy the synthesized template to the bucket. The keys to this are the use of the BootstraplessSynthesizer in the template stack and the Stage that I'll use to synthesize it. This creates a sort of CDK Inception, where your "cdk.out" will have your synthesized stack(s) and each stack will have another synthesized stack for your tenant template. Accessing the assembly of the stage allows us to grab the output and push it to S3 using the BucketDeploy construct.

export class TenantManagement extends Construct {
    public readonly templateBucketName: string;

    public readonly templateBucketKey: string;

    constructor(scope: Construct, id: string, props: TenantManagementProps) {
        super(scope, id);

        const stack = Stack.of(this);

        const templateBucket = new Bucket(this, 'TemplateBucket', {
            blockPublicAccess: {
                blockPublicAcls: true,
                blockPublicPolicy: true,
                ignorePublicAcls: true,
                restrictPublicBuckets: true,
            },
            objectOwnership: ObjectOwnership.OBJECT_WRITER,
            encryption: BucketEncryption.S3_MANAGED,
            enforceSSL: true,
            publicReadAccess: false,
            versioned: false,
            // important so that updates can be trigger based on this event
            eventBridgeEnabled: true,
        });

        const stage = new Stage(this, 'SynthStage');

        new TenantTemplateStack(stage, 'TenantTemplate', {
            // this allows the synthesis to generate a template without resolving CDK values like account and region
            synthesizer: new BootstraplessSynthesizer(),
        });

        // synthesize the template stack
        const assembly = stage.synth();

        // the stage only has one stack, so it's safe to grab index zero here to get the path of the output
        const templateFullPath = assembly.stacks[0].templateFullPath;

        // the bucket deployment construct will copy the resources in the specified path to S3
        new BucketDeployment(this, 'EachTenantStackDeployment', {
            destinationBucket: templateBucket,
            sources: [Source.asset(dirname(templateFullPath))],
        });
    }
}

At this point, I have a template that is being created that can be used for each of our tenants, but I still need to run the template at the right times. I'll create a few Step Functions to do this.

I'll start with the create because it's really hard to test doing an update and delete if you don't first create it. :) The create is triggered in the same way our Lambda function that was making SDK calls was triggered; via a tenant-created event sent to EventBridge.

The flow looks a bit like this:

The detail of the Step Function looks like this:

The Step Function is actually rather simple. It makes a call to CreateStack using the template I uploaded to the S3 bucket. After calling CreateStack it calls DescribeStack in a loop, checking to see that it has completed and failing if the stack fails. This way I can add metric alarms to notify the team if there are failures.

Next, I'll do the delete. Like the create, the delete is triggered from a tenant-deleted event sent to EventBridge. This runs a Step Function that looks a bit like this:

This one is a bit more complicated than the create because it first checks to see that the template is in a state that allows it to be deleted. This way you don't end up with errors if you try to delete a stack while it's in the process of being updated.

Finally, the update.

This Step Function is started whenever the template is updated in the S3 bucket. This means that when the deployment of our tenant management sends the updated template to the bucket this Step Function will run, which will automatically update all of your tenant's stacks. The State Machine looks a bit overwhelming, but when broken down into its parts it's pretty easy to understand.

The first part of this is just getting a list of tenants. I am storing our tenants in a DynamoDB table so I can query the data from there. DynamoDB uses paging, so I have to have some logic to loop over the data and call back into DynamoDB to get the next page.

The next bit is checking to see if the tenant's stack can be updated. This is important because a tenant may be in the process of being created when you deploy a new template. If you don't do this loop the update will fail and the new tenant won't get the updates.

Lastly, the UpdateStack call, and subsequent looping, looks just like our create logic adjusted for an update.

Conclusion

When you are building a multi-tenant SaaS app it's important to have a strategy for managing any tenant-specific resources you may have. Using Step Functions with the CDK is a great way to manage those updates. With this approach I get to continue to use CDK to model our resources, I have one place to go to see all the resources for a tenant (the CloudFormation stack for that tenant), and the complexities of updating and rolling back changes are managed by CloudFormation.

How to Add Paid Features to Your SaaS Apps

Jason Wadsworth — Mon, 13 May 2024 13:00:00 +0000

Welcome! This post focuses on implementing feature tiers in SaaS applications, rather than payment processing tools like Stripe or Square. If you're interested in learning about tiers and managing features for different customer levels, read on!

If you are still here, great! Let's get into what I am going to talk about.

Why should you consider tiers in your SaaS application?
How can you manage which customers have what tiers/features?
How do you add it to your code?
How do you account for the ”noisy neighbor” problem?
How do you make sure you’re not losing money?

Let's get started

What is the point of tiers?

When talking about a SaaS application you'll often hear about tiers or levels. The idea is simply that you have different features that each tier has access to in your application. Usually, the tiers are progressive, so someone in the second tier would get everything included in the first tier plus something more. Why might you want to include multiple tiers?

Increase Adoption

If you have an application for which you are currently charging money then adding a lower tier, whether free or just cheaper, you can enable increased adoption of your app. That increased adoption enables some of the next points. If nothing else, more people using your app means more people being aware of your app, which is a good thing.

Upsell Opportunities

Whether you're adding a lower tier to increase the number of users or adding a paid, or higher cost, tier, the point of doing it is to upsell. You have users who are currently using your app; hopefully getting value out of it at whatever tier they are in. By adding another tier you have the chance to convert that user into a paid, or higher-paying, user.

Learning

One thing that is often overlooked when considering a free tier is the value of what you can learn by having more users using your app. If you've ever worked in a startup, or just have worked on an app with limited users, you know how challenging it can be to get feedback, and then to understand what value to put on the feedback you get. With a small set of users, you aren't sure if the request from one or two users is really valuable or just valuable to them. As you increase the number of users in your app you have a chance to hear from a larger audience. That means a bigger sample size and more meaningful data.

Keep this one in mind if you are deciding whether or not to hold on to the lower tier(s) of your app.

It's also important to always understand your users. Users at different tiers sometimes have different needs. Be sure to weigh your feedback in light of who you are getting the feedback from and test any hypothesis against all of your user personas.

Enablement

Speaking of understanding user needs, one last point on the "why" of offering tiers is that it can enable you to do things that you can't do for free, or can't do at the price of your current app. Imagine you want to add a cool AI feature to your app. It sounds like a great idea, but you quickly realize that giving it away for free is too expensive. By adding a new tier you can charge a fee that makes adding that feature an option. This can be true even if you aren't looking to make money; you can price it in a way that at least covers the cost.

Tier Management

Now that we understand some of the reasons why you might want to add tiers, how do you manage them?

I'll start with some things you shouldn't do...

Don't Build Your Own Tools

As engineers, we always think we can do that. We'll look at some problem and decide that it's not a hard problem to solve and we'll go off and solve it. Don't fall into this trap. Engineering resources are precious; spend them on things that add value to your app, your users.

Don't Make it Complicated

I'm always telling people to stop making things more complicated than it needs to be. Solve the problem that is in front of you, not one that you might have later. This is true with tier management. In its simplest form it's just a true/false; does this tenant have this tier. It can become more complex later, as you learn, but start simple...always.

There are, of course, some things you can/should do as well...

Do Use 3rd Party Tools

This kind of goes without saying since it's the opposite of the first don't I listed, but it's worth restating and giving some examples. Using tools from third parties means taking advantage of what they have done so you don't have to do that work. This means you are free to build things that make your app special. I like to use feature flag tools for this. Some examples are LaunchDarkly, Split, and AWS App Config. I can't say I've used App Config, but the principles behind all of these are about the same; you pass in some bit of information and it tells you if the thing is on/off. It's a simple way to get tier management without a lot of work. Plus, if you aren't using feature flags in your app already you really should consider them. That's probably worth a blog post of its own.

Start With Options That Don't Scale

This is something you'll hear a lot in the startup space, but it's true everywhere. When you are doing something new you don't know how successful it will be. Don't spend time building things to make your life easier when it does before knowing if it will. It's okay if the first implementation of your tier management is someone going into your feature flag tool and manually changing values. You can track your billing in Excel when you're just getting started. Sure, those options are going to be painful if you are successful, but that's the point when you should increase the automation; not before.

Build As You Grow

Eventually, those things that don't scale will be pain points, for you and possibly for your users. As that happens start to build. Add something to automatically set values in the feature flag tool using their APIs. Send out invoices automatically with some sort of billing software. Even an internal UI that makes tier management a little easier can be a quick win that improves things just long enough to get to the next level of scale.

How Do You Make It Work?

At some point, you have to start putting something in the code. Here are some things to keep in mind when you do so.

Focus On Features, Not Tiers

We've been talking about tiers a lot, but what tiers really are is a collection of features. When you are adding features to your code you should mostly be thinking in terms of those features, not the tiers themselves. We've all seen pages that look something like the following:

Each tier shows you what features are included. Imagine if you want to move a feature from the basic tier to the free tier. If you focus on tiers then you have to go into the code and make that change. That doesn't sound so bad, after all, it's just one feature, so just one place in the code. What if you decided to add a whole new tier? Now you have to find every feature throughout the code and make sure the new tier is included in all the right spots. This makes changing tiers difficult and limits your sales and marketing options.

In addition to allowing you to change tiers, taking a feature-based approach allows you to grandfather in users when you make changes, and even do ala-cart sales where individual features are added for particular customers.

Don't Confuse Permissions and Features

There is some crossover between permissions and features, so it's easy to think they can be seen as the same. Both may result in a 403 - forbidden response from an HTTP call, after all. While they do have things in common, they are different.

Permissions will often go beyond the high-level "can you access this feature?" and into object-level permissions. A user may be allowed to access the files feature, but may only be able to see certain files.

Features, on the other hand, will often go into application flow. You may have permission to access the search feature, for example, but a feature flag may determine whether you use the standard search or the AI-based search. There isn't a permission issue, it's just a different path within the code.

Use Feature Flags

I already touched on this a bit; feature flags are a great way to determine what features a user/tenant can access. The code snippet below shows a quick example of what it might look like to evaluate whether a user can access a particular API. In this example, we have a Lambda function that is being called by AppSync. Our AppSync is using a custom authorizer where we are adding the tenantId and userId to the resolverContext. All the code does is make a call to the feature flag service to determine the flag value for the given context. We use the userId as the key, but the important data point here is the tenantId. That's the value that we'll have rules for to determine whether the value is true or false. If it's false we'll simply throw an error and we're done. If it's true then it does what it would normally do. Again, keep it simple to start with. Many of the feature flag tools have capabilities beyond simple true/false evaluation, but that's all you need to get started.

Make Sure the UI is Aware

When you're building a UI that has different levels of permissions it's a good practice to hide things from users that they can't do. This limits confusion and generally creates a better user experience. How many of you have clicked a button in the AWS console only to be given an error message saying you aren't allowed. Not a great experience.

When dealing with tiers and features you want to take a different approach. As we've already said, permissions and features are not the same. When you have a feature that is available to the user at a different tier you want the UI to show that to them. That doesn't mean it should look like you can do something and it will give you an error when you try it. Let's avoid rebuilding the AWS console experience. But you can grey out a button and show a message when the user hovers over it.

Making sure your UI is aware is how you upsell. You have users in the app who may not realize what additional capabilities your app offers at higher pricing tiers. Tell them!

Dealing With Noisy Neighbors

When you build a SaaS application you're most often hoping to get some cost benefits from having your different tenants share resources. This creates the opportunity for what is referred to as the noisy neighbor problem. It happens when one tenant is impacted by or is impacting other tenants. In a multi-tier application, this can happen in several ways, and its impact can be made worse when paid customers feel like the free-tier tenants are causing the system to slow down. This is particularly noticeable if you add a free tier and suddenly everything is worse.

There are things you can do to help.

Rate Limiting

There is a reason why every API, every service, in AWS has limits. The main reason is that AWS is IaaS/PaaS. Those share the same noisy neighbor problem as a SaaS app. Rate limiting allows you to limit how much any one user or tenant can use your system. By controlling the rate at which tenants can use your application you can avoid becoming overwhelmed by a single tenant.

In AWS one way to achieve this is by utilizing usage plans in API Gateway. With a usage plan, you can set the maximum rate for calls with the same API key. The nice thing is that you can have different usage plans so you can have different limits for different tiers. You might want your paid customers to be able to hit your APIs more frequently than the free ones, and usage plans make that easy to do.

If you aren't using API Gateway (REST API to be specific) your options are a bit more limited. You can get some benefit from WAF, though it's not really designed to be tenant-based. Still, it can help. Beyond that, you're mostly on your own. Keep in mind that anything you implement in your code is already sharing some amount of resources. Let's just hope AWS decides to add it to other places, like AppSync, in the near future.

Segmented Queues

If you've ever gone to an amusement park you've seen how queues work. You stand in line and wait for your turn. In a SaaS application, you may want to have your paid customers going through a different queue than your free customers. Think of this like the fast pass at an amusement park. The fast pass is still a queue. It goes to the same ride. It just has fewer people in it, so you get your turn faster. You can do the same thing by sending your data to different queues, one for the free tier and one for the paid. The great thing about this is that the same Lambda function is used by both. And because you can set the concurrency at the integration you can have each queue processing messages at different speeds. You can even go so far as to set up a queue for a single tenant.

Tenant Partitioning

As your application grows you may want to have a multi-account strategy in place. You may start by thinking you can put all your free tier customers in one AWS account and your paid customers in another. While this strategy does make sure your free tier tenants aren't impacting your paid tier tenants it does create a couple of issues.

First, it doesn't necessarily balance the workload. You may have some paid tenants who aren't doing much and some free tenants who do a lot. You really want a balance so that each account is doing a similar amount of work. That allows you to keep the account settings the same, making it easier to manage.

Second, tenants are, hopefully, moving between tiers (hopefully going from free to paid). If you isolate the tenants by tier you need to have a migration plan in place for when a tenant changes tiers. That can be expensive and complicated, especially if you want the tenant to keep working while you migrate them.

Instead of separating by tier, I like to do a weighted assignment approach. As a new tenant is added to the application the system determines where that tenant should go based on the usage of the current tenants. You can't know how the new tenant is going to behave but you can at least understand how the existing ones do and use that information to decide where to put a new tenant.

Monitoring

No matter how you decide to manage the noisy neighbor problem the one you you really should do is have monitoring in place. It's important to understand how your system is behaving at all times so you can make adjustments before your customers start to complain. Monitor things like latency, iterator age, message age, concurrency, and anything else that can impact the performance of your application. Use AWS tools at a minimum.

Understanding Your Cost

Even if you aren't adding multiple tiers to your app you really should try to understand the costs of each tenant/user. When you're considering adding tiers, either up or downstream, this information is crucial in understanding how much to charge and what features belong in what tier.

Custom Metrics

You can use the AWS metrics to get a lot of information, but it does have its limits. When you can't get what you need from the out-of-the-box metrics create your own. Add it anywhere you have a feature that you want to understand better. If nothing else just record that the feature was used. Make sure you are recording metrics for anything that you might want to give away. You don't want to be surprised by the cost of something you gave away for free.

Include Tier and Tenant

When you are adding your custom metrics be sure to include both the tier and tenant identifier to the metric as dimensions. This will allow you to look at the data aggregated by tier as well as to see how an individual tenant is using your app. It also can allow you to exclude a tenant if you believe their behavior is an anomaly.

Gather Real Use Data

Some APIs, like DynamoDB, return the actual usage with each call. For S3 you can see the size of the objects via EventBridge. Record that information, with tier and tenant added, so that you can see how much things will actually cost. While you can get this information from the generic AWS metrics you won't be able to see it by tier or tenant.

Recap

There are so many things to think about as you make decisions about adding tiers to your app. We talked about a few key points today:

Tiers allow you to upsell your app
Feature flags are a great way to manage tiers/features
Focus on features, not tiers to allow for greater flexibility
Don’t forget about your neighbors
Always be aware of cost

If you want to hear some more thoughts on this topic check out the presentation I recently did on this subject with the #BelieveInServerless group. There was a lot of great discussion that followed the presentation.

Lambda-less AppSync for SaaS

Jason Wadsworth — Mon, 04 Mar 2024 20:45:11 +0000

As a builder of SaaS software, I often find myself looking at services like AppSync with a bit of jealousy. See, AppSync has a way for you to interact directly with services like DynamoDB, removing the need for a Lambda function, and the cold starts that come with it. As a SaaS builder, these direct integrations have always been out of reach because of the inability to secure the data at the tenant level. Due to some features introduced by the Step Functions team last year, there now is a way. In this post, I'll walk you through how you can access DynamoDB data from an AppSync API without the need for a Lambda function, all while maintaining tenant data isolation.

Tenant Isolation

Before getting into the details of how this solution works let's be sure we understand the problem we are trying to solve. If you are building a multi-tenant SaaS application your application must be built in such a way that one tenant isn't able to access another tenant's data. I talk about this in some of my talks and have written some blogs about it as well. It's not enough to write code that isn't supposed to access the wrong tenant's data, you need to build the protection into the system; so that the code doesn't work at a permission level if it attempts to access the wrong tenant. An attempt to access the wrong tenant's data shouldn't just be a bug, it should be a failed operation. It just shouldn't be possible. This is where the problem with AppSync direct integrations comes in. When you have AppSync querying DynamoDB, for instance, you grant AppSync specific permission to do so. Those permissions aren't unique to the caller, they are only unique to the specific integration. So if tenant 1 calls the API it looks the same, at the permission level, as if tenant 2 makes the call. Not great for isolation.

The typical solution to this problem is to have your AppSync talk to a Lambda function. Somewhere along the way, you do an STS AssumeRole operation to get credentials specifically for the tenant on which you want to operate and use those credentials to talk to DynamoDB. These credentials are scoped to the tenant, so you can only get data for that tenant. There are some different ways of accomplishing this, but in the end, it comes down to each call to the DynamoDB table being made with credentials specific to the tenant making the request. If you were to request data for another tenant the permissions wouldn't allow it.

Unfortunately, that option isn't available to us with direct integrations. I'd love to see that change, but for now, it's just not possible.

Step Functions Cross Account Access

Sometime back in 2023, the Step Functions team announced a feature that would allow you to run a state machine task from one account and have it access resources in another. It turns out that this feature has a use within the same account, too.

While Step Functions Cross Account Access was designed for...well, cross-account access, it's really just telling Step Functions what role to assume to perform the task. You can use that same mechanism within an account to have the state machine assume a specific role for a task. For example, let's say you want your state machine to assume a role for a specific tenant, with permissions scoped down to just that tenant's data. See where I'm going here? Because the role in the state machine can be dynamic, you can have a Step Function that assumes the role of the specified tenant, and reuse the Step Function across all your tenants, much like you would a Lambda function.

Step Functions SDK Integrations

One of the great things about Step Functions is that it has literally hundreds of integrations available. You can make calls to most AWS services via the SDK integrations, or use the optimized integrations for a smaller set of integrations. And with the ability to specify the role you want to be assumed you can call them with the permissions scoped to just the current tenant.

AppSync and Step Functions

So far we have talked about how to have Step Functions access data for a particular tenant. What we want is for an integration with AppSync that doesn't require a Lambda function. This is where the Express State Machine integration with AppSync comes into play. With Express functions you can make synchronous Step Function calls that run the state machine directly.

There isn't anything new about this feature, so I won't go into the details. The main point is that you can call a Step Function from AppSync and return data from there.

Putting it All Together

Now that we understand that we can use Step Functions to make direct API calls with a tenant-specific IAM role, and we can call Step Functions from AppSync, how do we put this together?

To get this all to work safely we need to step back a bit and talk about the authentication. If you've read any of my previous posts on SaaS you've seen that I'll typically have a custom authorizer that not only validates the user (typically via JWT validation) but also obtains credentials for that user's tenant. In this case, we'll take a slightly different approach. Because Step Functions will be assuming the role for us we don't need to provide credentials, but we do want to provide the role that the Step Function should use. We'll add the tenant-specific role ARN to the resolverContext of the custom authorizer. This value will be available as part of the input to your Step Function. Specifically, you can access anything that you put in the resolverContext at $.identity.resolverContext in your state machine.

Tip:
You can access the input arguments of your Step Functions state machine from anywhere in your state machine by going through the context object, which is accessible by using the $$ notation. More information about the context object can be found here.

You may be tempted to make the name of the role something like TenantRole<tenant id> so that you can easily put together the role name anywhere that needs it. Doing so is not advisable because it can lead to the very problem we are trying to avoid. If the Step Function is determining the role to assume then it can make a mistake and use the wrong role. This, combined with a mistake about which tenant to access, allows the wrong tenant's data to be returned. Instead, you should have the tenant's role names be somewhat random. I like to use a ULID and store the name of the role with information about the tenant.

There is one more thing to keep in mind here. You probably want to limit what roles your Step Function can assume, but you don't know the names of the roles. I like to take advantage of the path of the role to make this easier. All my tenant-specific roles have a path that is something like /tenant-role/, which allows me to create an IAM policy that only allows assuming roles that are at that path. You can also limit what services can assume the role via the assume role policy document. Just be sure to keep in mind all the places you might want to assume this role (it's probably not just Step Functions).

The Tradeoffs

This may all sound a bit too good to be true. That's probably because there are some tradeoffs to be aware of.

The first, and probably the most important, is that each tenant must have their own role. Quite often we use a single role, with a dynamic policy, for tenant isolation. This has several advantages, not the least of which is that you don't have to manage all the roles. Unfortunately, the Step Functions integration doesn't allow for a dynamic policy (wouldn't that be nice?). The importance of this tradeoff can't be overstated. There is a hard limit of 5000 IAM roles per account. If you expect to have more than maybe 1000-2000 tenants you need to consider how you will manage the role limit. You might look into tenant sharding to help (Bill Tarr talks about this a bit in his talk SaaS architecture pitfalls: Lessons from the field from re:Invent 2023). In addition to the IAM limits, you need to be able to update these roles if and when your application's needs change. There are several options here, just know that this is something you have to deal with that isn't an issue when using dynamic policies.
Another tradeoff here is that there aren't utility functions in either AppSync or Step Functions for unmarshalling DynamoDB formatted data. Interestingly there is a way to marshall the data in AppSync, but the AppSync direct integration automatically unmarshalls the data on the way out, so there isn't a way to do that. What does this mean? A lot of very specific mapping code that has to convert the ugliness of DynamoDB JSON into something a bit more useful.

Conclusion

AppSync direct integrations are a great way to allow your API to get data without needing a Lambda function. Until recently, these integrations didn't work for multi-tenant SaaS apps. With the introduction of cross-account Step Function tasks, we can now leverage the direct integrations in AppSync and Step Functions to allow us to build a multi-tenant API using AppSync without using a Lambda function, all while still isolating each tenant's data.

Securing Cross-Account Access in Multi-Tenant SaaS Applications

Jason Wadsworth — Thu, 11 Jan 2024 06:00:00 +0000

If you’re building a SaaS solution, it’s critically important that you protect and isolate your customer's data from other customers (often referred to as tenants). For companies building SaaS on AWS, one aspect of their isolation strategy is to connect the data that resides in tenant-owned AWS account(s) with your SaaS application running in your, SaaS provider-owned, AWS accounts.

AWS recommends using the AWS Security Token Service (STS) API to make calls to get temporary credentials for this type of cross-account access. This API leverages AWS Identity and Access Management (IAM) roles to provide access between AWS accounts.

But what are you doing to make sure these roles are only being used to connect the correct tenant accounts? In a multi-tenant application, there are two primary concerns for protecting your tenants' accounts; the potential for one tenant, a bad actor, to use the information of another tenant to gain access to that tenant's data using your application, and your own mistakes. In this blog, we examine methods of securing cross-account access using STS to ensure our customers' data is secure and isolated.

Protection From Bad Actors - The Confused Deputy Problem

When your application needs to access data in your tenants' accounts, you use the Assume Role API to get temporary credentials. Without some extra protection, this opens the door for a bad actor to take advantage of the Confused Deputy problem. I won't go into the problem in detail here, but stated simply, it allows one tenant to access another tenants data simply by knowing the ARN of the role in the other tenant's account. AWS has a solution to combat this problem — the use of an ExternalId. There is a great post from AWS about how to implement this in your application. One important element to this is that you, the SaaS provider, need to supply the ExternalId and make sure it is a unique value in your application. By being the owner of the ID you can be sure that a second tenant cannot use the same ID in their tenant. When a tenant adds an AWS account to your application they must take the ID you supplied and include it in the assume role policy for the role they want to assume. This makes sure the tenant has access to that role.

Protection From Yourself - The Bad Code Problem

I have a saying — "if your idea of data protection is a where clause in SQL, you aren't protecting my data." That saying doesn't translate perfectly well to this subject, but the point is that you can't rely on your code to protect your tenants' data. You need something more; something that makes sure bad code doesn't lead to a break in isolation.

As much as we try to avoid it, mistakes happen when writing code. We have processes in place to help avoid the mistakes — code reviews, automated tests — but we can't eliminate them entirely. We can, however, plan for them and build a system that fails safely. I talk about this subject a lot, but it's typically talking about things like DynamoDB or S3 data. Data that you, the SaaS provider control. It turns out we can leverage the same sort of techniques to be sure we don't accidentally assume a role for the wrong tenant.

First, let's talk a bit about what the problem is we are trying to solve. Imagine the worst-case scenario where there is hard-coded data in your application that makes its way to production. This hard-coded data uses a single tenant's role ARN and external ID to make calls into their AWS account. Now all your tenants are seeing that one tenant's data. While this example is extreme, there are less extreme ways to have the same, or similar, results. What we need is a way that even bad code can't lead to a break in isolation.

If you've heard me speak on the subject, or have read my blog post on Multi-tenant Security Implementation, you know that we don't give our code (typically running in Lambda) permission to access tenant data. Instead, we pass in credentials that are used to access the data. These credentials are created using a dynamic policy that limits access to only that tenant's data. We'll do the same thing here, but instead of accessing data, we'll limit the ability to call the Assume Role API for the specific tenant's role.

To do this, we take advantage of the iam:ResourceTag condition. This condition allows you to require that the role being assumed includes a tag with a specific value. For our example, we'll call the tag MyApplicationTenantId. The condition will require that the role being assumed is tagged with a tag called MyApplicationTenantId and a value of that tenant's ID in our application. The dynamic policy looks something like this:

{
    Action: ['sts:AssumeRole'],
    Effect: 'Allow',
    Resource: ['*'],
    Condition: {
        StringEquals: {
            'iam:ResourceTag/MyApplicationTenantId': tenantId,
        },
    },
},

NOTE: You may have noticed that we are allowing all resources. Add specific resources to further restrict that if you'd like, but the condition is good enough for this use case. I've used the path of the role in the past but since the console doesn't allow you to set the path it limits how customers can configure this role.

When the customer creates the role in their AWS account they include the tag MyApplicationTenantId on that role and set its value to their tenant ID in your application.

Setting the Source Identity to Secure AssumeRole

If you're familiar with the Assume Role API you may be aware of some of its limits, particularly as it relates to role chaining. Role chaining, simply put, is assuming one role and then using those credentials to assume another role. There are other limitations to be aware of if you are going to use role chaining but one that is important to our scenario is that role chaining requires that you have permission to set the source identity when calling the API. This permission must exist on both the assume role policy document of the role being assumed as well as the permissions of the role doing the assuming. So for our code to work, we need to add a couple of things.

First, we need to add permissions to the assume role policy document on the role we are assuming. The full policy will look something like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::012345678912:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "my-application-provided-external-id"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::012345678912:root"
            },
            "Action": "sts:SetSourceIdentity"
        }
    ]
}

It's important to note that you can't include the condition statement on sts:SetSourceIdentity, but that's okay because it's only used in the context of assuming a role.

We also need to add the sts:SetSourceIdentity permission to the role doing the assuming. Our dynamic policy now looks something like this:

{
    Action: ['sts:AssumeRole'],
    Effect: 'Allow',
    Resource: ['*'],
    Condition: {
        StringEquals: {
            'iam:ResourceTag/MyApplicationTenantId': tenantId,
        },
    },
},
{
    Action: ['sts:SetSourceIdentity'],
    Effect: 'Allow',
    Resource: ['*'],
},

With this in place, you can safely assume your tenants' roles without being concerned you'll get data for the wrong tenant. Even if you have a hard-coded value in your code the permissions of the credentials your code uses won't have permission to assume the role.

Conclusion

Building multi-tenant SaaS applications comes with important tenant isolation challenges, especially if you connect your application to customer-owned AWS accounts. Being aware of the potential for bad actors or bad code to break that isolation is a key step in understanding how to protect against it. Making sure you’ve taken all steps to secure cross-account access with solutions like the one I've outlined in this blog helps keep your tenants’ data safe and your application secure.

Solving the DynamoDB EventBridge Pipes Problem

Jason Wadsworth — Tue, 28 Feb 2023 03:08:35 +0000

I was really excited when AWS announced EventBridge Pipes at re:Invent last year. This was going to simplify all the CDC (change data capture) code I find myself writing, and probably reduce my Lambda spend.

At first, everything was going great. I was able to create EventBridge events directly from my DynamoDB stream records with some simple JSON path. Then I ran into a problem, and I wasn't alone.

The Problem

Everything worked great when you have simple records in DynamoDB, and even complex objects would be easy enough. Where things fell apart was when I had a list. It wasn't a complicated record. The data looks like this:

{
    "id": "ABCDEFG",
    "firstName": "Jason",
    "lastName": "Wadsworth",
    "email": "jasonwadsworth@outlook.com",
    "groups": ["Administrator"]
}

In DynamoDB it ends up looking like this:

{
    "id": {
        "S": "ABCDEFG"
    },
    "firstName": {
        "S": "Jason"
    },
    "lastName": {
        "S": "Wadsworth"
    },
    "email": {
        "S": "jasonwadsworth@outlook.com"
    },
    "groups": {
        "L": [
            {
                "S": "Administrator"
            }
        ]
    }
}

Now, if you've played with EventBridge Pipes you know that you can do a bit of a transform in target, via the input template. It's a little odd to work with, but it gets the job done. The input template for the above would end up looking something like this (I intentionally left out the groups because...well, that's the problem).

{
    "id": <$.dynamodb.NewImage.id.S>,
    "firstName": <$.dynamodb.NewImage.firstName.S>,
    "lastName": <$.dynamodb.NewImage.lastName.S>,
    "email": <$.dynamodb.NewImage.email.S>
}

Okay, so what about the groups? Well, turns out that this syntax only supports some of JSON path, and it doesn't help here. With the help of some others I tried this, but it didn't work.

{
    "id": <$.dynamodb.NewImage.id.S>,
    "firstName": <$.dynamodb.NewImage.firstName.S>,
    "lastName": <$.dynamodb.NewImage.lastName.S>,
    "email": <$.dynamodb.NewImage.email.S>,
    "groups": <$.dynamodb.NewImage.groups.L[*].S>,
}

The Solution

After being very frustrated by this I felt there had to be a path forward. Turns out there is. The solution is in the enrichment of EventBridge Pipes. One of the enrichment options is Step Functions Express State Machines. After some trial and error I came up with the following solution (code is in CDK).

const userCreatedEnrichment = new StateMachine(this, 'UserCreatedEnrichment', {
    definition: new Map(this, 'UserCreatedEnrichmentMap', {}).iterator(
        new Pass(this, 'UserCreatedEnrichmentPass', {
            parameters: {
                'id.$': '$.dynamodb.NewImage.id.S',
                'email.$': '$.dynamodb.NewImage.email.S',
                'firstName.$': '$.dynamodb.NewImage.firstName.S',
                'groups.$': '$.dynamodb.NewImage.groups.L[*].S',
                'lastName.$': '$.dynamodb.NewImage.lastName.S',
            },
        }),
    ),
    stateMachineType: StateMachineType.EXPRESS,
});

const pipeRole = new Role(this, 'PipeRole', {
    assumedBy: new ServicePrincipal('pipes.amazonaws.com'),
    inlinePolicies: {
        sourcePolicy: new PolicyDocument({
            statements: [
                new PolicyStatement({
                    resources: [table.tableStreamArn],
                    actions: ['dynamodb:DescribeStream', 'dynamodb:GetRecords', 'dynamodb:GetShardIterator', 'dynamodb:ListStreams'],
                }),
            ],
        }),
        enrichmentPolicy: new PolicyDocument({
            statements: [
                new PolicyStatement({
                    resources: [userCreatedEnrichment.stateMachineArn],
                    actions: ['states:Start*'],
                }),
            ],
        }),
        targetPolicy: new PolicyDocument({
            statements: [
                new PolicyStatement({
                    resources: [defaultEventBus.eventBusArn],
                    actions: ['events:PutEvents'],
                }),
            ],
        }),
    },
});


new CfnPipe(this, 'UserCreatedPipe', {
    description: 'Sends UserCreated events',
    roleArn: pipeRole.roleArn,
    source: table.tableStreamArn,
    target: defaultEventBus.eventBusArn,
    sourceParameters: {
        dynamoDbStreamParameters: {
            startingPosition: 'LATEST',
            batchSize: 1,
        },
    },
    enrichment: userCreatedEnrichment.stateMachineArn,
    targetParameters: {
        eventBridgeEventBusParameters: {
            detailType: 'UserCreated',
            source: `MySource`,
        },
    },
});

The key here is that Step Functions DO support full JSON path. So by passing the raw data to a state machine I was able to manipulate the data exactly how I wanted it. Sure, it's an extra step, and it would be nice if EventBridge Pipes would fix it, but this is still better than writing more Lambda code.

Creating a Unique Constraint with DynamoDB

Jason Wadsworth — Thu, 16 Feb 2023 02:21:34 +0000

There are a lot of reasons why switching from SQL to NoSQL is a good idea for much of what we as developers do. The vast majority of our work is OLTP, transactional data processing, where we know what the access patterns are and can design our NoSQL data storage in a way that supports those access patterns.

Of course there are inevitably things that are not natively supported in NoSQL databases like DynamoDB, and often these things are a hurdle to those looking to make the transition. One of those things is the unique constraint.

Definition

If you're not familiar with unique constraints, they are a way of guaranteeing that there is only one instance of a particular value (or set of values if it is a composite constraint) in a table. It's different than the primary key in that it isn't...well...the primary key. A good example of this is a user table where the primary key would be a userId of some sort, and a unique constraint would be the user's email. In a SQL database you can have this constraint keep you from being able to have two users with the same email because the table will not allow duplicates.

In DynamoDB there isn't a unique constraint, but there is a way to get the same behavior. Here is how you do it.

Transactions to the rescue

There was a Twitter thread recently debating the value of DynamoDB transactions. I know there are some who don't like them, but this is one example of why I think they are valuable.

With a DynamoDB transaction you can create a limited ACID transaction on a set of records. For the example of the email constraint on a user you need just two (DynamoDB supports up to 100 records at the time of this writing, with a 4MB total size limit). So what does this transaction look like?

Basically you have a Put for each unique constraint and one for the primary record. If you're doing a single table design this can all be targeting the same table, but DynamoDB transactions can work across many tables. Each constraint record's key (combination of Partition Key and Sort Key) uniquely identifies it on the table. Each Put includes a condition that requires that the record either doesn't already exist or is owned by the user being updated.

Here's an example what that might look like:

const user = {
    userId: 'User1',
    email: 'john@example.com',
    first: 'John',
    last: 'Doe'
};

await documentClient.send(new new TransactWriteCommand({
    TransactItems: [
        {
            Put: {
                Item: { pk: user.email, sk: 'EmailConstraint', userId: user.userId },
                TableName: 'User',
                ConditionExpression: 'attribute_not_exists(pk) OR userId = :userId',
                ExpressionAttributeValues: {
                    ':userId': user.userId,
                }
            }
        },
        {
            Put: {
                Item: { ...user, pk: user.userId, sk: 'User' },
                TableName: 'User'
            }
        }
    ]
}));

The first Put has a unique value of the email and the value 'EmailConstraint'. The second Put has a unique value of the userId and the value 'User'. That means that you can only have one record with a particular email in the table, and only one record with a particular userId. The ConditionExpression on the first Put limits the operation by saying that the record being put must be a new record (attribute_not_exists(pk)) or the userId of the current record must be the same as the record being saved (userId = :userId).

Now imagine if we try add a second user that looks like this:

const user = {
    userId: 'User2',
    email: 'john@example.com',
    first: 'John',
    last: 'Roe'
};

Using the above transaction this would fail because the first item in the transaction would violate the ConditionExpression. Specifically, the record would already exist, so the attribute_not_exists(pk) would be false, and because the userId of the existing record is for a different user (User1), that would also be false.

If we change the record to have a different email it succeeds:

const user = {
    userId: 'User2',
    email: 'john.roe@example.com',
    first: 'John',
    last: 'Roe'
};

Now, let's say we want to update the first record, so we do another transaction to update it to the following:

const user = {
    userId: 'User1',
    email: 'john@example.com',
    first: 'Johnathan',
    last: 'Doe'
};

This one will succeed. The EmailConstraint record's ConditionExpression is satisfied. While the attribute_not_exists(pk) would be false, the userId = :userId would be true. Essentially nothing is changing on this record.

What about when you want to change the email of a user?

const user = {
    userId: 'User2',
    email: 'john@example.com',
    first: 'John',
    last: 'Roe'
};

Again, this would fail because the EmailConstraint record already exists and it does not belong to this user.

const user = {
    userId: 'User1',
    email: 'johnanthan@example.com',
    first: 'Johnathan',
    last: 'Doe'
};

This would succeed because there is no record with johnanthan@example.com as it's key. Of course this creates a different problem. Now you have an orphaned record, john@example.com. Your first thought might be to include a Delete in the transaction, but that won't quite work because we don't know the email address to delete. You could look it up first, of course, but you can't look it up within the transaction, so you'd end up possibly having out of sync data if two updates happened at the same time. Probably not a big concern for a user's email, but it could be an issue with other data. You could make sure that the record you're updating is still the record you read, and fail if it is not. That is a good option in some cases, when the probability of a collision is low. If it fails you can look it up again.

Embracing eventual consistency

Personally, I like to take a different approach to this problem. By using DynamoDB streams you can check for a change to the email address whenever there is a MODIFY, and if there is a change you can delete the old record at that time. This means that there is a period when the email is still unavailable, but it is a great way to guarantee the delete. You'll want to do the same on a REMOVE, since a delete has the same problem of not knowing what the email is at the time of the delete.

Conclusion

No, a unique constraint in DynamoDB isn't as easy as it is in a SQL database, and you should only use it when you absolutely need it, but at least now you know that it is an option. One more excuse to to avoid NoSQL is gone.

Multi-tenant Security Implementation

Jason Wadsworth — Tue, 06 Dec 2022 15:34:00 +0000

In my previous post I talked about why you need to think about data and security differently when working on a multi-tenant application. In this post I'll dig in a bit deeper and show you what we did at ByteChek (RIP) for our multi-tenant strategy.

The Architecture

To start, let's talk a bit about the high level architecture of the platform. 100% of the ByteChek platform is serverless. We use services like AppSync and API Gateway for synchronous communication and EventBridge, SNS, and SQS for asynchronous communication. Data is stored in S3 and DynamoDB. Compute is Lambda with a sprinkle of Step Functions for some coordination within a service. Like I said, 100% serverless. Here is a simple version of what it looks like:

Custom Authorizer

Our user interface is a React app. We use Cognito for user authentication and pass in a JWT to AppSync for requests from the app. This is the beginning of the multi-tenant strategy. As a request comes in to AppSync we use a custom authorizer on the request. There are a lot of things you can do with a custom authorizer, but the important piece for this conversation is called the resolver context (resolverContext in the JSON). This is a place where you can add anything you want to the request payload that AppSync will send to its resolvers. We use the resolver context to store credentials for the tenant of the user.

Where do we get these credentials? We use the STS (security token service) in AWS to request credentials. One of the features of the STS assume role API is the ability to pass in a policy with the request. This policy is combined with the policy of the role and the result is a set of permissions that are limited to the union of the two policies. Included in the request from Cognito is the tenant ID of that user. We use that ID to create a policy that limits access to just that tenant's resources. We then pass those credentials along with the request, in the resolver context. I'll show you what that looks like a bit later.

Using the Credentials

As I mentioned, we use Lambda for our compute, so all of our AppSync resolvers are Lambda functions. I'd love to do some direct integrations here (like read/write directly from/to DynamoDB) but we need some support from AWS on this front. A recent addition to Step Functions to support assuming a role within a task is step in the right direction, so I feel pretty confident that the future is bright for these options. Anyway, back to our solution. Typically when you create a Lambda function you assign it permissions to do whatever it is that the function needs to do. This will include things like writing to CloudWatch Logs, possibly X-Ray, as well as making calls to a DynamoDB table or S3 bucket. In our platform these functions are almost never granted permissions for the latter. Permissions to read or write to DynamoDB and/or S3 are instead granted via the credentials passed in via the resolver context. This means that the function itself cannot make a call to get or update data directly. From a security perspective that has removed the possibility of a developer accidentally writing code to do so. Instead we provide code for them to call that takes in the AppSync event and returns an AwsCredentialIdentity (JavaScript SDK v3) object that can be use with a DynamoDB or S3 client (or any other AWS client). For a developer this isn't much different than what they might be already used to doing. The biggest difference is that the client has to be created within the context of the request instead of being shared. If a developer forgets to pass in the credentials the calls don't work because the default credentials don't have the necessary permissions.

Asynchronous Activity

As you've likely noticed, everything up to this point has been for a synchronous request made from AppSync. You likely also recall that we used services like EventBridge for asynchronous operations. How does this all work for asynchronous code? Most asynchronous actions are a result of a synchronous request, and so you might be tempted to think you can just pass along the credentials from the synchronous request and all is well. Not so fast. First, those credentials have a time limit. The default and max is one hour. While an hour is surely good enough for most things you want to be careful not to find yourself in a situation where a failure leads to an extremely difficult recovery. Also, keep in mind that a request may start near the end of that one hour limit, meaning you may be left with far less. Second, not all paths provide a way to do this neatly (DynamoDB streams come to mind here). You certainly don't want to be storing credentials. And finally, let's not forget I said "most", which means there are things that don't have any context to pull from. Any scheduled operation would fall into this category.

For asynchronous code our Lambda functions follow the same rule as our synchronous code; no permissions to access DynamoDB or S3. Instead they are granted permission to get credentials for a tenant. These credentials are the same credentials that are used for the synchronous case, they just are requested instead of being passed in. Everything in the system has a tenant ID in it -- whether it's a record in a DynamoDB table, or a message sent in EventBridge. As we process a record we use the tenant ID to request the credentials and use those credentials to get a DynamoDB or S3 client. Once again, if a developer forgets to pass in the credentials the calls don't work. Furthermore, if we somehow have the wrong credentials it won't work either.

A side note here. You may be asking why we don't just request the credentials in the synchronous case. While there are some potential advantages to that approach, not the least of which is consistency throughout the code base, there is a reason to not. For synchronous request any time added to the processing of a request degrades the user experience. By getting the credentials during the custom authorizer we greatly reduce how often we need to get these credentials. You can control the cache timeout on the authorizer (we have it set to 15 minutes). This means that you only request the credentials once during that time. That can be a significant reduction in STS calls.

The Credentials

Okay, we've seen a lot about the flow of the system and when and how we get credentials. Let's take a little bit of time to look at the specifics of the policies. I'm going to be broader in my policies that you might want to actually be, but you should be able to get the idea here.

First, let's take a look at the policy attached to the role that we will be assuming. It looks something like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "dynamodb:*Item",
                "dynamodb:Query"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Condition": {
                "StringEquals": {
                    "s3:prefix": [
                        ""
                    ],
                    "s3:delimiter": [
                        "/"
                    ]
                }
            },
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:ListBucket",
                "s3:ListBucketVersions"
            ],
            "Resource": "arn:aws:s3:::*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:PutObject*",
                "s3:GetObject*",
                "s3:DeleteObject*"
            ],
            "Resource": "arn:aws:s3:::*",
            "Effect": "Allow"
        },
        {
            "Action": "cognito-idp:*",
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Notice that this policy grants some pretty broad permissions, and it needs to. This role will be used for all requests to customer data within the system. The key is the policy used when assuming the role. That policy looks something like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "dynamodb:*"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ],
            "Condition": {
                "ForAllValues:StringLike": {
                    "dynamodb:LeadingKeys": [
                        "TENANT_1_ID*"
                    ]
                }
            }
        },
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::*"
            ],
            "Condition": {
                "StringEquals": {
                    "s3:prefix": [
                        ""
                    ],
                    "s3:delimiter": [
                        "/"
                    ]
                }
            }
        },
        {
            "Action": [
                "s3:ListBucket",
                "s3:ListBucketVersions"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3: : :*"
            ],
            "Condition": {
                "StringLike": {
                    "s3:prefix": [
                        "TENANT_1_ID/*"
                    ]
                }
            }
        },
        {
            "Action": [
                "s3:*Object*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::*/TENANT_1_ID/*"
            ]
        }
    ]
}

Notice the conditions and resources that include TENANT_1_ID in them. It's those parts of this policy that limit the access to just that tenant's data.

As I mentioned earlier, the effective permissions when assuming a role are the union between the role's policy and the policy of the assume role request. This means that even if I add something to the policy when assuming a role I won't actually have that permission. Or if I include a permission in the role but don't in the policy of the request.

By making tenant data security a foundation of the system we've made it really easy for developers to do the right thing while making it really hard not to. Starting your multi-tenant SaaS application off with a good foundation will save you from potential headaches in the future.

Multi-tenant Security

Jason Wadsworth — Tue, 18 Oct 2022 01:55:34 +0000

Security is hard. Multi-tenant security is harder. Multi-tenancy, however, is what makes the SaaS model work, and so security becomes something that needs to be at the forefront of your system's architecture.

Let me start by telling you a bit more about what I mean when I talk about security in a multi-tenant platform. Security has many levels, from user authentication and authorization, to encryption of data at rest and in flight, and probably a million other things (I'm not a security expert and won't pretend to be one). Those things are important, but providing user authentication and encrypting data doesn't solve the multi-tenancy problem. A former colleague of mine once summed it up for me something like this:

If a "where" clause on a SQL statement is your idea of security then you aren't protecting my data.

When I first heard that I was stunned. That's what I had been doing. Everywhere I worked, everyone I talked to, was doing just that. Why was that not enough? What else should I be doing?

The multi-tenancy problem

Why was that not enough? Well, let me make this easy for you. What happens when a developer makes a mistake and leaves off that "where" clause? Now every tenant sees every other tenant's data. It's a very simple problem with very real probability and significant consequences. How significant the consequences probably depends a bit of the type of data, but at a minimum it impacts the trust customers will have in your platform.

Testing

The problem is easy enough to identify, so the next step is to address it. Your first thought might be to add some testing that makes sure you are including your tenant identifier in your queries. I'm not going to tell you to not write tests -- tests are great -- but this problem isn't one you can test yourself out of. Tests are only as good as your ability to remember to write them. Sometimes you can generalize things in a way that allows you to always include them, and perhaps you can do that here (that would seem to remove some degree of team autonomy, but that's a different topic). Tests, however, are not part of the system that is running, and while you hope you'd catch something before it is deployed it doesn't always work out that way.

So...?

So we know we have a big problem to solve. How do we systematically make sure tenant A can't see tenant B's data, and visa versa? This requires a change to how you think about accessing data within a system.

How it's typically done

Typically you have some bit of software that uses a database of some sort, and that software has credentials to allow it to access that database. These credentials can read and/or write to any of the data in that database, and it's up to the software to control what data is being read/written. That, as we've already stated, is the problem. The software is prone to mistakes, and the credentials do nothing to protect us from those mistakes.

A better way

What if you could protect yourself from those mistakes? I'm not suggesting you won't make the mistakes. Rather, I'm suggesting that the mistake doesn't result in exposing data it shouldn't. The solution is, frankly, pretty simple. Instead of the software having credentials to read/write to any of the data in the database it doesn't have any credentials. Okay, obviously it needs to have some credentials to work, but what if those credentials weren't so much a part of the software as much as a part of the process of the software. In other words it's not a configuration value that is global to the software but is something that the software gets as a part of the process. That alone doesn't help, but one small step does. Instead of having a set of credentials that allows access to the entire database you need to have credentials that are unique to the tenant. Those credentials should fundamentally limit access to only that tenant's data. Think of this as row-level security (though you could implement it a number of ways). The software gets the credentials for the tenant it's currently processing and therefore can only access that tenant's data.

If we go back to our missing "where" clause problem we can see that this would no longer be an issue. Well, it's still an issue because now your application, or feature, isn't working, but an error message is a lot easier to explain than "oops, you weren't supposed to see that". And hopefully those tests you were writing at least found that error.

I'm not foolish enough to think this solution is bullet proof, but I do have a pretty high level of confidence that we aren't going to make a mistake that will accidentally show one tenant's data to another. At the very least I can say with confidence that "a 'where' clause in a SQL statement" is not my idea of security.

Stay tuned for my next post where I'll show you how we did this at ByteChek

State of Serverless

Jason Wadsworth — Sat, 11 Jun 2022 22:47:34 +0000

The latest report on the state of serverless from DataDog was a bit disappointing for me. Here’s why.

There is a lot of talk about what serverless means. There are those who say we shouldn’t gate keep, essentially saying everyone’s opinion matters. There is, of course, truth to that, but at some point we have to come to an agreement about what something means in order to have meaningful conversations. Conversations like “what is the state of serverless?” can’t happen if we don’t have some agreed upon understanding of what serverless is. That is where my disappointment with DataDog’s report lies. They’ve chosen to decide for the community what serverless means, and, in my (and many others) opinion, they’re wrong.

Why do I care?

First, let me tell you why I even care. As I’ve already stated, an agreed understanding is vital to continued conversations about a subject. If we cannot agree on what something is how can we reasonably discuss it? But why, then, do I care whether the view DataDog has taken is the agreed upon view?

I care because I’m tired of companies taking over terminology. I’m tired of a word or phrase turning into a marketing tool rather than having actual meaning. Let me give you a recent example.

What do you think of when you hear DevOps? Do you think of a culture shift in engineering that shifts the way we work and how we think about supporting software? Or do you think of a team that manages K8 clusters and CI/CD tooling? DevOps meant something at one point and it was hijacked to mean something very different. In its early days you could talk to people about DevOps and, assuming they were familiar with the term, you’d be talking about the same thing. Now days I get constant messages from recruiters about a “DevOps Role” and I know right away they aren’t talking about the DevOps I believe in. It’s rare that I can have a conversation about DevOps anymore and expect people to be on the same page as to what it means.

What is serverless?

So, what do I think serverless means, or should mean? Let’s start with what Wikipedia says (at the time of this writing):

Serverless computing is a cloud computing execution model in which the cloud provider allocates machine resources on demand, taking care of the servers on behalf of their customers. "Serverless" is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. However, developers of serverless applications are not concerned with capacity planning, configuration, management, maintenance, fault tolerance, or scaling of containers, VMs, or physical servers. Serverless computing does not hold resources in volatile memory; computing is rather done in short bursts with the results persisted to storage. When an app is not in use, there are no computing resources allocated to the app. Pricing is based on the actual amount of resources consumed by an application. It can be a form of utility computing.

There are some key points here.

On demand

This one is a bit of a no brainer, but serverless needs to be on demand. Of course what that means, exactly, can be debated. After all, I can spin up an EC2 in AWS “on demand”, but most would agree that doing so would not qualify as serverless. On demand is part of the picture, but certainly not all of it.

Consumption based pricing

Again, pretty obvious. And again, it can be applied to things that aren’t serverless, too.

Ilities

Serverless should have all the ilities, like scalability, availability, and so on. These things are inherent in serverless and not something to be configured. That’s not to say you can’t have some controls in place (e.g., Lambda concurrency limits) but if autoscaling groups and multi-AZ are things you have to worry about then, to me, you aren’t doing serverless.

Single execution

This one may not be as obvious when reading the Wikipedia definition, but it’s in there.

Serverless computing does not hold resources in volatile memory; computing is rather done in short bursts with the results persisted to storage.

I think the “results persisted to storage” bit is a bit out of line, but the essence is that there is no expectation that you’ll have access to anything from a previous operation. We do have some ability to cache things in Lambda, but it’s very limited and isn’t something you rely on by any stretch.

Scale to zero

This is probably the most significant, and maybe controversial, part of the definition. Serverless doesn’t cost anything when there isn’t anything being executed. It’s important because it means I can afford to have entire teams of engineers, each with a fully functional serverless application, and not worry about the cost. This leads to more innovation, faster development, and greater understanding. As soon as you take away scale to zero you take away the freedom of individual engineers to try things that might break the system because doing so will impact everyone.

Containers are not serverless

Okay, I said it. I’m ignoring the fact that you can use a Docker image in a Lambda (it’s still a Lambda, not a container). Yes, I’m including Fargate in that. Now, before you list all the use cases for a container and tell me how I’m wrong about Fargate let me say a few things.

Note that I didn’t say you can’t have containers in a serverless architecture. What I’m saying here is that just because you are using Fargate (or it’s equivalent) doesn’t mean you are serverless. Fargate may have a place in a serverless world, but it’s in a limited role. If you spin up a container to perform a single task, and shut the container down when it’s done then you are possibly doing serverless. Be mindful of what you are doing in the container though. If it’s just a giant batch job that is really a bunch of smaller jobs put into a single execution environment you have probably crossed the line (and might want to consider Step Functions instead).

When I say containers aren’t serverless I mean that if you aren’t doing all the things that make something serverless then it’s not serverless. When DataDog includes Fargate in its report it’s unable to know how the container is being used, and thus can’t know if it’s being used in a true serverless mode. Frankly, I'd bet that it most often is not.

We may not all agree on the minor details of what makes something serverless (personally I’d like to exclude anything that requires networking), but we need to have a general understanding to be able to have conversations about it. More importantly, we need to do what we can to avoid losing the term completely, lest it become like “cloud native” (which isn’t).

Working With Hierarchy Data In DynamoDB

Jason Wadsworth — Tue, 06 Jul 2021 13:48:01 +0000

Working with hierarchies in DynamoDB can be a little intimidating. In this post I'll show you two ways to work with hierarchies, and hopefully take away some of the fear.

The Path Pattern

The first, and most common, way to deal with hierarchy data in DynamoDB is what I refer to as the path pattern. If you think of you hierarchy data like a directory or folder structure on a computer, it's pretty easy to get your head around how this works. Each element in the tree represents a folder and finding what is in a folder is as simple as knowing the path to the folder. With the right table structure, you can query for items in a folder as well as all items in all folders below a given folder. Here is how that works.

We'll start with a simple folder structure. Imagine you have some data that is something like this:

  Drive C
    Folder I
    Folder II
  Drive D
    Folder III
      Folder a
      Folder b
    Folder IV
      Folder c
    Folder V
      Folder d
        Folder i
        Folder ii
        Folder iii
      Folder e

When storing this data in DynamoDB you would include the full path of each item with that item. You might have a table that looks something like this:

ID	Root	Path	Folder Name
C			Drive C
I	C	/	Folder I
II	C	/	Folder II
D			Drive D
III	D	/	Folder III
a	D	/III/	Folder a
b	D	/III/	Folder b
IV	D	/	Folder IV
c	D	/IV/	Folder c
V	D	/	Folder V
d	D	/V/	Folder d
i	D	/V/d/	Folder i
ii	D	/V/d/	Folder ii
iii	D	/V/d/	Folder iii
e	D	/V/	Folder e

With the data in this format, you can get any single folder by its ID. You can list all folders directly under a folder by using the GSI (which will be the Root and the Path) and specifying the root and the path. You can also get all folders below a folder (all the way down the tree) by using the GSI and specifying the root and that the path begins_with the path of the folder.

The code to get the Folder d looks something like this:

const documentClient = new AWS.DynamoDB.DocumentClient();
const result = await documentClient.get({
            TableName: 'working-with-hierarchy-data-in-dynamodb',
            Key: {
                ID: 'd'
            },
        }).promise();

Querying for a list of the items directly below Folder V would look something like this:

const documentClient = new AWS.DynamoDB.DocumentClient();
const result = await documentClient.query({
            TableName: 'working-with-hierarchy-data-in-dynamodb',
            IndexName: 'gsi1',
            KeyConditionExpression: '#root = :root AND #path = :path',
            ExpressionAttributeNames: {
                '#root': 'Root',
                '#path': 'Path'
            },
            ExpressionAttributeValues: {
                ':root': 'D',
                ':path': '/V/'
            },
        }).promise();

Because the condition is = the only results will be those directly below Folder V; Folder d and Folder e.

Getting a list of all items below Folder V would look something like this:

const documentClient = new AWS.DynamoDB.DocumentClient();
const result = await documentClient.query({
            TableName: 'working-with-hierarchy-data-in-dynamodb',
            IndexName: 'gsi1',
            KeyConditionExpression: '#pk = :root AND begins_with(#sk, :path)',
            ExpressionAttributeNames: {
                '#root': 'Root',
                '#path': 'Path'
            },
            ExpressionAttributeValues: {
                ':root': 'D',
                ':path': '/V/'
            },
        }).promise();

In this case the condition is begins_with, so the results will include all values below Folder V; Folder d, Folder i, Folder ii, Folder iii, and Folder e.

This pattern allows for a great deal of power when dealing with hierarchical data. I would say that most often this is the pattern you'll want to use.

The Replication Pattern

While the above pattern is a very powerful tool in working with data in DynamoDB, I use it all the time, sometimes, you might find that it doesn't work for you. The above approach assumes that you will always know the structure above the node you are searching. In the examples, querying for items below a folder requires that you not only know the ID of the folder you wish to query, but you needed to know all the folders above it as well. In most cases that's not a problem, and even when you don't want to expose that information (maybe for security reasons), doing a quick lookup of that data is likely the best route (i.e., you can get the folder you want to query, and use that to know the path information above it without requiring that the caller supply the full path). So, when might you want to do something different?

Querying varying levels of depth

In the first example you were limited to only being to query either items directly below a node, or all items below a node.

If you wanted to get all items except those directly below a node you'd have to query all items and filter the direct items out. That's not so bad if the number of direct items is a small number, but if you have a lot of data to exclude it's not ideal.

If you wanted to get just the items below a node and the items directly below those items (i.e., two levels deep from the node you're querying) you'd have to either query all the items and filter out the ones you don't want, or run multiple queries, one for the direct items, and one query for each of the results from that query. The first option may be okay if there aren't too many records you want to exclude, but if you are a large hierarchy, say you are working with employee data and you're the size of Amazon, doing that for the root node (e.g., the CEO) would probably not be the best idea. The second option may not be terrible, assuming the number of items directly below the node you are querying is small, but it's certainly not going to be as fast as querying it all at once. Yes, you can run the each of the sub node queries in parallel, but they all require the first query to return first.

If you want three levels, or four levels, want to skip a level or two, or some other combination of ranges, well, I think you can see how complicated, and non-performant, that might get.

Faster querying when hierarchy is unknown

As I mentioned, if you don't know the structure of the hierarchy you can always request the node you want to start at and get that information before querying the hierarchy. In the world of DynamoDB you're talking about single digit millisecond latency on that get. For some applications that level of added latency is a problem (if it wasn't there wouldn't be a service like DAX). Most applications are probably fine accepting this small hit, but if you find that you need something with better read performance keep reading.

Reduced read costs

This one isn't a slam dunk, and there are even some cases where it might not be true (though most of those cases aren't a good use for this pattern anyway), but the size of that hierarchy data can get to be big if your hierarchy is deep. Small amounts can add up to a lot if you process a lot of data. This pattern will be more expensive on writes (and storage), but the read costs can be quite a bit lower in many cases.

The solution

Okay, let's get into how it works. At its core is one simple thing; every record needs to have a copy made for each node above it in the hierarchy. The data will include some extra information, specifically the relative depth and the id of the ancestor item. Using the above data, you'd end up with records like this:

ID	Ancestor ID	Relative Depth	Folder Name
C	I	0	Drive C
I	I	0	Folder I
I	C	1	Folder I
II	II	0	Folder II
II	C	1	Folder II
D	D	0	Drive D
III	III	0	Folder III
III	D	1	Folder III
a	a	0	Folder a
a	III	1	Folder a
a	D	2	Folder a
b	b	0	Folder b
b	III	1	Folder b
b	D	2	Folder b
IV	IV	0	Folder IV
IV	D	1	Folder IV
c	c	0	Folder c
c	IV	1	Folder c
c	D	2	Folder c
V	V	0	Folder V
V	D	1	Folder V
d	d	0	Folder d
d	V	1	Folder d
d	D	2	Folder d
i	i	0	Folder i
i	d	1	Folder i
i	V	2	Folder i
i	D	3	Folder i
ii	ii	0	Folder ii
ii	d	1	Folder ii
ii	V	2	Folder ii
ii	D	3	Folder ii
iii	iii	0	Folder iii
iii	d	1	Folder iii
iii	V	2	Folder iii
iii	D	3	Folder iii
e	e	0	Folder e
e	V	1	Folder e
e	D	2	Folder e

As you can see, there are a lot more records in the table. You'll also notice that each record is smaller than the first example. The Relative Depth is a number, so the size of that is fixed, and because we don't have to store the entire hierarchy on each row the extra data of the Ancestor ID is smaller than the Path, in the first pattern, for most rows.

Let's now look at some of the query patterns this makes available.

The code to get a folder, like Folder d, looks like it did before, but we need to add the relative depth:

const documentClient = new AWS.DynamoDB.DocumentClient();
const result = await documentClient.get({
            TableName: 'working-with-hierarchy-data-in-dynamodb',
            Key: {
                ID: 'd',
                'Relative Depth': 0
            },
        }).promise();

Querying for a list of the items directly below Folder V would look something like this:

const documentClient = new AWS.DynamoDB.DocumentClient();
const result = await documentClient.query({
            TableName: 'working-with-hierarchy-data-in-dynamodb',
            IndexName: 'gsi1',
            KeyConditionExpression: '#ancestorId = :ancestorId AND #relativeDepth = :relativeDepth',
            ExpressionAttributeNames: {
                '#ancestorId': 'Ancestor ID',
                '#relativeDepth': 'Relative Depth'
            },
            ExpressionAttributeValues: {
                ':ancestorId': 'V',
                ':relativeDepth': 1
            },
        }).promise();

Because the relative depth is set to 1 the only results will be those directly below Folder V; Folder d and Folder e.

Getting a list of all items below Folder V would look something like this:

const documentClient = new AWS.DynamoDB.DocumentClient();
const result = await documentClient.query({
            TableName: 'working-with-hierarchy-data-in-dynamodb',
            IndexName: 'gsi1',
            KeyConditionExpression: '#ancestorId = :ancestorId AND #relativeDepth > :zero',
            ExpressionAttributeNames: {
                '#ancestorId': 'Ancestor ID',
                '#relativeDepth': 'Relative Depth'
            },
            ExpressionAttributeValues: {
                ':ancestorId': 'V',
                ':zero': 0
            },
        }).promise();

In this example we are stating that the relative depth needs to be greater than 0, so it will not return Folder V, but it will return everything below it; Folder d, Folder i, Folder ii, Folder iii, and Folder e.

Now, let's say you want everything directly in Drive D as well as everything in those folders, but nothing beyond that (i.e., two levels deep). The query would look like this:

const documentClient = new AWS.DynamoDB.DocumentClient();
const result = await documentClient.query({
            TableName: 'working-with-hierarchy-data-in-dynamodb',
            IndexName: 'gsi1',
            KeyConditionExpression: '#ancestorId = :ancestorId AND #relativeDepth BETWEEN :start AND :end',
            ExpressionAttributeNames: {
                '#ancestorId': 'Ancestor ID',
                '#relativeDepth': 'Relative Depth'
            },
            ExpressionAttributeValues: {
                ':ancestorId': 'D',
                ':start': 1,
                ':end': 2,
            },
        }).promise();

This would return Folder III, Folder a, Folder b, Folder IV, Folder c, Folder V, Folder d, and Folder e, but it would not return Folder i, Folder ii, or Folder iii.

If you want three (3), or four (4) levels you just change the parameters. You can even include the node you are querying by starting at zero (0) or skip a level by starting at a depth of two (2).

As you can see, this pattern opens up some new access patterns that didn't exist before, or that might have been slower performing.

Tradeoffs

Of course, there are tradeoffs to this method, beyond the write vs. read tradeoffs we already mentioned.

First, any time a record is updated you'll have to update the record and all the replicated records as well. The cost of that operation isn't too expensive because it's a linear value based on the depth in the tree. For most trees you probably wouldn't need to do more than a handful of updates.

Second, re-parenting can be a bit complex. In the first pattern re-parenting just means updating the Path for the records at and below the re-parented record. With this pattern you need to remove records from former ancestors and add them to new ancestors. I've linked to example code for doing this, and it's not too terribly complex, but it's not a particular cheap operation, so if you move nodes around a lot you may want to consider the cost of that.

Streams

In the example code I've linked to this blog post I'm doing all the work of duplicating and updating the records at the time of saving. A better approach to this would be to use DynamoDB streams. As a record is added you can use the stream INSERT event to trigger the code to replicate the data to each of the ancestors. You can also use the stream MODIFY event to trigger re-parenting and updating of name data (or any other data that needs to be updated), and the REMOVE event to remove records from ancestors.

Conclusion

Hierarchies are an important part of working with data, and data stored in DynamoDB is no exception. By using one of the above patterns, I hope you find that working with hierarchy data in DynamoDB isn't something to fear. Once you find yourself using them, you'll open up a whole world of potential query patterns you might not have thought possible before.

Try It Out

If you want to give the second pattern a try, I have a working example on GitHub that you can deploy to your AWS account. https://github.com/jasonwadsworth/blog-code/tree/main/working-with-hierarchy-data-in-dynamodb

Lambda Retries and Dead Letter Queues

Jason Wadsworth — Sun, 02 May 2021 00:26:02 +0000

As you may know, I'm a big fan of serverless in AWS. The primary compute component of serverless in AWS is AWS Lambda, so as you might imagine, I use it a lot. When using Lambda, I try to follow best practices for retries and dead-letter-queues (DLQs) or error destinations, but there are so many ways to do it I often find myself needing to look them up. So, I thought it might be useful to have a simple guide. Here it is.

To make this easier to use quickly I'm including a quick reference table for each integration or integration type. The table will look something like this:

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Up to 2	❌ No	✅ Yes	✅ Yes

I'll do my best to include more information when necessary.

Asynchronous or Synchronous

There are two ways to invoke a Lambda function. The type of invocation is key to how you handle errors and retries. Understanding the difference between asynchronous and synchronous invocations will help you narrow down the options pretty quickly.

Asynchronous Invocations

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Up to 2	❌ No	✅ Yes	✅ Yes

All asynchronous invocations will retry a message up to two times (default). You can use a DLQ on the function, and/or an error destination. Asynchronous invocations do not support DLQs at the integration because the integration returns immediately after Lambda has received the request. The requests are placed in an internal queue that is managed by Lambda.

Lambda manages the function's asynchronous event queue and attempts to retry on errors. If the function returns an error, Lambda attempts to run it two more times, with a one-minute wait between the first two attempts, and two minutes between the second and third attempts. Function errors include errors returned by the function's code and errors returned by the function's runtime, such as timeouts.

The following integrations asynchronously invoke your Lambda function.

Synchronous Invocations

For synchronous invocations you won't get a DLQ or error destination at the function itself. For these integrations it's up to the caller to handle retries and errors, so you may get retries and DLQs in some cases, but not others.

When you invoke a function synchronously, Lambda runs the function and waits for a response. When the function completes, Lambda returns the response from the function's code with additional data, such as the version of the function that was invoked.

If Lambda isn't able to run the function, the error is displayed in the output.

Here is a list of each integration and how its retries and DLQs work.

CLI & SDK

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Some	❌ No	❌ No	❌ No

CLI & SDK invocations can call your function either synchronously or asynchronously. There are some limited cases where the call will be retried, but exceptions thrown by your code will not lead to a retry.

API Gateway

Retries	DLQ at integration	DLQ at function	Error destination at function
❌ None	❌ No	❌ No	❌ No

API Gateway invokes your function synchronously. There are no retries when making calls to your function, and the integration does not support a DLQ.

Cognito (except custom sender triggers)

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ 2	❌ No	❌ No	❌ No

Cognito invokes your function synchronously except for custom sender triggers.

When called, your Lambda function must respond within 5 seconds. If it does not, Amazon Cognito retries the call. After 3 unsuccessful attempts, the function times out.

DynamoDB Streams

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Until Expired*	✅ Yes	❌ No	❌ No

DynamoDB Streams invoke your function synchronously in batches. If the function returns an error or times out the entire batch will be retried until the message expires by default. The integration is an event source mapping, which includes several configuration options to control retries and DLQs. Although the Lambda console does make it seem as though you can configure an error destination, that configuration is really part of the event source mapping.

BisectBatchOnFunctionError allows you to isolate a single record that is causing a problem. Every error will result in splitting the batch in order to isolate the error. These retries do not count toward the MaximumRetryAttempts.
DestinationConfig allows you to send failures to either an SQS queue or an SNS topic.
MaximumRetryAttempts controls the number of times a message can fail before being discarded (or sent to a failure destination if configured). The default value is -1, which will retry until the message expires.

Elastic Load Balancing - Application Load Balancer

Retries	DLQ at integration	DLQ at function	Error destination at function
❌ No	❌ No	❌ No	❌ No

Application Load Balancer invokes your function synchronously. There are no retries when making calls to your function, and the integration does not support a DLQ.

Kinesis Firehose

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ User controlled	❌ No	❌ No	❌ No

Kinesis Firehose calls your function synchronously. There are options within the integration to control the number of retries. There is no option for a DLQ.

Kinesis Streams

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Until Expired*	✅ Yes	❌ No	❌ No

Kinesis Streams invoke your function synchronously in batches. If the function returns an error or times out the entire batch will be retried until the message expires by default. The integration is an event source mapping, which includes several configuration options to control retries and DLQs.

BisectBatchOnFunctionError allows you to isolate a single record that is causing a problem. Every error will result in splitting the batch in order to isolate the error. These retries do not count toward the MaximumRetryAttempts.
DestinationConfig allows you to send failures to either an SQS queue or an SNS topic.
MaximumRetryAttempts controls the number of times a message can fail before being discarded (or sent to a failure destination if configured). The default value is -1, which will retry until the message expires.

Lex

Retries	DLQ at integration	DLQ at function	Error destination at function
❌ None	❌ No	❌ No	❌ No

Lex invokes your function synchronously. There are no retries when making calls to your function, and the integration does not support a DLQ.

Amazon MQ

Retries	DLQ at integration	DLQ at function	Error destination at function
❌ None	❌ No	❌ No	❌ No

Amazon MQ invokes your function synchronously. The integration is an event source mapping, but does not include any settings for DLQ or retries. Those can be handled within ActiveMQ itself.

S3 Batch

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Yes	❌ No	❌ No	❌ No

S3 Batch invokes your function synchronously. There is no option for a DLQ. If the Lambda function returns a TemporaryFailure response code, Amazon S3 retries the operation.

SQS

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Configured in SQS	❌ No	❌ No	❌ No

SQS invokes your function synchronously. The integration is an event source mapping, however, no DLQ or retry options are available on the integration itself. Instead, you configure a redrive policy on the queue itself. This policy controls how many times a message can be received before being discarded or sent to a DLQ.

SQS also handles throttling in a rather unique way. Because the queue management is handled by Lambda it often will request more than your function can process. When this happens, your message will be tried again until the time remaining on the message timeout is less than the function timeout. At that point the message will be allowed to timeout, allowing the message to be retried or discarded based on your redrive policy. These retries do not count toward the message delivery count.

Step Functions (synchronous invoke)

Retries	DLQ at integration	DLQ at function	Error destination at function
✅ Controlled in retry configuration	✅ Controlled as a state	❌ No	❌ No

Step Functions can invoke a Lambda synchronously. When doing so you can use the retry setting to control the retries. Any error logic can be handled in the state machine using the catch setting and sending the message to an SQS queue or SNS topic within the state machine itself.