Forem: Pubudu Jayawardana

Understanding Lambda Tenant Isolation

Pubudu Jayawardana — Sun, 22 Feb 2026 23:33:54 +0000

Lambda tenant isolation is one of the important security features that came out of the 2025 re:Invent season.

Achieving tenant isolation in SaaS applications is not straightforward, and taking the single-tenant route to solve it introduces its own scaling challenges. This new feature is not a silver bullet, but it does offer much better support for keeping tenants isolated at scale.

In this blog post, I discuss what this feature is and the problems it addresses.

Lambda execution environment

When a invoke request reached AWS Lambda service initially, it starts a virtual environment in a EC2 host worker. We call this an execution environment. An execution environment will download the code and required dependencies and process the request. If required, it will return the response.

One of the key attributes of this execution environment is that it will not be removed or deleted immediately after processing a single request. It will be kept in 'warm' state to serve another incoming request. When the next request comes in, Lambda service will use the execution environment that is already available to process it, without creating a new environment.

Likewise, when you invoke a Lambda function, if there are execution environments available for that Lambda function, Lambda service will use them to process the requests else will create new execution environments.

However, this approach can be a concern when it comes to a multi-tenant Lambda function, because execution environments share 'left over' stuff like:

Global variables
Objects initialized outside of the handler
files saved in /tmp space

Multi-tenant Lambda function

In this multi-tenant setup, a Lambda function is shared by more than one tenant. And based on how execution environments behave, irrespective of the tenant that invoke the Lambda function, Lambda service will use execution environments that are already available for that Lambda function. But, as mentioned earlier, execution environments sharing some data across executions can be a security issue.

Having data shared across execution environments can be a great optimization if those data are accessible only by the intended tenant. However, when multiple tenants use the same execution environments, tenants will have access to data that they are not intended to.

For example, if we take a single execution environment, tenant 1 might fetch some secrets from Secret Manager or save some files in /tmp directory in its execution. If the same execution environments used for an execution of tenant 2, tenant 2 will have access to the secrets fetched for tenant 1 or the contents of the /tmp directory saved by tenant 1.

Solution

One of the solution for this problem is to reset the execution environment just before processing each request. For example, unsetting any global variables or wipe the /tmp directory etc. However, this approach will not be practical at scale.

Another option is to go for tenant specific Lambda functions which is the single-tenant approach. In this case, tenants will have their own dedicated Lambda function each. This will solve the problem of unintended access to temporary data, because different execution environments belongs to different Lambada functions will not share the execution environments.

However having Lambda function per tenant is not scalable. When there are a lot of tenants available, you end up with a lot of Lambda functions that you need to manage. While this is possible when you use a IAC tool like CDK or Terraform, still in a situation where you need to update source code to introduce a functionality or a bug fix, you will need to update all these tenant specific resources which is not easy. And, most of the time, those resources are not fully utilized too.

What if we can have a single Lambda function shared by all of the tenant (the multi-tenant approach), yet have the isolation we need in a Lambda per tenant (single-tenant approach)?

Lambda tenant isolation

With Lambda tenant isolation, you can have a single Lambda function shared by all of the tenant, yet have the isolation you need in a Lambda per tenant. With this new feature, Lambda service will do the heavy lifting by creating execution environments that are dedicated to a specific tenant. This means that execution environments will not be shared across tenants, and tenants will not have access to data that they are not intended to.

But, how Lambda determines the incoming request is from which tenant? For that, we need to provide a tenant-id in the request to Lambda.
If you use Lambda invoke cli method, you can use the --tenant-id parameter as follows:

aws lambda invoke \
    --function-name tenant-aware-lambda \
    --payload '{ "name": "Bob" }' \
    --tenant-id t1 \
    response.json

If you use Lambda API, you need to provide the value using X-Amz-Tenant-Id as follows:

POST /2015-03-31/functions/tenant-aware-lambda/invocations HTTP/1.1
Host: lambda.eu-central-1.amazonaws.com
Content-Type: application/json
Authorization: AWS4-HMAC-SHA256 Credential=...
X-Amz-Tenant-Id: t1

{
    "name": "Bob"
}

Tenant id is case sensitive and it can be any alpha numeric character with maximum length 256 characters. There are few special characters such as hyphens (-), underscores (_), colon (:), equals (=), plus (+), at (@) and periods (.) allowed too.

One of the key attributes of tenant id is that we don't need to pre-register those tenant ids. We can pass any dynamic value as the tenant id and Lambda service will take care of creating and maintaining a pool of execution environments for each value passed. Also, we can use any number of unique tenant ids, there is no limit.

How to enable this feature in Lambda

If you use the AWS Console, you can go to the Lambda create wizard and enable this in the additional security section.

Also, options for CLI, CloudFormation and CDK also available.

CLI:

aws lambda create-function \
    --function-name tenant-aware-lambda \
    --runtime python3.14 \
    --zip-file fileb://tenant-aware-lambda.zip \
    --handler index.handler \
    --role arn:aws:iam:123456789012:role/execution-role \
    --tenancy-config '{"TenantIsolationMode": "PER_TENANT"}'

CloudFormation:

MyLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: tenant-aware-lambda
      Runtime: python3.14
      Role: !GetAtt LambdaExecutionRole.Arn
      Handler: index.handler
      TenancyConfig:
        TenantIsolationMode: PER_TENANT
      Code:
        ZipFile: |
          .....
          .....
      Timeout: 10
      MemorySize: 128

CDK:

tenant_aware_lambda = _lambda.Function(
            self,
            "TenantAwareFunction",
            function_name="tenant-aware-lambda",
            runtime=_lambda.Runtime.PYTHON_3_13,
            handler="index.handler",
            code=_lambda.Code.from_asset("src/lambda/tenant_aware"),
            timeout=Duration.seconds(10),
            tenancy_config=_lambda.TenancyConfig.PER_TENANT,
        )

Please note: This feature can be enabled ONLY when the Lambda function is created. You cannot enable this for an existing Lambda function.

Try this yourself

I have created a sample application for you to see how this feature works. You can deploy it to your AWS environment using CDK with Python.

Clone the repository at github.com/pubudusj/lambda-tenant-isolation-demo and follow the steps below:

Create a virtual environment and activate it:

python3 -m venv .venv
source .venv/bin/activate

Install the dependencies:

pip install -r requirements.txt

Deploy the stack:

cdk deploy

This will create two Lambda functions:

A generic Lambda function (without tenant isolation)
A Lambda function with tenant isolation enabled

Both Lambda functions have a global variable counter which increments on each invocation. This is to simulate the shared state across executions. An API Gateway is also created with two endpoints to trigger these Lambda functions:

/execute_generic_lambda?tenant_id=<tenant_id> - invokes the generic Lambda function
/execute_tenant_aware_lambda?tenant_id=<tenant_id> - invokes the tenant-aware Lambda function

Testing

Generic Lambda function

First, let's test the generic Lambda function. Call the /execute_generic_lambda endpoint with a tenant id:

curl "<APIGW_BASE_URL>/execute_generic_lambda?tenant_id=tenant1"

You will see a response like:

{"tenant_id": "tenant1", "invocation_count": 1}

Now call the same endpoint again but with a different tenant id:

curl "<APIGW_BASE_URL>/execute_generic_lambda?tenant_id=tenant2"

You can see the invocation count keeps increasing regardless of the tenant id:

{"tenant_id": "tenant2", "invocation_count": 2}

This is because the generic Lambda function shares execution environments across all tenants. The global variable counter retains its value across invocations regardless of which tenant made the request. This is the exact problem we discussed earlier - any global state, cached data or files in /tmp are accessible across tenants.

Tenant-aware Lambda function

Now let's test the tenant-aware Lambda function. Call the /execute_tenant_aware_lambda endpoint with a tenant id:

curl "<APIGW_BASE_URL>/execute_tenant_aware_lambda?tenant_id=tenant1"

You will see a response like:

{"tenant": "tenant1", "invocation_count": 1}

Now call the same endpoint again with a different tenant id:

curl "<APIGW_BASE_URL>/execute_tenant_aware_lambda?tenant_id=tenant2"

This time, the invocation count resets:

{"tenant": "tenant2", "invocation_count": 1}

This is tenant isolation in action. Even though both requests are handled by the same Lambda function, Lambda service creates separate execution environments for each tenant. The global variable counter, or any other shared state in the execution environment, is isolated per tenant. Tenant 2 will never see the state left behind by Tenant 1.

Also note that in the tenant-aware Lambda function, we can access the tenant id from the Lambda context object (ex: using context.tenant_id in Python), instead of extracting it from the query parameters. Lambda service automatically makes the tenant id available in the context when tenant isolation is enabled. This is useful if you need to do some operations based on the tenant id - for example, fetch some data from other services.

In this example, I have used this tenant id available in the context object to publish a custom CloudWatch metric per tenant. This is helpful to monitor per-tenant invocation patterns, which can be useful for billing or auditing purposes.

Effect on Lambda concurrency and cold starts

One important thing to understand is how tenant isolation affects Lambda concurrency. Since execution environments are not shared across tenants, each tenant will need their own set of execution environments. This means that the overall number of concurrent execution environments can be higher compared to a non-isolated Lambda function where environments are freely shared.

For example, if you have 10 tenants each making concurrent requests, instead of reusing a pool of warm execution environments, Lambda needs to maintain separate pools per tenant. This can lead to more cold starts, especially for low-traffic tenants.

However, this is a trade-off worth making when tenant isolation is critical for your application.

Please note: Make sure to consider the Lambda concurrency limits in your account when enabling tenant isolation, especially when dealing with a large number of tenants.

Integration with API Gateway

As at now, only integration supports Lambda tenant isolation features is API gateway.

In the example project, I have mapped the incoming query string tenant_id to the integration request header X-Amz-Tenant-Id for the Lambda service.

tenant_aware_integration = apigw.LambdaIntegration(
    tenant_aware_lambda,
    request_parameters={
        "integration.request.header.X-Amz-Tenant-Id": "method.request.querystring.tenant_id"
    },
)

When using Lambda integration on API Gateway, it gives us a lot of flexibility in choosing what value to map as the X-Amz-Tenant-Id. It could be the source AWS account id, a value from the request body, or a header value. If authentication and authorization are enabled, it could even be a Cognito user group or a claim from a JWT token. This makes API Gateway a convenient place to resolve and pass the tenant id to Lambda.

What I'd like to see next

While Lambda tenant isolation is a solid step forward, there are a few areas where I think it can be even more valuable.

More integration support: Currently only API Gateway supports as a integration to Lambda with this feature. It would be great to see this extended to other integrations such as SQS to Lambda via event source mapping.
Native per-tenant metrics: Built-in CloudWatch metrics by tenant id would remove the need to publish custom metrics manually.
Per-tenant concurrency controls: The ability to set concurrency limits per tenant would help prevent a noisy tenant from consuming all the available concurrency.

Conclusion

Tenant isolation is a fundamental requirement in many SaaS applications. While the single-tenant approach provides strong isolation, it comes with significant operational overhead. The multi-tenant approach is operationally simpler but introduces security risks with shared execution environments.

Lambda tenant isolation gives us the best of both worlds. We get a single Lambda function that is easy to manage and deploy, while Lambda service ensures that the execution environments are isolated per tenant. This eliminates the risk of data leakage across tenants without the burden of managing separate Lambda functions for each tenant.

This is a great addition to the serverless toolbox. If you are building multi-tenant SaaS applications on AWS Lambda, this feature is worth exploring.

Resources

AWS Lambda tenant isolation documentation: https://docs.aws.amazon.com/lambda/latest/dg/tenant-isolation.html
Launch blog post: https://aws.amazon.com/blogs/aws/streamlined-multi-tenant-application-development-with-tenant-isolation-mode-in-aws-lambda

👋 I regularly create content on AWS and Serverless, and if you're interested, feel free to follow/connect with me so you don't miss out on my latest posts!

LinkedIn: https://www.linkedin.com/in/pubudusj
Twitter/X: https://x.com/pubudusj
Medium: https://medium.com/@pubudusj
Personal blog: https://pubudu.dev

Simple Leave Management with AWS Lambda Durable Functions

Pubudu Jayawardana — Wed, 14 Jan 2026 00:17:16 +0000

Intro

In AWS re:Invent 2025, Lambda introduced Durable Functions with a great set of features. One of the main features is that a single execution can span up to one year. Also with built-in checkpointing it is possible to track the steps that were already completed and when an execution retried from a resume or an interruption, completed steps will be skipped and resumes from the next step.

When it comes to orchestrating multiple AWS services into a workflow, up until now, AWS Step Functions was the best choice. However, now Durable Functions too offer similar functionality to build a workflow within the same familiar Lambda environment, which is great!

In this blog post, I explain how Durable Function's callback feature can be used for a human-in-the-loop functionality.

Durable Function callbacks

When an execution starts, Durable Function will start an invocation. When there are multiple steps within the execution, it is possible to set the execution on hold (or wait) in a certain step for a given time period or until a signal is received from an external process. In case of a external signal, once the signal is received by the Lambda service, based on the success or failure nature of the signal, the execution that was on hold can be resumed or completely terminated.

This is a great feature supported natively by Durable Functions that is helpful for an example, in a business process where a human approval is required to continue with the flow.

Simple leave management with Durable Functions

In this example, an employee can send a leave request and the manager can approve or reject the request. Below are the steps involved and how Durable Function is being used.

Employee sends a leave request. Durable function starts an execution.
A Leave is created in a DynamoDB table with pending state.
The employee receives an email with the request receipt confirmation.
The manager receives an email with a callback id to be used to approve or reject the leave.
Durable function keeps the execution on hold until manager approval or rejection is received.
Once the manager approves or rejects, the execution resumes.
If the manager didn’t process the request within a given time period, the request will be expired.
The leave status is updated in the DynamoDB table.
The employee receives an email with the manager’s decision or expiry.

How it works

There is a proxy function which accepts and validates the incoming request to create a leave. Here I used a Lambda Function URL to submit the request.
Once validated, the proxy function will invoke the durable function.

Why use the proxy function?

Even with Durable functions, the synchronous invocations are limited by 15 minutes. Since I need the durable function to run longer, it is required to trigger the durable function asynchronously. Using the proxy function, durable function is invoked with invocation type ‘Event’, where it just fire-and-forget and will not wait for the durable function to complete.
When the durable function starts its execution, first a leave record is created in the DynamoDB table.
Then, an email is sent to the employee confirming the receipt of the leave request.
In the next step, an email is sent to the manager asking to approve or reject the leave request. This step is a callback step. The execution waits at this step until the manager approves or rejects.
Since this is a callback step, there is a callback id generated. This callback id is indicated in the email to the manager.
To simulate the manager’s approval or rejection, there is a Function URL exposed to trigger the Process Leave Lambda function.
This function accepts callback_id and the decision (approve or reject) as input.
Based on the decision, this function will call the boto3 methods of Lambda send_durable_execution_callback_success or send_durable_execution_callback_failure using the given callback id.

Please note: These latest SDK methods are not available with the boto3 version provided by Lambda by default as of now. So, it is required to package boto3 version 1.42.1 or newer with your Lambda source code.
When Lambda service receives the decision, the durable execution that was on hold will re-start.
If the manager’s decision was not received on time (I set this to 5 minutes at the moment), the wait_for_callback step will throw an CallableRuntimeError exception with an error message - Callback timed out.
Based on the decision or expiry of callback, the leave record in DB is updated.
Then an email will be sent to the employee with the leave's final status and the execution is completed.

Try this yourself

Here is a Github repository of a sample project I created with AWS CDK and Python for you to try out this scenario in your AWS account.

https://github.com/pubudusj/simple-leave-management-with-durable-functions

Clone the repository.
Copy the .env.example into .env and add the values. The value for SYSTEM_FROM_EMAIL should be already configured in Simple Email Service (SES) to send emails to the manager and employee email addresses.
Deploy the stack with CDK.
Once deployed, there will be two Lambda function urls in the output - one for creating the leave request and the other for process leave request (as manager).

Send a POST request to the create leaves endpoint with a payload similar to below:

{
    "start_date":"2026-01-10",
    "end_date": "2026-01-20",
    "employee_email": "employee@email.com"
}

This will send an email to the employee email address as follows:
Also, an email to the manager with the callback id as follows:
Manager can send a POST request to the leave process endpoint with the callback id in the email with his decision (approve or reject) as follows:
```
{
    "callback_id":"[callback id from email]",
    "decision": "approve"
}
```
Based on the decision, the employee will receive an email with the status.
If the manager didn't process the decision within given time (for demo purposes, this was set to 5 minutes), the leave will be marked as expired and employee will receive an email as follows:

Please try this and let me know your thoughts!

Resources

AWS Lambda Durable Functions documentation: https://docs.aws.amazon.com/lambda/latest/dg/durable-functions.html
Durable execution SDK: https://docs.aws.amazon.com/lambda/latest/dg/durable-execution-sdk.html

Monitoring multiple dynamic resources using a single Amazon CloudWatch alarm

Pubudu Jayawardana — Tue, 21 Oct 2025 09:18:26 +0000

Intro

When you need to monitor your resources with a CloudWatch alarm, what you normally have to do is to create an alarm with a specific matric of that resource. Although this gives a granular level of monitoring into your resources, you always have to add or remove alarms as and when you have new resources or when you remove a specific resource. Which is an operational overhead although it can be automated if you are using infrastructure as code tool.

Another option available is to use aggregated metrics in your alarms such as CPUUtilization for EC2 so you have coverage, but at high level into a group of resources you need to monitor. The downside of this is that it lacks granular visibility into your individual resources. Also, there is only a limited number of resources that support aggregated metrics.

In September 2025, Amazon CloudWatch introduced a nice feature of allowing monitor multiple individual metrics via a single alarm using CloudWatch Metric Insights. By using Metric insight SQL queries in the alarm, it automatically updates its query results with each evaluation and adjusts in real time when resources are created and deleted.

With the introduction of the multi-metric alarms, you can now get both granular level per resource monitoring as well as less maintenance where you don’t need to update alarms when resources are created or removed from your application because the alarm itself will automatically discover and monitor the resources.

How it works

You can define a generic SQL statement as the alarm metric.

For example, let’s assume that you would like to monitor all your SQS queues for available messages and if there are any messages available you need to be notified.

For this requirement, you can define a query as follows:

SELECT MAX(ApproximateNumberOfMessagesVisible)
FROM SCHEMA("AWS/SQS", QueueName)
GROUP BY QueueName
ORDER BY COUNT() DESC

With this query, the alarm will monitor all your queues for ApproximateNumberOfMessagesVisible count and trigger the alarm if a specific queue has at least one visible message. Also, if you add a new queue after the alarm is created, it will be taken into account when evaluating this query.

Also, you can use tags to filter the resources you need to monitor. For example…

SELECT MAX(ApproximateNumberOfMessagesVisible)
FROM SCHEMA("AWS/SQS", QueueName)
WHERE tag.Component = 'consumer'
AND tag.Environment = 'production'
AND tag.IsDLQ = 'true'
GROUP BY QueueName
ORDER BY COUNT() DESC

It would have been a great feature if this supports wildcards or regex based filtering into the resource properties, but as of now, only ways to use filters are using tags or the complete property value as equal or not equal.

When you create the alarm with this Metrics Insights Query, you can see a new section on the alarm called “Contributors”. Here, you can see all the resources that match the conditions in the alarm query as well as the current state of each contributor.

Image: CloudWatch multi metric alarm contributors.

Alarm states

When a single contributor metric breached the threshold, then its state will be changed to in alarm state. And in the alarm details, it is clearly defined which contributor caused the alarm to change the state. This is super helpful to identify the exact resource that triggered the alarm.

Image: CloudWatch multi metric alarm notification.

The alarm action is based on the contributor transitions. Which means even when the alarm is already in ‘In alarm’ state, if another contributor breached the threshold and its state becomes ‘In alarm’, the alarm action will be triggered.

Image: CloudWatch multi metric example alarm history.

This is how the similar situation shows in the CloudWatch console.

Image: CloudWatch multi metric example alarm graph.

Try this yourself

I have created a Github repository with a SAM template that helps to deploy some AWS resources into your AWS account and try this scenario.

Pre-requisites

You need to enable the CloudWatch setting “Resource tags for telemetry” if you need to use tag based filters in these queries. Go to CloudWatch > settings > Enable resource tags on telemetry in your region to enable this.

Clone the repository: https://github.com/pubudusj/cloudwatch-multi-metrics-alarm
Deploy the stack using AWS SAM CLI.
Once the stack is deployed, you can see in the created CloudWatch alarm, the contributors are available.
When you add some messages into one of the SQS queues, you can see the alarm state changed and you will receive the alarm notification which includes the contributor.
You can uncomment line 64 - 74 and deploy again to create a new resource.
This new resource will be evaluated and you can see it is listed in the contributors list.
If you add a message to the new queue, the alarm triggers.
Likewise, any resource new or old that matches the query will be considered to evaluate the alarm.

Limitations

There are some notable limitations in this approach.

Currently a single query can return no more than 500 time series and it is a hard limit. Which means as per the above query, maximum 500 resources can be monitored, which might not be not sufficient for a application with a lot of resources. When you have more than 500 resources to monitor, you will have to use multiple alarms.
There is no wildcard or regex supported in the query (comparing to some other features in CloudWatch)

Summary

The new feature of CloudWatch is great for monitoring dynamic resources in your application without the management overhead of adding/removing alarms when the resources are created and deleted. There are some hard limits where this might not be suitable for all the situations. However, I believe there will be more improvements added to this feature in near future - like wild card or regex based resource filtering, that will give better developer experience.

Resources

Release news: Amazon CloudWatch query alarms now support monitoring metrics individually. https://aws.amazon.com/about-aws/whats-new/2025/09/amazon-cloudwatch-alarm-multiple-metrics/

Creating a Metrics Insights CloudWatch alarm. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-insights-alarm-create.html

Metrics Insights quotas https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-insights-limits.html

👋 I regularly create content on AWS and Serverless, and if you're interested, feel free to follow/connect with me so you don't miss out on my latest posts!

LinkedIn: https://www.linkedin.com/in/pubudusj
Twitter/X: https://x.com/pubudusj
Medium: https://medium.com/@pubudusj
Personal blog: https://pubudu.dev

Detect EventBridge target failure: Part 2 - using enhanced monitoring

Pubudu Jayawardana — Tue, 30 Sep 2025 10:06:27 +0000

Intro

When delivering messages to different targets using EventBridge, it is important to get notified if there are any delivery failures. EventBridge doesn’t provide this out of the box, but there are several ways to achieve this.

In the 1st part of this blog we discussed how we can get notified when a delivery fails to a target using a dead letter queue.

In this blog post we will discuss another (better?) option to achieve the same using EventBridge enhanced logging.

EventBridge Enhanced Logging

On 15th July 2025 AWS introduced enhanced logging for EventBridge. Which means now you can enable logging and EvenBridge will send those logs into a configured log delivery location.

Available log destinations are:

S3
CloudWatch logs
Amazon Data Firehose stream

Also you can configure the standard log levels as required: Trace, Info and Errors.

You can select more than one log destination and for each destination, you can select the same or a different log level. This is really useful, for example, I can use S3 to log all the traces while using CloudWatch log stream to log only errors.

How it works

In this example, I have used EventBridge enhanced logging to log any errors occurring into a CloudWatch log stream.

Image: Architecture

Simply, EventBridge will log any errors into the CloudWatch log. Once the log record is available in the CloudWatch log stream, there are multiple ways to trigger a CloudWatch alarm. In this example, I use the number of incoming log events as the metric to trigger the alarm.

If you use a CloudWatch log stream with log level trace (which includes info and errors as well) and still you want to trigger an alarm based on any error occurring, you may use an option like creating a metric filter in the log group.

Try this yourself

I have created a Github repository with a AWS SAM template for you to test this scenario in your AWS account.

Clone the Github repository: https://github.com/pubudusj/event-bridge-target-failure-detection-with-enhanced-logging

Deploy the stack using below command:

sam deploy \
--template-file template.yaml \
--stack-name eb-fail-detection-with-enhanced-logging \
--capabilities CAPABILITY_IAM \
--no-confirm-changeset \
--parameter-overrides NotificationEmail=[YourEmailAddress]

Here, add your email address as NotificationEmail, so you will get the notification into your email box when the target fails.
Once the stack is deployed, you will get a SNS subscription confirmation email. You need to confirm it in order to receive notifications.
Then, publish a message into the created event bus with the source as xyzcorp.
This way the message will match the rule and try to deliver the message to the target.
I have blocked the permission for publishing the target intentionally to simulate the failure.
In a moment, you should get an email with the alarms status.
If you go to the CloudWatch log stream created, you can see the entry with log level ERROR and message type INVOCATION_FAILURE.

Please note:
As of now, creating the enhanced logging and delivering them to a delivery destination is not a straight forward configuration. You need to create a CloudWatch log group, a delivery source, a delivery destination and a logs delivery.

Refer: https://github.com/pubudusj/event-bridge-target-failure-detection-with-enhanced-logging/blob/main/template.yaml#L31-L59

Which means if you need to send these enhanced logs to multiple destinations, you need to repeat configuring those resources per destination.

However, nice thing about this approach is that you only need to configure this only once on the event bus, and it will log the whole message life cycle within EventBridge systems including ingestion as well as deliveries to all the targets.

Summary

Overall, enhanced logging is a great improvement to EventBridge because up until now, message delivery was a black box (for customers), specially the consumer side of EventBridge. With this new addition, you can track and debug the flow of your message within EventBridge systems transparently using the logs generated in each and every step that the message is going through, from ingest to delivery (of course depends on the log level that was configured).
Also, you can configure more than one log destinations as well as different log levels.
Since creating single log delivery using CloudFormation requires several AWS resources to be configured, I hope EventBridge team will provide easy to use method to configure this, ideally as properties of EventBridge Bus.

Resources

Monitor and debug event-driven applications with new Amazon EventBridge logging: https://aws.amazon.com/blogs/aws/monitor-and-debug-event-driven-applications-with-new-amazon-eventbridge-logging/

👋 I regularly create content on AWS and Serverless, and if you're interested, feel free to follow/connect with me so you don't miss out on my latest posts!

LinkedIn: https://www.linkedin.com/in/pubudusj
Twitter/X: https://x.com/pubudusj
Medium: https://medium.com/@pubudusj
Personal blog: https://pubudu.dev

Detect EventBridge target failure: Part 1 - with dead letter queue

Pubudu Jayawardana — Tue, 23 Sep 2025 20:20:31 +0000

Intro

When EventBridge delivers messages to its target, there can be many reasons that cause failing to send a message. There can be permission issues, rate limits or the unavailability of the target or even can be a glitch in the AWS itself, just to name a few.

No matter what the reason is, it is always ideal to get notified that there is an issue delivering messages and the reason for the failure. In this blog post I will discuss how a dead letter queue can be useful to get notified when the EventBridge fails to deliver messages to its target.

Dead letter queues

Dead letter queues are unsung heroes of the event driven architecture 😀. Those are easy to set up and manage yet greatly improve the resilience of a system. Also it is very cost effective.

Let’s see how we can capture the target delivery failures in EventBridge using a dead letter queue.

Please note that EventBridge supports DLQ in a couple of “levels”. EventBridge bus can have a DLQ itself, or you can set a DLQ per target basis. Let’s discuss the differences.

DLQ on EventBridge bus level

EventBridge bus can have a DLQ of its own. However, this is limited to capture any errors related to the KMS encryption. EventBridge sends events that aren’t successfully encrypted to the DLQ.

You can only see this DLQ setting in the EventBridge in AWS console only if customer managed KMS is used to encrypt messages. In fact, it is part of the Encryption settings.

Image: DLQ for Event bus only available when customer managed KMS is in use.

However, this DLQ will NOT capture any target related failures, so that we cannot use this for our purpose.

DLQ on EventBridge target level

When EventBridge cannot deliver a message to a target, we can set up a SQS queue to put that message in, on the target level.

Image: DLQ on target.

Since one rule can have more than one target, each target can have different DLQs as well. You can use the same SQS queue as the DLQ for all the targets, but you have to configure it for each and every target separately. It may sound like repetitive work, but if you use an infrastructure as a code tool like CDK or CloudFormation, this is not complex.

How it works

Image: High level architecture.

EventBus tries to deliver a message to its target (here, it is a SQS queue) via EventBridge rule.
Let’s assume there is a permission issue, and the message cannot be delivered.
Then, EventBridge will put the message into the DLQ configured for this specific target.
In CloudWatch, there is an alarm set up to be triggered whenever there is a message in the DLQ.
When the failed message is in the DLQ, Alarm triggers and there is a SNS topic configured as the alarm action.
And when the alarm action publishes a message to SNS topic, it will send the notification to all the subscribers to notify about the failure.

Try this yourself

I have created an AWS SAM template to try this scenario in your AWS account.

Clone the Github repository: https://github.com/pubudusj/event-bridge-target-failure-detection-with-dlq

Deploy the stack using below command:

sam deploy \
--template-file template.yaml \
--stack-name event-bridge-target-failure-detection-with-dlq \
--capabilities CAPABILITY_IAM \
--no-confirm-changeset \
--parameter-overrides NotificationEmail=[YourEmailAddress]

Here, add your email address as NotificationEmail, so you will get the notification into your email box when the target fails.
Once the stack is deployed, you will get a SNS subscription confirmation email. You need to confirm it in order to receive notifications.
Then, publish a message into the created event bus with the source as xyzcorp.
This way the message will match the rule and try to deliver the message to the target.
I have blocked the permission for publishing the target intentionally to simulate the failure.
In a moment, you should get an email with the alarms status.
Further, if you check the messages in DLQ, you can see the failed message and in the message attributes, you may see the reason of failure (depends on the reason).

Image: Message attributes of a failed message in DLQ.

10.You can configure the threshold, period and evaluation periods of the alarm as needed to control the frequency of the notifications in case of a failure. https://github.com/pubudusj/event-bridge-target-failure-detection-with-dlq/blob/main/template.yaml#L61-L63

Summary

EventBridge bus has a DLQ but it is for a different purpose and cannot capture any target failures.
You can use this dead letter queue approach, to capture any messages which cannot be delivered to the target. Based on the no of messages in the queue, you can get notified using CloudWatch metric and SNS. However, you will need to configure it for each and every EventBridge target separately. Using an IAC tool to configure this may be convenient.
I will discuss another solution to achieve the same in part 2 of this blog post.

Resources

Using dead-letter queues to process undelivered events in EventBridge https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-rule-dlq.html

👋 I regularly create content on AWS and Serverless, and if you're interested, feel free to follow/connect with me so you don't miss out on my latest posts!

LinkedIn: https://www.linkedin.com/in/pubudusj
Twitter/X: https://x.com/pubudusj
Medium: https://medium.com/@pubudusj
Personal blog: https://pubudu.dev

EventBridge to SQS when cross region and cross account

Pubudu Jayawardana — Wed, 17 Sep 2025 07:43:30 +0000

Intro

Delivering a message to a target SQS queue from EventBridge is a very common requirement in an event driven application. Sometimes, this SQS queue can be in a different region and may be in a different AWS account. Depending on where the target SQS queue lives, the approach to how you set up the solution differs.

In this blog post, we are going to see different scenarios on how EventBridge can deliver messages to a target SQS queue when the target SQS queue is in a different region or different AWS account.

Scenario 1: Event bus, rule and target SQS queue in same AWS account, same region

This is the most common and simplest scenario. Event bus has a rule which directly configured a SQS queue as a target. All the resources are in the same region in the same AWS account.

Scenario 2: Event bus and rule in one region, target SQS queue in another region in the same AWS account

In this scenario, the Event bus and the rule is in one region. The target SQS queue is in the same AWS account, but in a different region.

As of today (September 2025), EventBridge supports SQS as a target only when it is in the same region.

Because of that, first you have to create a new Event bus (or use the default Event bus) in the 2nd region. And configure the SQS queue as a target of that Event bus.
Then, configure the event bus in the second region to be a target in the EventBridge rule of the first region.

Scenario 3: Event bus, rule in one AWS account, target SQS queue in another AWS account, but all are in the same region

AWS introduced cross account support for EventBridge targets beginning of 2025. Previously, when delivering messages to cross account resource, you had to use a Event bus in the 2nd account to establish the connection (similar to scenario 2 above).

With this new feature, it is possible to have a cross account target which simplifies the message delivery from an event bus to a SQS queue in another account.

However, there is a limitation. Although the cross account delivery is possible, the EventBridge bus and the rule and the target in the 2nd AWS account must be in the same region.

Scenario 4: Event bus, rule in one AWS account, target SQS queue in a different region in another AWS account

As mentioned earlier, cross account delivery is supported only when the target SQS queue is in the same region. If the SQS queue is in a different region, you have to use an Event bus in the 2nd account in between.

Summary

The pattern is simple. SQS can be a target in an EventBridge rule when the rule and the SQS queue both are in the same region. Target SQS queue can be in another AWS account as long as the region is the same.

If the SQS queue is in a different region than the EventBridge rule, you have to use an intermediate EventBus in the 2nd region. This intermediate EventBus and the target SQS can be in another AWS account.

Resources

Event bus targets in Amazon EventBridge - https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-targets.html
Introducing cross-account targets for Amazon EventBridge Event Buses - https://aws.amazon.com/blogs/compute/introducing-cross-account-targets-for-amazon-eventbridge-event-buses/

👋 I regularly create content on AWS and Serverless, and if you're interested, feel free to follow/connect with me so you don't miss out on my latest posts!

LinkedIn: https://www.linkedin.com/in/pubudusj
Twitter/X: https://x.com/pubudusj
Medium: https://medium.com/@pubudusj
Personal blog: https://pubudu.dev

How I built a spelling game with AWS Serverless and GenAI

Pubudu Jayawardana — Wed, 02 Apr 2025 14:48:34 +0000

Intro

Last December 2024, as part of the AWS Game Builder Challenge, I built a simple spelling game. For that, I used AWS Serverless and GenAI services and CDK was used as the IAC tool. This was my first experience building something with the help of GenAI and using Bedrock in an application. This blog post explains the project and my experience building this simple spelling game.

This is the final result - the game you can play :)

https://spelling-game.pubudu.dev/

How to play the game?

Player can select a language to play the game. As of now, only English (US) and Dutch are the available languages.
When the player selects a language, there will be a maximum 5 words generated. This includes an audio, brief meaning of the word and the number of characters for each word.
Then the player needs to fill the word in the text box given.
There is an indicator next to the text box how many characters have been entered in the text box and how many characters are required for the word. This indicator will be red until the required number of characters are filled, then it turns into green.
There is a timer that starts as soon as the words are generated. This is based on the number of words generated.
When the remaining time is less than 30 seconds, the background of the page as well as the background of the timer turns into red.
Player can submit the answers before the timer runs out, else it will be automatically submitted.
Then, the answers are evaluated and based on the number of correct answers, there is a pop up visible.
If all the answers are correct, there will be a "Confetti" effect appearing on the page.
Clicking on the 'show results' button on the pop up, the player can see the correct/incorrect answers and the correct word (in case of an incorrect answer).

Implementation

There are 2 main parts of this application.

Backend - where words are generated and APIs are available.
Frontend - Vue.js application for the player to interact.

Source code

There are two repositories available with the complete source code.

Backend - https://github.com/pubudusj/spelling-game-backend

Frontend - https://github.com/pubudusj/spelling-game-frontend

Deployment instructions:

Backend is implemented using AWS CDK and can be deployed as a generic CDK application. The URL of the Cloudfront distribution is required for the frontend to work.

In the frontend, add the Cloudfront API URL to the VITE_API_BASE_URL in the env file. Install necessary dependencies and then run the application using npm run dev in the dev mode. Or you can build the frontend app using npm run build.

When the backend stack is deployed, it will output the S3 hosting bucket name. You can copy the built frontend app into this bucket which will host the frontend via the Cloudfront distribution created in the backend stack.

Backend

There are 2 main components of the backend.

Words generator component - to generate and save words in the database.
API component - to serve frontend.

Backend - words generator component

Here is the high level overview of the words generator component and the steps within the state machine. Within Step Functions, there will be many AWS services called as explained below.

Image: Words generator overview:

Image: Words generator state machine:

Words Generator Step Functions state machine is responsible for generating words.
This Step function execution takes the language code as an input. ex: en-US, nl-NL, etc.
As the first step, Bedrock InvokeModel is being called to generate 5 words with descriptions for each word, based on the language.
Here, the Anthropic Claude 3 Haiku model is being used which gives better balance of accuracy and the pricing in this scenario.
Here is the prompt I used to generate the words:

"Generate 5 unique words that have a random number of characters more than 4 and less than 10 in Dutch language. For each word, provide a brief description of its meaning in English with more than a couple of words. Produce output only in a minified JSON array with the keys word and description. Word must always be in lowercase."
Here, the response is a JSON string.
Within the step's Result selector, the response will be converted to an array using the intrinsic function - StringToJson.
```
{
  "words.$": "States.StringToJson($.Body.content[0].text)"
}
```
Then there is a map state where each word is an input to each map.
Within the map state, there are branches (currently two) based on the language.
In each branch, there is a step to synthesise the word using Polly using StartSpeechSynthesisTask.
StartSpeechSynthesisTask is an async operation. So, there is an immediate step to check if the synthesis task is completed using Polly's GetSpeechSynthesisTask.
Great thing about the StartSpeechSynthesisTask api is that it not only synthesises, but automatically saves the mp3 into the given S3 bucket.
If the speech synthesis task is not finished based on the status of the GetSpeechSynthesisTask, it waits and retries the status check.
Once synthesis is done, the execution continues to save word to DynamoDB step.

In this step, we use DynamoDB's PutItem api to save the generated data to the table. One record consists of below data:

pk - Primary key in the format of Word#{LanguageCode}. ex: Word#en-US
sk - MD5 hash of the word - here, intrinsic function States.Hash($.word, 'MD5')is in use. 
word - The word generated by Bedrock.
description - The description generated by Bedrock.
s3file - Mp3 file location provided by Polly synthesis task.
charcount - Character count of the word. This is retrieved from Polly synthesis task.
updated_at - Update timestamp.

Since this synthesis and save to db task runs on a map state, after a single execution, there will be a maximum 5 new words available in the ddb for the given language.
There are two EventBridge Schedules running every 5 minutes to call this State Machine with the different language codes - English and Dutch.

Backend - API component

API component is to create resources to interact with the frontend.

There are two APIs available in the backend.

POST /questions - To generate questions using Step Functions to appear on the frontend.
POST /answers - To validate the answers submitted by the player.

Image: Frontend and API components:

Generate questions API

Generate questions API accept one argument. Which is the language code.
This /questions api has a proxy integration to a Lambda Function which will start an execution in Questions Generator State Machine synchronously using start_sync_execution SDK call.
This Questions Generator State Machine is in type - Express.

Image: Questions Generator State Machine:

A sample input is as follows:

{
   "language":"en-US",
   "iterate":[1, 2, 3, 4, 5]
}

Here, "iterate" is a hard coded array to start a map execution within the state machine.
Within the state machine, first the map state is executed based on the "iterate" array from the input.

Inside the map state, first, it fetches max 50 records from DynamoDB. Here, DynamoDB scan is used. However, in order to fetch some random data, a random ExclusiveStartKey is in use with the help of the intrinsic function UUID(). Also, FilterExpression is used to filter the records applicable only for the given language code.

{
  "TableName": "arn:aws:dynamodb:****",
  "Limit": 50,
  "ExclusiveStartKey": {
    "pk": {
      "S.$": "States.Format('Word#{}', $$.Execution.Input.language)"
    },
    "sk": {
      "S.$": "States.UUID()"
    }
  },
  "FilterExpression": "pk = :pk",
  "ExpressionAttributeValues": {
    ":pk": {
      "S.$": "States.Format('Word#{}', $$.Execution.Input.language)"
    }
  },
  "ReturnConsumedCapacity": "TOTAL"
}

Next, the number of item counts returned from the previous step is checked.
If the count is more than 0, then in the next step, single random record is selected from them. This step is a Pass state with transformation using Parameters, which uses intrinsic functions - ArrayGetItem, MathRandom and ArrayLength.
```
{
  "item.$": "States.ArrayGetItem($.items,States.MathRandom(0, States.ArrayLength($.items)))"
}
```
Then, the selected record is being sent to the Generate Pre signed URL step. This is a Lambda function, which generates a pre-signed URL for the s3file path of the record. So, from the frontend the mp3 file can be played using this pre-signed url. Also there is a transformation of data within this Lambda function. The expiry of the pre-signed url is set to minimum because it is only required within the session of the game.

This is the last step within the map state which outputs the record in below format.

{
   "id":"XXXX",
   "description":"Description of the word",
   "charcount":8,
   "language":"en-US",
   "url":"Presigned-url"
}

Once all the map steps are completed, there is a final aggregation Lambda function - GetUniqueResultsLambda. Since each map step is independent, there is a possibility of selecting the same record in more than one map state. This Lambda function simply removes such duplications.
And the response is returned to the frontend as the response to the /questions endpoint.

Validate answers API

POST /answers API is responsible for validating the answers submitted in the frontend.

This API endpoint has a proxy Lambda function which accepts the payload in below format:

{
"language":"en-US",
"answers":[
  {
     "id":"312e8b6583d4b65b",
     "word":"effect"
  },
  {
     "id":"99b788c54c1a8265",
     "word":"perilous"
  },
  ...
  ...
]
}

Within the Lambda function, it does a DynamoDB's batch_get_item SDK call to fetch words per ids and match with the word provided in the API.

Then it returns the response in below format:

[
{
    "id": "312e8b6583d4b65b",
    "original_word": "affect",
    "correct": false
},
{
    "id": "99b788c54c1a8265",
    "original_word": "perilous",
    "correct": true
},
...
...
]

Based on the value of "correct", frontend will calculate and show the results.

Frontend

For the frontend, I have used a simple single page application built with Vue.js. I have very limited knowledge on frontend technologies. Because of that, I used Amazon Q Developer on VSCode to implement the frontend application.

Almost 95% of the frontend application was built by Amazon Q Developer. I have asked different questions and in most of the cases, Amazon Q was able to analyse the code and generate the code as per my requirements. This was a step by step process where I asked Amazon Q to generate specific functionality at a time.

Here are some work in progress "versions" of the application that was implemented and fine tuned step by step using Amazon Q.

Here are some examples for the questions I asked from Amazon Q and how it analysed and generated code.

Lessons learnt

Below are some of the lessons I learnt while I was working on this project, including some feedback on some of the services I used here.

Bedrock does not always return JSON. Within the prompt I used, I stated - "Produce output only in a minified JSON array with the keys word and description". However, once in a while, Bedrock returns data in different formats. This could have been more accurate if I add the beginning of the response in the prompt, so Bedrock can continue from there. However, this will increase the request token count for each API call. So, to avoid additional cost and also, the error rate is acceptable (since this is anyway a background job), I kept this prompt as it is.
Amazon Polly's StartSpeechSynthesisTask doesn't support S3 path. We can provide the OutputS3BucketName where the generated audio will be stored. However, we cannot specify a path to save the object. Instead, I have used the OutputS3KeyPrefix parameter to provide the path with the language code so, the audio is saved in s3://bucket_name/language_code/file_name.mp3
However, one minor issue with that is, Polly always adds a dot (.) between the file name and the prefix. So, all the files generated in the sub path start with a dot.
Cost of Polly. Polly has Generative and Neural text-to-speech engines apart from Standard. However, the cost of them is quite high compared to Standard. Also, those are available only for a limited number of languages. https://aws.amazon.com/polly/pricing/
Selecting random items from the DynamoDB table is hard. There is no straightforward way to achieve this. That's why I had to use a ExclusiveStartKey and fetch maximum 50 items and select one random item. This might introduce some duplications. That's the reason I had to use the GetUniqueResultsLambda which removes any duplicates from the map job.
I initially used direct integrations to start the Step Functions express workflow to generate questions directly from API Gateway. However, the VTL is complex to build specially to get the response in a specific format. So I stuck to the Lambda proxy option which was more simple.
Initially I have hosted the frontend using Amplify Hosting. However, I wanted to restrict the API Gateway access only from Amplify project, but there is no option yet. So I switched to Cloudfront, S3 set up. This is described in detail in one of my previous blogs: Enforce CloudFront-Only Access for AWS API Gateway
For the frontend, I indeed did Vibe coding. So, there is a high chance that any frontend specialist might find unacceptable code in it 😉.
Most of the time, Amazon Q generated code which includes the expected functionality. However, there were cases it couldn't fix some code snippets. For example, when I asked to center a component, it couldn't fix it even after 20th iteration. May be this is because of the complexity of the single page structure.

Possible improvements

Currently there is no 'history' of the games a particular player played. Adding a logged-in mode to record the player status/progress can be a nice feature.
Add more languages: Currently only English(US) and Dutch are available to select. Having more languages is nice. However, those need to be supported by Polly.
Finished game's standing against all other games in the results can be a nice feature. For this, recording the results and comparing against all the results will be required.

What are the other possible improvements you see?

Conclusion

Overall, it was a great experience using Bedrock and Amazon Q. I didn't win the hackathon, but as this was the first time I was using those GenAI services, I learnt a lot from this project and enjoyed a lot. Frontend code was mostly generated by Amazon Q and I am happy with the results of it although I am certain there can be many improvements in the frontend code.

Useful Links

Getting started with Amazon Q Developer: https://aws.amazon.com/q/developer/getting-started/
Getting started with Amazon Bedrock: https://docs.aws.amazon.com/bedrock/latest/userguide/getting-started.html
Invoke and customize Amazon Bedrock models with Step Functions: https://docs.aws.amazon.com/step-functions/latest/dg/connect-bedrock.html
Claude Prompt engineering overview: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview

👋 I regularly create content on AWS and Serverless, and if you're interested, feel free to follow/connect with me so you don't miss out on my latest posts!

LinkedIn: https://www.linkedin.com/in/pubudusj
Twitter/X: https://x.com/pubudusj
Dev.to: https://dev.to/pubudusj
Personal blog: https://pubudu.dev

Enforce CloudFront-Only Access for AWS API Gateway

Pubudu Jayawardana — Thu, 06 Mar 2025 15:58:02 +0000

Recently, I had a requirement to expose a REST API from API Gateway exclusively through CloudFront. Keeping API Gateway behind Cloudfront provides additional layer of security because Cloudfront comes with automatic protections of AWS Shield Standard, at no additional charge.

There are several ways to achieve this, for example, using a signed request using Lambda@Edge in Cloudfront, but I went for a simpler solution, which is using a custom header in the request to API Gateway from the Cloudfront and using Lambda authorizer at the API gateway to validate it. This blog post explains how to implement this solution.

How it works

In the Cloudfront distribution, we create an origin for API gateway endpoint.
Here, we define a custom header on this cloudfront origin.
When a request comes to Cloudfront distribution, as per the behaviour we define, it will call the API gateway endpoint.
This request will contain the custom header.
In the API Gateway end, there is an Lambda authorizer which validates this incoming header.
There is a SSM secure parameter which holds the value of the header.
When it is required to validate the incoming header, Lambda authorizer will fetch this value from the parameter store.

Communication between the Cloudfront distribution and the API gateway is secure and there is no way someone who doesn't have access to Cloudfront settings will get to know the header value. However, keeping a static value for a validation seems not correct. So, it is better to rotate the value of the header for added security.

Rotating the header value

There is a EventBridge scheduler which will invoke a Lambda function in a given interval.
This Lambda function will perform below actions:
- Generate a random value for the header.
- Update the SSM secure parameter.
- Update the Cloudfront origin's custom header value.
Once this Lambda execution is successful, in both sides (Cloudfront and API Gateway) the new header value will be used.

Try this yourself

Here is a Github repository of the project I created to try out this solution. https://github.com/pubudusj/secure-api-with-cloudfront

You can deploy this to your AWS account using CDK.

Please note: First, you need to create a secure SSM parameter with any value. Then create an .env file copying the .env.example file in the project root directory and set the name/path of the parameter in the .env file.

In the CDK code, you will notice that the initial header value is hardcoded.

   origin=origins.RestApiOrigin(
       rest_api,
       origin_path="/prod",
       custom_headers={custom_header_key: "test"},
   ),

Also, this obviously does not match with the value you have in the SSM parameter.

This is fine, because this stack includes a Cloudformation custom resource, which will be executed on create. This custom resource will start the initial token rotation as soon as the stack is created which generates a new header value and syncs both SSM secure parameter and Cloudfront distribution.

Once the stack is deployed, you can access the API using API Gateway endpoint and also with Cloudfront endpoint.

For example:

Cloudfront endpoint: https://[CloudfrontPrefix].cloudfront.net/prod/hello
API Gateway endpoint: https://[APIGWPrefix].execute-api.[region].amazonaws.com/prod/hello

You will notice that Cloudfront endpoint works fine while there is a 401 Unauthorized error from the API Gateway endpoint.

Tips/Lesson Learned

Here I have used an SSM Parameter store with a custom built rotation mechanism. However, you can use Secret manager instead with its inbuilt rotation features. Make sure you are aware of the differences. Yan has done this comparison in detail in his blog post https://theburningmonk.com/2023/03/the-old-faithful-why-ssm-parameter-store-still-reigns-over-secrets-manager/
Keep in mind that when setting a new custom header value on the Cloudfront origin, it will perform a re-deployment. Cloudfront re-deployment takes time. It can be within a couple of minutes to 10-15 minutes too. But for a change like this, it should be faster.

Because of this reason, during a Cloudfront re-deployment, your SSM parameter and incoming header value at API Gateway can be different. One solution I used to address this is to cache the Lambda authorization result. Here I used a ttl of 5 minutes, but you can adjust as required.

authorizer = apigateway.RequestAuthorizer(
   self,
   "LambdaHeaderAuthorizer",
   handler=custom_authorizer,
   identity_sources=[apigateway.IdentitySource.header(custom_header_key)],
   results_cache_ttl=Duration.minutes(5),
)

Useful Links

Cloudfront custom headers: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/add-origin-custom-headers.html
API Gateway Lambda authorizer: https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-use-lambda-authorizer.html

Please let me know your thoughts on this implementation.

You can find more AWS and Serverless contents at my personal blog: https://pubudu.dev

And don't forget to follow me on LinkedIn too:
https://www.linkedin.com/in/pubudusj

SQS encryption options

Pubudu Jayawardana — Wed, 04 Sep 2024 09:33:17 +0000

When building distributed applications on AWS, Amazon Simple Queue Service (SQS) often becomes a crucial component in managing message flow between services. Ensuring the security of these messages is important, especially when dealing with sensitive data across multiple AWS accounts. In this post, we’ll explore different encryption options available for SQS and choosing the best option for scenarios like cross-account access.

In-Transit (as it travels to and from Amazon SQS) Encryption

You can protect data in-transit using HTTPS (TLS). This ensures that messages are protected as they travel between your application and SQS, preventing attacks such as man-in-the-middle. You can enforce only encrypted connections over HTTPS (TLS) using aws:SecureTransport condition in the queue policy.

"Condition": {
    "Bool": {
        "aws:SecureTransport": "true"
    }
}

Another option for in-transit encryption is using client-side encryption where you encrypt data before sending it to SQS. Here you have to manage the encryption-decryption mechanism yourself.

Server-Side Encryption

Server-side encryption (SSE) adds an extra layer of security by encrypting the contents of your queue at the storage level. SSE protects the contents of messages in queues using SQS-managed encryption keys (SSE-SQS) or keys managed in the AWS Key Management Service (SSE-KMS).

SSE-SQS (SQS Managed Keys)

This is the simplest option, where Amazon SQS takes care of the encryption keys for you. SQS generates, manages, and uses the encryption key, requiring no additional configuration on your part. Since October 2022, this is by default enabled for any new SQS queue.

SSE-KMS (AWS Key Management Service Keys)

This option provides more control by allowing you to use AWS Key Management Service (KMS) to manage the encryption keys. With SSE-KMS, you can use an existing KMS key or create a new one specifically for your SQS queue. This method enables finer-grained access control and auditing capabilities compared to SSE-SQS.

Cross-account access with SSE

As one of the most used messaging services between software components, cross account access is a very common scenario for SQS. Still, we need to make sure the messages that are exchanged between SQS queues are secured.

In general, you can manage the cross account access for a SQS using its access policy.

Below is an IAM policy of my_sqs_queue in account 111111111111. This has granted the account 222222222222 to send messages to my_sqs_queue.

{
      "Sid": "cross_account_access",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::222222222222:root"
      },
      "Action": [
        "sqs:GetQueueAttributes",
        "sqs:GetQueueUrl",
        "sqs:SendMessage"
      ],
      "Resource": "arn:aws:sqs:eu-west-1:111111111111:my_sqs_queue",
      }
    }

Depending on the server side encryption used in the queue, there will be additional permission required to grant send messages to my_sqs_queue.

If SSE-SQS is used

Good news is if SSE-SQS is used, there are no additional encryption related permissions required by account 222222222222. Which means above IAM permission is sufficient to send messages to my_sqs_queue.

If SSE-KMS is used

If this encryption is used, additional permission for the KMS key must be set in order to successfully send messages to my_sqs_queue.

Let's assume my_sqs_queue is encrypted using a KMS key in the same account 111111111111 which has the alias my_kms_key. In the my_kms_key, you have to grant permission for account 222222222222 as follows:

{
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::222222222222:root"
    },
    "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
    ],
    "Resource": "arn:aws:kms:eu-west-1:111111111111:key/[key-id]"
}

Further, in the account 222222222222, there should be permission kms:GenerateDataKey for the KMS key as follows:

{
    "Effect": "Allow",
    "Action": "kms:GenerateDataKey",
    "Resource": "arn:aws:kms:eu-west-1:111111111111:key/[key-id]"
}

Conclusion: What SSE method to choose?

Both SSE-SQS and SSE-KMS support cross-account access for SQS queues. The key difference lies in how much control and responsibility you want over the encryption process.

SSE-SQS is ideal when you need simple, effective encryption without the additional complexity of managing KMS keys. It's ideal for most general use cases where ease of setup and management is a priority.

Use SSE-KMS when you require more control over your encryption keys and need to meet strict security and compliance requirements. This option is suited for environments where key management and detailed access control are critical.

Useful Links

Amazon SQS security best practices: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-security-best-practices.html#implement-server-side-encryption
Encryption at rest in Amazon SQS - Developer Guide : https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-server-side-encryption.html
Amazon SQS Key management - Developer guide: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-key-management.html

Dead Letter Queue (DLQ) for AWS Step Functions

Pubudu Jayawardana — Mon, 27 May 2024 07:28:38 +0000

Intro

When it comes to building a software system, one of the most critical components is error handling, because “everything fails all the time”. While it's impossible to anticipate every possible failure scenario, having a plan in place when an unexpected thing happens can always be helpful for robustness and resilience of the system.

This ‘plan’ can be a simple Dead Letter Queue (DLQ)!

Dead Letter Queues can act as the safety net where you can keep the messages when unexpected issues arise. This helps you to isolate problematic messages, debug the issue without
disrupting the rest of your workflow, and then retry or reprocess the message as needed.

Many AWS Serverless services support SQS queues as the dead letter queue natively. However, Step Functions - one of the main Serverless services offered by AWS for workflow orchestration - does not support dead letter queues natively (yet).

In this blog post, I am going to discuss a couple of workarounds to safely capture any messages that have failed to process by a Step Function into a dead letter queue.

Scenario

Let’s consider a very common use case where we have a message in a SQS source queue, which needs to be processed by Step Functions. First the messages are read by a Lambda function that starts a step function execution for each message.

Here, once Step function execution lambda reads the message from Source Queue and successfully starts the Step function execution, the message will be marked as successfully processed and will be deleted from the queue. If there’s any errors in the Step Function execution, the message will be lost.

Solution 01

In order to retain the message even when there is a failure in Step Function execution, we can add a 2nd SQS queue which acts as a dead letter queue as follows.

The state machine will look like this:

How it works

Within the State machine, we can create a step namely “Send message to DLQ” which is a SDK integration for SQS SendMessage functionality.
In this step, the message to be sent to the DLQ is built based on the execution input retrieved from the context parameters.
In the other steps of StateMachine, as required, we can configure the Catch errors in Error handling settings where we use the above “Send message to DLQ” step as the catcher.
This way, when an error happens in a state, we can send the message to the DLQ and re-process from there.

Try this yourself

Here is a Github repository of a sample project I created to try out this solution.
https://github.com/pubudusj/dlq-for-stepfunctions

You can deploy this to your AWS account using CDK.

Once deployed you can test the functionality by sending a message to the source SQS queue in below expected format.

{
    "metadata": {},
    "data": {
        "foo": "bar"
    }
}

And you can see the message ends up in the DLQ and the "metadata" object now includes the "attempt_number" as well.
As an alternative for using the metadata section of the message to set the attempt number, you may use SQS message attribute as well.

Please note: In this approach, the DLQ is not a "real" DLQ and is not configured with the source SQS queue. However, it will help to capture any messages that failed to process by the Step Function execution.

Solution 02

In this method, we will use a real DLQ that is configured with the source queue.

The state machine will look like this:

How it works

There is a source SQS queue and a DLQ configured to it.
In the DLQ settings, the max receive count is set as > 1 so the message will be available in the DLQ immediately after the first failure.
There is a Lambda function which has the trigger set up to process messages from the source queue and initialize SF execution.
In the Event Source mapping setting of this lambda function, it is required to set the report_batch_item_failures to True.
First, when the message is processed by the lambda function, we set the visibility timeout of the message to a larger value. This must be larger than the time it takes to complete the Step Function execution.
Then, the step function execution will be started. Here, we pass the SQS queue url and the message receipt handler values along with the original values from the message from SQS.
In the example above, in order to determine if we need to simulate the failure, we use a simple choice state.
If it is a successful execution, we will call the step - Delete SQS Message. Here we use the SQS SDK integration to delete the message using the SQS queue url and the receipt handle values received in the input payload.
If it is a failure, we will call a step named - “Set message visibility timeout 5s”. Here we will use SQS SDK integration for the action: “changeMessageVisibility” to set the SQS message’s visibility to 5 seconds. For this SDK integration, we use the SQS queue url and the SQS receipt handle values passed in the execution input.
Once the message visibility is set to 5 seconds, it will again appear on the source queue after 5 seconds. However, since we have the rule ‘max receive count’ set to more than 1, the message will be immediately put into the DLQ of the source queue.

Try this yourself

I have another Github repository for you to try this in your own AWS environment. You can set it up using CDK/Python.
https://github.com/pubudusj/dlq-for-stepfunctions-2

To simulate a failure scenario, send a message into the source queue with a "failed" value as True.

{
    "foo":"bar",
    "failed": true
}

This will make the step function execution fail and the message will be immediately available in the DLQ of the source queue.

With this approach, you can use the native DLQ functionality when we cannot process messages in the Step Function execution.

Summary

Step Functions is one of the widely used AWS Serverless services. However it doesn’t support dead letter queues (DLQs) natively yet. Still there are workarounds to achieve this with few simple steps. This blog post explained two of such workarounds which help to build a better resilient system.

Call external APIs with OAuth within Step Functions

Pubudu Jayawardana — Sat, 09 Dec 2023 08:24:44 +0000

Until last week, if you needed to call an external API from Step Functions execution, you had to use a Lambda function. And you needed to manage the responses and make the execution fail or success based on that. Also, most probably due to the fact that the external API is protected, custom code is required to manage the authorisation of the api call.

As you see this involves significant custom code which needs to be maintained by the developers.

Good news is, in the re:Invent 2023, AWS has introduced native integration to HTTPS APIs from Step Functions which allows to call to any 3rd party API endpoint as a step in the execution and, based on the response, you can perform any business logic remaining in the state machine.

How it works

In Step Functions, it utilises EventBridge connection to manage authentication credentials for the connection to the 3rd party API.

Create EventBridge Connection

In the EventBridge connection, you can select 3 different authentication options.

Basic (Username/Password)
OAuth Client Credentials
API Key

For this example, we will use OAuth client credentials as the authorisation mechanism.

To create a EventBridge connection with OAuth client credentials, you have to provide below information.

Client id
Client secret
Auth endpoint
HTTP method

Once the EventBridge connection creation request is received, first it stores the provided OAuth credentials in AWS Secret Manager.

Next, it calls the auth endpoint with the given client id and client secret to obtain an auth token.

If the auth call is successful, then the auth token will be securely saved in the same secret previously created in the secret manager. And this token will be used in the calls to the api endpoint.

If the auth attempt failed, it simply de-register the EventBridge connection and it cannot no longer be usable.
Once the EventBridge connection is successfully created, we can use it in a State Machine as the authentication option for the api endpoint.

Create State Machine with 3rd party API

In the State Machine, simply add a "Call third-party API" step and, add the configurations - API endpoint, HTTP method.

For authentication, enter the ARN of the EventBridge Connection created previously.

Also, you can configure the request payload that sends to the API and you may use the reference paths to build the payload from the runtime data.

Refreshing OAuth Token

Normally, OAuth uses short-lived tokens. When an API request uses a token that has already expired, it returns 401 Unauthorized error. In such a case, you will need to first renew the token by calling the auth endpoint and use the new token to call the API.

In the Step Function HTTP API call, this is taken care of by the EventBridge Connection.

Let's assume the API returns a 401 error which is the default OAuth response when the token is expired. In that case, EventBridge Connection will automatically call its auth endpoint and retrieve a new token. And update the secret manager entry.

So, in the State Machine, you need to set up "Retry" for the "Call third-party API" step for the specific error - "States.Http.StatusCode.401". This retry will automatically resolve the unthorized error without any additional steps.

You can always catch specific error using the HTTP error code as follows:
States.Http.StatusCode.[Status_Code]

In EventBridge Connection, OAuth tokens are refreshed when a 401 or 407 response is returned.

Since Step Functions execution uses EventBridge Connections and Secret Manager, necessary permissions must be set in the State Machine role.

Try this yourself

I have created a sample application that uses Step Function step to call an external API. Here, I have used CDK with Python. Github repository can be found at:
https://github.com/pubudusj/step-functions-https-api-integration

Clone the repository.
Install the dependencies and deploy the stack using CDK CLI.
Once deployed, it will create 2 Lambda functions with Function URLs.
One Function URL will be used as the API endpoint while the other will be used as the autorization endpoint.
EventBridge connection is also created with autorization endpoint Function URL.
And, a State Machine will be created with a single step to call the API endpoint and it has 2 retries set up if API returns status code 401. And authorization via the EventBridge Connection.

Testing

As soon as the stack is deployed, you can see there is a secret created in Secret Manager with the given client id, client secret and newly generated auth token.
You can initialize a Step Functions execution with below input.
```
{"set401": false}
```
You can see it successfully completes and the output includes the return value from the API Lambda function. Also, in the Lambda logs, you can see the auth token from the secret is being used in the header as the Bearer token.
To simulate a 401 error, initialize a Step Functions execution with below input:
```
{"set401": true}
```
With this input, the API Lambda function, will return a 401 error for the initial attempt. And the state will immediately start retrying. And the step will be successful in the retries.
If you check the secret in the Secret Manager, you will be able to see the token has updated with new value. Also, if you check the API Lambda's logs, you can see the request headers include different Bearer token values in the initial and the retry attempts.

Express vs Standard workflows

This feature works well in both Express and Standard State Machine types. However, it is wise to keep in mind that (even generally express workflows are cost effective compared to the standard workflows) cost of the express workflows depend on the execution duration too.

So, if your APIs are slow, it will cost more. For Standard type, it will be a single state transition, so the time it takes to call the API will not be a concern for pricing.

Summary

The introduction of HTTP integration support in AWS Step Functions is a significant enhancement that simplifies the process of invoking external APIs. This feature not only reduces the custom Lambda code needed for API calls but also eliminates the necessity for custom code to handle complex retry logic when generating and renewing authentication tokens.

With this capability, developers can seamlessly utilise third-party APIs with minimal code, and enhance the overall efficiency of workflow orchestration in AWS Step Functions.

Useful Links

Step Functions external endpoints AWS Blog: https://aws.amazon.com/blogs/aws/external-endpoints-and-testing-of-task-states-now-available-in-aws-step-functions/
Step Functions external endpoints documentation: https://docs.aws.amazon.com/step-functions/latest/dg/connect-third-party-apis.html
EventBridge Connection documentation: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-api-destinations.html#eb-api-destination-connection

3 ways to catch all the events going through the EventBridge Event Bus

Pubudu Jayawardana — Wed, 01 Nov 2023 20:49:11 +0000

For some requirements, you will need to record all the events that go through your EventBridge Event Bus. CloudWatch can be a suitable target for this. https://repost.aws/knowledge-center/cloudwatch-log-group-eventbridge

In this blog post, I am going to discuss how we can implement 3 different rules that can be used to implement catch-all functionality for a EventBridge event bus.

Using prefix

You can use the “prefix” pattern matching feature of the EventBridge rule to capture all the events.
Keeping the prefix value to an empty string will do the trick.

EventRuleCatchAllWithPrefix:
    Type: AWS::Events::Rule
    Properties:
      Description: "EventRule to catch all using prefix"
      EventBusName: !Ref MyEventBus
      EventPattern:
        source:
          - prefix: ""
      Targets:
        - Arn: !GetAtt CatchAllWithPrefixLogGroup.Arn
          Id: "TargetCatchAllWithPrefix"

Refers docs of prefix pattern here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns-content-based-filtering.html#eb-filtering-prefix-matching

In this example, I have used the field “source” to apply the prefix filter since each and every event going through an event bus will have a source field. As an alternative, you may use any field that exists in the event.

Using version

Similar to the prefix matching, we can exactly match the “version” of the event to capture all the events. For all the events going through Event Bus will include the version field with value 0. As of now, there are no other version values available other than 0, but this might change in future.

Here is an example how you can define the catch all rule with exactly matching the version.

EventRuleCatchAllWithVersion:
    Type: AWS::Events::Rule
    Properties:
      Description: "EventRule to catch all using version"
      EventBusName: !Ref MyEventBus
      EventPattern:
        version: ["0"]
      Targets:
        - Arn: !GetAtt CatchAllWithVersionLogGroup.Arn
          Id: "TargetCatchAllWithVersion"

Using wildcard

EventBridge recently announced the support for wildcards in their event rules (https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-eventbridge-wildcard-filters-rules/).

We can use this to form the catch all rule as follows. Use any field that exists in the event all the time (here, the “source” field) and apply the wildcard “*”.

EventRuleCatchAllWithWildcard:
    Type: AWS::Events::Rule
    Properties:
      Description: "EventRule to catch all using wildcard"
      EventBusName: !Ref MyEventBus
      EventPattern:
        source:
          - wildcard: "*"
      Targets:
        - Arn: !GetAtt CatchAllWithWildcardLogGroup.Arn
          Id: "TargetCatchAllWithWildcard"

Refer docs of wildcard pattern here: https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-event-patterns-content-based-filtering.html#eb-filtering-wildcard-matching

Try this yourself

Here is the Github repository I created to show this functionality. You can deploy this into your AWS environment using AWS SAM CLI.

https://github.com/pubudusj/eventbridge-catch-all

Once deployed, it will create an Event Bus, 3 different rules as discussed above and 3 different CloudWatch Logs as targets for those rules.

When you send any message into the event bus, you can see them end up in all the CloudWatch Logs.

Summary

AWS is well known for providing more than one method to achieve the same results. This is one of the examples, where you can implement a catch all functionality for your event bus defining rules in 3 different ways.