Forem: Ran Isenberg

Protect Your API Gateway with AWS WAF using CDK

Ran Isenberg — Mon, 16 Dec 2024 07:36:30 +0000

In this post, you will learn about the basics of the AWS Web Application Firewall (WAF) and write CDK code to protect a REST API Gateway service. We will enable WAF metrics, add managed rules to the ACL, and enable logging into a Cloudwatch log group.

This is the second of three posts in the WAF series.

In the **first **post, I provided tips and tricks for using AWS WAF for a production-ready SaaS service.

In the third post, we will review AWS Firewall Manager and how it allows an organization to manage AWS Application Web Firewall ACLs at scale.

This blog post was originally published on my website, “Ran The Builder.”

AWS Web Application Firewall (WAF) Introduction

AWS WAF is a web application firewall that lets you monitor or block the HTTP(S) requests forwarded to your protected web application resources. You can protect the following resource types:

Amazon CloudFront distribution
Amazon API Gateway REST API
Application Load Balancer
AWS AppSync GraphQL API
Amazon Cognito user pool
AWS App Runner service
AWS Verified Access instance

As you can see from the list above, AWS WAF can protect both your container-based and Serverless services. I covered the threats that WAF protects you from in the **first **post in the series.

To understand how WAF provides an extra layer of security, we first need to understand how it works. Let’s see how we can configure WAF to protect our SaaS service.

ACL and Rules

To use WAF with your AWS resources (as defined above), create a WAF access control list (ACL), define its rules, and associate the ACL with the resource you wish to protect.

The associated resources forward incoming requests to AWS WAF for inspection by the web ACL. In your web ACL, you create rules to define traffic patterns to look for in requests and to specify the actions to take on matching requests. — AWS

Each rule has action to run when matched. Action choices include the following:

Allow the requests to go to the protected resource for processing and response.
Block the requests.
Count the requests.
Run CAPTCHA or challenge checks against requests to verify human users and standard browser use.

AWS provides managed rules that are out of the box, and I highly suggest you use them as much as possible. These rules, constantly updated by AWS, are easy to use and there’s really no reason to reinvent the wheel. However, sometimes, we need to use custom rules.

A rule can have multiple conditions with an ‘and,’ ‘or’ or ‘not’ between them. A condition can use a regex to match HTTP header, rate-based element, block traffic originating from specific countries, or even an IP set (for example, block traffic that doesn’t originate from your work VPN IP ranges). WAF also has the option to alter headers (transform) before examining them. You have many custom options; just be aware that the more advanced and custom you go, the more WCU you use (see WCU section).

Another important notion to remember is that rules have priority. WAF examines traffic from the top priority to the bottom, and when a rule matches its conditions, it performs the defined action, whatever it may be.

Logging is also an essential aspect of WAF debugging. In this post, I will cover logging to a CW log group, but there are two other options — I covered them with my recommendations in the **first post.**

Lastly, you can enable CloudWatch metrics for your WAF rules. I suggest you enable them as they provide valuable insight into whether they make a difference (know what and why you’re paying for!). For more information, check the AWS docs.

Now that we have the basics down, let’s move on to the CDK code.

Sample Serverless Service Architecture

The ‘orders’ service allows users to order products. We will use my open-source Serverless template project: AWS Lambda Handler Cookbook.

This repository provides a working, deployable, open-source, serverless service template with an AWS Lambda function and AWS CDK Python code, all the best practices, and a complete CI/CD pipeline. You can start a serverless service in 3 clicks!

Now, let’s protect our API Gateway from DDoS and other attacks or disruptions.

Lets add an AWS WAF ACL to the mix and associate it with our API Gateway:

The API Gateway will pass traffic to our AWS WAF ACL for inspection. The ACL will try to match the traffic against its list of rules (ordered by priority) and execute the action of the first rule that matches.

AWS CDK Code

For the API Gateway CDK code click here.

For the complete WAF code click here.

Lets’ review the WAF CDK code below. We want to create a new WAF ACL with three AWS managed rules and enable metrics and logging:

In line 10 we get the API Gateway construct we wish to associate with our WAF ACL.

In lines 14–88, we define the WAF ACL, enable CW metrics, and define the ACL list of rules.

Between lines 23 and 88, we define the rules with a priority, statements, and a visibility configuration that includes CloudWatch metrics for comprehensive monitoring and control.

In line 91 we associate our service API Gateway to the ACL.

In lines 94–128, we add the logging configuration. We create the log group, allow WAF to create log streams inside it and enable the logging configurations.

After deployment, we get the AWS managed rules we defined ordered by their priority:

and this logging config:

That’s it!

Check WAF’s advanced logging features such as filters or redacted fields at the official documentation.

AWS AppSync Events — Serverless WebSockets Done Right or Just Different?

Ran Isenberg — Mon, 18 Nov 2024 06:36:05 +0000

On October 30th, 2024, AWS announced a new feature: AppSync Events. This feature lets developers easily broadcast real-time event data to a few or millions of subscribers using secure and performant Serverless WebSocket APIs. But wait, didn’t we already have Serverless web sockets with API Gateway Websocket APIs? Well, yes, but not exactly like this.

In this post, you will learn about AppSync Events. I’ll cover its basics and provide my take on its future, use cases, usage, and how it differs from API Gateway Websockets. I’ll also provide a working JavaScript GitHub repository so you can play with it yourself.

This blog post was originally published on my website, “Ran The Builder.”

What’s a WebSocket?

**WebSocket **is a computer communications protocol, providing a simultaneous two-way communication channel over a single Transmission Control Protocol (TCP) connection — Wikipedia

Web sockets are nothing new. They are used in every social application on your phone and on numerous websites that show live scores and real-time updates. Have you ever wondered how your Chrome tab magically gets updated with new information? No, it’s not polling. It’s WebSockets. The application backend pushes an update (either a broadcast/unicast/multicast) to your browser’s web socket.

While the protocol has its own implementation on top of TCP, it starts its handshake on top of the HTTP protocol and then “upgrades” to the WebSocket protocol, where it does its own thing according to the RFC.

If you want to learn more about the protocol, check out this video.
Let’s move on to AppSync Events.

What is AppSync Events?

AWS AppSync Events lets you create secure and performant serverless WebSocket APIs that can broadcast real-time event data to millions of subscribers, without you having to manage connections or resource scaling. — AWS

First of all, let’s get it out of the way. This has nothing to do with GraphQL and AppSync.

It’s just bad branding for this new interpretation of Serverless websockets.

Core Concepts & Terminology

AppSync events are interpreted differently on WebSockets, both on the physical protocol level and in serverless management and message interpretation.

To my understanding, according to the documentation, they added another layer of abstraction, an inner sub-protocol (AppSync-y) for managing a new channel entity on top of a single WebSocket. This can be seen from the AWS developer’s guide diagram:

Once clients connect to the AppSync endpoint, they can try to subscribe to a channel.

The cool trick here is connecting to multiple channels on the same web socket. It gives the illusion of numerous “pub-sub” topics/channels, but it’s all the same abstracted web socket underneath. Nice touch!

Subscription requires authorization and there are five options:

These options cover every standard authorization method. I usually opt for a Lambda function that checks the web application’s secure cookie and JWT and additional checks.

The documentation goes into a lot of detail and even provides Lambda function input schema example and sample implementation — impressive!

Returning to the topic of channels, there’s the /default channel, but you can add more channels. A channel namespace is composed of a maximum of five segments, offering a high degree of flexibility. Think of it as UI-oriented pub sub-topics.

Fo example, you can create a channel for general user updates, or by categories such as NBA, Football, and so on. The ability to use up to five segments gives you the power to do publish updates about different sports: like /users/nba and /users/football.

In addition, you can subscribe to /users/nba/ to get general NBA news, or you can subscribe to /users/nba/* (wildcard!), which means you get all NBA news from the subsegments like /users/nba/bulls and /users/nba/lakers.

When it comes to publishing, this can be done from both the frontend and backend and requires authorization. You can set a different authorization method per channel and use a Lambda function (a nice touch!). If you have permission to publish messages to these channels, you only need to send an HTTP request to the channel (and the segments) address. You can publish from a Lambda function or Event Bridge (API destination, I assume) turning AppSync Events to an event-driven architecture enabler.

First Impressions and Usage

I followed the console and documentation and created my first Events API. I don’t remember the last time I saw such well-made documentation for a new launch. There’s support for many authorizers and even code examples for the client side. One glaring missing feature is the CDK support, which is just an L1 one. At the time of writing, there’s an open issue for CDK L2. I started with the basics — the not really secure API key authorization.

I didn’t use the consoler’s pub/sub-editor. I went straight to playing with a web application and tried subscribing to and publishing to a channel. I used the code examples from the console. And it worked.

I had some minor issues because I’m a complete frontend noob, but after some help from my amazing colleague Afik Grinstein and we had a working Vite project.

The complete GitHub repo is found at https://github.com/ran-isenberg/appsync-events-client.

Let’s review what we built:

Our client subscribes to the ‘default/test’ channel and segment. Once subscribed, it publishes events via the publish UI text input section.

The events logs section displays all published messages on ‘/default/test’.

and yes, it’s really a websocket with AppSync protocol inside:

and:

Here’s the JS code, pretty simple when using Amplify SDK:

I was able to subscribe to another channel, for example, ‘/users’, and check the wildcard subscription. Yeah, it works, and it’s super easy. When dealing with unicast or customer specific channels, you must add custom authorization to check whether the user can subscribe to that tenant/user ID message.

Anyways, honestly, this is impressive, it just works.

If you are interested in the full code and HTML file, it’s here.

Let’s move to the insights section where things get spicy.

Insights and Tips

In this section, I’ll share some insights gained from using the service for several hours, leading to the summary and conclusion — will I use it or not?

Security

I like the flexibility of the authorization options. Five authorization types for the initial connection and extra authorization for channel-specific subscriptions are more than enough.

You can also add a per-channel connection authorization method and fine-tune it for publish and subscription requests with a Lambda of your choice. Here’s a handler example.

There’s AWS WAF support, which doesn’t exist for API Gateway Websockets. It’s mandatory since we are exposing an HTTP endpoint for publishing messages.

Check out my blog post here to learn about AWS WAF.

No Connection Tables

There’s no need to manage connection tables; it’s genuinely Serverless!

If you have no idea why I’m surprised, it’s because when using API Gateway WebSockets — when you want to send a message to a user, you need to know its web socket connection ID. This mapping is visible during the handshake process ($connect) when you authenticate the user and allow the connection, meaning you need to save it somewhere for your backend service. DynamoDB is an excellent option for storing this key-value information, but it’s a code you need to add, write, maintain, and test. It’s not a big deal, but AWS didn’t give it to us out of the box. Here’s it’s abstracted. You don’t know the connection ID (and I’m sure there is one); you speak the languages of channels. You don’t even know who is connected to the channels; the assumption here is that you don’t even care. You want to publish a message to a channel, and whoever is subscribed will get that message. Classic pub-sub.

To sum it up, it’s a different approach — channel vs. connection id.

Mass Broadcast is Easy

With API Gateway sockets, you need to iterate the connection table, fetch the relevant connection IDs, and then use the AWS SDK to publish a message. You need to handle errors, retries, throttling, and it gets very challenging when the larger the target audience list gets larger.

With AppSync Events, all you need to do is make ONE API call to an HTTP endpoint with the right authorization — mind-blowing. Simple.

However, you don’t have any feedback on which users got or didn’t get the message. With API GW web sockets, if you publish to a closed connection ID, you get an exception and know it. Maybe that’s unimportant to your use case, but you should be aware of that.

Dear AppSync Events team, I have a suggestion. What if there could be a message queue feature for channels? A feature where specific channels (/users) store data for a predefined time even if users log out. Then, when the channel subscription is restored (/users/, the missed messages pop out of the queue.

Tenant/User Isolation Requires Channel Design

Unicast is harder (kind of, but not by much). Let’s say I need to send user-specific messages over the websocket and all the customers’ general messages, but they are customer-specific and cannot be sent to everybody.

Our channels namespace can look something like this: users subscribe to ‘/tenant/’ and’/users/’. Make sure you don’t user wildcard subscription in this instance, otherwise you’d break tenant and user isolation.

Be advised: Channel namespaces are not infinite. There’s a quota of 50, but it’s a soft link.

In addition, each namespace has a limitation of 50 characters, so pay attention when you add tenant id or user ids to the play — a standard UUID is 36 characters.

(Over) Flexibility Means Complexity

Designing channels and authorizing users to channels adds complexity. You need to understand how to model them; remember what type of data (personal or not) goes into where because there’s no protection here. You need to build it. You need to build the logic into your custom channel authorization. You also need to ensure your front-end clients subscribe to the correct channels. In addition, the more entities you add to the system, the more code changes you need to make — if you add an ‘admin’ persona, you might need to add a new namespace or different/extra subscription with all their updated authorization handlers.

It’s much more complicated than in the API GW WebSockets, where you connect and listen to messages. Sure, I need to build a DynamoDB table that keeps a mapping between the connection ID, the user ID, and the session ID, but that’s pretty basic.

Developer Experience

Documentation is good but not perfect. CDK support is severely lacking.

The pub sub-editor is a nice touch for quick testing of a channel and its authorization. I don’t expect backend developers to work on the console on a daily basis, though. It’s a good option for developers who want to test a new channel and see data from their backend publisher.

The Amplify code examples are good, it got me started very quickly. If you don’t use their SDK, you can still use Events API but you will need to add all the AppSync headers in the correct places.

Lastly, you have to use the built-in JavaScript event handler for ‘onSubscribe’ and ‘onPublish’ custom channel handlers. You can’t bring your own function or dependancies. Not the best experience in my view.

Pricing

As Jeremy Daly mentioned:

the pricing is 3x cheaper than API Gateway WebSockets for connection minutes. $0.08 per million connection minutes for AppSync versus $0.25 per million for API Gateway. Message transfers seem to be the same, but AppSync seems to charge you for every “operation”.

There’s also a generous free tier.

That’s a win in my book for AppSync over API Gateway.

Summary — Done Right or Just Different — Or in Other Words, Will I Use it?

AppSync Events is a new take on Serverless WebSockets. It adds complexity to authorization and channel management but fixes the one major downside of using Api GW web sockets for mass broadcast.

If you don’t require mass broadcast and always send user-specific messages, the regular API GW websockets are a solid, battle-tested option. I’ll write a blog post about it soon with CDK code examples, too!

The service has a bright future and lots of potential. I highly recommend it for POCs, startups, and other companies that are willing to be early adopters.

However, as an enterprise architect, I must think carefully about it. The developer experience is not great, gov-cloud isn’t supported yet and being an early adopter means the service can break or change out of the blue. In addition, judging from the past, for example, Evidently and AppConfig seemed to do the same feature set, but with some differences, and in the end, AppConfig remained, and Evidently didn’t. By the way, if you want to learn more about AppConfig for feature flags, check my post here.

But who knows, maybe it’s the other way around. Maybe API Gateway web sockets will be deprecated over time. It all depends on AppSync Events adoption and on the service team’s improvement of the developer experience and documentation.

AWS WAF Essentials: Securing Your SaaS Services Against Cyber Threats

Ran Isenberg — Wed, 13 Nov 2024 07:33:14 +0000

I work in a cybersecurity company, CyberArk, and security is a concept that is constantly on my mind when I build SaaS services on AWS. It has to be — malicious actors are constantly looking for vulnerabilities, waiting for someone to slip up or for the perfect moment to launch a DDoS attack against your cloud service. As a result, we must prepare and secure our services. We can delegate most of the heavy lifting (at a cost) to AWS.

In this post, you will learn about the AWS Web Application Firewall (WAF), what it is for, tips, and insights for visibility, ownership, governance (and more) to protect your SaaS services.

This blog post was originally published on my website, “Ran The Builder.”

You’ve Got to Secure Your SaaS Application

DDoS attacks, whether deployed by individuals or botnets, flood servers with requests and overwhelm them with traffic, which leads to the hosted services and sites being unavailable for users and visitors — Akamai

When you build public APIs or cloud-based applications, malicious actors will eventually attempt to disrupt and damage your customers’ experience.

These attacks, known as distributed denial of service attacks (DDoS attacks), are carried out automatically. They are orchestrated by a network of compromised and remotely controlled machines, also known as bots, which are used to flood the target with an overwhelming amount of traffic.

A bot is a computer program that automates interactions with web properties over the Internet — CloudFlare

However, bots don’t always launch full-scale attacks. They access sites for various reasons, including indexing (think about the Google search engine), which makes them “good” bots.

A bad bot is a computer program that either tries to steal data by scraping data or probing for vulnerabilities in preparation for an incoming attack. Luckily, we can protect ourselves from malicious bots while allowing the “good” bots to do their work. If you want to learn more about good and bad bots, check out CloudFlare’s article.

Blocking DDoS attacks and malicious bots will maintain the integrity and performance of your SaaS service. Let’s see how we can do that with AWS Web Application Firewall.

If you want to learn more about DDoS and the most famous DDoS attacks in history, check out this article.

AWS Web Application Firewall (WAF) Introduction

AWS WAF is a web application firewall that lets you monitor or block the HTTP(S) requests forwarded to your protected web application resources. You can protect the following resource types:

Amazon CloudFront distribution
Amazon API Gateway REST API
Application Load Balancer
AWS AppSync GraphQL API
Amazon Cognito user pool
AWS App Runner service
AWS Verified Access instance

As you can see from the list above, AWS WAF can protect both your container-based and Serverless services.

To understand how WAF provides an extra layer of security, we first need to understand how it works. Let’s see how we can configure WAF to protect our SaaS service.

ACL and Rules

To use WAF with your AWS resources (as defined above), create a WAF access control list (ACL), define its rules, and associate the ACL with the resource you wish to protect.

The associated resources forward incoming requests to AWS WAF for inspection by the web ACL. In your web ACL, you create rules to define traffic patterns to look for in requests and to specify the actions to take on matching requests. — AWS

Each rule has action to run when matched. Action choices include the following:

Allow the requests to go to the protected resource for processing and response.
Block the requests.
Count the requests.
Run CAPTCHA or challenge checks against requests to verify human users and standard browser use.

However, sometimes, we need to use custom rules.

Lastly, you can enable CloudWatch metrics for your WAF rules. I suggest you enable them as they provide valuable insight into whether they make a difference (know what and why you’re paying for!). For more information, check the AWS docs.

For other rules-building tips, check out this post and always consult with a security expert.

Now that we have the basics out of the way let’s head over to my tips and tricks section, which discusses working with WAF and using it in an enterprise SaaS environment.

AWS WAF Tips & Tricks

In this section, I will provide insights and tips I’ve gathered using AWS WAF in my organization. The tips span team ownership, organizational governance, logs, and debugging issues in production and ongoing WAF ACL maintenance.

Ownership

Security experts, whether they are internal security architects or external security advisors, play a crucial role in analyzing your service’s attack vector and constructing the set of ACL rules. The security architect also has the task of monitoring the overall adoption of WAF across the organization and the rules metrics over time (are they making an impact? do they match?).

Developers, on the other hand, are responsible for translating these rules into AWS CDK or Terraform IaC templates (more on that in the second post of the series). The developers will deploy the WAF ACL, associate it with their service resource (API Gateway), and add custom rules if needed.

If there’s a WAF issue, customers are unable to connect to the service, etc., developers can view the WAF logs and traffic and resolve the incident with the help of the security architects.

This approach is a good balance between governance and team independence.

Governance

Speaking of governance, in the next post, we will discuss how to manage these rules in a centralized manner. Meanwhile, it’s recommended to use a CDK construct or a Terraform template as a black-box architectural pattern containing all the recommended rules and deploy them across the organization. You can read more about this concept in my AWS.com article with Anton Aleksandrov “streamlining serverless governance by codifying architectural blueprints”.

In my third article in the series, we will discuss how we take governance further and utilize AWS Firewall Manager for centralized security governance with AWS WAF.

Adding ACL Rules Over Time

Security changes and adapts over time. Your ACL rules will change. However, adding a new rule (especially with a ‘block’ action) can damage customers’ experience or even block them when misconfigured. As such, having two ACL variants is essential: dev and production. As the name states, dev ACLs are used in the development accounts and are a great testing ground for testing new rules, blocking patterns, or other techniques. Try to think out of the box and simulate customer traffic as much as possible.

Once you are sure the new rules will not alter the customer’s experience, add them to the production WAF ACL.

Mistakes can still occur, and unexpected customer flows can be hindered. As such, you need to plan a quick “break glass” method that the developers will train beforehand and know to initiate in case of a production issue. The break glass method can be as simple as a quick GitHub revert to remove or adjust the new ACL rule. Developers must know about a pending WAF change and how to debug it correctly.

Visibility & Logs

Developers must easily understand customers’ issues that originate from WAF blocking their requests. Luckily, AWS WAF has logging support, which you must enable per ACL.

There are three possible WAF logging destinations:

S3 bucket
Kinesis Firehose
CloudWatch log group

Options one and two are good options for centralized monitoring, where all logs are sent to one AWS account, stored, parsed, and analyzed. The downside is that it requires you to build a data pipeline for those logs and then manage access for all developers in the organization, as they need means to debug their service production issues.

AWS has recently released a video showing how you can build such a pipeline:
https://www.youtube.com/watch?v=qPYOsMsHDEM

However, my preferred option is option number three, using a local CloudWatch group, as it is simple and allows the developers to use CloudWatch insights and tools to debug the WAF logs on their account, so no special permissions are required. If you choose this option, set log retention time to 7–14 days at max to reduce log storage costs.

For more information, check out AWS documentation.

WAF WCU & Pricing

AWS WAF uses WCUs to calculate and control the operating resources that are required to run your rules, rule groups, and web ACLs. The WCU requirements for a rule group are determined by the rules that you define inside the rule group. The maximum capacity for a rule group is 5,000 WCUs. — AWS

While I’d rather add maximum security and build the perfect ACL with all the rules possible, it’s not possible due to the 5000 WCU limit. You need to find the correct balance between security coverage and overall WCU. Try to use AWS-managed rules as much as possible, as they are more WCU-optimized and well-maintained.

Another critical issue is the cost. WAF is expensive as you pay for rules, ACLs, and traffic volume that goes through the ACL. For more details, check out the pricing calculator.

Summary

In this post, I introduce the AWS Web Application Firewall and explain why you should use it. We also covered the basics of ACLs and rules. Lastly, I shared some crucial tips that I’ve learned from using and updating WAF in a large enterprise.

In the next post in the series, I will provide AWS CDK code for WAF deployment and associate it with an API Gateway.

In that third post, we will review AWS Firewall Manager and how it allows the organization to manage AWS Application Web Firewall ACLs at scale.

Understanding AWS Availability Zones: Boosting SaaS Resilience and Uptime

Ran Isenberg — Mon, 21 Oct 2024 06:02:53 +0000

Resilience and availability are critical aspects of every SaaS application. By building your SaaS Serverless services on AWS’ infrastructure, you automatically deploy to multiple availability zones within a single region, which provides increased resilience and automated availability failover mechanisms. As engineers and architects, it is imperative to understand these concepts and how to configure our services correctly.

In this post, you will learn about availability zones (AZs), what they are, why they are essential, and my tips for configuring resources across multiple AZs within a single region.

This blog post was originally published on my website, “Ran The Builder.”

Availability Zones Definitions and Properties

When we create resources in a region, we create them in one or more availability zones.

The AZs (availability zones) work together behind the scenes to keep our service running by sharing the traffic, replicating the data in the databases, and taking over each other traffic in case of an AZ outage.

The specifics are sometimes hidden from us, but sometimes, we explicitly define the amount of AZs we want to use — VPCs, Aurora, and Fargate come to mind.

Let’s define what an availability zone is. According to AWS documentation:

AWS resources are hosted in multiple locations world-wide. These locations are composed of AWS Regions, Availability Zones, and Local Zones. Each AWS Region is a separate geographic area. Each AWS Region has multiple, isolated locations known as Availability Zones.

As you know, “everything fails all the time” as Werner Vogles always says, so the concept of AZs is critical to make sure our service works even if one AZ has issues but the other AZs are online.

AZs provide higher availability, fault tolerance, and scalability than a single data center. Instead of having a single point of failure, you now have multiple zones to rely on in case one of the AZs has issues.

*There’s also the concept of multi-regional services, which raises service availability a notch, but I will not cover it here.

Availability Zone’s Properties

AZs have special characteristics according to AWS documentation:

Each region has a different number of AZs, but they all have at least 3.
All AZs in a region are interconnected via high-bandwidth, low-latency, fully redundant metro fiber.
Traffic between AZs is encrypted and supports synchronous replication.
Partitioning applications across AZs improves protection against issues like power outages and natural disasters.
AZs are physically separated by a meaningful distance, typically within 100 km (60 miles) of each other.
Your service might not utilise ALL the AZs.

When one AZ has issues in a region, AWS knows how to utilize the other AZs to keep your service and its resources going. Traffic will “automagically” shift to other healthy AZs and their resources. In addition, once the faulty AZ has returned online, some databases can restore the data they missed.

All of this is managed for us by AWS. That’s Pretty amazing if you ask me.

Why Understanding AZs Matters

Serverless developers don’t usually think about AZs, as this information is handled by AWS behind the scenes. When using Serverless services like Lambda and DynamoDB, the AZs are selected for you, and you don’t need to manage or maintain them. For more info, check the AWS docs.

However, when using VPCs or not fully Serverless services (you need to define their VPCs and they don’t scale to zero) like Aurora, OpenSearch, or Fargate, AZs come into play, and it is important to understand their impact.

AZs can increase overall SLA and SLI as you are more resilient to a single AZ failure. They increase fault resilience and availability and have the potential to improve performance as AWS automatically balances traffic and access across AZs.

Let’s take Aurora, for example, and review some of the advantages that deploying to multiple AZs brings:

Aurora stores copies of the data in a DB cluster across multiple Availability Zones in a single AWS Region. When data is written to the primary DB instance, Aurora synchronously replicates the data across Availability Zones to six storage nodes associated with your cluster volume — AWS docs

Other than protecting your databases against failures when one AZ has issues, it also allows AWS to conduct failovers in case of planned maintenance.

AZs and the Peculiar IDs Case

Availability zones have physical IDs and names:

AWS maps the physical Availability Zones **randomly **to the Availability Zone names for each AWS account. This approach helps to distribute resources across the Availability Zones in an AWS Region, instead of resources likely being concentrated in Availability Zone “a” for each Region. As a result, the Availability Zone us-east-1a for your AWS account **might not represent **the same physical location as us-east-1a for a different AWS account.

When we consider a larger scale than a single AWS account, such as an AWS organization with multiple services deployed across multiple accounts, the issue becomes even more significant.

If two services in two different accounts in that organization were to deploy to specific AZs by their names: ‘us-east-1a’ and ‘us-east-1b’, AWS would map to different physical AZs (over which you have no control) across your organization’s accounts. While you think you are deploying your resources to the same physical AZs, that is likely untrue!

In order to control what specific AZ you deploy to, you need to know the correct mapping between name and physical ID.

We will cover a workaround with an AWS CDK code example later in this post.

When Can an AZ or Region Go Down?

An AZ can go down due to hardware failures, natural disasters, power outages, or other disasters.

A region goes down when ALL of its AZs suffer failures and are marked as “down.” Such an outage affects all AWS customers deployed to this region. See the AWS case history, which goes back 13 years. These things happen!

Let’s review some service failure use cases, from the simple to the edge cases.

The simple use case is that your service goes down when all the AZs it utilizes are down.

However, things can get more complicated. Let’s assume you don’t deploy to all the AZs in that region. The region has 3 AZs, and you use only 2 out of 3 AZs: AZs A and B. If availability zones A and B are down, you will still experience a regional failure on your application, even though the region still has one AZ functioning.

But it can get even more complicated. Let’s assume that your SaaS service depends on another SaaS service in runtime. Both services are deployed in different AWS accounts.

Assume your service deploys to AZ 1 and 2, and service B, which you depend critically on, deploys to 1 and 3. In case regions 1 and 3 are down, while you still have your AZ 2 available, you are essentially down, as your critical dependency, service B is down.

That is why understanding AZs and ensuring the SaaS organization uses the same methodology and the same AZs ensure availability.

Like everything else in software, it’s all about pros, cons, and restrictions. Deploying to all AZs is the simplest answer, but it will cost you a lot.

Let’s discuss my recommendations for AZ selection.

Recommendations for Availability Zone Selection

As a rule of thumb:

Two are better than one. Deploy to a minimum of two AZs. For consistency across the organization, see the code example below for how this can be achieved. Make sure all the organizations use the same AZs (1 and 2, for instance). See the ‘AZ Selection via IaC’ section below to see how to do it via IaC.
Critical SaaS services that are willing to pay extra for improved SLI and performance during partial AZ malfunction are encouraged to deploy to three or more AZs.
Before adding an AZ, it’s crucial to calculate the cost of both the DATA TRANSFER and the extra infrastructure. This step ensures that the deployment remains cost-effective and aligns with the organization’s budget.
For critical SaaS services that deploy to regions with more than 3 AZs, evaluate the cost to determine whether the added AZs are worth it. As a reference, such regions have failed before, so even a region with 6 AZs can be done.

AZ Selection via IaC

It is impossible to set AZ physical ids like ‘use-1az1’ (for us-east-1 region) directly in the AWS CloudFormation/CDK code. Instead, you need to provide the AZ names; as we know, they are mapped differently in each account to different ids. We need to determine what AZ name is mapped to AZ1, AZ2, etc.

You can do that in the CDK code using AWS SDK (boto3 for Python) to find the account specific mapping between AZ name and id.

Consider this Python code example that creates a VPC that is ALWAYS deployed to AZ1 and AZ2:

The magic occurs in lines 48–57. We iterate over the mapping and look for the AZ names that match the ID we want to deploy to. Simple and effective!

AZ Selection Tips for VPC

When configuring a Virtual Private Cloud (VPC), you have more control and must explicitly select which AZs to use. Typically, deploying your resources (such as EC2 instances or Elastic Load Balancers) is best practice across multiple regional AZs.

Use multiple subnets within the VPC, each residing in a different AZ. This allows you to spread your resources across AZs and maintain redundancy in case of an AZ failure.

On a side note, if you have Lambdas inside your VPC, they will be deployed according to the VPC definition, otherwise, it’s up to the Lambda service to configure and choose the AZs.

AZ Selection Tips for Aurora

An Aurora DB cluster is fault-tolerant by design. The cluster volume spans multiple Availability Zones (AZs) in a single AWS Region, and each AZ contains a copy of the cluster volume data. This functionality means that your DB cluster can tolerate a failure of an AZ without any data loss and only a brief interruption of service.

We recommend that you distribute the primary instance and reader instances in your DB cluster over multiple Availability Zones to improve the availability of your DB cluster. — AWS docs

Understanding Availability Zone’s Cost

Deploying and creating resources in multiple AZs increases cost. You pay for the deployed resources (Aurora replicas, EC2s, ALBs, Nat gateway, etc.) and, in some cases, for the traffic between the AZs.

Let’s review the following scenario: We deploy an EC2 machine and an Aurora RDS MySQL cluster over a VPC with 2 AZs.

In this case, we will pay double the amount for the resources of the EC2, Aurora DB , VPCs, and other network parts (ENIs, etc.) As for data transfer, you don’t pay for the AZ data replication between the RDS instances nor data sent within the same region (between EC2 and RDS). However, you will pay for any network cost between EC2s and for any data that is between EC2 and RDS from a different AZ (in case RDS is down in the first AZ).

Due to network and infrastructure deployment, you will pay more than twice as much if you have one AZ and add another. However, as you add more AZs, the infrastructure deployment’s per-unit incremental cost will be lower. The cost of data transfer itself differs between regions, but for the most part, cross-AZ data transfer within the region costs $0.01/GB. If the updates are back and forth, you pay twice, $0.02/GB. Read more here.

Remember, the cost of adding an AZ can differ significantly between regions, so it’s essential to consider these regional costs before making any changes.

Summary

In this post, we covered the importance of resilience and availability in SaaS applications built on AWS using Availability Zones (AZs). We defined what AZs are and why they matter and wrote concrete IaC code to configure resources across multiple AZs.

Thank you Maxim Drobachevsky and Meitar Karas on your help and review!

AWS re:Invent 2024 — My Selection Of Sessions

Ran Isenberg — Wed, 02 Oct 2024 07:04:46 +0000

In this post, you will find my opinionated list of AWS re:Invent 24 breakout sessions, workshops, builder sessions, code-talks, dev chats, and chalk talks that I found relevant to Serverless or highly interesting in general.

Don’t forget to read my guide to AWS re:Invent — tips and tricks.

You can find my complete session list over at reinvent planner.

This blog post was originally published on my website, “Ran The Builder.”

Session Types
Session Levels
My Breakout Session
Levels 100–200
Level 300
Level 400
Heroes/Community Track
Non Serverless But Highly Recommended

Session Types

**Breakout sessions — **lecture-style and run 45 to 60 minutes. Often includes 10–15 minutes of Q&A.

**Builders’ sessions — **These one-hour hands-on sessions have ten attendees and one AWS expert per table. Each builders’ session begins with a short explanation or demo of what you are going to build. There is no formal presentation. It’s just you, your laptop, and the AWS expert.

**Chalk talks — **highly interactive whiteboarding sessions with AWS experts. Expect a lively technical discussion, centered around real-world architecture challenges, with a small group of experts and peers. These sessions run for 60 minutes.

**Workshops — **a two-hour interactive sessions where you work in small teams to solve real problems using AWS services. Each workshop starts with a short lecture (10 to 15 minutes) by the main speaker, and the rest of the time is spent working as a group. Don’t forget to bring your laptop to these workshops.

*Code Talk *— Engaging, code-focused sessions with a small audience. AWS experts lead a live coding discussion as they explain the why behind AWS solutions.

Dev Chat — shorter community driven session make. Get insights from AWS customers.

Session Levels

There are four levels: 100, 200, 300, and 400.

The 100–200 (Foundational *& *Intermediate) levels are excellent for a Serverless beginner.

If you build Serverless applications daily, target the 300–400 (advanced & expert*)* levels and only go to 100–200 for sessions in unfamiliar subjects (perhaps containers, data-related, or machine learning).

The complete catalog can be found at:

https://hub.reinvent.awsevents.com/attendee-portal/catalog/

But I highly suggest you use this alternative and better catalog to build your sessions list:

https://reinvent-planner.cloud/

Let’s go over my recommended sessions.

My Breakout Session

SVS401 | Breakout Session | Best practices for serverless developers

I’m thrilled to share that I will be co-presenting a breakout session at AWS re:Invent 2024 with Julian Wood.

This session provides architectural best practices, optimizations, and useful shortcuts that experts can use to build secure, high-scale, and high-performance serverless applications.

I will add insights and best practices from my perspective, as I have been running production workloads on Serverless for the past four years.

Levels 100–200

SEG101 | Breakout | Adopting the SaaS mindset to drive growth

Learn about AWS Lambda, Amazon API Gateway, and event-driven integration services, discover how to build your first serverless application, and learn how to handle multi-tenant architectures for SaaS applications. A good starting position to the world of SaaS and Serverless.

SVS204-R | Builder Session | Write less code: Building applications with a serverless mindset [REPEAT]

This hands-on session explores patterns for using direct service integrations using Amazon API Gateway, AWS Step Functions, and Amazon EventBridge. Discover the efficiency of utilizing configuration to streamline development tasks, and push the heavy lifting to AWS. You must bring your laptop to participate.

SVS206 | Chalk Talk | Building an event sourcing system using AWS serverless technologies

In this chalk talk, explore strategies for building effective event sourcing architectures using AWS serverless technologies. Learn how event sourcing stores the application state as an append-only event log, preserving context and enabling traceability. Discover powerful benefits like auditing, fault tolerance, root cause analysis, and event-driven architectures across industries and applications. Learn how to distinguish event sourcing from patterns like event streaming and domain-driven design. Leave with practical insights into leveraging serverless for implementing event sourcing to gain audit visibility, fault tolerance, and visibility into your application state.

SVS205 | Workshop | Building a serverless web application for a theme park

In this workshop, learn how to build a complete serverless web application for a popular theme park called Innovator Island. You must bring your laptop to participate. ***Sounds like an intro workshop for people who are not experienced with Serverless. Also includes some Frontend work, which is a nice bonus.*

API203-R | Builder’s Session | Building common orchestrated workflows with AWS Step Functions [REPEAT]

Accelerate your data processing journey in this hands-on AWS Step Functions workshop. Build orchestrated data processing, async processing, and distributed transaction use cases. Leave with a deeper understanding of how to use AWS Step Functions to build scalable, efficient, cost-effective data processing architectures. You must bring your laptop to participate.

SVS209 | Breakout Session | Containers or serverless functions: A path for cloud-native success

In this session, explore the fundamental differences between containers and serverless functions. Investigate real-world scenarios to gain insights into choosing the right approach based on workload requirements, deployment scenarios, and operations. Choosing the wrong tool for the job is one of the most critical mistake an architect can make — I highly recommend this session.

SVS201-R | Workshop | Getting started with serverless patterns [REPEAT]

In this workshop, learn how to recognize and apply those patterns and best practices by building nearly production-ready code for a serverless application. Create microservices, run unit and integration tests, configure a CI/CD pipeline, and set up observability. You must bring your laptop to participate. I’m interested in knowing how close this will be to my Serverless blueprint.

API206-R | Chalk Talk | How event-driven architectures can go wrong and how to fix them

Attend this chalk talk to learn common event-driven pitfalls, including YOLO events, god events, observability soup, event loops, exposing monoliths, state corruption, and surprise bills. Explore strategies and techniques teams can implement to avoid these pitfalls and reap the full benefits of event-driven architectures.

SVS202-R | Chalk Talk | Thinking serverless [REPEAT]

Serverless is more than just AWS Lambda. It’s about learning to use a range of different services and techniques to solve a technical problem. How do you approach building a solution with a serverless mindset? In this chalk talk, learn how to tackle a business problem from a customer perspective by breaking down needs into serverless building blocks that work well together. I really like this approach. I discuss similar things on my reflecting on Serverless blog post.

Level 300

SVS306 | WorkShop Accelerate development with AWS Lambda Powertools for serverless APIs

In this workshop, start with an existing application built with Python and progressively improve your API event handler using Powertools for AWS Lambda. Learn how to implement request and response validation, dynamic routing, exception handling, middleware, and OpenAPI schema generation. Discover how to improve your API event handler with serverless best practices using Python that you can easily extend to other Powertools runtimes. You must bring your laptop to participate.

API306 | Breakout Session | Advanced patterns for distributed systems

Today’s applications are interconnected: they expose APIs, publish events, call third-party services, and externalize states. They must therefore address the fundamental challenges of distributed systems, such as out-of-order delivery, retries, idempotence, or partial failures. To balance those characteristics, architects have a range of options, including reducing the level of coupling through indirection, transformation, and asynchrony. In this session, learn about common design trade-offs for distributed systems and how to navigate them with design patterns, illustrated with real-world examples.

SVS312-R | Chalk Talk | AWS Lambda performance tuning: Best practices and guidance [REPEAT]

In this chalk talk, you learn about opportunities to optimize your serverless applications built with AWS Lambda, including optimizations in the function configuration and within your function code. This talk also covers how you can best measure and tune your function’s performance by configuring memory to get the right application performance. You also hear best practices for initialization logic and reuse to enable fast startup and fast function processing times.

API310 | Code Talk | Build a meeting summarization solution with generative AI & serverless

In this code talk, see live coding of a serverless application for producing meeting summaries with generative AI. Learn how to orchestrate transcription using Amazon Transcribe and summarization with Amazon Bedrock, orchestrated with AWS Step Functions. Discover how to simplify and scale your application using event-driven techniques with Amazon EventBridge. Leave with practical skills for developing serverless generative AI solutions that streamline meeting insights through automated transcription and summarization powered by AI/ML services.

SVS339 | Breakout Session | Building event-driven architectures using Amazon ECS with AWS Fargate

Event-driven architecture (EDA) enables organizations to build highly flexible and resilient systems, and customers are leveraging serverless containers to run EDA workloads due to its ease of use, scalability, and deep integrations with AWS serverless services. This session explores practical aspects of implementing EDA on Amazon ECS with AWS Fargate, focusing on patterns for consuming events in containerized environments using AWS Step Functions, Amazon SQS, and Amazon EventBridge. Learn how to build scalable, fault-tolerant, and event-driven solutions that can adapt to changing business requirements.

DEV341 | Dev chat | From single to multitenant: Scaling a mission-critical serverless app

Both single-tenant and multitenant architectures come with advantages and drawbacks. While the single-tenant approach is often simpler to implement initially, it may fall short as your system scales. In this dev chat, explore how PostNL transitioned one of its mission-critical applications, EBE, from a single-tenant to a multitenant architecture, hearing about the challenges it faced, the strategies it employed, and the benefits it realized through this transformation, providing valuable insights for those considering a similar evolution for their applications.

SVS324 | Breakout Session | Implementing security best practices for serverless applications

Building with serverless enables organizations to build and deploy applications without managing underlying infrastructure. Serverless strengthens your overall security posture by reducing attack surface and shifting security operations to AWS. In this session, explore how to implement security best practices across the software delivery lifecycle and into production deployment. Hear lessons learned from working with numerous enterprise customers that can help your builders be productive and innovative within security guardrails.

SVS313 | Chalk talk | Is your serverless application ready for production?

Building secure, reliable, and performant applications while balancing cost and operations can be challenging. Getting this right by aligning to the AWS Well-Architected Framework can greatly increase your chance of success. In this chalk talk, specific best practice guidance is applied to a serverless reference architecture.

SVS320 | Breakout Session | Accelerate serverless deployments using Terraform with proven patterns

In this session, discover best practices and proven patterns for using Terraform to build serverless applications safely, predictably, and repeatedly. Learn techniques for designing modular, reusable architectures and strategies to test applications locally. Understand how to manage ownership and separation of concerns between operations and development teams. Gain insights into efficiently deploying serverless applications to the cloud. Familiarize yourself with open source frameworks to accelerate your serverless journey with Terraform today. Leave equipped with practical skills for leveraging Terraform’s power in your organization’s modern cloud architectures.

DEV339 | Dev chat | Supercharge Lambda functions with Powertools for AWS Lambda

AWS Lambda functions are crucial in cloud architectures but can be challenging due to potential failures and repetitive AWS-specific code. Powertools for AWS Lambda is a library that addresses these issues by enhancing observability, resiliency, and operational excellence in your Lambda functions. In this dev chat, explore the Powertools library’s capabilities and see how it can improve your Lambda functions by using it in a real-world application, helping you achieve a well-architected solution in AWS.

SVS319 | Breakout Session | Unlock the power of generative AI with AWS Serverless

Learn to harness the power of AWS Serverless to build robust, cost-effective generative AI applications in this breakout session. Explore using AWS Step Functions to orchestrate complex AI workflows seamlessly. Gain insights through real-world use cases and patterns covering prompt engineering, model fine-tuning, batch inferencing, Retrieval Augmented Generation (RAG), and more. Leave equipped with the knowledge and skills to unlock the true potential of secured, highly scalable, high-performance generative AI applications using serverless workflows. Elevate your AI capabilities in this rapidly evolving field.

Level 400

SVS404 | Workshop | Building serverless distributed data processing workloads

Enterprises today face an ever-increasing need to process large-scale data to meet their business goals and unlock new value. Distributed data processing offers a cost-efficient way to speed up processing, but it also presents challenges for developers in managing the parallelism within serverful environments. In this workshop, learn how serverless technologies like AWS Step Functions and AWS Lambda can help you simplify management and scaling, offload undifferentiated tasks, and address the challenges of distributed data processing. Also, discover use cases, best practices, and resources that can help you accelerate your data processing journey. You must bring your laptop to participate.

OPN402 | Breakout | Gain expert-level knowledge about Powertools for AWS Lambda

Did you learn serverless best practices but are unsure about implementation? Have you used Powertools for AWS Lambda but felt you barely scratched the surface? This session dives deep into observability practices, resilient data pipelines with AWS Batch, safe retries with idempotency, mono- and multi-function APIs, and more. Learn about each practice in depth, achieve expert-level knowledge, and hear from maintainers about what’s next.

API401 | Chalk talk | Multi-tenant Amazon SQS queues: Mitigating noisy neighbors

This chalk talk explores advanced strategies for managing multi-tenant Amazon SQS queues, discusses the challenges posed by noisy neighbors, and shares effective mitigation techniques, including shuffle sharding and overflow queues. Gain insights into optimizing queue performance, ensuring fair resource allocation, and maintaining service quality across tenants. Walk through best practices for implementing these solutions, potential trade-offs, and examples of multi-tenant Amazon SQS architectures.

SVS406 | Chalk talk | Scale streaming workloads with AWS Lambda

In this chalk talk, learn how to optimize your streaming data processing with AWS Lambda. Explore scenarios where default processing speeds may bottleneck workloads consuming messages from Apache Kafka, Amazon DynamoDB, or other sources — especially when data enrichment is required. Learn how to implement parallel processing techniques for ordered and unordered use cases to address throughput limitations. See a live demo showcasing performance improvements in an example message processing pipeline. Leave this talk with practical strategies for achieving high-throughput, scalable streaming workloads on Lambda.

Heroes/Community Track

The community track consists of AWS Heroes’ and builder’s breakout sessions and dev chats. Hearing from proven community leaders who share their real production knowledge is an invaluable asset.

With over 30 sessions, I was unable to include them all here: you should search for sessions that start with ‘DEV.’ Search in the catalog for ‘DEV2’ for dev session 200 level, search ‘DEV3’ for dev session 300 level sessions, and ‘DEV4’ for dev session 400 level.

Non Serverless But Highly Recommended

SAS313 | Chalk Talk | Designing SaaS architectures that support global growth and scale

SaaS organizations are often driven by growth. Scaling to meet this growth often requires teams to think about how their underlying architecture, operations, and application can support these growth models. Designing a multi-tenant environment that can scale into new geographies, be deployed into more regions, and/or address additional compliance requirements can be challenging. This chalk talk examines the architectural challenges that come with supporting various growth/reach models, highlighting techniques, patterns, and strategies that are used to prime your SaaS offering for broader reach/expansion. It covers the architectural, deployment, resilience, and operational considerations that come with tackling this growth profile.

DEV335 | Breakout Session | The modern CI/CD toolbox: Strategies for consistency and reliability

As software delivery scales and environments become more diverse, maintaining consistency, security, and reliability in continuous integration and continuous delivery (CI/CD) becomes increasingly challenging. Never fear! This fun, interactive session featuring AWS community and employee experts shows how to tackle the growing complexity by adopting best practices and modern techniques. Explore methods for ensuring consistent deployments across environments, robust configuration management, progressive delivery strategies, drift detection, and automated auditing with generative AI. Discover practical solutions to enhance reliability, safety, and efficiency, enabling faster delivery and reducing errors by treating all changes equally in the pipeline and streamlining processes across projects.

SAS406 | Breakout Session | Accelerating multitenant development with the SaaS Builder Toolkit

The SaaS Builder Toolkit (SBT) provides developers with a pre-built set of tools that decompose SaaS into a series of building blocks that can be used to create multitenant environments. This session digs into the moving parts of this toolkit, exploring the inner working of its core components, architecture, and extensibility model. It also looks at a real-world example of SBT in action, composing a working multitenant application from scratch. Additionally, it explores how SBT addresses core concepts, including building a control plane, onboarding tenants, authenticating tenants, supporting tiering, and provisioning tenant resources.

SAS305 | Breakout Session | SaaS architecture pitfalls: Lessons from the field

The last 7+ years spent helping companies build SaaS solutions has been eye-opening. AWS has gotten great insights into the dynamics, challenges, and pitfalls that teams often face when building SaaS solutions. In this session, explore a range of different patterns, including common technical and business themes that have impacted the scale, growth, and cost efficiency of SaaS offerings. This is about capturing these trends and outlining guidance that can help teams avoid falling into these same traps. Hear about the technical nuances, architecture challenges, and operational impacts that undermine the success of SaaS businesses.

Guide to AWS re:Invent - Tips & Tricks

Ran Isenberg — Mon, 23 Sep 2024 05:54:08 +0000

Photo by Pixabay

It’s that time of the year — AWS re:Invent is upon us!

This will be my second year attending, and having been both a speaker and attendee last year, I wanted to share some valuable insights and tricks with you.

In this post, you’ll find my top tips for making the most out of the AWS re:Invent conference while navigating it like a pro.

This blog post was originally published on my website, “Ran The Builder.”

Arrival Date
Pack the Correct Tools
Sessions & Getting Around
Night Time
AWS re:Play
Socialize
Shopping
Departure Date

Arrival Date

Coming from outside the USA, the flight can be harsh, and the jet lag hits hard. Give yourself time to relax and adjust. I’d recommend coming on Friday or Saturday before the conference starts on Monday.

Take the conference tag and swag on Sunday if possible. Come prepared with the QR code you got after you registered.

Take a hike or a short day trip around Vegas before the conference starts. The beautiful scenery will help you overcome jet lag.

Pack the Correct Tools

Everything is big in America: both the conference area and the hotels. The conference sessions are spread over several hotels. You will move from one hotel to another on foot or with free conference shuttles. However, getting from your room to the shuttles, food area, and sessions will take time. It can look close when you start walking, but trust me, it’s not; it’s far away. I found myself walking about 8+ Kilometers per day.

So, get proper walking or running shoes, lots of water, and a battery pack for your phone.

Dress in layers. When you step outside, it’s cold in the morning — 3–6C the last time I was there — but it gets warmer to 15+C in the noon. However, inside the hotels, it’s warmer, and when you walk a lot, you get hot. Dress in layers to adjust accordingly.

Sessions & Getting Around

Session registration — seats are limited. Try to register as soon as the session registration opens up as there’s limited seats.

In case you missed the seat, not all hope is lost. You can stand in line outside the session’s room before the session starts. Try to stand there one hour before the session. Most chances are that they will be able to make it. If not, for popular sessions, “repeats” might get scheduled later, and there are sometimes video rooms where you can watch it live. If not, most breakout sessions are recorded, and you can see them after the conference ends on YouTube.

As we’ve established, getting around takes time. Unless your daily sessions are in the same hotel, it will take you time to get from one to another. Plan at least 40 minutes between sessions and plan time for relaxation, food, and socializing. Try to go to a maximum of 3 sessions per day. You can do more, but you will be exhausted by the end.

Go to at least one keynote. Werner Vogels’ keynote is a must in my book. Last year, I could watch another keynote on the TV in my hotel room. Ask your hotel whether this is an option.

Attend different session types, not just breakout sessions. Go to dev/builder talks and chalk talks, as they are more intimate, and you can chat with the speakers and get a personal experience.

GameDay is an amazing experience. It’s like a four-hour hackathon. Bring a laptop; it’s challenging and fun. It’s for groups of up to four people, but you can always join people on the spot, which decreases your chances of winning the prizes.

An excellent alternative for GameDay is the workshop sessions.

Session levels — if you are an expert in a specific field, don’t go to any session below 300 or even 400 if you know that field well. 100 and 200 sessions are perfect for areas you don’t know anything about or are not a technical person.

The community track is amazing. Both builders and heroes share their knowledge there. It’s an authentic track with non-AWS employees, but real professionals share their knowledge and insights. It’s a must. Choose your preferred sessions.

Expo — Swag and fun. Attend the expo during the first two days to try your luck with content, swag, and cool gifts from many companies.

Plan, plan, and plan again. Use the nonofficial re:Invent planner, https://reinvent-planner.cloud/ to plan your sessions. It is a simple yet powerful tool built by AWS community builder Raphael Manke.

Night Time

Vegas has so many shows; there are also serverless parties and other companies that have free mixers and parties. Check out:

https://conferenceparties.com/reinvent2024/

AWS re:Play

On Thursday night, there’s the official party of re:Invent. The food is okay, but it’s crowded and loud, which makes me feel too old for it. There’s lots of beer, and you get a free T-shirt!

It’s an open area far from the hotels, so it’s cold! Dress accordingly.

Socialize

One of the coolest things about re:Invent is that you get to meet like-minded people from all over the world. Don’t be shy; socialize! Talk to as many people as possible. Don’t be afraid to reach out to AWS builders, heroes, and session speakers that you want to meet.

You can always view recorded sessions at home, but you probably can’t meet your favorite AWS expert and discuss serverless over beer.

Try to go to mixer events or parties (serverless parties were a thing last time!)

If you want to increase your chance of meeting AWS heroes and fellow community builders, someone might be eating over at Denny’s in the morning.

Shopping

Take an Uber (I think it was 30$ per direction) and get to the outlets (north or south).

Be aware that re:Invent takes place after Black Friday, so many shops might look like they survived a zombie attack, but some good deals can be had.

You can buy on Amazon.com and ship it to your hotel (check with your hotel). This is a fast option that allows you to take advantage of any Black Friday deals and have the packages waiting for you until you fly. It costs extra to store the package, but nothing too bad. You should do an Amazon Prime trial to maximize savings and deals.

Other than that, the hotel area has other malls and pretty much any shop and brand you can think of. Some of them have late Black Friday/Cyber Monday deals throughout the week.

Departure Date

Most people leave on Friday when there is at least half a day of sessions. I’ve attended some of the best sessions on Friday, but it’s really personal.

Be aware that the expo will be tearing down on Friday.

Lastly, have fun and learn!

A Critical Look at AWS Lambda Extensions: Pros, Cons, and Recommended Use Cases

Ran Isenberg — Tue, 10 Sep 2024 07:41:34 +0000

In this VERY opinionated post, I will share my thoughts about AWS Lambda extensions, the good and the bad, and when you should or should not use them.

This blog post was originally published on my website, “Ran The Builder.”

Lambda Extensions Introduction

Lambda extensions are an interesting mechanism. AWS recommends using them to enhance your functions with black box capabilities developed by AWS or external providers.

…use Lambda extensions to integrate functions with your preferred monitoring, observability, security, and governance tools — AWS

You can think of them as black boxes that you can attach to your Lambda, which functions as a Lambda layer and enjoy new features.

Here’s a list of classic uses that come to mind:

Fetch configuration, secrets, and parameters and store them in the cache. Fetch them automatically once every several minutes. Think of fetching secrets from Secrets Manager or Parameter Store.
Send logs, traces, and metrics to an external observability provider such as Lumigo, DataDog, and others.
Monitor the Lambda function’s CPU, memory, disk usage, and other interesting machine parameters and send them to a monitoring system, such as the Lambda insights extension.

You can find more extension use cases in this repository. Some are AWS-backed, and some are third-party.

To use a Lambda extension, you only need to attach it as a Lambda layer. Check out my post to learn about Lambda layers and when to use them. I find them particularly useful for deployment optimization.

The Case for Lambda Extensions

On paper, extensions sound like a fantastic mechanism — you add a Lambda layer and a few environment variables, and boom, you are now integrated with an observability third-party provider and get new functionality you didn’t develop.

It gets even better — extension developers can develop extensions in Rust, a compiled language that provides blazing-fast performance, a minor cold start hit, is environment agnostic, and can run on Lambda functions that use a different runtime.

The advantage of an executable is that it is compiled to native code, and is ready to run. The extension is environment agnostic, so it can run alongside with a Lambda function written in another runtime. — Optimizing AWS Lambda extensions in C# and Rust

However, Lambda extensions often come at a cost that outweighs these advantages.

Let’s review my reasons and the use cases for which I currently use extensions.

The Case Against Lambda Extensions

Lambda extensions are a powerful tool that inherit most of its problems from the mechanism it is built upon — Lambda layers. Let’s review the cases against Lambda extensions.

We can divide the cases into several categories:

Security
Developer experience
Performance & cost

Security

Extensions are similar to any open-source SDK library. However, there’s an added twist.

Extensions can expose you to a security risk as you are never sure what is bundled in the layer (some extension providers document and do a good job, but many don’t).

In addition, according to the AWS documentation:

Extensions have access to the same resources as functions. Because extensions are executed within the same environment as the function — AWS documentation

If you use a compromised version of an extension (think of hackers gaining access and publishing their version of the layer in the layer publisher’s official AWS account), they can access whatever resource your function has in their black box running process. So far, this is quite similar to a comprised open-source SDK.

But there’s a plot twist.

Secure CI/CD pipelines include a vulnerability scanner, such as Synk, that scans your Python dependencies files (poetry.toml), checks whether there’s any comprised version, and fails your pipeline before a compromised version is deployed to production.

That’s not the case with Lambda layers and extensions. Their code is added to the Lambda during invocation, where tools like Amazon Inspector can scan the Lambda’s layers, code, and dependencies and find compromised code.

However, that’s too late. Your compromised extension is already running in production. So, there’s an added security risk.

Developer Experience

Setting up a local developer environment with extensions is hard. You need to figure out what external dependencies the extension’s layer brings and install them locally to test your code and debug in the IDE. This can prove challenging, especially in cases where there are version conflicts. A discrepancy between the local developer and Lambda function environments can lead to crashes and bugs that only happen on production and not in local environments.

In addition, some Lambda extensions require your Lambda function’s code to interact with it (most likely via localhost network calls), making it almost impossible to test your code in the IDE. If you have no idea what I’m talking about, check out my AWS re:invent 2023 session and my Lambda testing series, where I show how you can locally test your functions. Extensions break this experience.

Another critical developer experience is related to maintenance and upgrades.

Lambda extensions are enabled via Lambda layers.

Lambda layers have inherent problems:

Versioning — Layers are versioned, and your function always consumes a specific version that changes between regions, making your life even more complicated (in one region, it’s version 55, in another, 43). The version is part of the layer ARN.
Updates — You need to be aware of a new version (somehow) and manually change it to the latest version. Lambda layers don’t have a package manager like Python and other languages, so updates become a manual endeavor.
The total unzipped size of the function and all layers cannot exceed the 250 MB unzipped deployment package size quota. Your extension takes a chunk out of this quota.

Performance & Cost

There are no free meals. Extensions share function resources such as CPU, memory, and storage, and you may see an increased duration of billed function.

In addition, each extension must complete its initialization before Lambda invokes the function. Therefore, an extension that consumes significant initialization time can increase the function’s cold start duration. It may be worth it for you, but it’s a fact you need to be aware of.

When to Use Lambda Extensions

Let’s cover several use cases see if they fit Lambda extensions or not.

Fetch Configuration

I recently read a blog proving that an extension can fetch configuration faster than the Node Lambda function that uses it. That’s probably because the extension is written in Rust, which is faster than Node. However, once the secret is fetched, it is stored in a cache for several minutes. Powertools for AWS, an amazing open-source library, provides the same features without an extension. An extension saves you a few dozen milliseconds in the first call (and for the first call every time the cache expires), but it’s all the same once the cache is not empty. Unless you fetch a ridiculously high volume of secrets in the function, it’s probably not worth adding the complexity and issues mentioned above to save a dozen milliseconds every few minutes.

TL;DR — don’t use an extension.

Send Logs to a 3rd Party Observability Provider

This is one of the most common use cases for using the Lambda extension. It works, and it works well, and you probably don’t need your code to interact with it directly in the IDE. It seems fitting for most companies.

However, coming from a security enterprise, we chose a different route. I wrote a blog post about why it’s better to use a centralized service to send all logs from your AWS account to a 3rd party observability provider instead of having all Lambda functions send them individually. It’s simple when you have few services, but it gets painful at scale, especially when you want to have the same configuration and log filtering governance across hundreds of services. You are better off writing the data to CloudWatch and using a centralized mechanism to send your data to any provider you wish and filter out data you don’t want to send. It’s more secure and performs better, but you pay more (but it’s worth it!). Check my post here for detailed pros and cons.

TL;DR — use an extension but you might want to reconsider at scale.

Monitoring Lambda’s Container Metrics

This is a great use case, and I can’t find any issues with it other than it should come out of the box in Lambda. I don’t want to think about it or attach extensions. I want to enable it via IaC configuration and see the statistics in CloudWatch.

TL;DR — use an extension

Chaos Engineering

Another unique use case with extensions in chaos engineering is where Lambda extensions are a practical tool.

Koby Aharon covered serverless chaos engineering concepts in his two fantastic guest posts on my website and provided implementation details. He conducted his chaos experiment using a Lambda extension. You can review his introduction post here and his experiment details here.

He “installed” the extension at the beginning of the experiment, leaving his deployment code and function code clean of extensions and layers, and removed it once the experiment was done. It is super clean, and he does it only during a chaos experiment, which is done in a separate account, so it does not affect production. In this example, maintenance is less of an issue, and developers don’t need to be aware of it during development. Zero cons, and you get all the extension pros.

TL;DR — use an extension

Now, I haven’t covered all the extensions in the world. Still, these examples cover a large percentage of the widespread use cases.

Summary

In this post, I have reviewed several common Lambda extension use cases.

I have listed the pros and cons of lambda extensions and suggested the most fitting use cases.

Bottom line, if your code actively interacts with the extension, you are doing something wrong, and you should replace the extension with regular Lambda function code or, in most cases, with an open-source library that does the same work, sometimes even with more features. Just because you can *do it with an extension does not mean you *should.

Observability and chaos engineering, on the other hand, are fine examples of proper extension usage.

Build a Serverless Web Application on Fargate ECS with AWS CDK

Ran Isenberg — Tue, 13 Aug 2024 07:14:01 +0000

Containers can be serverless, too, at least to some degree. I’ve decided to try Fargate and see how easy it is to deploy a web application while learning about its advantages and developer experience.

In this post, you will learn how to build a Fargate ECS cluster with an application load balancer and a web application using Python CDK code. We will build a production-grade and secure cluster that passes security scanning tools like CDK-nag and hosts a ChatBot web application that connects to OpenAI.

You can find the complete deployable GitHub repository here.

If you find any misconfiguration or mistake, let me know.

This blog post was originally published on my website, “Ran The Builder.”

AWS Fargate Quick Introduction
Building a Serverless Chatbot Web Application on Fargate
High Level Architecture
CDK Code
Web Application Code & Dockerfile
Deployment Result
Summary

AWS Fargate Quick Introduction

You can create a web application on Lambda, App Runner, EC2, ECS, or EKS. In this post, we will focus on ECS and containerized web applications.

To deploy a containerized web application, you need an ECS cluster and an Application load balancer that routes traffic to it.

AWS Fargate is a service that takes containerization services, either ECS or EKS, to the next level. It adds another layer of welcomed abstraction and ease of management.

AWS Fargate is a technology that you can use with Amazon ECS to run containers without having to manage servers or clusters of Amazon EC2 instances. With AWS Fargate, you no longer have to provision, configure, or scale clusters of virtual machines to run containers. — AWS documentation

You don’t need to think about EC2 machine types, manage them, install security updates, or manage their scaling up or down ; it’s all done for you. AWS claims it is a serverless service; however, in my book, it’s “almost” serverless since, as you will see in the code examples, we still need to create VPCs, and we pay for the service even when it’s idle, and no customers are accessing our cluster. However, it’s still a blessed abstraction, and I do agree with the following AWS statement:

With AWS Fargate, you can focus on building applications. You manage less — AWS documentation

You manage less but pay extra, so keep that in mind. If you are okay with managing the infrastructure and the EC2 machines under the hood, you can use ECS clusters without Fargate, which will cost you DevOps/infra team time.

Building a Serverless Chatbot Web Application on Fargate

We will write a CDK construct that will deploy a Fargate-managed ECS cluster that runs a web application with an Application load balancer routing traffic to it. I purchased a custom domain and created a certificate, but it is not mandatory.

The entire solution will be secure and audited with production-grade settings. I used CDK-nag to scan my CDK’s CloudFormation output and ensure it was not missing any important auditing or security configurations.

The web application will be a simple chatbot (yes, one of those) that empowers OpenAI. The chatbot is an ECS task container running a Python-based Streamlit application.

Streamlit, an open-source tool, allowed me to create a chatbot interface with less than 10 lines of code. It’s designed to handle user sessions and responses in the backend while providing an attractive frontend user interface.

As a side note, the CDK code can deploy any web application, not just a chatbot.

High Level Architecture

Let’s review the architecture of our Fargate solution below.

AWS Fargate will manage the Application load balancer and an ECS cluster. The ECS cluster deploys on a VPC across two AZs and creates ECS task containers (the web application container) with their security groups. The container image is uploaded to ECR.

Fargate also creates a load balancer. It has AWS WAF to protect it with an S3 bucket that logs traffic data and another bucket that tracks access to the first S3 bucket.

We also created a Route53 custom domain and DNS records that point to the ALB public address.

Let’s review the CDK code and the web application’s code.

CDK Code

There’s a lot of code, so follow my comments.

In line 19, we get a WAF object (see its definition here) and network assets. The network assets are the custom domain and its certificate. Be aware that these are optional. Their definition is here.

In lines 25–31, we build the ECR docker image for our web application container image. CDK will run Docker and expect a Dockerfile in the directory we provide. More on that later.

In lines 34–78, we define a VPC for our ECS cluster. Fargate is not “truly” serverless; we must still use VPCs like regular ECS. We also enable VPC flows and all the required permissions (yes, it’s a lot of code just for logs!).

In line 81, we define the ECS cluster and enable container insights. There are many moving parts here, so we should enable as many logs and metrics as possible to debug issues.

In lines 35–51, we define the Fargate task definition. Notice that we don’t select EC2 machine types. Instead, we define the amount of memory and virtual CPUs we require. Fargate will find the appropriate machine types according to its cost-efficient strategies (you can control them and add Spot instances capacity provider, too). In line 42, we select the ARM64 platform since I use an ARM-based laptop to build the Docker image.

In lines 84–103, we define the container used on the ECS task. We set the Docker image and enabled logging for the container. Again, another log type that we should enable to help debug the web application.

In line 103, we set the container open port to 8501, which is the default for Streamlit as our web application.

Let’s move on to the ALB and other production readiness configurations.

In lines 105–129, we create the S3 buckets for ALB and bucket access logs; we enable the security-recommended configs.

In lines 132–163, we define the Fargate ALB task. It creates an ECS service with our task definition (and container) and sets up an ALB that handles the routing to those containers. We provide it with the certificate and hosted zone created in the network assets construct. We don’t set a public IP; we enable HTTP redirect to HTTPS. Advanced features such as circuit breakers are recommended to detect failures during deployment and start a rollback to a previously stable state.

In lines 154–160, we add IAM permissions to the running process in the container. In our case, we need to access the secrets manager to fetch the OpenAI API key that was stored there.

In line 166, we enable WAF ACL on our ALB for extra security.

In lines 172–182, we tell Fargate how and when to scale its service. We provide both CPU and memory thresholds.

Lastly, line 184 enables health checks on our web application. Streamlit provides health checks endpoint support on the ‘/healthz’ path.

You can find the complete file in this repository.

Web Application Code & Dockerfile

Let’s start with the Dockerfile.

When you run CDK deploy, it will eventually use this Dockerfile to create and upload an ECR image. It requires a requirements.txt file to reside in the same folder. We also require app.py of our web application.

I created it automatically by running the following command:

poetry export --without=dev --format=requirements.txt > cdk/service/docker/requirements.txt

Let’s review the file below:

As for the web application code, its mostly based on this example with added code that fetches the OpenAI API key secret from Secrets Manager.

If you look closely at line 21, you will see that I did a simple “prompt engineering” and asked the bot to answer every question with some predefined context. Obviously, this is not an optimized or good chat bot, but just a POC. The main idea was to learn how to deploy a Fargate web application.

Let’s deploy the code and see it in action.

Deployment Result

Once deployed, you will access the URL and be greeted by this chatbot interface.

I asked him if Ran the builder, prefers K8s or serverless and got a pretty reasonable answer.

Summary

In this post, we learned a bit about Fargate. We created a Chatbot on Fargate ECS and improved its security and auditing best practices using CDK. You can use this code as a template and deploy any web application.

Please review this code — have it checkd for misconfigurations and security issues. Also, make sure you add CloudWatch dashboards and alerts in the same manner I did for my serverless service (see this post).

If you find any security issues, please contact me.

Reflecting on Serverless: Current State, Community Thoughts, and Future Prospects

Ran Isenberg — Wed, 24 Jul 2024 14:22:31 +0000

Lately, I’ve seen many serverless outtakes and mixed opinions online. They stirred up the community and started a discussion, which is always good.

In this post, I will reflect on the current state of serverless, share my thoughts about serverless articles from the community, and discuss the future of serverless.

This blog post was originally published on my website, “Ran The Builder.”

Intro
Serverless Reads on the Web
Clickbait Articles
Serverless & AI
Serverless as a Halo Product
Serverless Challenges
Serverless isn’t a Silver Bullet
Developers Doing DevOps & Platform Engineering
A Lacking Developer Experience and Too Many Options
It’s Solvable
The Serverless Community
The Future of Serverless
Summary

Intro

Even though it has been almost five years since I started using serverless, it still feels like magic to me. Deploying code without managing the underlying infrastructure is freeing.

In addition, the ability to create event-driven architectures built from AWS-managed resources or direct integrations that replace containers’ code is mind-boggling. You can build services that would otherwise take years in just minutes. I’m still amazed at DynamoDB’s global tables’ cross-region replication performance.

The reality is that Serverless is not the newest cool kid on the block anymore (yes, GenAI, I’m talking about you). Lambda has celebrated its 10th anniversary this year. It’s crazy — 10 years already! And you could also argue that the first serverless service was SQS, which was launched two decades ago! The hype has moved, and that’s normal.

Any technology matures over time. Kubernetes is not “new” anymore. Heck, even ChatGPT is not “new” anymore. I wrote a post about it a year ago about how my grandmother wanted to use it.

At some point, technologies and products reach maturity, which is where we are currently. My take is that Serverless has matured into a true cloud “workhorse” — an amazing scalable and reliable technology that solves many problems but (like any other) comes with its own set of limitations and learnings (more on that later).

Serverless Reads on the Web

Lately, I’ve seen many serverless outtakes and mixed opinions online. They stirred up the community and started a discussion, which is always good.

I’d like to share my thoughts on some serverless outtakes I’ve seen in the last few months.

Clickbait Articles

Lately, I’ve seen many serverless outtakes and mixed opinions online. They stirred up the community and started a discussion, which is always good.

I’d like to share my thoughts on some serverless outtakes I’ve seen in the last few months.

Clickbait Articles

This is a generalization of a trend of articles I’ve seen: “Team X stopped using serverless, and now they are 1000% better”.

There’s a lot of misinformation and misconceptions about serverless. Usually, in these articles, they refer to Lambda alone.

It’s 2024, and for some reason people still view serverless as just Lambda **(facepalm**).

There’s a beautiful serverless world out there with staple services: S3, SNS, SQS, DynamoDB, EventBridge and many more, that you can (and should!) use with your K8S cluster. Why operate Kafka when you get the same, if not better, experience with MSK or SQS without the maintenance cost?

Second, like any other technology, Lambda isn’t a magic solution to all problems. There are problems for which Lambda doesn’t fit (gasp) or make sense, and that’s okay!

Sometimes you need long-running sessions lasting more than 15 minutes. Or use GPU. Or you want to handle predictable traffic patterns and OK with building on ECS Fargate. Containers are awesome, and not having to manage the underlying infrastructure is even better.

The main point these articles miss is that just because Lambda was not used in a specific use case does not mean that Lambda is a flawed technology. It just means that it didn’t fit that specific service’s requirements.

BTW, it’s okay to mix. I designed a service where part of it is Lambda and part containers. Don’t choose a solution because that’s what you know and have always done. Werner Vogels discussed it at AWS re:Invent 2023. Choose the solution that fits the problem requirements and constraints the BEST, period, and don’t be afraid to do things differently from what you are used to.

Serverless & AI

Unless you’ve lived under a rock in the past month, you’ve probably read Luc van Donkersgoed’s post about serverless and AI. If in case, you missed it, I asked ChatGPT to summarize it for you:

Luc expresses concern that AWS’s intense focus on Generative AI (GenAI) overshadows traditional cloud engineering and core infrastructure. He highlights how AWS’s recent events and announcements have been predominantly centered on GenAI, leaving less attention and resources for other essential areas like databases, scalable infrastructure, and maintainable applications. Luc argues that while GenAI has its benefits, it cannot replace the foundational elements of cloud engineering and urges AWS to balance its focus to better support developers and existing business needs. — ChatGPT 4o

As an engineer and an architect, I agree that I’d rather see fewer AI features and more serverless or infrastructure features. I’m sick of hearing of new AI features, but I can’t ignore that GenAI has made me a better and faster engineer. I can’t deny that some of it is genuinely cool and has amazing potential, like Bedrock agents calling APIs (see my post and impressions here).

Now, taking on AWS’s customer obsession aspect and their business needs. I’m not an AWS customer; my company is, and that’s a significant distinction. CyberArk’s enterprise customers — want us to build AI features and plenty of them. There’s a real demand, and some of the implementations are improving our customers’ lives. So from that perspective, AWS is doing what their customers wants. However, since AWS was late to the game we’re seeing this never-ending wave of AI features.

Looking forward, I believe the AWS AI madness will calm down in a year or two as they close the gap, at least until the next hype comes along.

Serverless as a Halo Product

Gregor Hohpe published another interesting read titled: “Is AWS Lambda a Halo Product? — Shiny, advanced, a strong fan following, but lack of mainstream adoption — the trademarks of a Halo product. Sounds like AWS Lambda?”.

There are some solid points and some that that I personally have a different opinion on.

Let’s discuss the mainstream aspect. What does it really mean? Is it publicly known and used, or are these just pure market share numbers?

Lambda will not completely replace ECS or EC2. It wasn’t supposed to.

As I’ve mentioned above, they each have their place. Is it mainstream? Yes. People know of it and use it in large-scale enterprise companies.

I can say that with confidence because when I interview for developer positions at CyberArk, I get resumes of engineers who claim to know Lambda, SQS, DynamoDB, and other serverless services — and they actually know them.

Let’s continue.

I disagree with sentences like “I use serverless mainly for demos” or “EC2s are reliable” (hinting that Lambda isn’t?); well, like I wrote in the beginning, serverless is more than just Lambda. Other serverless are helpful even if you are running EC2s or K8s.

In addition, many enterprise companies use serverless in production, and it’s highly reliable. In fact, the AWS post-failure-event summaries from the last 13 years show that EC2 had four specific outages the previous year, with Lambda having just one.

Let’s talk about the following quote:

Those EC2 instances aren’t quite as sexy, but they are well-understood, reliable, and have predictable cost

According to DataDog’s state of cloud costs report item number 4, and I quote:

More than 80 percent of container spend is wasted on idle resources. About 54 percent of this wasted spend is on cluster idle, which is the cost of **over-provisioning **cluster infrastructure.

Let’s read that again: 80%! Wow, what a mind-boggling waste of resources and money. Perhaps the EC2s that host container-based solutions and Kubernetes aren’t that understood after all?

With serverless, you don’t pay for idle resources and don’t need to manage their infrastructure which reduces dramatically your costs — a massive win for serverless.

Let’s discuss ongoing maintenance.

Securing EC2s over time is not easy. Handling OS and library upgrades is sometimes not straightforward at all. We literally had one of the biggest outages in years just last week — thanks, CrowdStrike! Talk about having a predictable maintenance cost.

Take a look at K8s, too. You hear many horror stories about version upgrades, setting up a cluster or service mesh, and other issues. So why is serverless getting so much heat? That’s because, with K8s, the infra team handles that.

With serverless, every developer needs to learn and understand the concepts and write IaC that builds serverless architecture moving parts.

Let’s expand on this point.

Serverless Challenges

Gregor makes good points, too. Serverless, and Lambda as part of it, is unique (halo-ish), and as such, you need to learn the tech; you need to REALLY understand to make the most of it. I stumbled across an article by Sheen Brisals that discusses whether serverless is hard or not. I highly suggest you read it.

Let’s discuss in detail the current state of serverless struggles. I want to clarify: any technology comes with it’s own set of challenges. Every serverless challenge I mention is solvable; it’s just a matter of cost, knowledge, and having the right people to drive the technology inside your organization.

Serverless isn’t a Silver Bullet

The great power of serverless is that starting with and becoming productive is much easier. Just think how long it would take a developer who has never seen either Lambda or Kubernetes to deploy a Hello World backend with public API on both. As you start building more realistic production applications, the complexity increases. You must take care of observability, security, cost optimization, failure handling, etc. With non-serverless, this responsibility usually falls on the operations team. With serverless, it usually falls on developers, where there is considerable confusion.

Let’s go back to the interview example I was using.

The developers knew serverless concepts. However, most of them fumbled and failed the second I asked about testing techniques or failure handling with DLQ and retries.

Serverless is not a silver bullet that removes the complexity of building and running cloud applications. It can help you significantly reduce the complexity of infrastructure maintenance and focus on the application layer. However, that also comes with the new complexity of understanding and applying serverless application patterns, such as using DLQ, event driven architecture concepts, and more.

Developers Doing DevOps & Platform Engineering

Some companies fear change or want to avoid putting in the cost to make the change. Serverless requires a culture shift, better developer onboarding, and someone leading the way in best practices and governance. In my company, it’s been the platform engineering responsibility. That’s been part of my job. However, many companies don’t connect platform engineering with serverless. At scale, it’s a must; you need governance, security, FinOps optimizations, general best practices, and someone to own it.

If there’s no concrete owner, it falls on the shoulders of every single developer, and it’s just too much.

A Lacking Developer Experience and Too Many Options

If using serverless was so easy and obvious, I wouldn’t have needed to write over 50 articles or share my serverless pragmatism views at AWS re:Invent 2023 with my partner in crime, Heitor Lessa.

Issues like serverless testing, serverless observability, learning to write a proper Lambda handler, dealing with tenant isolation, working with infrastructure as code tools (too many AWS options — SAM, CDK, Chalice, which one to choose and why?), and learning all the best practices overwhelm developers and managers alike.

AWS has published articles on most topics, but there are many opinions, too many ‘hello world’ projects that get deprecated within six months, and not enough advanced use cases.

AWS must provide a better developer experience so developers don’t need to worry about these topics and focus on writing business logic.

It’s Solvable

It’s solvable but adopting any new technology always, always, always (!) requires investment. At CyberArk, we’ve reduced these challenges by creating self-service blueprints and codifying architectural blueprints and patterns so our developers get all the best practices right out of the box. Add internal workshops and knowledge sharing to that, and you will get on the right path. We’ve scaled from 10 serverless developers to hundreds.

It can be done.

I’ve shared many insights on AWS serverless office hours webinars and an AWS article and will share our journey at AWS Community Day DACH this September in Munich.

The Serverless Community

The serverless community is nothing short of amazing.

There are global events, serverless days, CDK day, AWS community days with tons of severless content, AWS serverless developer advocates, AWS community builders specializing in serverless, AWS serverless heroes, serverless newsletters, and so on.

Experts are sharing knowledge like never before. Even AWS’s documentation has gotten better. And with the AI revolution, writing serverless code is easier.

Four years ago, it wasn’t like that.

Open-source libraries like Powertools for AWS Lambda have changed the industry. They recently crossed 100 billion API integrations per week. Powertools is making customers’ lives a lot easier worldwide.

There’s even a new serverless community on Discord called ‘Believe in Serverless ‘, where you can easily connect with serverless experts and engage in discussions about the technology. This is unprecedented access to pragmatic advice from those who actually use the tech.

The Future of Serverless

Let’s clarify this section. I’m not a prophet; this is a mere wish list.

The serverless future will be bright; serverless is not going anywhere, and it’s not dying.

However, if AWS wants to increase its adoption, it needs to improve the following issues:

Focus on developer experience for current and new features. Don’t release service without debugging logs (I’m looking at your EventBridge pipes, which took a year to bring in failure logs).
More AWS Powertools, please! More runtime supports (RUST, Golang) and make them all feature parity with the Python variation, and extend it to containers, not just Lambda. Please invest in this amazing AWS DevEx team.
Don’t put the ‘serverless’ tag on services that are NOT genuinely serverless. If I need to set up a VPC or pay for idle time, it’s not serverless. Newer services, such as Verified Permissions that uphold these requirements, are encouraged.
All new non-compute services MUST be serverless. I don’t want to manage infrastructure ever again, so serverless should be the standard.
Focus on one IaC tool and make the best possible you can. For me, the obvious choice is CDK. Focus on releasing better CDK L2, and L3 constructs that implement best practices and security to turn constructs into productized patterns, as Jeremy Daly says.
Step function testing and definition — this is still too hard to do. Let’s change it. It’s a mandatory service that gets hate from developers due to poor DevEx.
Maintain GitHub sample projects for at least two years after publication and deprecate projects that are not maintained anymore.
Be more pragmatic. It starts with AWS Summit sessions and goes all the way to AWS re:Invent. Please bring more technical customer use cases. We want to learn from actual production environments, not just theory.
Provide production-grade ready and secure blueprints and recommended project setups, similar to what I created, the AWS Lambda handler cookbook. Maybe even endorse community projects if you don’t have the capacity to build them.
Help us tackle tenant isolation in a more generic and safe manner. Every company handles it differently, and it’s very error-prone.
We’re going to see AI creep into serverless, especially in the IaC domain, and that’s okay. I’ve yet to try AWS App Studio. Please don’t over do it (like Luc wrote).
Strive to make features transparent — Lambda insights, for example, instead of having me as a user attaching a Lambda extension and managing its version (which is not a great experience; see my Lambda layers post), let me provide a CDK flag to the Lambda construct, and you will attach the extension behind the scenes and manage it for me.
Cross-account access at scale with serverless and EDA is challenging; we can and should simplify it. These are challenges that come with this architecture. I discuss it in my security post.

Summary

There have been many serverless outtakes and mixed opinions online. They stirred up the community and started a discussion, which is always good. It made me take a moment to think about the current state and organize my thoughts.

I hope you enjoyed reading my vision. Time will tell whether I was right or not.

Thank you Anton Aleksandrov, Bill Tarr and Johannes Koch for reviewing this post and offering insights.

Build a Chatbot with Amazon Bedrock: Automate API Calls Using Powertools for AWS Lambda and CDK

Ran Isenberg — Mon, 08 Jul 2024 07:51:34 +0000

Bedrock and LLMs are the cool kids in town. I decided to figure out how easy it is to build a chatbot using Bedrock agents’ capability to trigger an API based on OpenAPI schema with the new Bedrock support that Powertools for AWS offers.

This post will teach you how to use Amazon Bedrock agents to automate Lambda function-based API calls using Powertools for AWS and CDK for Python. We will also discuss the limitations and gotchas of this solution with Bedrock agents.

Disclaimer: I’m not an AI guy, so if you want to learn more about how it works under the hood, this is not your blog. I will show you practical code for setting up Bedrock agents with your APIs in 5 minutes.

This blog post was originally published on my website, “Ran The Builder.”

Bedrock Introduction
Bedrock Agents
Powertools for AWS Bedrock Support
What are We Building
Infrastructure
Lambda Function Handler — Powertools for Bedrock
Generating OpenAPI File
Bedrock Agent in Action
Limitations and Gotchas
Summary

Bedrock Introduction

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies — https://aws.amazon.com/bedrock/

Bedrock introduces a bold claim:

The easiest way to build and scale generative AI applications with foundation models

However, the fact that I could build such an application within an hour or so speaks volumes about this claim. I did, however, have help; I used Powertool for AWS Lambda’s excellent documentation and new support for Bedrock agents.

The way I see it, Bedrock offers a wide range of APIs and Agents that empower you to interact with third-party LLMs and utilize them for any purpose, depending on the LLM’s expertise — whether it’s general purpose helper, writing music, creating pictures out of text or calling APIs on your behalf.

Bedrock doesn’t require any particular infrastructure for you to deploy (VPCs etc.) or manage. It is a fully managed service, and you pay you pay only for what you use, but it can get expensive. The pricing model is quite complex and varies greatly depending on the models you use and the features you select.

Highly sought-after features like guardrails (clean language filters, personal identifiable information scanners, etc.) add to the cost, but again, they are fully managed.

Bedrock Agents

Agents enable generative AI applications to execute multistep tasks across company systems and data sources

Agents are your friendly chatbots that can run multi-step tasks. In our case, they will call APIs according to user input.

Bedrock agents have several components:

Model — the LLM you select for the agent to use.
Instructions — the initial prompt that sets the context for the agent’s session. This is a classic prompt engineering practice: ‘you are a sales agent, selling X for customers’.
Actions groups — You define the agent’s actions for the user. You provide an OpenAPI schema and a Lambda function that implements that OpenAPI schema.
Knowledge bases — Optional. The agent queries the knowledge base for extra context to augment response generation and input into steps of the orchestration process.

If you want to learn how they work, check out AWS docs.

Powertools for AWS Bedrock Support

Agents for Bedrock, or just “agents,” understand the free text input, find the correct API to trigger, build the payload according to the OpenAPI, and learn whether the call was successful.

At first, I didn’t realize that Bedrock expects your APIs to change.

Usually, I serve my APIs with API Gateway, which triggers my Lambda functions. The event sent to the function has API Gateway metadata and information, and the body, comes as a JSON-encoded string.

With agents, they don’t interact with an API Gateway URL. They interact with a Lambda function (or more than one), each providing a different OpenAPI file. Agents invoke the functions directly, send a different input than API GW, and expect different response than the regular API Gateway response.

Powertools abstract these differences. I was able to take a Lambda function that worked behind an API Gateway, use Powertools’ event handler for API Gateway, and change the event handler type to Bedrock handler, and it just worked with the agents. Amazing!

Below, you can see the flow of events:

Agents use LLM and user input to understand what API (Lambda function) to invoke using the OpenAPI file that describes the API.
Powertools handles the Bedrock agent input parsing, validation, and routes to the correct inner function. Each inner function handles a different API route, thus creating a monolith Lambda.
Your custom business logic runs and returns a response that adheres to the OpenAPI schema.
Powertools returns a Bedrock format response that contains the response from section 3.

This brings me to problem number one — you can’t use the Lambda function with Bedrock agents and an API Gateway. You need to choose only one.

This is a major problem. It means I need to duplicate my APIs — one for Bedrock and another for regular customers. The inputs and responses are just too different. It’s really a shame that Bedrock didn’t extend the API Gateway model with add Bedrock agents context and headers.

If you want to see a TypeScript variation without Powertools, then I highly suggest you check out Lee Gilmore’s post.

What are We Building

We will build a Bedrock agent that will represent a seller. We will ask the agent to purchase orders on our behalf. We will use my AWS Lambda Handler Cookbook template open-source project that represents an order service. You can place orders by calling the POST ‘/api/orders’ API with a JSON payload of customer name and item counts. Orders are saved to a DynamoDB table by a Lambda function.

The Cookbook template was recently featured in an AWS article.

I altered the template and replaced the API Gateway with Bedrock agents. We will build the following parts:

Agents with CDK infrastructure as code
Generate OpenAPI schema file
Lambda function handler’s code to support Bedrock

All code examples can be found at the bedrock branch.

Infrastructure

We will start with the CDK code to deploy the Lambda function and the agent.

You can also use SAM according to Powertool’s documentation.

First, add the ‘cdklabs’ constructs to your poetry file.

Let’s review the Bedrock construct.

This construct is 90% the one that was shown on the Powertool’s excellent documentation:

In lines 18–24, we create the Bedrock agent.

In line 21, we select the model we wish to use.

In line 22, we supply the prompt engineering instruction.

In line 23, we prepare the agent to be used and tested immediately after deployment.

In lines 26–38, we prepare the action group and connect the Lambda function. We get input for the OpenAPI file. The OpenAPI file in this example resides in the ‘docs/swagger/openapi.json’ file.

Lambda Function Handler — Powertools for Bedrock

Let’s review the code for the Lambda function that implements the API of the orders service.

In line 17, we initiate the Bedrock Powertools event handler. Input validation is enabled by default.

In lines 27–43, we define our POST /API/orders API. This metadata helps Powertools generate an OpenAPI file for us (see next section). It defines the API description, input schema, HTTP responses, and their schemas.

In lines 59–62, we define the function’s entry point. According to the input path and HTTP command (POST), it will route the Bedrock agent request to the correct inner function. In this example, there is just one function in line 20.

In lines 49–53, we handle the input (validated automatically by Powertools!) and pass it to the inner logic handler to create the order and save it in the database. This is a hexagonal architectural implementation. You can learn more about it here.

In line 56, we return a Pydantic response object, and Powertools handles the Bedrock response format for us.

Find the complete code here.

Generating OpenAPI File

Powertools for AWS Lambda provides a way to generate the required OpenAPI file from the code.

Let’s see the simplified version below:

You can run this code and then move the output file to the folder you assign in the CDK construct at line 28.

Bedrock Agent in Action

After we deploy our service, we can enter the Bedrock console and see our new agent waiting for us:

Let’s test it out and chat with it in the console:

Success! It understood that we wanted to place an order; it built the payload input, executed the Lambda function successfully, and even displayed its output.

Let’s see what input it sent to the function (I printed it off the Lambda logs)

As you can see, it’s very different from the API Gateway schema. Line 8 contains all sorts of metadata about the agent origin, the API path, the payload, lines 9–18, and the HTTP command, which is shown in line 20.

Let’s verify the functionality of the function and the accuracy of the agents by examining the order that was successfully saved to the DynamoDB table.

As you can see, the order id matches the agent’s response and the input parameters.

Limitations and Gotchas

The Powertools’ documentation and code examples were spot on. It worked out of the box.

Powertools did an excellent job. However, I’ve had several issues with Bedrock agents:

Bedrock agents currently support OpenAPI schema 3.0.0 but not 3.1.0. Pydantic 2, which I use to define my models, generates the latest version. I had to change the number to 3.0.0 manually and hope that my API does not use any special 3.1.0 features. Luckily, it didn’t, but it was tough to find the error as it was raised during CloudFormation deployment (‘OpenAPI file rejected) and didn’t explain why my schema file was unsuitable. Powertool’s excellent support over their discord channel helped me. Thanks, Leandro!
Your Lambda needs to be monolithic and contain all the routes of your OpenAPI. An alternative would be to create multiple action groups with multiple OpenAPI files, which is doable but does not scale with a large number of routes and APIs.
This one’s a major issue — You can’t use the Lambda function with agents and an API Gateway. You need to choose. This means I need to duplicate my APIs — one for Bedrock, another for regular customers. The inputs and responses are just too different. It’s really a shame that Bedrock didn’t extend the API Gateway model with add Bedrock agents context and headers.
My agent sent an incorrect payload type. It marked the payload as an integer but kept sending the JSON object as a string instead of a number. My API has strict validation, so it didn’t convert the string to a number and failed the request. I had to debug the matter, which was not as easy as I’d hoped. Your mileage may vary with different LLMs; I chose the “simplest” one.

Summary

Chatting with an “agent” that resulted in a DynamoDB entry being created is quite amazing. Even more amazing is the fact that I was able to get this working so fast. Managed services with CDK support are the way to go forward!

I hope Bedrock makes changes according to my feedback and improves the user experience. The current implementation does not allow me to use it in production APIs without duplicating a lot of code.

Amazon CloudFormation Custom Resources Best Practices with CDK and Python Examples

Ran Isenberg — Tue, 18 Jun 2024 05:23:43 +0000

When I started developing services on AWS, I thought CloudFormation resources could cover all my needs. I was wrong.

I quickly discovered that production environments are complex, with numerous edge cases. Fortunately, CloudFormation allows for extension through custom resources. While custom resources can be handy, improper implementation can result in stack failures, deletion issues, and significant headaches.

In this blog post, we’ll explore CloudFormation custom resources, why you need them, and their different types. We’ll also define best practices for implementing them correctly with AWS CDK and Python code examples using Powertools for AWS, Pydantic and crhelper.

This blog post was originally published on my website, “Ran The Builder.”

The Case for a CloudFormation Custom Resource
Post Deployment Scripts
Custom Resource — One Stack to Rule Them All
CloudFormation Custom Resource Types
Plain AWS SDK Calls
SNS-Backed Custom Resource
Lambda Backed Custom Resource
Custom Resources Best Practices
Summary

The Case for a CloudFormation Custom Resource

CloudFormation can be useful when your provisioning requirements involve complex logic or workflows that can’t be expressed with CloudFormation’s built-in resource types. — AWS Docs

Here are some examples that come to mind:

Adding a database to an Aurora cluster.
Creating a Cognito admin/test user for a user pool.
Creating a Route53 DNS entry or creating a certificate in a domain created in a different AWS account.
Uploading a JSON file as an observability dashboard to DataDog
You want to trigger a resource provisioning that takes a lot of time — maybe up to an hour.
Any non AWS resource that you wish to create.

Post Deployment Scripts

To counter such scenarios, I’ve seen people add a mysterious ‘post_deploy’ script to their CI/CD pipeline that runs after the CF deployment stage and creates the missing resources and configurations via API calls.

It’s dangerous. If that script fails, you cannot automatically revert the CF stack deployment as it has already been done leaving your service in an unstable state.

In addition, people forget that resources have a lifecycle and handle object deletion, thus keeping many orphaned resources when the stack is deleted.

Custom Resource — One Stack to Rule Them All

The way I see it, everything that you do in the pipeline in deployment stage, any resource that you add or reconfigure should update together as there are dependancies, and if there’s a failure, CloudFormation will reliably revert the stack deployment and safeguard your production from being broken.

Our solution is to stress the importance of including ALL resources and configuration changes, including their lifecycle event handling (more on that below), as part of the CloudFormation stack as a **custom resource**.

However, it’s not all roses and daisies. Many people stay away from custom resources because mistakes can be highly annoying — from the custom resources failing to delete to waiting for up to an hour for a stack to fail deployment.

Rest assured, you’d be fine if you followed the code examples and best practices.

Let’s review the types of custom resources.

CloudFormation Custom Resource Types

It’s important to remember that every CloudFormation resource has life cycle events it needs to implement. The main events include creation, update (due to logical ID or configuration changes), and deletion. When we build our custom resource, we will need to define its behavior in reaction to these CloudFormation events.

There are three types of custom resources; let’s list them from the simplest to the most customized and complicated options:

Plain AWS SDK calls — simple, less code to write
SNS-backed resource — more complicated
Lambda-backed resource — the most complicated but the most flexible

Let’s start with the first type.

Plain AWS SDK Calls

This is the simplest way to implement a custom resource. In the example below, we want to create a Cognito user pool test user right after the user pool is created.

The process of creating and deleting a user is as simple as making a call to the AWS SDK. You can find the necessary steps [here] and [here].

Let’s see how we can translate these API calls to a simple CDK object.

We define a CDK function that receives a Cognito user pool object used as SDK parameters (its ID and ARN).

In line 7, we create a new ‘AwsCustomResource’ instance.

In line 10, we pass the API definition for the creation process: the boto SDK service, the API name: ‘adminCreateUser,’ and its parameters. Similarly, we can add ‘on_delete’ and ‘on_update’ handlers.

Behind the scenes, AWS creates a singleton Lambda function that handles the CloudFormation lifecycle events for you — super simple and easy!

In line 26, we add a dependency; this resource depends on the user pool created before running an API.

Bottom line: if you can map your lifecycle events to AWS SDK API calls, it’s the best and most straightforward way to cover CloudFormation’s missing capabilities with minimal code.

SNS-Backed Custom Resource

The second type is an interesting one.

I’d use this custom resource to trigger long provisioning (up to an hour!) in a decoupled and async manner via an SNS message. Depending on where the SNS topic resides, it can create resources or configurations, even in a different account.

One practical application of this custom resource type is to send all custom resource creation information to a centralized account. This allows for easy tracking of unique resources, enhancing organizational visibility.

This is a use case I describe in an article that I wrote with Bill Tarr from the AWS SaaS factory for the cloud operations AWS blog website. It will be hopefully released soon.

The entire GitHub repository can be found here.

Event Flow

Let’s review the custom resource creation flow below. Please note that the SNS to SQS to Lambda pattern is not given in the CDK below, it is assumed that the SNS topic owner (perhaps even in a different CF stack), creates this pattern. However, I will provide the Lambda function code as it has specific custom resource logic related code.

Custom resource creation event flow:

Parameters are sent as a dictionary to the SNS topic. You must ensure the topic accepts messages from the deploying account/organization.
SNS topic passes the message to its subscriber, the SQS queue.
SQS queue triggers the Lambda function with a batch of messages (min size is 1).
The Lambda function parses the messages and extracts the custom resource event type (create/delete/update) and its parameters, which appear at the ‘resource_properties’ property of SQS body massage. Note that you will be given both the previous and current parameters for an update event.
The Lambda function handles the logic aspect of the custom resource, creating or configuring resources.
The lambda function sends a POST request to the pre-signed S3 URL path that is part of the event with the correct status: failure/success and any other required information. Click here for a ‘create’ event example.
Custom resource is released from its wait state, deployment ends with a success or failure (reverted).

During the deployment in stage 1, the custom resource enters a wait state after it sends an SNS message. The message receiver needs to release the resource from its wait state. If an hour passes without this release (default timeout time), the stack fails on a timeout, and a revert takes place. If the message receiver sends a failure message back, the stack fails, and a revert takes place.

The receiver must send an HTTP POST request with a specific body that marks success or failure to a pre-signed S3 URL the custom resource generates.

Elements 2–6 can be part of a different AWS account belonging to a different team entirely in your organization and serve as a ‘black box’ orchestration. In that case, you just build the Custom resource, which is relatively easy.

Custom Resource CDK Code

Let’s start with the custom resource definition. The custom resource sends the SNS topic a message with predefined parameters as the message body. Each life-cycle event (create, delete, update) will automatically send a different SNS message attributes with the CDK properties we defined. In an update event, both the current and previous parameters are sent.

In lines 9–18, we define the custom resource.

In line 12, we provide the SNS topic ARN as the message target.

In line 13, we define the resource type (it will appear in the CF console), and it must start with ‘Custom::.’

In line 15, we define the dictionary SNS message payload that will be sent to the topic. We can use any set of keys and values we want as long as their value is known during deployment.

Lambda Function’s Code

Let’s review the receiver side of the flow and how it handles the CF custom resource events. We will use the library ‘cr_helper’ to handle the events with a combination of Powertools’ Parser utility for input validation with ‘pydantic.’ ‘cr_helper’ will route the correct event to the appropriate function inside the handler, manage the response to the S3 pre-signed URL, and handle errors (send a failure response for every uncaught exception).

The code below is taken from one of my open-source projects, which deploys Service Catalog products and uses custom resources and SNS messages. Other than the code under the ‘logic’ folder, which you can replace with your own implementation, most of the code is generic.

You can view the complete code here.

The flow is simple:

Initialize the CR helper library. It will handle the routing to the inner event handler functions and, once completed, release the custom resource from a wait state (see 2c below) with an HTTP request.
Iterate the batch of SQS messages and per SQS message:
Route to the correct inner function according to the SQS message body, the custom resource CF event. Route ‘create’ events to my ‘create_event’ function, ‘delete’ to the ‘delete_event’ function, and update’ to ‘update_event.’
Each ‘x_event’ function parses the input according to the expected parameters defined in the CDK code according to the ‘CloudFormationCustomResource’ schemas (lines 5–7). We leverage Powertools for the AWS parser utility and pass the payload to the logic layer that creates deletes, or updates resources.
‘cr_helper’ sends an HTTP POST request to the pre-signed URL with either success or failure information. Failure is sent when the inner event handlers raise an exception.

In line 13, we import the event handler logic functions, which are in charge of the resource logic. Replace this import with your implementation. I followed a Lambda best practice of writing the function with architectural layers. Click here to learn more.

In lines 17–22, we initialize the ‘cr_helper’ helper utility.

In line 43, we must return a resource ID in the ‘create_event’ function. It’s crucial to make sure it is unique. Otherwise, you won’t be able to create multiple custom resources of this type in the same account.

In line 50, we implement an update flow. This can happen when either the resource id changes or the input parameters change. The CloudFormation event will contain both the current and previous parameters so it’s possible to find the differences and make changes in the logic code accordingly.

The bottom line is that if you need to trigger a provision or logic in another account or service (that might belong to another team), this is a great way to decouple this logic between the services and allow a long process, which can last up to an hour.

Lambda Backed Custom Resource

In this case, the custom resource triggers a Lambda function with a CloudFormation life-cycle event to handle. It’s beneficial in cases where you want to write yourself the entire provision flow and maintain it in the same project; that’s in contrast to the previous custom resource where you send an async message to an SNS topic and let someone else handle the resource logic.

Let’s review the custom resource creation flow in the diagram below.

Event Flow

Custom resource creation event Flow:

Parameters are sent as a dictionary as part of the event to the invoked Lambda function.
Lambda function parses the messages, extracts the custom resource event type (create/delete/update) and its parameters that appear at ‘resource_properties’. Note that for an ‘update’ event you will be given both the previous and current parameters.
The Lambda function handles the logic aspect of the custom resource, creating or configuring resources.
The lambda function sends a POST request to the pre-signed S3 URL path (‘ResponseURL’ in the event) that is part of the event with the correct status: failure/success and any other required information. Click here for a ‘create’ event example.
Custom resource is released from its wait state, deployment ends with a success or failure (reverted).

You can use this resource to trigger a longer provision process (up to an hour) by triggering a Step Function state machine in the Lambda function, as long as you send the S3 pre-signed URL to that process so it can mark the result instead.

Custom Resource CDK Code

Let’s review the code below.

In lines 10–16, we build the Lambda function to handle the CF custom resource events.

In line 18, we define a provider, a synonym for an event handler, and set our lambda function as the custom resource event target.

In lines 19–27, we define the custom resource and set the service_token as the provider’s service token. See the provider definition here.

In lines 24–25, we define the input parameters we want the Lambda to receive. We can pass whatever parameters the Lambda can use during the provisioning process.

In line 27, we set the custom resource type in the CF console. It must start with ‘Custom::.’

Lambda Function’s Code

Let’s review the function’s code below. It will be familiar to the previous example, without the SQS batch iteration section, which is replaced with a global error handler in lines 19–23.

We define one function for each event type: create, update, delete, and the ‘helper’ library knows which one to trigger based on the incoming input event properties.

Pydantic and Powertools’s parser utility are used as before to parse the input of every event. This input is then passed to any logic function you write to handle the event: create a resource, send an API request, delete resources, etc.

As before, we need to return a resource ID in the ‘create_event’ function. It’s crucial to make sure it is unique; otherwise, you won’t be able to create multiple custom resources of this type in the same account.

As in the SNS example, the functions ‘handle_delete,’ ‘handle_create,’ and ‘handle_update’ are your implementation logic.

Bottom line: If you need to trigger a flow and manage it entirely in the same account via Lambda function code, this is a great way to do so and handle its life-cycle events.

Custom Resources Best Practices

Custom resources are error prone and you must put extra care into your error handling code.

Failing to do so, can result in resources that CF cannot delete.

Here are a few pointers:

Use the tools in this guide: ‘cr_helper’ and Powertools.
Read the documents specified in this guide to make sure you understand the input events and when each event is sent.
Understand timeouts and ensure you configure all the resources accordingly — Lambda timeout definition, CR timeout, etc.
Try to be as flexible in the Lambda function logic implementation as possible. Don’t fail on every issue. For example, if you need to delete a resource via API and it’s not there, you can return a success instead of failing.
Test, test and test again, flows of create, update and delete. Be creative and ensure proper integration and E2E tests for your Lambda. Learn here in my testing blog series about serverless tests.
Set the custom resource timeout setting. It can now be changed so you don’t wait for an hour in case of an error in your code.
‘cr_helper’ also provides a polling mechanism helper for longer creation flows — use it when required. I have yet to use it. See the readme.

Finally, choose the simplest custom resource that makes sense to you. Don’t over-engineer and think about custom resource team ownership. Decouple when possible with the SNS mechanism if another team handles the provision flow. In that case, it’s best to do it in a centralized manner.

Summary

This post covered several cases CloudFormation native resources don’t cover. We learned of custom resources and their types, their use cases and reviewed general best practices with CDK and Python code.

AWS Security Best Practices: Leveraging IAM for Service-to-Service Authentication and Authorization

Ran Isenberg — Mon, 03 Jun 2024 06:59:42 +0000

A critical aspect of cloud services is service-to-service communication, and it’s essential to do it securely. As an architect who designed centralized authentication and authorization services in CyberArk, a cyber security SaaS provider, I will share my take on developing a secure authorization mechanism between AWS cloud-based services, whether serverless or containers.

By the time you finish reading this post, you will not only understand the importance of service authentication and authorization, but also be equipped with the practical knowledge to implement it. This includes securely enabling cross-account access to resources for both synchronous and asynchronous communication patterns using AWS IAM.

This post includes JSON IAM policies and Python code examples.

This post is the first of many security topics in an upcoming series.

This blog post was originally published on my website, “Ran The Builder.”

Authentication and Authorization Concepts
Service-to-Service Communication Patterns
IAM Based Authentication & Authorization
Synchronous Communication IAM Solution
Asynchronous Communication Solution
Choosing Between Assume Role and Resource Based Policies
Summary
Appendix: Private and Public API Gateways

Authentication and Authorization Concepts

Authentication is:

Authentication is the act of proving an assertion, such as the identity of a computer system user — Wikipedia

and authorization on the other hand is:

is the function of specifying access rights/privileges to resources — Wikipedia

In a secure service, authentication and authorization go hand in hand.

As a service developer who exposes a REST API, it’s crucial to understand that leaving your API open to the world without proper authentication and authorization measures can lead to unauthorized access and data manipulation.

You want your service first to accept communication requests from authenticated services (principals), those whose identities are proven to be trusted and known. Once identified, you want to ensure that only authorized services can trigger your API or communicate with your service.

Or in other words:

Authentication aims to validate and identify that a principal is who it claims to be.
Authorization aims to define the relationship between principals, actions, and resources.

We can visualize authorization as the relationship between the principals (services), the actions they wish to take (e.g., execute an API), and the resources (the actual REST API endpoint and HTTP action type) on which their actions are called.

So, service-to-service communication security comes in two parts:

First, make sure the caller service is who it claims to be.
Then, assert that it has permission to take action on the resource.

To better understand the importance of authentication and authorization, let’s consider some real-world examples of service-to-service communication. These examples will highlight the critical role these measures play in ensuring the security and integrity of your REST API services.

Service-to-Service Communication Patterns

Developers spend a lot of time designing customer-facing REST APIs. These APIs introduce the concept of users to the system (human or non-human). AWS offers several user and service authentication options such as Amazon Cognito (Cognito) and AWS Identity and Access Management (IAM), with IAM being the main option for internal user management, while Cognito can connect to external identity providers such as CyberArk and others.

In this post, I’d like to focus on AWS IAM and its role in an important aspect of cloud services: service-to-service authentication and authorization.

Synchronous Communication Pattern

In this use case, there are three services: A, B, and C, each in its own AWS account.

Service A is serverless, B is EC2-based, and C serves a REST API with API Gateway. Both services A and B use service C’s API.

*This part discusses public API Gateways. For private API Gateway, visit the appendix below.

However, service C wants to ensure that only these specific services can call its API and perhaps even fine-tune it to be least privileged so that only a particular Lambda function or an EC2 machine can trigger it.

Let’s continue to the second pattern.

Asynchronous Communication Pattern

In this use case, we have our three services as before:

A, B and C each in its own AWS account as before.
C holds an SNS topic used as a centralized publisher-subscriber asynchronous communication service in the organization.

Service A is the publisher, and Service B subscribes to the SNS topic via an SQS queue.

Service C wants to ensure that only Service A can publish messages to the SNS topic and that only Service B’s SQS can subscribe to the topic as the topic contains data that you should keep private.

This is a standard pattern. You can swap the SNS topic to the EventBridge bus or any messaging service you may have, such as Amazon MSK.

Single or Multiple Accounts

As a side note, services A, B, and C may share the same account, simplifying the IAM solution. However, following the IAM practices in this post is essential, even if all services share the same account do not take shortcuts. You might find yourself moving them to different accounts in the future. In that case, if you don’t follow the best practices, you will have a hard time and many breaking changes ahead of you (from experience!). It’s best always to follow best practices, especially when security is involved.

IAM Based Authentication & Authorization

Now that we’ve covered the basics of service to service authentication and authorization issues let’s discuss the solution and start using AWS IAM.

This solution will cover both cross-account and same-account authentication and authorization.

IAM authentication means that a principal, for instance, service A, must be authenticated (signed in to AWS) using their credentials to send a request to other AWS services and resources, or service C in our example. Once authenticated, we can leverage IAM again to ensure service A is authorized to access service C.

We will combine two aspects of IAM, identity-based and resource-based policies, and provide authorization solutions for both synchronous and asynchronous communications. We will also provide cross-account access, i.e., authorization services from different AWS accounts using the “assume role” or delegation mechanism.

Resource Based Policies

According to AWS documentation:

Resource-based policies grant the specified principal permission to perform specific actions on that resource and defines under what conditions this applies

Resource-based policies are JSON policy documents that you attach to a resource, such as API Gateway, SNS, S3, or another resource. With resource-based policies, you can specify who has access to the resource and what actions they can perform on it. In addition, they can be used to allow cross-account access. Sounds perfect, right? Well, it has limitations, and not all resource types support it.

Let’s review the pros and cons of this IAM mechanism.

Pros:

Simple to define
Enable cross-account access.

Cons:

Not all AWS resources support resource-based policies.
Resource-based policies have a size limit like all IAM policies. When you define more resources and reach the maximum size, there’s no way to overcome it other than provisioning a new resource (duplicating part of service C, basically).

Assume Role Mechanism

The IAM access delegation mechanism, or “assume role,” as I call it, is crucial in cross-account communication. However, it can also be used in a single AWS account scenario.

The IAM access delegation mechanism is underpinned by identity-based policies attached to an IAM role. In our example, role C is located in service C’s account and grants access to the resource in question, which could be service C’s API Gateway or SNS topic.

Whoever has permission to assume role C can get a set of temporary IAM security credentials that can be used for authentication and authorization to communicate with service C, whether to execute a REST API call or publish a message to an SNS topic.

Assuming a role involves making an AWS SDK call to Amazon STS (AWS Security Token Service) and utilizing the temporary credentials from the SDK response to initiate a communication session with the service you intend to communicate with, we will explore code examples later in this post to ensure you have a comprehensive understanding of the process.

However, we need to define who can assume this role and gain access to service C. We will need resource policies and define that only the roles of services A and B can assume role C and gain access to service C.

Let’s review the pros and cons of this IAM mechanism.

Pros:

Supports all resource types as long as you can define an IAM policy.
Abstracts the resource and its ARN. You get a role that provides you access. The resource can change tomorrow, but all you know is the role of ARN and what API call to use. Suppose the call is abstracted in an organizational SDK that encapsulates the resource. In that case, you can change resources — SNS topic to EventBridge bus — and keep the changes in the role policies and SDK implementation levels, but the role ARNs remain the same.
Role C’s resource-based policy, which defines who can assume the role, has a size limit. However, if we want to add more services, we can provision a new role for new services to assume. Unlike the previous mechanism, it’s okay to provision a new resource as the protected resource remains unchanged (API Gateway or SNS topic).

Cons:

It is more complicated to define and requires an extra role in service C.
Assuming a role is another AWS SDK call that extends the overall runtime of services A and B and more error-prone code to maintain.

Let’s see how we can solve our authentication and authorization issues in synchronous and asynchronous communication patterns with concrete IAM policies and Python code examples.

Synchronous Communication IAM Solution

Let’s start by boosting the security of service C, the API Gateway. First, we will add an IAM authorizer to all its API endpoints. By doing so, by default, all unauthenticated and unauthorized requests are denied.

When IAM authorization is enabled, clients must use Signature Version 4 to sign their requests with AWS credentials. API Gateway invokes your API route only if the client has execute-api permission for the route. — AWS documentation

AWS IAM ensures that only requests with authenticated and valid IAM credentials which are authorized will execute the service’s C API. All that remains is for service C to define which services (identities) and AWS accounts can execute its APIs.

There are two ways to achieve that:

Resource-based policy
Identity-based policies assume role mechanisms.

Let’s start with a resource-based policy.

Resource Based Policy

In this use case, we need to alter service C’s API Gateway resource based-policy to allow services A and B (their Lambda function role, for instance, or their entire account or VPC endpoint etc.) to execute the API and its endpoints. We can define a fine-grained definition and decide exactly what endpoint each service can communicate with.

Here’s an example of such a resource policy for service C API Gateway:

In lines 8–9, we can allow an entire account (principal) such as the account of service A or B’s account or a specific role ARN in a different account (a better option, least privileged), the action (line 12) of executing an API endpoint. We can set the exact endpoint and HTTP command in line 14.

Be advised that at the moment, you cannot use this mechanism on an HTTP API Gateway, just the REST variant.

On services A and B side, they must define their roles (Lambda function role for service A) with identity permissions for the same action specified in the resource policy. For cross-account access, you must define the policy in this two-sided manner. Service A defines its Lambda function’s role with permissions to to call service C API Gateway, and service C allows service A to call it from the other account.

In addition, services A and B must send their IAM credentials (Sig v4) in the HTTP authorization header when calling service C’s API Gateway.

Here’s a Python example for this process:

In this example, we use our service’s role to authenticate with IAM, create the auth headers in line 7, and send them in line 15.

Assume Role

In this use case, we need to create a role in service C’s account with permissions to execute the API Gateway and let a principal in services A and B assume it.

We can start by defining the role’s trust policy — in this policy, the resource, is the role itself.

We let specific roles (principals) from services A and B assume this role (action). We can give a broader scope for services A or B, either to the entire account or to a role with a predefined prefix. However, it’s usually best to minimize the scope, so a role prefix or a specific role is better and more secure.

Next, we need to add the role’s permissions to execute service C’s API to the role :

Lastly, the code on service A is very similar to before, with the addition of the assumed role code.

You can find code examples for multiple services here or refer to the code below.

For ‘RoleArn,’ you need to provide the role that service C creates and shares its ARN with you. Typically, service teams exchange ARNs manually. I’d recommend saving that ARN as an environment variable in the Lambda function. Also, ensure that your Lambda role has the necessary permissions to assume roles.

As you see below, the code is very similar to code example before, just with the addition of the STS API call in lines 5–10 and using the response values in lines 11–18.

If you wish to learn about private API Gateway use cases, refer to the appendix.

Asynchronous Communication Solution

Our goal is to allow service A to publish SNS messages to service C’s SNS topic and to let service B subscribe via SQS to the SNS messages.

Let’s review the two IAM implementation options we have.

Resource Based Policy

In this case, we will define an SNS access policy that allows service A (principal) to publish messages (action) to the SNS topic of service C (resource) and service B to subscribe to the messages. Be advised it’s best to fine-tune these permissions to the role that can publish and to the specific SQS that can subscribe.

On the service A side, the Lambda role will define its permissions to publish to the SNS topic. Having the two sides define the permissions allows them to work when dealing with cross-account access.

When publishing a message to the topic, service A’s Lambda function will utilize the AWS SDK to make the API call. The SDK uses the function role’s credentials to take care of the IAM authentication and authorization side.

On the service B side, we need to define an SQS subscription to the SNS topic by following the documentation here.

Assume Role

In this case, we need to create a role in service C’s account with permission to publish messages to the SNS topic. Then, we let a specific role of service A assume this role.

We can do this by defining the role’s trust policy.

We can give a broader scope for service A to the entire account or a role with a predefined prefix. However, it’s usually best to minimize the scope. Hence, a role prefix or a specific role is better and more secure.

Next, we need to add to the role the permissions to publish messages to service C’s SNS topic:

Now, on service A’s side, we need to assume the role and use AWS SDK to send SNS messages. It’s also important to ensure the Lambda function role has permissions ‘sns:Publish’ and ‘sts:AssumeRole’; otherwise, the AWS SDK calls will not work.

The service teams need to exchange the resource ARNs and account numbers for the policies to work.

Service A assumes the role of SDK call and uses the temporary IAM credentials to create a boto client for the SNS publish message SDK call.

Service B remains as in the resource-based policy example; it can work only as a resource policy that allows its SQS to subscribe. However, service A requires some code changes.

Choosing Between Assume Role and Resource Based Policies

Your service communication will be secure with authentication and authorization, whether you chose resource-based policies or assumed role solution. However, each implementation has its pros and cons that can make your life harder in the future if you ignore them.

I’ll divide my recommendation by different parameters.

Suppose I had to choose just one implementation. In that case, I’d go with the ‘assume role’ path, as it allows my service to support multiple services in the future easily. I can create more roles to support more services that assume them and I’m not limited by IAM policy size.

However, resource-based policies are better if you only care about performance, so don’t add another SDK call to assume the role. Keep in mind that these policies have a maximum size. Suppose you expect to connect many services and different AWS accounts (think of a central pub-sub account or central API). In that case, you will encounter these limitations at some point in the future. As it doesn’t make sense to duplicate the SNS topic or API gateway for integration with new services, you are better off choosing the ‘assume role’ path. It’s easier to provide new roles than to duplicate an SNS topic or an API Gateway, which doesn’t make sense.

Another deciding factor is whether the services are in the same account and who maintains them — the same team or not. The resource-based policy is excellent for internal service or micro-service communication when the same team maintains them, as it introduces some degree of coupling. Still, it is acceptable as it “stays” in the family. However, suppose different accounts and teams are involved. In that case, the ‘assume role’ route is better as it decouples the resources and teams in question and supports endless future extensions.

Lastly, you can always change the implementation, so don’t be afraid to choose; just make sure you select one of these two options.

Summary

In this post, we have learned about authentication and authorization. We have also learned about two service-to-service communication patterns: asynchronous and synchronous.

We have implemented both authentication and authorization for those patterns using AWS IAM. We saw four different implementations and discussed their pros and cons, whether the resource-based policies route or the ‘assume role’ one.

In the following posts, we will discuss the challenges these patterns bring, how to solve them, and how to take authorization another step forward into the fine-grained domain.

Appendix: Private and Public API Gateways

AWS recommends building private API Gateways for service-to-service communication to enhance performance, reduce network costs (you don’t leave the AWS network), and improve security.

…traffic to your private API uses secure connections and does not leave the Amazon network — it is isolated from the public internet — AWS documentation

In reality, it’s possible and easier to build APIs as public API Gateways. In that case, authentication and authorization become more critical and you must follow the guidelines in this post.

Private API Gateways require VPCs. They bring extra complexity as connecting services also need to use VPC and VPC endpoints.

AWS recommends connecting the services’ VPC networks via resource policies for VPC endpoints or VPC peering. You can read more about it here.

When you use serverless and Lambda functions and wish to communicate with a private API Gateway, you need to put your Lambda functions inside VPCs. This is not ideal, as VPCs have unwanted effects on the Lambda functions, such as longer cold starts and increased costs.

VPCs Don’t Replace Authentication and Authorization

I want to set this point straight.

Setting up a network connection between different VPC endpoints does not mean you implemented service authentication or authorization or that you are 100% secure.

Yes, it brings up an extra layer of security, but it does not replace IAM-based authentication and authorization.

First, it’s not the least privilege; any service inside those VPC networks can access your service. It’s an extra layer of security but does not replace authorization. In addition, it does not scale. The more services you add, the more “breached” your service becomes, with more VPCs and services that gain access.

In addition, if attackers gain access to one of the VPCs, they can communicate freely with your service because your service accepts any incoming calls.

However, combining IAM authentication and authorization with VPC provides the best and most comprehensive security for your API Gateway.

You can learn more about such patterns with API Gateways and VPC in this video: https://www.youtube.com/watch?v=H4hCygngTrU

Forem: Ran Isenberg

Protect Your API Gateway with AWS WAF using CDK

AWS Web Application Firewall (WAF) Introduction

ACL and Rules

Sample Serverless Service Architecture

AWS CDK Code

AWS AppSync Events — Serverless WebSockets Done Right or Just Different?

What’s a WebSocket?

What is AppSync Events?

Core Concepts & Terminology

First Impressions and Usage

Insights and Tips

Security

No Connection Tables

Mass Broadcast is Easy

Tenant/User Isolation Requires Channel Design

(Over) Flexibility Means Complexity

Developer Experience

Pricing

Summary — Done Right or Just Different — Or in Other Words, Will I Use it?

AWS WAF Essentials: Securing Your SaaS Services Against Cyber Threats

You’ve Got to Secure Your SaaS Application

AWS Web Application Firewall (WAF) Introduction

ACL and Rules

AWS WAF Tips & Tricks

Ownership

Governance

Adding ACL Rules Over Time

Visibility & Logs

WAF WCU & Pricing

Summary

Understanding AWS Availability Zones: Boosting SaaS Resilience and Uptime

Availability Zones Definitions and Properties

Availability Zone’s Properties

Why Understanding AZs Matters

AZs and the Peculiar IDs Case

When Can an AZ or Region Go Down?

Recommendations for Availability Zone Selection

AZ Selection via IaC

AZ Selection Tips for VPC

AZ Selection Tips for Aurora

Understanding Availability Zone’s Cost

Summary

AWS re:Invent 2024 — My Selection Of Sessions

Table of Contents

Session Types

Session Levels

My Breakout Session

Levels 100–200

Level 300

Level 400

Heroes/Community Track

Non Serverless But Highly Recommended

Guide to AWS re:Invent - Tips & Tricks

Table of Contents

Arrival Date

Pack the Correct Tools

Sessions & Getting Around

Night Time

AWS re:Play

Socialize

Shopping

Departure Date

Lastly, have fun and learn!

A Critical Look at AWS Lambda Extensions: Pros, Cons, and Recommended Use Cases

Lambda Extensions Introduction

The Case for Lambda Extensions

The Case Against Lambda Extensions

Security

Developer Experience

Performance & Cost

When to Use Lambda Extensions

Fetch Configuration

Send Logs to a 3rd Party Observability Provider

Monitoring Lambda’s Container Metrics

Chaos Engineering

Summary

Build a Serverless Web Application on Fargate ECS with AWS CDK

Table of Contents

AWS Fargate Quick Introduction