Forem: Maciej Radzikowski

25 Good and Bad Serverless (and other) Announcements from re:Invent 2023

Maciej Radzikowski — Mon, 04 Dec 2023 16:02:31 +0000

Originally published at BetterDev.blog

AWS re:Invent 2023 ended, so let’s look at the most interesting announcements for serverless and more. There are exciting features to try and, maybe more importantly, features to avoid!

Looking at the announcements, I think we are entering the next era for serverless and cloud computing in general. There are fewer groundbreaking new services or big features. Instead, there are more quality of (developer) life improvements. Enough to say that four of the announcements below are about CloudWatch Logs.

And that’s a great thing. It means that things are stable and most capability gaps are closed, so we can focus on building our solutions on top instead of making workarounds (in most cases, at least).

That doesn’t mean there were no important updates. For me, Step Functions and CloudWatch Logs teams are winning this re:Invent in terms of the best improvements and new features.

Okay, let’s see what’s new.

Serverless

Step Functions: HTTP request task

The new task type allows you to make HTTP requests directly from the Step Function. No more need for a Lambda function to make an external API call.

What’s best is that it’s really versatile. Instead of creating integrations with specific API partners, the Step Functions team lets you connect to almost any API. You can set the request method, body, headers, and query parameters and even encode the body in the x-www-form-urlencoded form. There are some limitations, sure – the request body must be provided as JSON, and not all headers are allowed. So no XML or SOAP requests, but I don’t think that’s a big issue, even though I still integrate with such APIs occasionally.

An interesting design choice is using EventBridge Connections for authorization. That makes sense – it’s already in place and supports basic auth, API keys, and oAuth. But don’t be afraid; it does not use EventBridge API Destinations, so there is no 5-second timeout limit (although I could not find any specific timeout mentioned).

Step Functions HTTP step parameters

My opinion: Easily top #1 re:Invent announcements for me. My only issue is Secrets Manager, which you must use for secrets when creating EventBridge Connection. While EventBridge then makes its own copy of the secret, which is free with no charges for using it, the original one still costs flat $0.40 per secret to store a few bytes of API key.

My reaction: I'm loving it 😍

See more: announcement post

Step Functions: redrive failed execution

Step Functions execution failed in the middle because of an external error and now you have to re-run the whole process? Not anymore. You can redrive execution starting from the failed state. Important to know is that the redrive will run on exactly the same Step Function definition, so you can’t modify Step Function and re-run the failed step using the new, fixed workflow version.

Step Functions failed step – retries and then manual redrive

My opinion: It’s never the fault of our code, right? But jokes aside, it’s very useful.

My reaction: SF team admiration 🤗

See more: announcement post

Step Functions: test step

Developing and testing Step Functions can be annoying when you make changes, run execution, wait for it to reach your modified state, see errors, fix, rerun the whole thing… You get what I mean. Wouldn’t it be great if you could run an individual state to test it?

Well, now you can. You can select and run a state in the Step Function edit view, providing only the state input. What’s great is you get very detailed, step-by-step (you see what I did here?) logs on input/output processing. This will allow you to catch all problems like missing IAM permissions, incorrect result selectors, etc.

Step Functions test state shows step input and output processing details

My opinion: Total saved developer-hours of Step Functions users will be counted in thousands.

My reaction: amazed 🤩

See more: announcement post

Lambda: much faster scaling up

Before, when handling increased traffic, Lambda would scale up by creating up to 500 to 3,000 new execution environments in the first minute (depending on the region) and an additional 500 environments every minute. Moreover, those scaling quotas were shared by all Lambda functions in the account region.

Now, it can scale up by 1,000 environments every 10 seconds. That’s… much faster. And each function is scaled independently, meaning each of your Lambda functions can create up to 1,000 new environments every 10 seconds.

There is still the limit of total account concurrency. The default quota for Lambda concurrent executions is 1,000, and even lower for new accounts. But you can request it raised to "tens of thousands".

Lambda scaling example – 1000 new environments every 10 seconds, up to account concurrency limit (source)

My opinion: That's great. And you don't have to do anything to benefit from this.

My reaction: positively shocked 😲

See more: announcement post, docs

Lambda: future runtime launch dates

Next to Lambda runtime deprecation dates, the AWS docs now include the target dates for the new runtime version releases.

My opinion:

Why is it so important I put it on the list? Because if I’m tired of people constantly asking "when Python x", "when Node y", then the AWS Lambda team must be fed up, too.

Also, since we’re on this subject, I don’t think people understand that adding a new runtime version is for AWS more than wget https://nodejs.org/en/download/the-latest-node-version.zip and it also includes a commitment to maintenance for extended time for tens of thousands of clients. So people asking for the new runtime on the same day as its release are, frankly, delusional.

Also, AWS already improved the runtime upgrade cycle. Now, with this added transparency on the expected release dates, I’m totally satisfied.

My reaction: relieved 😮‍💨

See more: docs

Lambda: advanced logging controls

You can now select JSON log format for Lambda function and use language-default logging tools, like Node.js console object and Python logging module, to log messages in structured JSON format. When using it, you set the log level in Lambda settings, which is a nice improvement over a standard environment variable. Additionally, system logs like START, END, and REPORT are also logged as JSON.

Another feature of advanced logging controls is setting a custom log group instead of the auto-generated one, which can be helpful if you want to aggregate logs from multiple functions in a single place.

Lambda advanced logging configuration

My opinion: Initially, I was ecstatic – no need for added dependencies for the logger! Then I tested it. On Node.js, the extra parameters added in the standard way are inlined as string, not JSON fields. On Python, the DEBUG log level includes logs from boto3, and there is a lot of them. A big no-no for me. I’m continuing logging as JSON with logger libraries, which also provide features like log sampling.

My reaction: big sad 😭

See more: announcement post

CloudWatch Logs: cheaper log class

CloudWatch Logs, with $0.50 per GB of ingested data, can quickly become one of the top costs for serverless applications. That’s why it’s essential to follow some best practices.

Now, however, you can choose the Infrequent Access log class for your Log Group and pay only half the price for the ingest – $0.25 per GB. There are, obviously, tradeoffs – not all features are available with Infrequent Access. You can’t create subscription filters, export to S3, use Live Tail, Lambda Insights, the new Anomaly Detection, or use metric filtering or embedded metric format for creating metrics from logs. Additionally, you can see them only through Logs Insights, not the regular log group stream view, so the cost for reading logs is $0.005 per GB of data scanned.

My opinion:

The lack of CloudFormation support for now makes it only a theoretical feature for any IaC (unless you want it very much and use custom resources to create it on your own). The CDK may be the first to support it since it creates Lambda Log Groups programmatically anyway.

Apart from that, if you don’t need any of the unsupported capabilities, it’s definitely worth considering, especially for the production environments.

My reaction: mixed feelings 🥲

See more: announcement post

CloudWatch Logs: pattern grouping

In CloudWatch Logs Insights, all logs are now additionally grouped in automatically recognized patterns. "Patterns" are similar logs that differ only in some values, like dumped variables. Which makes total sense because there is usually only a limited set of different log types your application writes.

With patterns, it’s much easier to find unusual or alarming logs – instead of traversing hundreds of log messages, you only look at a few aggregated types of log messages. You can click on any of them to see all matching logs. Additionally, you can compare patterns with different periods to see if the number and ratio of log types changed or not.

Example patterns detected on relatively simple Lambda logs

My opinion: It’s one of those features you wonder how you could lived without for so long. And you get this for free (no additional cost apart from the regular price of CloudWatch Logs Insights), which is a cherry on top.

My reaction: simply amazed 🤩

See more: announcement post

CloudWatch Logs: anomaly detection

You can now enable anomaly detection on CloudWatch Log Group. After a short learning time, it will automatically, well, detect anomalies in your logs. Anomalies are based on automatically recognized patterns (similar to those described above) and changes in their frequency. When enabled, CloudWatch will detect changes like a decreased number of success logs, an increased number of warning or error logs, or even a lack of logs, meaning your Lambda stopped being invoked (yes, I had to create such alarms in the past).

You can temporarily or permanently suppress types of findings, which is very useful.

To get notified of new anomalies, you need to create a CloudWatch Alarm. Apart from that, anomaly detection is free of charge (or rather – included in the data ingestion price).

Example detected anomalies in CloudWatch Logs (source)

My opinion:

AWS announced pattern grouping and anomaly detection in a single blog post, but I think those two are big and important enough to be separate topics. I’ll definitely use both of those features on real production apps sooner rather than later.

With anomaly detection, the only problem may be the initial number of false positives, especially if you link it to a CloudWatch Alarm. And the fact we get this for no additional charge surprises me - in a good way, of course.

My reaction: simply amazed 🤩

See more: announcement post

CloudWatch Logs: query generator

While the CloudWatch Logs Insights query syntax is not overly complicated, I’m not using it frequently enough to learn it by heart like SQL. Therefore, the new query generator that converts natural language expressions like:

find 10 longest Lambda invocation times

into:

fields @timestamp, @message 
| filter @type = "REPORT" 
| stats max(@duration) as maxDuration by @logStream 
| sort maxDuration desc 
| limit 10

is really something.

Currently, the capability is in preview and available only in the us-east-1 and us-west-2 regions, free of charge.

My opinion: I’m not the biggest fan of Generative AI, simply because everyone is adding it everywhere even if it does not make sense, sometimes worsening instead of improving services. That being said, the query generator for Logs Insights is an example of AI applied correctly. I’m eager to put it to the test in real cases.

My reaction: interested 🧐

See more: announcement post

DynamoDB: zero-ETL OpenSearch integration

DynamoDB can now automatically ingest items to OpenSearch. You need to enable DynamoDB Streams and point-in-time recovery on the table, and the rest – both the initial data load and keeping it in sync – is handled for you. There is just one catch – it requires an OpenSearch ingest pipeline.

My opinion: Ingest from DynamoDB to OpenSearch is a common pattern to get a fast and scalable database with added search capabilities. The built-in integration that does not require a Lambda function in the middle is totally spot on. The only problem lies in OpenSearch ingest pipeline pricing, starting with a small amount of $170/month. While I see it as worthwhile on large applications, it’s unacceptable for serverless development and smaller services. Thus, I’m good with my Lambda functions for now…

My reaction: not even sad, just disappointed 😑

See more: announcement post

DynamoDB: zero-ETL Redshift integration

Similarly to OpenSearch integration, DynamoDB will be able to keep data in sync in Redshift.

The capability is now in limited preview, so it’s hard to say more about it. You can sign up for access.

My opinion: In my experience, it’s a less common pattern than ingest to OpenSearch, but a direct integration is still warmly welcomed. I just hope that, unlike the OpenSearch ingestion, it won’t introduce extra costs.

My reaction: cautiously interested 🧐

See more: announcement

AppSync: easier Aurora Data API integration

AppSync JavaScript resolver utils got support for making Aurora Data API requests with createMySQLStatement() and createPgStatement() helper functions. The query can be provided as SQL or created with a builder. Additionally, in AppSync, you can now generate the whole API from Aurora Cluster with a few clicks.

AppSync can make direct SQL queries… but only to Aurora Serverless v1 (source)

My opinion: This sounds good until you realize that the Data API is still only supported by the Aurora Serverless v1, not v2. And that’s a big limitation because v1 does not scale well and is available only with a few not-so-recent Aurora versions. And since you could already make Data API calls from AppSync, this does not introduce any new real capabilities. Don’t get me wrong – helper functions are great, but with no support for Data API from Aurora Serverless v2, they are not worth much…

My reaction: meh 😒

See more: announcement post, docs

ElastiCache Serverless

The new serverless version of ElastiCache is more expensive than the eight cheapest on-demand instance variants. But the pricing model is interesting because, unlike other expensive serverless services, here you don’t pay for always-on idle processing units but only for actual per-request usage. What generates costs is the data storage with a minimum billable value of 1 GB.

My opinion: While better than for OpenSearch Serverless, the pricing model is still unsuitable for serverless workflows with a $90/month minimum charge. DynamoDB is a better key-value storage for most new applications, so unless you migrate an existing system, have particular needs, or calculated significant cost savings, you will be better with DynamoDB by default.

My reaction: could be worse 🙄

See more: announcement post

SQS: FIFO throughput increase and DLQ redrive

SQS FIFO queues offer exactly-once processing and strict ordering. Of course, this comes at a price, and this price is limited throughput. While standard queues offer a "nearly unlimited number of transactions per second", FIFO queues always had limits. But now, with the support for 70,000 transactions per second in high throughput mode, you have to try really hard to reach those limits.

Also, FIFO queues now support the Dead Letter Queue redrive option, allowing the re-delivery of messages that have not been processed.

My opinion: Sadly, I don’t work on any service requiring such a large throughput. However, it’s worth remembering that it’s possible only in the high throughput mode where messages belong to a uniform distribution of groups, and FIFO order is guaranteed only in the scope of a single group.

My reaction: nice 🙂

See more: announcement post

S3: Express One Zone storage class

The new S3 storage class, named Express One Zone, is created to handle "hundreds of thousands of requests per second with consistent single-digit millisecond latency". What’s important is that this is a purpose-built storage that works best for small files with a short life span.

This storage is different enough from other S3 storage classes that it has a separate tab on the S3 page in the AWS Console. Next to the "general purpose buckets", those new Express One Zone buckets are under the "directory buckets" tab. They have a special naming scheme and a new authentication method that grants access tokens valid for 5 minutes. Furthermore, when listing objects, prefixes must be full directory names ending with /.

The main use case of Express One Zone buckets is to store intermediate files of data-intensive distributed computing, like AI/ML training. But I’m sure AWS customers will find many other applications.

Express One Zone buckets are listed under new "Directory buckets" tab

My opinion: It’s a specific solution for a specific problem. Before you use it because "it’s faster", "it has directories", and "read and writes cost half the Standard class price", please note it costs seven times more for storage and does not offer the same durability and availability, making it not suitable for long-term or general purpose storage.

My reaction: nice 🙂

See more: announcement post

OpenSearch Serverless: vector engine

The vector engine for OpenSearch Serverless, meant as a database for ML/AI models, is now generally available. Interestingly, the OpenSearch and OpenSearch Serverless diverge more and more, with the Serverless version getting unique new features.

My opinion: I saw at least five new or adapted vector databases this year, so this one does not surprise me. And all would be good if not for the OpenSearch Serverless costs – starting with $700/month. However, in the announcement post, they write about the possibility of using no active replicas and 0.5 computing units for development, which would land it around $170/month? I’m not entirely sure since there is no mention of it on the pricing page.

My reaction: not my area, hard to say 😶

See more: announcement post

Integrated Application Test Kit

The Integrated Application Test Kit, or IATK, is a new library helping to run tests for serverless applications in the cloud. It has a few useful features, like resolving resource physical IDs from the CloudFormation stack or waiting for asynchronous EventBridge events.

The library is available in public preview and, for now, only in Python.

My opinion: I’m looking forward to library development, both in terms of new features and supported platforms (Node.js in particular). The current capabilities are limited, and a lot will depend on the development team’s direction. But in a good scenario, it can become the base for running integration tests, replacing the helpers I currently write on my own.

Reaction: curious 🤔

See more: announcement post

Not Serverless (but still interesting)

Aurora: Limitless preview

Scaling SQL databases is hard. Scaling SQL databases for writes is especially hard. And this is the problem the new Aurora Limitless Database tackles.

The Limitless Database uses data sharding to spread the load on multiple instances and transaction routers to manage writes and reads to/from shards. Unlike the read replicas, this horizontal scaling works for both read and write operations. And thanks to the routers, the sharding is transparent for the client.

The Aurora Limitless is now in a limited preview. You can sign up for access.

Aurora Limitless architecture (source)

My opinion: While data sharding is not new, having it as a managed AWS service removes a lot of complexity for extremely high-traffic applications requiring SQL.

My reaction: academically interested 🤓

See more: announcement post

OpenSearch: direct S3 queries

You can now connect OpenSearch to S3 and make queries on the data in the buckets. Similarly to Athena, it uses the Glue Data Catalog to represent your S3 data as tables. You can select one of the three indexing strategies, from ingesting only metadata to ingesting all the data from S3 to OpenSearch, which translates to query performance.

Indexing and making queries consume compute units, which is an additional cost to your OpenSearch cluster. While no compute units are consumed while "no queries or indexing activities are active", I’m not sure how that relates to keeping the S3 index up to date and whether it’s 24/7 activity.

My opinion: Yes, it looks like Athena with extra steps. However, I see the benefit of making queries through a single engine in a uniform way. Moreover, indexing data in OpenSearch should significantly outperform Athena.

My reaction: interested 🧐

See more: announcement post

AWS SDK for Rust and Kotlin

The Rust and Kotlin AWS SDKs become generally available.

My opinion: I’m not planning to use any of those two languages in the foreseeable future. But I know the community highly anticipated at least the Rust SDK, so I’m happy for the folks using it.

My reaction: nice 🙂

See more: Rust AWS SDK announcement post, Kotlin AWS SDK announcement post

Cost Optimization Hub

The Billing and Cost Management pages are now merged into one. I never understood why they were separate, and I had to jump through the links in the first place.

In this new management page, there is now a Cost Optimization Hub. It aggregates optimization suggestions from over ten different services.

Using Cost Optimization Hub is free, but you must opt-in to enable it. For best results, you must also opt-in to Compute Optimizer.

My opinion: While it’s an improvement, I don’t understand why we now have separate Cost Optimization Hub and Compute Optimizer, especially since the first takes the data from the latter.

My reaction: confused 🤨

See more: announcement

myApplications

The Console Home page now has myApplications view, where you can create applications and get a dashboard with costs, alarms, security findings, and other widgets just for selected resource groups. You add resources to the applications by tagging them with a special awsApplication tag. You can choose resources manually or select a CloudFormation stack, and all its resources will be tagged for you, which is nice. But even better is to add the tag to all resources in your stack(s), for example, with the CDK Tags aspect.

There is also an AWS::ServiceCatalogAppRegistry::ResourceAssociation CloudFormation resource suggested by the myApplications page that you can add to the stack. While I understand that it should associate your stack with the application, it does not work, and myApplications shows that you must add tags anyway.

myApplications Compute widget automatically selects and shows the most utilized resource statistics (source)

My opinion: I know the lack of logical "here is your application and all its resources" groups in AWS is somewhat intimidating for beginners. But I doubt this is helpful for those new users – it took me about 15 minutes to fully understand how it works… Besides, if you follow the basics of best practices to work with AWS, you have a single application deployed per account, so it’s not needed. And I don’t understand why this is a separate thing to Resource Groups instead of an integral part of it…

My reaction: meh 😒

See more: announcement post

CloudWatch: Application Signals

In short, it’s automatic instrumentation and monitoring of EKS Clusters. You can also use it for non-EKS applications by running the CloudWatch agent on your ECS Cluster or EC2 instance yourself. Notably, for now, the instrumentation works only for Java applications.

Application Signals services view (source)

Another introduced capability of Application Signals is Service Level Objective (SLO) monitoring. You can monitor one of the discovered services or a CloudWatch Metric and set the target objective. Then, you can tell your customers about the 99.9% uptime of your application based on the actual data.

Applications Signals SLO monitoring (source)

My opinion: I’ve been lucky enough never to use Kubernetes, and I do not intend to. But I hope my less fortunate colleagues will find Application Signals useful.

My reaction: nice 🙂

See more: announcement post

Q: the AWS AI

I’ve left the "best" for the end. There are multiple levels to unpack here, so bear with me.

Not surprisingly, the "AI" continued to pop up everywhere during the re:Invent, for better or worse. Undoubtedly, the biggest announcement in this area was Amazon Q, the AWS response to Large Language Models (LLM) taking over the world.

But what is Q? Well, it’s one thing in four forms.

Form 1: generative AI service

The Amazon Q service lets you create customized LLM working in a ChatGPT style, trained on the materials you provide. Those may be, for example, your company knowledge database enhancing Q with domain-specific facts. You can use a few configuration tweaks to tune and improve responses, like restricting irrelevant topics or defining the context for the answers.

There are plenty of data source connections to make your Q bot smarter (source)

A side note: extra points for the short and simple service name. Totally unrelated: do you know how many letters you must write in the AWS Console search bar to get results to find the service?

Form 2: AWS AI assistant

Without a doubt, you will notice the new popups in the AWS docs and the Console itself with an Amazon Q chat. You can ask it about AWS services and features and hope for the correct answer. It’s the Q service in practice, trained on AWS documentation, showing the service’s capabilities. It’s an excellent example of AWS dogfooding.

However, it shows not only the good but also the bad sides of Amazon Q. Like every LLM, it’s prone to hallucinations, as shown on many screenshots circulating on Twitter. See this thread from Corey Quinn as an example.

You can learn a lot of new things from Amazon Q (source)

It also does not know about resources in your account and does not make any operations. Thus, asking it about the total number of your Lambda functions will only give you a CLI command to check it yourself.

It’s also integrated with CodeWhisperer, so you can chat with it without leaving your IDE.

Form 3: troubleshooting helper

There are new troubleshooting tools integrated into the AWS Console that use Q. Right now, it can help find the failed Lambda invocation error root cause or debug VPC connectivity issues.

If identifying IAM permissions problems would be the only Amazon Q capability, it would be the most used AWS service anyway (source)

Form 4: context-aware query generator

In Redshift, you can now write your question in a natural language and get SQL matching your tables in a response. That’s actually pretty awesome. I hope similar functionalities will also land in other services.

It's still the preview

Everyone expects the best from AWS, but Amazon Q and all the capabilities it powers are still in the preview, so some hiccups are understandable. I’m sure AWS will fine-tune it to reduce hallucinations and give better answers.

My opinion: Finding the appropriate documentation page in Google takes me less time than waiting for the chatbot’s answer. But I’ll give it a go. For now, I like that Amazon Q’s answers include the links to the documentation sources.

My reaction: mixed feelings 😵‍💫

See more: announcement post

Conclusion

Wow, that turned out long.

Recordings from re:Invent are available on YouTube: keynotes, hundreds of presentations.

Please let me know in the comments if there is anything you liked the most from the announcements!

Avoiding and solving CDK resource name conflicts

Maciej Radzikowski — Mon, 30 Jan 2023 16:00:00 +0000

CDK generates Logical IDs used by the CloudFormation to track and identify resources. In this post, I'll explain what Logical IDs are, how they're generated, and why they're important. Understanding this will help you avoid unexpected resource deletions and baffling "resource already exists" errors during deployment.

CDK provides an abstraction layer over the CloudFormation, which is used under the hood. With CDK, Infrastructure as Code is easier and more secure. But to use CDK effectively, you still need to understand how CloudFormation works. Failing to do so can have dire consequences, like the accidental removal of all your production database data. And we don't want that.

Construct ID vs. Logical ID vs. Physical ID

Let's create a simple Stack with one Construct - an SQS Queue. For the Construct ID, the second parameter in the constructor, we set MyQueue.

import {Stack, StackProps} from 'aws-cdk-lib';
import {Queue} from 'aws-cdk-lib/aws-sqs';

export class MyStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
        super(scope, id, props);

        new Queue(this, 'MyQueue');
    }
}

After running cdk deploy we get a CloudFormation stack with a single resource. The generated CloudFormation template looks like this:

Resources:
  MyQueueE6CA6235:
    Type: AWS::SQS::Queue
    UpdateReplacePolicy: Delete
    DeletionPolicy: Delete
    Metadata:
      aws:cdk:path: MyStack/MyQueue/Resource

The template contains a resource AWS::SQS::Queue with Logical ID MyQueueE6CA6235. As you can see, the Logical ID is the Construct ID we provided, with an extra suffix added by the CDK.

CDK Constructs relate to CloudFormation resources in a one-to-many relationship. A single CDK Construct can create one or more CloudFormation resources. In this example, the Queue Construct creates a single AWS::SQS::Queue resource.

Yet another thing is the resource name or the Physical ID. If you go to the SQS page in AWS Console, you will find a queue with a name like MyStack-MyQueueE6CA6235-86lqOs0JG5ZC. It's the name auto-generated by the CloudFormation, consisting of the stack name, resource Logical ID, and a random suffix added by the CloudFormation for uniqueness. This will turn out important further down the road, so read on.

For now, we have three IDs:

the Construct ID that we set in the CDK code (MyQueue),
the Logical ID generated by the CDK and put in the CloudFormation template (MyQueueE6CA6235),
the Physical ID (resource name) generated by the CloudFormation (MyStack-MyQueueE6CA6235-86lqOs0JG5ZC).

Additionally, the Physical ID is part of the ARN (Amazon Resource Name) used by the clients to make API calls to the resource. The Logical ID is important for the CloudFormation, but the Physical ID is necessary for resource clients.

How CloudFormation tracks resources

CloudFormation identifies resources by their Logical IDs. If we change the Logical ID in the CloudFormation template, the CloudFormation sees it as two changes:

removal of the old resource,
and creation of the new one.

In CloudFormation terms, this is called replacing the resource.

This behavior is described in the CloudFormation documentation:

For most resources, changing the logical name of a resource is equivalent to deleting that resource and replacing it with a new one. Any other resources that depend on the renamed resource also need to be updated and might cause them to be replaced.

The simplest way to provoke it is to change the Construct ID:

import {Stack, StackProps} from 'aws-cdk-lib';
import {Queue} from 'aws-cdk-lib/aws-sqs';

export class MyStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
        super(scope, id, props);

        new Queue(this, 'MyRenamedQueue');
    }
}

When we run cdk deploy, CloudFormation will first create a new SQS queue and only then remove the old one.

The order of operations is essential here - CloudFormation will first create new resources, and only after that succeeds will it remove the old ones. This minifies downtime and prevents the removal of existing resources if something goes wrong during the update and needs to be rolled back.

Old resources are removed in the UPDATE_COMPLETE_CLEANUP_IN_PROGRESS phase, which is described as follows:

Ongoing removal of old resources for one or more stacks after a successful stack update. For stack updates that require resources to be replaced, CloudFormation creates the new resources first and then deletes the old resources to help reduce any interruptions with your stack. In this state, the stack has been updated and is usable, but CloudFormation is still deleting the old resources.

In the example above, we changed the Construct ID (and, therefore, the Logical ID), and the update went smoothly. But it's not always the case.

Dangers of replacing CloudFormation resources

By changing the CloudFormation resource Logical ID, we removed the existing SQS queue and created a new one. That's a dangerous thing to do in the production environment.

Losing production data by accident

What if the queue had messages that were not yet processed? We would lose them.

If instead of an SQS queue, it would be a DynamoDB, RDS, or any other database - we would replace it with a fresh, empty one.

It's also not good when dealing with stateless resources like Lambda functions. By replacing one resource with another, we lose the metrics continuity.

CloudFormation resource already exists error

Losing data is not the only potential problem. Sometimes, the CloudFormation may not make the update at all, telling us the resource we want to create already exists.

Let's modify the first version of our Stack and add the queueName property. This corresponds to the queue Physical ID. Previously, CloudFormation generated that name for us, keeping it unique by adding a random suffix. Now, we hardcode it.

import {Stack, StackProps} from 'aws-cdk-lib';
import {Queue} from 'aws-cdk-lib/aws-sqs';
import {Construct} from 'constructs';

export class MyStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
        super(scope, id, props);

        new Queue(this, 'MyQueue', {
            queueName: 'my-queue',
        });
    }
}

If we deploy the stack now and then do the same as before - change the Construct ID from MyQueue to MyRenamedQueue, leaving the queueName as it is, updating the CloudFormation stack will fail:

CREATE_FAILED | AWS::SQS::Queue | MyRenamedQueue
Resource handler returned message: "Resource of type 'AWS::SQS::Queue' with identifier 'my-queue' already exists." (RequestToken: 557cc5a2-5e53-feb7-1d7e-63d41aed398f, HandlerErrorCode: AlreadyExists)

Why is that?

The queue name must be unique on a given AWS account in a given region. It's similar for Lambda functions, DynamoDB tables, and, frankly, most other AWS resources.

But wait!, you may say. We did not declare a second SQS queue with the same name. Our stack still contains a single queue.

But let's look at the order of operations:

We create a CDK Construct with ID MyQueue and name my-queue
1. CloudFormation creates a queue with Logical ID MyQueueE6CA6235 (suffix added by the CDK) named my-queue
We change the CDK Construct ID from MyQueue to MyRenamedQueue
1. CloudFormation sees it as the removal of MyQueueE6CA6235 and creation of MyRenamedQueue5E166F18 (suffix added by the CDK)
2. Firstly, it tries to create the new queue MyRenamedQueue5E166F18 named my-queue
3. Creation fails - queue with name my-queue already exists

How to fix it? There are two ways:

Restore the original Construct ID. However, as we will see in a moment, it may not always be possible if we refactor the code.
Comment out the Construct, re-deploy the Stack (so the old resource is removed), uncomment the Construct, and re-deploy again to create the new resource.

Preventing CloudFormation resources replacement

Okay, so to prevent all those problems, is it enough to not set the resource names by hand and not modify the Construct IDs? Well, unfortunately, it's not that simple.

Letting CloudFormation generate unique names

The best practice is to let CloudFormation generate unique resource names instead of hardcoding them. This has two benefits:

we prevent the errors like the one described above,
we can deploy multiple instances of the same CloudFormation stack on the same account, for example, to create various environments of our service.

(The latter can also be achieved with resource names set by hand by adding the environment name to the resource name.)

But sometimes, using auto-generated names is not suitable. From my experience, the "hardcoded" names are better:

for resources shared with other AWS accounts (for example, if a service in another AWS account pushes messages directly to our SQS queue) because if we remove and re-create the stack, the resource ARN will not change, and no update of external clients will be needed,
for resources like Glue Tables, where a nice and short name is much better to use in Athena queries, and it needs to be unique only in the scope of the Glue Database.

Not changing CDK Construct IDs

But as we discussed earlier, replacing resources is likely not the best thing to do in the first place. So to prevent it, we just don't modify the CDK Construct IDs. Simple enough, right?

Well, you can guess it - not really.

Let's look again at our simple Stack with a Queue Construct:

import {Stack, StackProps} from 'aws-cdk-lib';
import {Queue} from 'aws-cdk-lib/aws-sqs';
import {Construct} from 'constructs';

export class MyStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
        super(scope, id, props);

        new Queue(this, 'MyQueue');
    }
}

Let's say that as our service grows, we add more SQS queues and always need them to have dead-letter queues (DLQ) configured. So instead of repeating ourselves, we extract it into a separate Construct. Remember, Constructs are abstract CDK building blocks you can nest, and each Construct may create one or more CloudFormation resources.

import {Stack, StackProps} from 'aws-cdk-lib';
import {Queue} from 'aws-cdk-lib/aws-sqs';
import {Construct} from 'constructs';

export class MyStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
        super(scope, id, props);

        new MyQueueWithDLQ(this, 'MyQueueWithDLQ');
    }
}

class MyQueueWithDLQ extends Construct {
    constructor(scope: Construct, id: string) {
        super(scope, id);

        const dlq = new Queue(this, 'DLQ');

        new Queue(this, 'MyQueue', {
            deadLetterQueue: {
                maxReceiveCount: 5,
                queue: dlq,
            },
        });
    }
}

We've moved the Queue from MyStack to MyQueueWithDLQ Construct. But the Queue ID stays the same - it's still MyQueue.

If we re-deploy the stack now, we will see two new queues created:

MyCustomQueueMyQueue20F468EB,
MyCustomQueueDLQE6D3019E,

and the existing one removed.

Why is that?

CDK generates the Logical IDs based on the full Construct "path". With nested Constructs, IDs of all "higher" Constructs are used to create the unique Logical ID. So when the path changed from MyQueue to MyCustomQueue/MyQueue, the generated Logical ID changed from MyQueueE6CA6235 to MyCustomQueueMyQueue20F468EB.

So even if we don't change the Construct IDs, moving Constructs into other Constructs changes generated Logical IDs. This is what often happens during development or refactoring.

Pinning Logical IDs during CDK refactoring

Thankfully, we can still refactor our CDK code while preventing changes to resources' Logical IDs.

To do so, we can override the Logical ID, setting it by hand instead of letting CDK generate it. Of course, it's not recommended to do it ahead of time, but when we refactor the code and want to move the existing Construct. Then, we can check the current Logical ID and "pin" it so it won't be changed:

import {CfnQueue, Queue} from 'aws-cdk-lib/aws-sqs';

const queue = new Queue(this, 'MyRenamedQueue');  
(queue.node.defaultChild as CfnQueue).overrideLogicalId('MyQueueE6CA6235');

Summary

I hope this post clarifies how CDK and CloudFormation track resources and makes it less confusing.

What's important is that the CloudFormation identifies the resources by the Logical ID, not the name or any other property. So if you change the Logical ID, the new resource is created, and then the old one is removed.

Replacing resources with new ones is usually safe in development environments but dangerous in production, where it can cause us to lose data.

CDK generates the Logical IDs from the Construct ID. If you have nested Constructs, all higher Construct IDs are used to generate the Logical ID. Moving a Construct into another Construct changes its Logical ID.

When we refactor the CDK code and want to move the Construct without causing the resource to be replaced, we can pin down the current Logical ID.

A particularly nasty problem is changing the Logical ID of a resource with a hardcoded name. CloudFormation will first try to create the new resource and fail because the resource with the same name already exists. The solution is to either revert to the previous Logical ID or to temporarily remove the Construct from the Stack, re-deploy to remove the old resource, restore the Construct, and re-deploy again.

Top 12 Serverless Announcements from re:Invent 2022

Maciej Radzikowski — Wed, 07 Dec 2022 09:07:00 +0000

re:Invent 2022, the annual AWS conference in Las Vegas, is now behind us. I did not attend in person, but that gave me time to consolidate this list of top new serverless features while everyone else is sleeping off the intense 5-day conference. And I envy them just a little.

pre:Invent

"pre:Invent" is a few weeks before the actual conference. You can always see an increased number of features and improvements releases in that period.

Here are my favorite picks.

🔑 Multiple MFA devices in IAM

(announcement post)

Finally.

You could already set up Multi-Factor Authentication for IAM users and the account root user. But until now, you were limited to 1 MFA device only. This was not perfect. If the device is lost or destroyed, you could get blocked from the account.

But it's no issue anymore. Now you can assign up to 8 MFA devices, which can be a:

virtual MFA device - like the Authy app
FIDO security key - such as YubiKey
hardware TOTP token

If you don't have MFA enabled yet, especially for your AWS account root user - it's about time. Virtual MFA is easy and free to set up. On the other hand, FIDO is more secure, although it requires having a security key. Good news - you may be eligible for a free YubiKey from AWS if you are from the US.

Yes, this is not serverless per se, but it's too important to omit.

🏃‍♂️ Lambda Node.js 18 runtime

(announcement post)

18.x is the currently active LTS version of Node.js. As every version, it comes with various new features and improvements. One of the most significant is the Fetch API, bringing the well-known fetch() function from the browsers to the backend, eliminating the need for third-party packages to make HTTP requests (or at least to make them easily). While still experimental, the Fetch API is available by default in Node 18.

But, maybe even more importantly, Node.js 18.x Lambda runtime comes with AWS SDK v3 included. That replaces AWS SDK v2, which was available in the previous runtime versions. Now, while using the new SDK v3, you can omit it from your code bundle to reduce its size since the SDK is already available in the runtime. That's not my favorite practice, but I know many folks are doing so.

However, there are reports of increased cold starts with Node 18 runtime versus Node 16. Hopefully, the Lambda team will improve this soon.

If you are using the AWS JS SDK v3, the best way to mock it for unit tests is to use the aws-sdk-client-mock library.

⏰ EventBridge Scheduler

(announcement post)

The new capability of the EventBridge allows scheduling tasks to execute. But wait, we already had CloudWatch Events, later transformed into EventBridge scheduled rules. So what's new here, you may ask?

Well, the new EventBridge Scheduler is much more powerful. For instance, it integrates with hundreds of AWS services, allowing you to make thousands of API calls directly without a Lambda function.

But the most distinct feature is one-time schedules. Until now, setting up singular actions to be executed in the future involved architecture patterns with DynamoDB and TTL or periodic status checking. Now, you can offload this to the EventBridge.

The Scheduler comes with a soft limit of 1 million scheduled tasks, high-throughput, and configurable time windows for distributing the load. The only drawback is that the one-time tasks are not automatically deleted and count into the scheduled tasks limit. However, the responsible team is working on improving it soon.

📨 EventBridge suffix, case-insensitive, and OR matching

(announcement post)

Continuing with EventBridge, content-based event filtering has new capabilities. Now you can filter by a suffix - this was a highly requested feature, with one use-case being filtering S3 object events by the file extension. There is also a new equals-ignore-case condition and an $or directive to match if any of the provided conditions match.

See the documentation for the description of all filters.

🚀 AppSync JavaScript Resolvers

(announcement post)

This was the top-voted, long-awaited request for AppSync.

Resolvers are code snippets that integrate between AppSync and other services. They are used to prepare the request and parse the response. Until now, you had to write them in VTL (Apache Velocity Templates) - a format beloved by developers. If they would not love it, why would they spend so much time writing VTLs, right?

JavaScript Resolvers are the new default in AppSync. However, they come with several limitations:

not whole JavaScript syntax is supported,
asynchronous operations are not supported,
there is no external network connectivity,
code must be a single file under 32 KB in size.

Thus even in JavaScript, they are still AppSync Resolvers. Their role is to prepare payloads the AppSync will pass on. They are not a replacement for Lambda functions for more complex operations.

Still, this is a great improvement. With JavaScript, writing and testing Resolvers will be much easier. And, of course, you can use TypeScript and transpile it to JS!

🧩 Cross-account access in Step Functions

(announcement post)

Step Functions Task steps can now assume provided IAM roles and access resources on other AWS accounts directly.

Until now, to access another account, you needed a Lambda function that would assume a cross-account role. Now you just provide the role ARN in the Task definition, and Step Function assumes it. This way, you can make any API call to any service on a different account (with a role that gives you access to it, of course).

re:Invent

Of course, the biggest announcements were on the re:Invent itself.

🚰 EventBridge Pipes

(announcement post)

EventBridge Pipes are here to make your Lambdas obsolete.

Pipes are triggered by events from various sources, just like Lambda functions. Then you can filter, enrich and transform the incoming events. Finally, you send them to a target.

That flow describes a lot of Lambda functions I wrote. With Pipes, it's simple, low-code, reliable, and effective.

At the moment of the initial release, Pipes support DynamoDB Streams, Kinesis Streams, SQS, MSK, and MQ as event sources. You can use Lambdas, Step Functions, or API calls for enriching events. Finally, Pipes can send events to 15 target destinations, including EventBridge buses, APIs, Kinesis Streams, Kinesis Firehose, SNS, SQS, Step Functions, Lambdas, and more.

And all those features cost just $0.40 per million invocations (after filtering!). For comparison, it's the same price as for the SQS requests. Furthermore, you can optimize it by batching the input events.

🪣 Step Functions Distributed Map

(announcement post)

Step Functions are great for data processing. But there is a limited size of the payload you can pass between the next steps, as well as limited parallelism that affects the performance for larger jobs. This makes processing files still very dependent on Lambda functions.

Well, no more.

The new flavor of the Map state, the Distributed Map, is here to orchestrate large-scale processing jobs directly in the Step Functions, focusing on S3 files. It can read a JSON or CSV file from S3 and iterate over individual records. Or, even better, it can list files from the S3 location on its own and iterate over them. Then, for processing the records or files, it starts separate child workflows with up to 10,000 parallel executions. And to optimize the work, it can process in batches (with a single child workflow getting multiple records/files as input).

🫰 Lambda SnapStart for Java

(announcement post)

Java is known for long cold starts on Lambda. And even though I say cold starts are not a big problem in most cases, I mean it when the initialization takes 0.5-1 second. With Java, it's often above 5 seconds, which is a whole different story.

Probably that's why AWS decided to tackle the issue, starting with Java first. With the new SnapStart feature, the function initialization happens during the deployment. Then the disk and memory state of the initialized environment are cached. So when you invoke the function, the environment is restored from the cache in under 200 ms.

I'm unlikely to write any Lambda function in Java. However, I'm hoping the SnapStart will also become available on other runtimes. If (or when) it comes to Python and Docker, it will be a game changer for serverless Machine Learning solutions, which also suffer from long cold starts.

🕵️‍♂️ Inspector support for Lambda

(announcement post)

Amazon Inspector is a service that scans software libraries against known security vulnerabilities. It does not require installing any additional dependencies or agents. And after EC2 and ECR, it now supports Lambda functions.

You just enable the Inspector in the AWS Console. Then it automatically and continuously scans all the Lambda functions on the account.

How much does security costs? $0.30/Lambda/month.

Should you enable it right away on the production account? Probably yes, unless you already have dependency vulnerabilities scanning in place (like GitHub Dependabot or Snyk).

no:Invent

Unfortunately, there were some disappointments as well.

💸 OpenSearch "Serverless"

One of the promises of serverless is no-use, no-pay pricing. AWS themselves said it multiple times in the past.

But this year, AWS decided to break that promise. In my opinion - for marketing purposes, because "serverless" is trending now.

So after MSK "Serverless", Aurora "Serverless" v2, and Neptune "Serverless", now we got OpenSearch "Serverless".

The problem with all of them? They do not scale down to zero. Therefore, you will pay a minimum fee for created instances, even if not used at all.

How much? Almost $700/month for the OpenSearch "Serverless".

Why is that a problem? I'm glad you ask. I wrote about this after the Aurora "Serverless" v2 release.

And don't get me wrong. The auto-scaling offer of all those services is a wonderful thing. I also understand it's not easy to make a database that will scale down to zero and then scale up to handle incoming requests with no additional latency. My only problem lies in the misleading naming.

🏅 No Serverless Specialty Certificate

Despite all the marketing around the serverless, there is still no Serverless Specialty AWS certificate. While the serverless solutions are part of the Associate and Professional certificate exams, they make up only about 10% of the questions. A certificate that proves knowledge of modern, serverless architectures and solutions without EC2 machines and complex network routing is something the community eagerly awaits.

But we got a consolidation prize - a Serverless Learning Path in the AWS Skill Builder. It's a free, self-paced, online course where you can earn a badge on completion.

Notable mentions

There were many, many more releases this year on the re:Invent, around the serverless and not.

You can now manage your AWS Organization through CloudFormation, including creating accounts, organizational units, and policies. It's one of those things you are surprised were not already possible. However, I will stick to the OrgFormation for my own accounts, as it offers additional features like deploying stacks and performing custom logic across the organization.

AWS Glue, a service I'm not a big fan of personally, announced version 4.0 and several new capabilities.

SageMaker, already bloated with features, got at least a dozen more.

Application Composer is a new visual tool for designing serverless applications. After the Step Functions Workflow Studio, it's another drag-and-drop solution suggesting that AWS wants to improve on the Developer Experience field. However, I doubt I will use it myself. I don't believe in a drag-and-drop application design. And it integrates with SAM for IaC while I'm on the team CDK.

However, I'm looking forward to learning more about Amazon Verified Permissions, which is now in a closed preview. From my understanding, it will allow you to offload application permission management to AWS. I'll definitely give it a try.

Direction of serverless

AWS Lambda is no longer a necessary element of serverless applications. More and more solutions can exclusively rely on low-code services like AppSync with its direct integrations (now connected with JavaScript), EventBridge (now with Pipes), or Step Functions (now with built-in file processing). If Lambda functions are used, their role is reduced.

And that's a great thing.

Code is a liability.

By moving the standard and repeatable tasks to the platform, we can innovate faster. There is less code to write, test, and maintain. Less code also means a lower risk of bugs.

And that's the idea of serverless. Fewer things for us, developers, to manage. More focus on what matters for the business.

Sessions recordings

AWS re:Invent is not only about exciting new launches. It's also a lot of tech sessions.

Sure, some talks are just brand marketing. But many technical presentations are given by the best people in the industry who built the solutions you are using. Those sessions are on all levels of advancement.

Their recordings are available on this lengthy playlist (over 440 videos at this moment!): AWS re:Invent 2022 sessions.

How to pass AWS Certification exams

Maciej Radzikowski — Wed, 23 Nov 2022 17:48:19 +0000

I've never cared too much about certificates, apart from the SSL ones (haha). And yet I passed 7 AWS exams. Why? How to prepare? How to pass? How to pay only 50% for the exam? I answer all this and more in this post.

After passing both Professional-level exams, the DevOps Engineer and the Solutions Architect, I shared my thoughts on them on Twitter. People were interested, so this post extends those tweets with content universal for all AWS certificates.

Maciej Radzikowski

@radzikowski_m

This week I passed the AWS Architect Professional exam 🎉

Here are my hints on preparing, exam scope, understanding questions, and some pro tips!

A thread 🧵👇

17:07 PM - 02 Nov 2022

Maciej Radzikowski

@radzikowski_m

This week I passed the AWS DevOps Professional exam 🎉

Here is a little guide - scope, what to pay attention to, and how I prepared for it in less than 2 weeks.

A thread 🧵👇

17:00 PM - 13 Jan 2022

What are AWS certificates?

AWS offers 12 certificates. They come in four categories, covering different areas of AWS and varying in difficulty.

The Foundational, Associate, and Professional-level certificates build learning paths for Architects and Engineers. They cover a broad spectrum of AWS services and solutions built on AWS. Specialty certificates focus on particular areas and go into much detail on services in scope.

Which AWS certificate is for you?

If you are a "tech" person - don't take the Cloud Practitioner - Foundational exam. It is very abstract, requiring just a knowledge of what the cloud is, its advantages, concepts, and the purpose of core services. However, it may be a proper exam for Sales, Marketing, or Agile people from your organization, helping them understand the technology the Engineers are working with.

There is no required order for taking the exams. You don't need Associate certifications to pursue Professional ones.

Nonetheless, I suggest starting with the Associate level exam first. Which one? The one that matches your expertise and practical experience the most. The scope and expected knowledge for each exam is listed on the AWS Certification pages.

Only after obtaining one or two Associate certificates, I recommend going for the Professional or Specialty ones, which are considerably more difficult.

How much do AWS exams cost?

As you can see in the illustration above, the Foundational-level exam is the cheapest - it costs "only" 100 USD. The Associate-level exams cost 150 USD, and the Professional and Specialty - 300 USD.

However, you can pay only 50% of it! After you pass an exam, you get a voucher with a 50% discount for the next one. So, as long as you prepare well to pass it on the first try, you can pay half the price of all the exams after the first.

Do AWS Certificates expire?

All certificates are valid for three years. Then, to keep it active, you must either re-take the exam or achieve a higher-level certification. The connection lines in the above illustration show which certificates prolong which. You can keep the Cloud Practitioner - Foundational certificate active by achieving any Associate-level certificate. The DevOps Engineer - Professional extends the validity of both the Developer and SysOps Administrator from the Associate level. And Solution Architect - Associate can be prolonged by passing the Solution Architect - Professional exam.

Why do certificates expire? Besides the obvious monetary reasons, AWS constantly adds new features and services. Exams are updated over time with refreshed questions to reflect it. After three years, there are always new, better, and more optimal ways to solve particular problems.

Why take AWS exams?

Everyone can have different reasons for getting certified. I will list mine.

Firstly, for me, getting certified is a great way to learn. The exam scope pushes me to learn about services and features I may not have used and dive deeper into the ones I know. And always at least part of this knowledge comes in handy in my daily work.

In my opinion, AWS exams are valuable for being close to real-life problems. So after preparing for it, you are left with practical learnings. And it's not something you can say about all technical certificates out there...

Secondly, getting certified is promoted by my employer, Merapar. The company is an Advanced Consulting Partner of AWS, and as such, it requires some number of active AWS certificates among the employees. Achieved certifications across the company are also proof of knowledge and expertise for our customers.

And finally, even though I discovered it after the fact, being AWS certified gives you access to AWS Certification Lounges at events like AWS re:Invent or AWS Summit. I was at the Summit in London this year, and there were good snacks in the Lounge, so totally worth it!

Another argument for getting certified is to boost your professional profile and CV, thus opening doors to interviews or promotions. So while it's not my most significant reason, it's certainly valid. And if you can get your current employer to pay for the learning and getting certified, which can considerably help you in case of looking for another company in the next three years - that's a great deal.

Are they relatable to real work?

On each exam, you will encounter scenarios and services you don't deal with in your job. But that's because there are so many solutions you can deploy on AWS. To make the certification more tailored, it would need to be more granular, ending with not 12 but 50 different certificates.

There is probably no architect that, even across a few years, will work with all the scenarios you are tested against for the Solutions Architect - Professional certificate. Nor an engineer that will use all kinds of databases and accompanying services you need to know for the Database - Specialty exam. But that doesn't invalidate the usefulness of the scope that overlaps with your day-to-day work

How to learn for AWS Certificate?

I always rely on those three parts for learning for AWS Certificate exams:

practical experience,
certificate course,
solving practice tests.

The first one, practical experience, is not to be underestimated. The more hands-on experience you have with the services you are questioned about, the better. While you can learn everything in theory and pass the exam, I strongly advise against it. Getting at least some experience in core services for the given certificate will make learning and taking the exam much easier.

Many times on the AWS exams, I got a question, wasn't sure about the answer right away, and figured it out based on similar work I did in the past. It's much easier to remember something you did hands-on than the information you only learned in the course.

That leads us to the next part - courses. I recommend Udemy courses by Stephane Maarek. While I'm generally not a fan of video courses, that's the best, most comprehensive way to go through the exam's scope and get the condensed knowledge I found.

But don't just watch. Active learning is much more effective!

Make notes. Draw mind maps. Whatever suits you. And above all - go to AWS Console and play with the services and concepts you learn. If you set up things on your own once or twice, you will be able to recall them much better on the exam and will know what options were possible and what not.

And finally - solving practice tests. It's a great learning technique - it forces your brain to work on problems and figure out the answer actively. It works even if you don't answer correctly but check the right solution with justification afterward. Also, it will prepare you for the type of questions on the exam.

Where from take the practice questions? For each certificate:

there are 10 sample questions in PDF linked on the AWS certificate page,
20 more are in the "Official Practice Question Set" on the AWS Skill Builder, also linked on the certificate page in the resources section,
few more are in the "Exam Readiness" training on the AWS Skill Builder,
there are separate Udemy courses with just tests containing from 100 to 400 sample questions, again from Stephane Maarek,
you can google for more sample questions for the individual exams.

It takes me 2-4 weeks to prepare for each exam. I schedule the exam shortly after I start learning for it - there is no better motivation than a deadline 🙃

How to solve AWS exam questions?

Most AWS exam questions are scenario-based. Therefore, you need to know how to read and understand them to solve them.

My process is as follows:

Read the question thoroughly.
Identify key phrases, services, and requirements.
Identify the objective in question.
Scan through the answers and eliminate obviously incorrect ones.
Reread the remaining answers, continue eliminating until only one is left, or choose the best from the rest.

The "objective" is often highlighted in the question, for example:

"Which solution meets the requirements in the MOST cost-effective manner?"
"Which combination of steps will meet these requirements with the LEAST change to the architecture?"
"Which solution meets the requirements with the LOWEST overall latency?"

Multiple answers may present a technically correct solution to a given scenario but bring different pros and cons. Thus you need to consider them in terms of the identified objective.

Always choose some answer. There are no negative points. You can flag a question to go back to it later, but it's better to select an answer right away in case you don't have spare time at the end of the exam.

Solving exam question example

Let's try it! From Database - Specialty sample questions:

A company’s ecommerce application stores order transactions in an Amazon RDS for MySQL database. The database has run out of available storage and the application is currently unable to take orders.

Which action should a database specialist take to resolve the issue in the shortest amount of time?

A) Add more storage space to the DB instance using the ModifyDBInstance action.

B) Create a new DB instance with more storage space from the latest backup.

C) Change the DB instance status from STORAGE_FULL to AVAILABLE.

D) Configure a read replica with more storage space.

Key phrases and services: Amazon RDS; storage is full, so writes fail.

Objective: solve the issue with minimal downtime.

First scan through answers:

C is incorrect. You can't just "tell the database it isn't full" and expect it to magically work without adding the storage space.
D is incorrect. Read replica is for distributing reads from the database, while our problem is with writes.

That leaves us with A and B. Both are theoretically possible. But the objective is to minimize the downtime, and creating a new DB instance from the backup could take hours, depending on the database size. That means the B is incorrect too. So the answer is A - adding storage with ModifyDBInstance action is the only one left and sounds reasonable.

Solving multiple-response questions

AWS exams also include multiple-response questions. The question always indicates how many answers you must choose.

There are two types of multiple-response questions. You are asked to select a combination of steps to achieve the solution or multiple alternative solutions. Check the wording of the question correctly.

However, in 90% of cases, you are asked to choose a combination of steps. Those questions often contain pairs of answers. For example, if you must select 3 answers from 6, there are usually 3 aspects of the question scenario and 2 answers for each. So identify the pairs and choose the best answer from each of them.

An example from the DevOps Engineer - Professional exam:

A devops engineer wants to implement a blue/green deployment process for an application on AWS and be able to gradually shift the traffic between the environments. The application runs on Amazon EC2 instances behind an Application Load Balancer. The instances run in an EC2 Auto Scaling group. Data is stored in an Amazon RDS Multi-AZ DB instance. External DNS is provided by Amazon Route 53.

Which combination of steps will implement the blue/green process? (Select THREE.)

A) Create a second Auto Scaling group behind the same Application Load Balancer.

B) Create a second Application Load Balancer and Auto Scaling group.

C) Create a second alias record in Route 53 pointing to the new environment and use a failover routing policy between the two records.

D) Create a second alias record in Route 53 pointing to the new environment and use a weighted routing policy between the two records.

E) Configure the new EC2 instances to use the same RDS database instance.

F) Configure the new EC2 instances to use the failover node of the RDS database instance.

Key phrases and services: EC2, Auto Scaling, Application Load Balancer, RDS, Route 53.

Objective: implement blue/green deployment.

We can see the pairs of answers:

A and B refer to the architecture of Auto Scaling groups and Application Load Balancer,
C and D are about setting up Route 53,
E and F are about connecting EC2 to RDS.

Now we choose one answer from each pair:

B - we need a second Application Load Balancer to direct the traffic to it from Route 53.
D - failover routing is for Disaster Recovery, weighted routing will allow us to shift the traffic gradually.
E - both environments (blue and green) need to work simultaneously on the same RDS instance, and the failover node is, again, for Disaster Recovery.

How to take an AWS Certificate exam?

You register and take the exam through one of the testing companies, Pearson VUE or PSI. From January 1, 2023, only Pearson VUE.

You can take the exam at a local testing center or online. However, if you have a testing center nearby - go there. I took two exams online, and my experience was poor. First, you waste time installing testing software that monitors everything and often requires disabling any antivirus you have. Then you spend more time preparing your room and taking photos of it. And when you are finally ready to start, the software crashes. And then, for the next 30 minutes, you are trying to contact the support and make it work, stressing out about the technical issues on top of the exam itself. And it's not just my experience, but my colleagues as well. So, if you can, go to the local testing center.

Sadly, you won't get the results right after you finish the exam. You must wait up to 24 hours. Usually, the first thing you get is a notification from Credly about a new badge issued to you (if you passed) and, several hours later, an official email from AWS Training and Certification.

Summary

AWS offers certifications that prove your knowledge on several levels and in distinct areas.

Getting AWS Certified is a good way to deepen your AWS knowledge and prove that knowledge to your (current or future) employer and customers. But it does not replace the hands-on experience. Quite the contrary - it will be much easier to pass an exam having at least some practical experience in the area first.

It's best to start with an Associate-level certificate. Then continue with Specialty or Professional, depending on your area of expertise.

While those exams cost from $150 (Associate) to $300 (Professional and Specialty), after each exam you pass, you get a voucher with a 50% discount for the next one. The only trick to always paying 50% price (except for the first one) is not to fail an exam 😉

AWS exams are not trivial, so you must prepare accordingly. I recommend three learning methods in conjunction: getting practical experience in the area, going through a certificate course, and solving practice tests.

AWS exam questions are usually scenario-based, and you need to learn how to understand and solve them. This is one of the reasons why taking practice tests is so important.

And finally, you can take the exam at the local test center or online. If available, I recommend the first option as it's less stressful and, perversely, often less time-consuming.

Good luck!

PS. Do you have any tips and tricks for the AWS exams yourself? Please share in the comments!

Running Serverless ML on AWS Lambda

Maciej Radzikowski — Mon, 21 Nov 2022 16:18:00 +0000

Yes, you can run Machine Learning models on serverless, directly with AWS Lambda. I know because I built and productionized such solutions. It's not complicated, but there are a few things to be aware of. I explain them in this in-depth tutorial, where we build a serverless ML pipeline.

As always, the link to the complete project on GitHub is at the end of the post.

Productionizing ML solutions on AWS

There is a wide variety of advanced ML models available on the internet. You can download and use them with just a few lines of code in a high-level ML library. But there is a gap between running amazing models locally and productionizing their usage.

Here comes serverless, allowing you to run your models in the cloud as simply as you do ad-hoc jobs locally and build event-driven ML pipelines without managing any infrastructure components. And, of course, all that while paying only for what you actually use, not for some virtual machines waiting idly for work.

But there must be some troubles along the way. Otherwise, I would not have to write this post.

The number one problem we face with running ML models is the size of the dependencies. Both ML models and libraries are huge. The other thing to consider is latency - loading the model into memory takes time. But we can tackle both those issues.

So let's build a serverless Machine Learning pipeline. We just need some use-case. How about automatically generating captions for uploaded images?

Serverless ML pipeline architecture

Our objective is simple: generate a caption for each uploaded image. That's complex enough to make it a real-life example while keeping the tutorial concise.

The starting point of our pipeline is an S3 bucket. When we upload images to it, the bucket will send notifications about new objects to the Lambda function. There we do the Machine Learning magic and save the generated caption in the DynamoDB table.

We will define the infrastructure with AWS CDK, my preferred Infrastructure as a Code tool. Once we declare the infrastructure, the CDK will use the CloudFormation to deploy the needed resources.

Serverless Machine Learning pipeline for auto-generating image captions

Python, the obvious choice

For Machine Learning, Python is the default language of choice. So this is also our choice for the Lambda function.

AWS CDK also supports Python, so it makes perfect sense to use it to define the infrastructure and keep the project uniform.

Why not SageMaker Serverless?

Amazon SageMaker is a Swiss Army knife for Machine Learning. But I don't mean the handy pocket version. Rather something like this:

Wenger 16999 Swiss Army Knife Giant, with 87 tools included, looks a lot like Amazon SageMaker

With SageMaker Serverless Inference, you can deploy and use an ML model, paying only for the actual usage. However, it runs on AWS Lambda under the hood, bringing the same limitations - like the lack of the GPU.

At the same time, introducing SageMaker adds complexity to:

the architecture - adding another service to our pipeline and calling it from the Lambda function,
the deployment process - preparing and deploying the model to SageMaker,
and the code - using the SageMaker library.

While the advantages of using SageMaker libraries and tooling could be significant in some scenarios, we intend to use pre-trained models and high-level libraries that will be entirely sufficient on their own.

Generating image captions with Machine Learning

Let's start with the Lambda function code that will be the heart of our pipeline.

Python libraries

We will use an existing model from 🤗 Hugging Face. It's a platform containing pre-trained ML models for various use cases. It also provides 🤗 Transformers - a high-level ML library making using those models dead simple. So simple that even I can use it.

For dependency management, we will use Poetry. Why not pip? Because Poetry is better in every aspect.

So we use it to install the libraries we need:

poetry add boto3 transformers[torch] pillow

boto3 is an AWS SDK for Python. We will need it to communicate with S3 and DynamoDB from the Lambda function.

Then we add the 🤗 Hugging Face transformers library mentioned above, specifying it should also install the PyTorch ML framework it will use under the hood.

And finally, we need the Pillow library for image processing.

Pre-trained model

As I mentioned, we will use an existing, pre-trained model that does exactly what we need: nlpconnect/vit-gpt2-image-captioning from 🤗 Hugging Face. We just need to download it.

Because the pre-trained model is large, around 1 GB, we need a Git LFS extension installed to download it. Then we run:

git lfs install
git clone https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

Lambda code

The Lambda code is just 51 lines (and I put blank lines generously!).

captioning_lambda/main.py:

import os
from io import BytesIO

import boto3
from PIL import Image
from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer

s3 = boto3.client("s3")

dynamodb = boto3.resource("dynamodb")
captions_table = dynamodb.Table(os.environ["TABLE_NAME"])

model = VisionEncoderDecoderModel.from_pretrained("./vit-gpt2-image-captioning")
feature_extractor = ViTFeatureExtractor.from_pretrained("./vit-gpt2-image-captioning")
tokenizer = AutoTokenizer.from_pretrained("./vit-gpt2-image-captioning")


def handler(event, context) -> None:
    bucket = event.Records[0].s3.bucket.name
    key = event.Records[0].s3.object.key

    image = load_image(bucket, key)

    caption = generate_caption(image)

    persist_caption(key, caption)


def load_image(bucket: str, key: str) -> Image:
    file_byte_string = s3.get_object(Bucket=bucket, Key=key)["Body"].read()
    image = Image.open(BytesIO(file_byte_string))

    if image.mode != "RGB":
        image = image.convert(mode="RGB")

    return image


def generate_caption(image: Image) -> str:
    pixel_values = feature_extractor(images=[image], return_tensors="pt").pixel_values

    output_ids = model.generate(pixel_values, max_length=16, num_beams=4)

    return tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()


def persist_caption(key: str, caption: str) -> None:
    captions_table.put_item(Item={
        "key": key,
        "caption": caption,
    })

Firstly, in lines 8-11, we create boto3 clients to interact with S3 and DynamoDB. For DynamoDB, we need the table name we will pass to the Lambda function as an environment variable.

Then we load the previously downloaded ML models (lines 13-15).

We do both those things outside the handler method. Therefore this code will be executed only once, when the Lambda environment is created, not on every Lambda execution. This is critical, as loading models takes quite a long. We will look at it in more detail a bit later.

Further, we have the handler method. The handler is called on Lambda invocation. The event we receive contains the details on the newly created S3 object, and we extract the bucket name and the object key from it.

Then, we do three simple steps:

Fetch the image from the S3 bucket.
Use the previously loaded ML models to understand the image content and generate a caption.
Save the caption in the DynamoDB table.

Overcoming Lambda size limitations

If we package our code right now, with libraries and model, and upload it to Lambda with a Python environment, we will get an error. The package size limit is 250 MB. Our package is... around 3 GB.

The 250 MB limit includes Lambda layers, so they are not a solution here.

So what is the solution? Bundling it as a Docker image instead. The Docker image size limit for Lambda is 10 GB.

Lambda Docker image

We will use a multi-stage build for the Docker image to omit build dependencies in our target image.

captioning_lambda/Dockerfile:

FROM public.ecr.aws/docker/library/python:3.9.15-slim-bullseye AS build

WORKDIR /root

RUN apt-get update && apt-get install -y \
    curl \
    git \
    git-lfs

RUN curl -sSL https://install.python-poetry.org | python3 -
ENV PATH="/root/.local/bin:$PATH"

RUN git lfs install
RUN git clone --depth=1 https://huggingface.co/nlpconnect/vit-gpt2-image-captioning
RUN rm -rf vit-gpt2-image-captioning/.git

COPY pyproject.toml poetry.lock ./
RUN poetry export -f requirements.txt --output requirements.txt

########################################

FROM public.ecr.aws/lambda/python:3.9.2022.10.26.12

COPY --from=build /root/vit-gpt2-image-captioning ./vit-gpt2-image-captioning

COPY --from=build /root/requirements.txt ./
RUN pip3 install -r requirements.txt --target "$LAMBDA_RUNTIME_DIR"

COPY main.py ./

CMD ["main.handler"]

In the first stage, build, we install curl, Git, Git LFS, and Poetry.

Then we download the nlpconnect/vit-gpt2-image-captioning model from the 🤗 Hugging Face, just as we previously did it locally. Finally, we use Poetry to generate a requirements.txt file with our production Python dependencies.

Then we use the official Docker image for Lambda with Python. Firstly, we copy the ML model fetched in the build stage and the requirements.txt file. Next, we install Python dependencies with pip and copy the sources of our Lambda function - the Python code we wrote above. Finally, we instruct that our Lambda handler is the handler method from the main.py file.

Importance of Dockerfile commands order

The order of operations in our Dockerfile is essential. Each command creates a cacheable layer. But if one layer is changed, all the next are rebuilt. That's why we want to have the largest and least frequently changed layers first and the ones changed more often last. So in our image, we have, in order:

ML model
Python libraries
Lambda code

When we change the Lambda code, only that last layer is updated. That means no time-consuming operations like fetching ML model or Python libraries happens on every code update. Also, when making changes and deploying the Docker image, only that last, thin layer will be uploaded every time, not the full 3 GB image.

Provisioning ML pipeline with CDK

Now we need to provision AWS infrastructure. It's a simple CDK Stack with three constructs - the DynamoDB table, Lambda function, and S3 bucket.

Setting up the CDK project from scratch is out of the scope of this tutorial, but you can find the complete source in the GitHub project repository at the end of the post. Here is the essential - MLStack that contains our resources.

cdk/ml_stack.py:

from aws_cdk import Stack, RemovalPolicy, Duration
from aws_cdk.aws_dynamodb import Table, BillingMode, Attribute, AttributeType
from aws_cdk.aws_lambda import DockerImageFunction, DockerImageCode
from aws_cdk.aws_logs import RetentionDays
from aws_cdk.aws_s3 import Bucket, EventType
from aws_cdk.aws_s3_notifications import LambdaDestination
from constructs import Construct


class MLStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        captions_table = Table(
            self, "CaptionsTable",
            removal_policy=RemovalPolicy.DESTROY,
            billing_mode=BillingMode.PAY_PER_REQUEST,
            partition_key=Attribute(name="key", type=AttributeType.STRING),
        )

        captioning_lambda = DockerImageFunction(
            self, "CaptioningLambda",
            code=DockerImageCode.from_image_asset("./captioning_lambda/"),
            memory_size=10 * 1024,
            timeout=Duration.minutes(5),
            log_retention=RetentionDays.ONE_MONTH,
            environment={
                "TABLE_NAME": captions_table.table_name,
            }
        )

        captions_table.grant_write_data(captioning_lambda)

        images_bucket = Bucket(
            self, "ImagesBucket",
            removal_policy=RemovalPolicy.DESTROY,
            auto_delete_objects=True,
        )
        images_bucket.add_event_notification(EventType.OBJECT_CREATED, LambdaDestination(captioning_lambda))
        images_bucket.grant_read(captioning_lambda)

The DynamoDB table definition is pretty simple. It's just a table with on-demand billing.

The S3 bucket is not complicated as well. We add an event notification rule to it to invoke Lambda for every new object created in the bucket.

We also add proper permissions to Lambda to access the DynamoDB table and S3 bucket.

For the Lambda function, we use the DockerImageFunction construct. We point to the Dockerfile location for the source code, and the CDK will handle building the Docker image.

Adjusting Lambda memory and CPU for Machine Learning

ML libraries require a lot of memory, partially because they need to load huge ML models. Here we set the maximum possible memory size for the Lambda - 10 GB. However, our model does not require this much - 5 GB would be enough.

But the amount of allocated memory translates to the allocated CPU power. That's why adding more memory is the first step for optimizing Lambda functions.

ML operations are compute-expensive. In Lambda, without GPU, everything is done on the CPU. Therefore, the more CPU power is available, the faster our function will work.

On the other hand, remember that increasing the memory allocation also increases Lambda invocation cost. So with heavy usage, it's worth finding the best balance between the execution speed and costs. I detailed how to do this in my Lambda performance optimization post.

Deploying the CDK stack

With CDK CLI installed, the deployment is as simple as running:

cdk deploy

When we run the deployment for the first time, the CDK will build and upload our Lambda function Docker image. It can take a couple of minutes, requiring fetching gigabytes from the internet and then uploading it to AWS. But consecutive deployments, if we modify only the Lambda code, will be much faster thanks to the image layers cache described before.

Testing serverless image captions generation

After uploading several images to the S3 bucket and checking the DynamoDB table after a moment, we see the automatically generated captions:

Quite good!

Minimizing serverless ML pipeline latency

Now, let's look at the latency.

During the first Lambda function invocation, it goes through a cold start. First, AWS fetches the Docker image, provisions the environment, and executes everything we have outside the handler method. Only then is Lambda ready to handle the event.

The same happens after every Lambda update or after it's inactive (not called) for some time.

Here is a sample invocation with cold start traced with AWS X-Ray:

Lambda invocation timeline with cold start

Cold start - highlighted Initialization segment - took 11.4s. From the additional logs, I know that 10s of it was loading the ML models (lines 13-15 of the Lambda code).

Then, the longest part of the invocation was generating the caption, which took almost 2s.

On the consecutive runs, there is no cold start (Initialization) part, so the entire execution ends in less than 3s.

Here are the measured times for other cold start invocations:

initialization total	loading models
11.4s	10.0s
52.3s	17.6s
16.0s	14.6s
17.6s	16.3s
2m27s	2m15s
10.7s	9.5s
11.3	10.0s
9.5s	8.3s

As you can see, while the cold start was usually under 20 seconds, occasionally, it took much longer. Even more than 2 minutes. This variety is something I observe with large Docker images and CPU-intensive initializations, typical for ML workloads. But it's due to AWS internals and is not something we can improve ourselves.

Optimizing cold starts

Contrary to popular belief, cold starts are not so big of a problem. They happen rarely:

on the initial invocation,
when the invocations count increases and Lambda scales up to accommodate for it,
after the function was not invoked for some time and AWS freed allocated resources.

On production systems, cold starts can affect less than 0.1% of invocations.

But if you need, what can you do about them?

Storing files on EFS

One option I tried in the past is using EFS. It's a file system that you can attach to the Lambda function. By putting large Python libraries and ML models there, you no longer deal with a large deployment package. Therefore you can use the native Python Lambda runtime instead of the Docker image. And smaller bundle and native Lambda runtime give lower cold start latency.

But in our case, most of the cold start time is not a result of fetching large Docker image but loading the ML model into memory. And EFS does not solve this part.

Instead, EFS introduces several complications. First, you need an EC2 instance to put the files on EFS. There is no easy way to deploy files to EFS during the deployment. And additionally, you must put your Lambda in a VPC to attach the EFS.

Altogether, the drawbacks of using EFS in this scenario heavily outweigh the benefits.

Initializing ahead of time with provisioned concurrency

The more comprehensive solution for Lambda cold starts is to use provisioned concurrency. It's basically like asking AWS to keep the given number of function runtimes active for us.

Provisioned function initialization happens during the deployment. As a result, after the deployment completes, our Lambda function is ready to handle events - without a cold start.

However, provisioned concurrency incurs costs for the number of runtimes we require active, regardless of whether the Lambda is invoked. Also, keep in mind that if there are more events in the queue than already provisioned functions can handle at once, Lambda will create new environments - with cold starts.

Provisioned concurrency is a good solution for client-facing ML Lambdas, where we cannot allow the 20 secs cold start. Nonetheless, for asynchronous processes, like in our case, it's most often a waste of money.

To setup provisioned concurrency, you need to create a Lambda alias and set the number of provisioned concurrent executions you want it to set up:

captioning_lambda.add_alias(
    "live",
    provisioned_concurrent_executions=1,
)

Reducing concurrent cold starts

If we upload five images to the S3 bucket at once, then, due to a long cold start, we will see five separate Lambda runtimes created, each to handle one invocation. Of course, assuming there are no "hot" runtimes of our Lambda function already existing.

That may or may not be what we want. But with such a disproportionally huge cold start to the invocation length, having just a single runtime provisioned and handling all the invocations would not take much longer. And we pay for the initialization time of the Docker-based runtimes.

So sometimes, if we know a limited number of runtimes is enough and our workloads come in batches (like by uploading multiple images to our bucket simultaneously), it may be wise to limit the number of runtimes Lambda can create. We do this by setting the reserved concurrency for Lambda. In CDK, it's the reserved_concurrent_executions property of the Lambda function construct.

For example, with reserved concurrency set to 1, Lambda will create only a single runtime environment. All events will be queued on the Lambda internal queue and executed one by one.

Of course, we need to ensure that the workload won't be bigger than the throughput of our Lambda. If we keep uploading more images than the set number of runtimes can process, the queue will continue to grow. Eventually, events that won't be processed in 6 hours (by default) will get dropped.

Summary and lessons learned

Well, that is quite a long post. But I wanted it to reflect real-life objectives and considerations, so I went into detail. I hope it will be helpful for you.

Let's recap.

Nowadays, we can fetch pre-trained Machine Learning models for various cases from the internet and use them in a few lines of code. That's great. However, if you have more ML experience, using smaller, more specialized libraries instead of high-level ones could benefit both initial deployment times and cold starts. However, libraries' size is not the biggest problem, so that's optional.

No matter what Python libraries we choose, ML models are usually at least 1 GB in size anyway. That means we need to use Docker image instead of native Lambda runtimes. But that's okay. If we order commands in the Dockerfile correctly, only the first image build and upload will take a long time. Consecutive ones will take seconds, as we will only update the last image layer containing our code.

The ML Lambda functions require a generous amount of memory assigned for two reasons. Firstly, ML libraries load models into memory, so too low memory will result in out-of-memory errors. Secondly, Lambda scales the CPU with the memory, and ML operations are CPU-intensive. Therefore, the more CPU power there is the lower latency of our function.

Initializing ML libraries with models increases the cold start, which can take a significant amount of time. If the ML pipeline is an asynchronous process, that's probably not an issue. Cold starts happen only from time to time. But it may be worth paying extra for the Lambda provisioned concurrency in a client-facing ML Lambdas, where we cannot allow such long cold starts.

And finally, due to the long cold starts, limiting the number of Lambda runtimes created with reserved concurrency settings may be good if our workloads come in batches.

With both provisioned concurrency and reserved concurrency, we should pay extra attention to proper monitoring.

You can find the complete source for this project on GitHub: aws-lambda-ml

Least deployment privilege with CDK Bootstrap

Maciej Radzikowski — Mon, 26 Sep 2022 13:00:34 +0000

Security is not convenient. That’s probably why the CDK, by default, uses AdministratorAccess Policy to deploy resources. But we can easily change it and increase the security of our AWS account, following the least privilege principle with a minimal additional burden.

Dangers of default CDK Bootstrap

To start using the CDK, we must bootstrap our AWS account. Bootstrapping creates the resources required by the CDK on the account.

If we follow the official docs for getting started with CDK, the process is as simple as it can be:

$ npm install -g aws-cdk

$ cdk bootstrap aws://372507991746/eu-west-1 \

 ⏳  Bootstrapping environment aws://372507991746/eu-west-1...
Trusted accounts for deployment: (none)
Trusted accounts for lookup: (none)
Using default execution policy of 'arn:aws:iam::aws:policy/AdministratorAccess'.
Pass '--cloudformation-execution-policies' to customize.
CDKToolkit: creating CloudFormation changeset...
 ✅  Environment aws://372507991746/eu-west-1 bootstrapped.

The cdk bootstrap command creates a CloudFormation Stack named CDKToolkit.

This stack contains 5 IAM Roles:

CloudFormationExecutionRole
DeploymentActionRole
LookupRole
FilePublishingRole
ImagePublishingRole

What are they used for?

Why leaving them as they are is against the least privilege principle?

And how can we fix this?

Read on.

IAM Roles created by CDK Bootstrap

CloudFormationExecutionRole

This is the Role that CloudFormation will assume to deploy our Stacks. CloudFormation will use this Role both when we deploy from our local machine with cdk deploy command and through CDK Pipelines for CI/CD.

The CloudFormationExecutionRole must have permissions to list, create, modify and delete all the resources we use in our Stacks. For example, if our Stack contains a Lambda function, CloudFormation must have permission to create it.

To allow creating any kind of resources with CDK, this Role has arn:aws:iam::aws:policy/AdministratorAccess Policy assigned by default. That’s right – it gives full access to our account, allowing to do anything. That’s very much against the least privilege principle.

Dangers of the `AdministratorAccess` Policy

Why is it bad? Don’t we want the CDK to be able to create any resources we need in our Stacks?

We want the CDK to be able to deploy only the resources we use. So, for example, if we build an app utilizing just a few serverless services, like Lambda, API Gateway, and DynamoDB, we don’t want the CDK to be able to spin up EC2 machines.

Suppose our computer or the code repository with automatic deployment through the CI pipeline gets compromised. In that case, the attacker can use the CDK to deploy a CloudFormation stack with a bunch of EC2 machines mining bitcoins.

Security, like onions and ogres, has layers. Each layer should prevent the attacker from achieving their goal. The fact we have a password on our computer and the code repository is private doesn’t justify leaving the next doors behind them wide open.

Thankfully, we can improve it. Looking again at the output of the cdk bootstrap command, we can notice this message:

Using default execution policy of 'arn:aws:iam::aws:policy/AdministratorAccess'.
Pass '--cloudformation-execution-policies' to customize.

Stay tuned; we will fix it in a moment. But, first, let’s make sure we know what other IAM Roles created by the CDK do.

DeploymentActionRole

The CDK CLI and CDK Pipelines assume this Role to create and manage CloudFormation Stacks and the files in a CDK assets S3 Bucket.

It also allows passing the CloudFormationExecutionRole to CloudFormation. Then CloudFormation can use it to create, update and delete resources.

Moreover, the DeploymentActionRole allows accessing and managing objects in S3 Buckets on other accounts, which is needed for cross-account deployments.

LookupRole

CDK CLI uses the LookupRole when it needs to get information about the already existing resources that we want to use in our CDK app. Those resources include Route53 Hosted Zones, VPCs, SSM Parameters, and a few others.

The bad part is that the LookupRole uses a ReadOnlyAccess IAM Policy, which gives it access to read everything, not only the resources the CDK can do a lookup for.

On the bright side, it’s just read-only access, and kms:Decrypt is explicitly excluded from it through the second Policy attached to the LookupRole, so it can’t be used to read encrypted data and secrets.

FilePublishingRole and ImagePublishingRole

Those two Roles allow CDK uploading and managing:

assets (like Lambda function sources) in the CDK assets bucket,
container images in the CDK ECR repository.

Those assets and images are built from our application code, uploaded by the CDK, and then referenced in the CloudFormation Stacks.

Limiting the CDK Execution Policy access

After reviewing the IAM Roles created by the CDK bootstrap process, we can see the most problematic is the CloudFormationExecutionRole. It gives CDK full access to our AWS account, while it should only allow deploying and managing the types of resources we use in our app.

Let’s fix this.

Creating own CloudFormation Execution Policy

We start with creating our own IAM Policy. It should allow accessing only the AWS services that we use in our CDK application. But, on the other hand, it needs broad access to those selected services. So we will just give full access to them with an asterisk wildcard (*).

Additionally, we will limit access to only the region where we operate. In this example, it will be eu-west-1. Some services, like CloudFront, are global, so we list them separately with no region restriction.

And finally, permissions to the IAM actions. As a service managing access to other AWS components, IAM is critical to security. At the same time, it has over 200 actions. So we select only the ones required for our Stacks to work. We also exclude access to the Roles generated by the CDK and the Policy itself for additional protection.

cdkCFExecutionPolicy.json:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "apigateway:*",
        "cloudwatch:*",
        "lambda:*",
        "logs:*",
        "s3:*",
        "ssm:*"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "eu-west-1"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudfront:*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "iam:*Role*",
        "iam:GetPolicy",
        "iam:CreatePolicy",
        "iam:DeletePolicy",
        "iam:*PolicyVersion*"
      ],
      "NotResource": [
        "arn:aws:iam::*:role/cdk-*",
        "arn:aws:iam::*:policy/cdkCFExecutionPolicy"
      ]
    }
  ]
}

This policy document should be committed to the project repository, as it will evolve with time.

Having the JSON file, we need to create the IAM Policy:

aws iam create-policy \
  --policy-name cdkCFExecutionPolicy \
  --policy-document file://cdkCFExecutionPolicy.json

Bootstrapping CDK with custom Execution Policy

Now we need to bootstrap the CDK, providing the created IAM Policy to be used instead of the default AdministratorAccess one:

ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
cdk bootstrap aws://$ACCOUNT_ID/eu-west-1 \
  --cloudformation-execution-policies "arn:aws:iam::$ACCOUNT_ID:policy/cdkCFExecutionPolicy"

And… that’s it. That’s how easy it is to apply the least privilege principle to CDK deployments.

If we have the CDK already bootstrapped on our account, simply rerunning cdk bootstrap, this time with our custom execution policy, will update it.

Updating the Policy

With time, we add more services to our application. This requires us to extend the cdkCFExecutionPolicy with access to additional services.

To do this, firstly, we modify the definition in the cdkCFExecutionPolicy.json. Then we create a new Policy version and set it as a default one:

ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
aws iam create-policy-version \
  --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/cdkCFExecutionPolicy \
  --policy-document file://cdkCFExecutionPolicy.json \
  --set-as-default

From now on, the CDK will be using an updated Policy.

There is a limit to 5 Policy versions, so we need to delete old versions to make updates. But it’s not difficult. We simply list existing versions:

aws iam list-policy-versions \
  --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/cdkCFExecutionPolicy

And then delete the selected old version:

aws iam delete-policy-version \
  --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/cdkCFExecutionPolicy \
  --version-id <VERSION>

Multiple projects

Best practices recommend having only a single project per AWS account. But if we really need to deploy a second CDK project to the same account, here is how to bootstrap it with its own execution policy.

The first step is to create and deploy the new IAM Policy for the second project:

aws iam create-policy \
  --policy-name cdkCFExecutionPolicy2ndProject \
  --policy-document file://cdkCFExecutionPolicy.json

Then we bootstrap the CDK on the same account, creating a separate set of IAM Roles by adding two flags to the cdk bootstrap command:

cdk bootstrap aws://$ACCOUNT_ID/eu-west-1 \
  --cloudformation-execution-policies "arn:aws:iam::$ACCOUNT_ID:policy/cdkCFExecutionPolicy2ndProject" \
  --toolkit-stack-name CDKToolkitMySecondProject \
  --qualifier 2ndProject

The first one, --toolkit-stack-name, assures that a separate CDK stack with its own resources will be created. The default Stack name is CDKToolkit, so we provide a distinct one.

The second parameter, --qualifier, is a short string added to many resource names created by the CDK to avoid name collisions. It must be unique for every project.

And lastly, for the second project to actually use these newly bootstrapped CDK Roles, we need to add the same qualifier to the project’s cdk.json configuration:

{
  "app": "...",
  "context": {
    "@aws-cdk/core:bootstrapQualifier": "2ndProject"
  }
}

Summary

By default, CDK uses the AdministratorAccess IAM Policy to deploy CloudFormation Stacks. That’s far from the “least privilege” principle.

Thankfully, we can quickly improve it for better security. First, we create a custom IAM Policy with access to only the services we use in our application. Then we (re)bootstrap the CDK, providing our Policy ARN as an --cloudformation-execution-policies argument.

Over time, if we need to grant the CDK access to more services, we just update the IAM Policy.

Afterthoughts on the convenience

They say the security is not convenient. In fact, I believe this is why the CDK uses AdministratorAccess Policy by default – this allows using CDK right away, with just one simple cdk bootstrap command.

It’s good the cdk bootstrap output warns about using the AdministratorAccess, but sadly, I suspect it’s ignored in most cases.

Luckily, creating a custom Policy and maintaining it is straightforward, so the problem can be fixed quickly.

The AWS CDK, Or Why I Stopped Being a CDK Skeptic

Maciej Radzikowski — Thu, 30 Jun 2022 12:54:07 +0000

Until recently, I was skeptical about the AWS CDK. I believe in Infrastructure as Code (IaC), but with the "code" being YAML. But after using CDK in real projects, the amount of heavy lifting it does and the vast reduction of a boilerplate code changed my view.

I'm a long-time user and fan of the Serverless Framework, and it was my go-to tool for the IaC on AWS. It provides an abstraction layer on top of the CloudFormation, the AWS infrastructure provisioning service. I thought that with the Serverless Framework, building serverless projects is as straightforward as it can be.

Then, in July 2019, AWS released the CDK – Cloud Development Kit. Like Serverless Framework, it also uses CloudFormation under the hood. But contrary to the SF, in which the primary way to declare infrastructure is YAML, in CDK, you write code in one of the supported programming languages.

When others started adopting CDK, I didn't jump on the hype train right away but kept looking from a distance. This changed in the past months.

My objection towards CDK

My biggest objection was the core trait of the CDK – declaring infrastructure in a programming language.

While YAML has its problems, I see it as an elegant and readable solution to define the configuration. And that includes infrastructure configuration.

On the other hand, it's much easier to make a mess using a programming language, and that's what I was afraid of. When you can use loops, if conditions, and any advanced language features, you can as easily define the infrastructure cleanly and concisely as make it a spaghetti code. And the infrastructure is the last place where I want to investigate step-by-step what the hell is happening through the code flow.

To sum up – I believe(d) that it's much less likely to make the infrastructure definition unreadable with YAML than using a programming language.

The need for CDK

Then I started working on a new project that required the resources to be created based on the configuration files. During the deployment, multiple instances of Lambda functions and other resources had to be generated with different settings depending on the provided configuration.

Generating resources dynamically during deployment requires some higher-level logic. While you can use JavaScript/TypeScript instead of YAML to define resources in the Serverless Framework, the CDK, with code as the native way to declare infrastructure, seemed like an obvious choice.

If you already used CloudFormation or any IaC tool based on it (like Serverless Framework or SAM), the learning curve of CDK is flat. In a few weeks, I scaffolded the new project, finding out how much good there was in the CDK. Now, two projects later, the CDK is the default IaC framework for me.

Pros of CDK

The two most obvious pros of CDK are that you can utilize all the power of a programming language to define the infrastructure and that it uses CloudFormation under the hood.

With languages supported by the CDK – TypeScript, JavaScript, Python, Java, C#, and Go – you write the code in a familiar way. No more esoteric CloudFormation logic instructions in JSON or YAML. And, with the code completion in your IDE, no more checking the documentation for the exact name of every single parameter.

Equally important, CDK "synthesizes" the code into standard CloudFormation stacks and deploys them. Using CloudFormation, the stable and battle-tested (although slow) IaC system to perform the actual infrastructure management, CDK can focus on providing the best possible developer experience on top of it.

But what really convinced me were things I found only after I started working with the CDK.

Reusable custom Constructs

First and foremost, Constructs reusability.

Constructs are building blocks of the CDK. Like LEGO bricks, you can compose them to make larger, specialized Constructs. Then you can use those Constructs multiple times across the application to create many similar resources without repeating the configuration.

This is really powerful. The first thing I did was to create a custom Construct for a Node.js Lambda function, which included basic CloudWatch Alarms. This way, every function in my application is always properly monitored. Overriding the parameters of this custom Lambda Construct, I can customize alarm thresholds and function configuration where needed.

High-level Constructs

CDK comes with 3 "levels" of Constructs.

Level 1, or L1, are low-level Constructs that correspond directly to the CloudFormation resources. Their names are prefixed with Cfn, like CfnBucket or CfnFunction. With them, you work with exactly the same structures as in raw CloudFormation, with the same parameters and behavior. Nothing more, nothing less.

L2 Constructs, on the other hand, are smarter and provide higher-level API. They come with sensible defaults and reduce the required boilerplate code to the minimum. They also provide helper functions, for example, to set the IAM permissions.

L2 Constructs are the core of the CDK. Using them, you apply many best practices for setting up individual resources, like protecting access to the S3 Buckets. And since their API is an abstraction over the CloudFormation properties, it's a lot more concise and readable.

This snippet creates a secured S3 Bucket and a Lambda function with an IAM Role granting read access to it, which translates to about 80 lines of L1 Constructs code or raw CloudFormation YAML:

const myBucket = new Bucket(this, 'MyBucket', {
    encryption: BucketEncryption.S3_MANAGED,
    blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
    objectOwnership: ObjectOwnership.BUCKET_OWNER_ENFORCED,
});

const myLambda = new NodejsFunction(this, 'MyLambda', {
    entry: path.join(__dirname, 'src', 'index.ts'),
    runtime: Runtime.NODEJS_16_X,
});
myBucket.grantRead(myLambda);

L3 Constructs, also called patterns, are even more abstract building blocks that include multiple resources. For example, with a single L3 Construct, you can create a Fargate service running on an ECS cluster behind an Application Load Balancer. Don't make me count how many CloudFormation resources you need to define to set it up yourself…

Sensible defaults

I've mentioned it above, but let me emphasize this. L2 and L3 Constructs include sensible defaults that reduce the boilerplate and make the infrastructure safer, more robust, and closer to the best practices.

Examples include:

S3 Buckets and DynamoDB tables having enabled "Retain" Deletion Policy by default to not loose production data in case of accidental stack removal,
Node.js Lambda functions enabling connections reuse to optimize AWS SDK v2

Utility functions

Utility functions are a great help in L2 and L3 Constructs.

The ones you will most often come across are the IAM permissions helpers. Instead of laboriously defining IAM policies, you can glue access between Constructs with grant*() functions.

const myLambda = new NodejsFunction(this, 'MyLambda', {
    entry: path.join(__dirname, 'src', 'index.ts'),
    runtime: Runtime.NODEJS_16_X,
});
myS3Bucket.grantRead(myLambda);
myDynamoDBTable.grantReadWriteData(myLambda);

But IAM helpers are not all. Few other examples include:

Function.metricError() to get a CloudWatch Metric for the Lambda function errors count that you can use to set up a CloudWatch Alarm,
MappingTemplate.dynamoDbGetItem() to create an AppSync resolver mapping template to put an item to a DynamoDB Table,
Arn.format() to build an ARN from the parts like the region, service name, and the resource name,
and many more.

Logic above CloudFormation

CloudFormation, used under the hood by the CDK, deals only with the infrastructure and is very strict about it. It takes the current state of the resources and modifies them to achieve the expected state.

But deployment often includes more than just creating cloud resources. For example, you need to upload your application code, sometimes upload some assets to S3 buckets, or create an SSL certificate for the domain in the us-east-1 region for the CloudFront. Those actions are outside of the CloudFormation scope.

Thankfully, they are not outside of the CDK scope. For example, the Lambda Function Construct will bundle and upload your function code. The S3 BucketDeployment Construct will put your website files in a bucket. And DnsValidatedCertificate Construct will create an SSL certificate in any region you need.

Another neat built-in feature is the autoDeleteObjects parameter of the S3 Bucket Construct. When set to true, it will empty the bucket on the stack removal, letting the bucket be deleted. This is perfect for website hosting and short-living feature branch environments, where buckets do not contain valuable data.

Advantages over Serverless Framework plugins

You can achieve all of the above in the Serverless Framework with plugins. But the CDK has two advantages here.

Firstly, it's all built-in. No additional dependencies to install and no problems with unmaintained and outdated plugins.

Secondly, everything is handled server-side (cloud-side?) by the custom resources triggered by CloudFormation lifecycle hooks. You run the stack deployment, and you don't worry about uninterrupted internet connection during the SSL certificate creation and verification. Even better, you can delete the stack from the AWS Console, and it will trigger the bucket content removal, not only when you delete the stack using the CDK CLI.

In contrast, Serverless Framework plugins usually perform such actions through the AWS SDK calls from your local machine.

Multiple stacks support

Most of the projects I work on consist of multiple CloudFormation stacks. In CDK, you create an App, which may include numerous Stacks. Then, you can define dependencies between the stacks, and the CDK will take care of deploying them in the correct order.

The fact that multiple stacks are managed as a single application enables applying settings and making changes to the whole project in one place, without duplication. This includes setting tags or applying Aspects to all the resources in all the stacks.

/**
 * Set RemovalPolicy DESTROY to LogGroups
 * so they are removed on the stack removal, not kept indefinitely.
 */
class LogGroupRemovalPolicyAspect implements IAspect {

    public visit(node: IConstruct): void {
        if (node instanceof CfnLogGroup) {
            node.applyRemovalPolicy(RemovalPolicy.DESTROY);
        }
    }

}

const app = new cdk.App();

Tags.of(app).add('projectName', 'mySecretProject');
Aspects.of(app).add(new LogGroupRemovalPolicyAspect());

Built-in CI

With CDK, you can create a Continuous Integration pipeline for the project. The pipeline, obviously using AWS CodePipeline, is quite clever. You deploy it once, and if you make and commit any changes to it, it will mutate itself before deploying your stacks.

Construct Hub

Constructs' reusability makes them perfect for sharing. Construct Hub is a catalog of open-source Constructs built by AWS, AWS partners, and the community.

Among a variety of L3 Constructs, especially those three caught my eye:

cdk-iam-floyd – IAM Policy generator with a fluent interface
cdk-monitoring-constructs – CloudWatch Dashboard and Alarms for various AWS services
CDK-SPA-Deploy – all you need to have a SPA website running

Cons of CDK

Obviously, I had to find some drawbacks. Otherwise, I would be even angrier that I only now convinced myself to CDK.

One-at-the-time stacks deployment

CDK runs on top of CloudFormation, which is famously slow. In a multi-stack application, you need to deploy stacks in parallel where possible to reduce the overall deployment time.

Unfortunately, at the moment, this is not possible with the CDK when deploying from your local machine. So you can take a (long) break every time you deploy the whole application. But this will hopefully be resolved soon with a –concurrency option.

CodePipeline for the CI

The built-in CI, while clever, is using the AWS CodePipeline. And if you have worked with the CodePipeline, you know it's not the best CI out there.

The biggest issue I encountered is the impossibility of retrying individual failed stage deployments. The CDK splits stack deployments into creating CloudFormation Change Sets and executing them. If executing Change Set fails due to a conflict you then fix, it's impossible to retry the given stage deployment without re-running the whole pipeline. This is because, when retrying the individual stage deployment, CodePipeline runs only the failed actions (the Change Set execution) and does not re-create the Change Sets first.

Also, if you want to notify your GitHub repository of the pipeline execution result, you need to implement the webhook call yourself. This is yet another example of how great the AWS CodePipeline is.

Of course, you don't have to use built-in CDK pipelines. Instead, you can script your own deployment on any CI/CD platform you want.

Low-value constructs in Construct Hub

I've described the Construct Hub above as a library of high-level CDK Constructs. At the moment it contains over 1000 packages.

That sounds great, but many of them are L3 patterns which are, in my opinion, rather low value. For example, LambdaToSqs and SqsToLambda Constructs integrate just a Lambda writing or reading from the SQS queue. Maybe it's just me, but it seems a lot like the is-even package – the benefits are too small to justify installing a dependency instead of doing it by yourself. But maybe I'm wrong.

Conclusion

While I still find the Serverless Framework awesome for simpler use cases, I discovered that CDK is the best fit for larger projects. With it, you can reduce the boilerplate to the minimum. Declaring infrastructure is faster with high-level Constructs, code completion, sensible defaults, and utilities. And it's easier to keep high-quality and unified configuration with reusable Constructs.

Personal backup to Amazon S3 – cheap and easy

Maciej Radzikowski — Wed, 16 Mar 2022 11:05:57 +0000

In need to backup my personal files in the cloud, I wrote a script that archives the data into the Amazon S3 bucket. After some fine-tuning and solving a bunch of edge-cases, it's limited mainly by the disk read and my internet upload speed. And it costs me only $3.70 per TiB per month.

Instead of reinventing the wheel, I started with research. There must be a good, easy-to-use cloud backup service, right? But everything I found was too complex and/or expensive. So I wrote down the backup script myself.

Then I did the research again, and the results were quite different - this time, I found a few reasonable services I could use. But I already had the script, I had fun writing it, I will continue using it, so I decided to share it.

Off-site backup for personal files

The question about the personal backup system is raised from time to time on Hacker News. On the internet, you can find a 3-2-1 backup strategy:

3 copies
on 2 different media
with at least 1 copy off-site

I have an external disk with documents and photos archive. This, however, is just one copy kept right next to my laptop. And hard drives fail.

So I needed an off-site backup.

Looking for personal cloud backup solutions, I found some overcomplicated, some expensive, and one or two reasonable services. But then I remembered that I work on AWS, and Amazon S3 storage is cheap. This is especially true if you want to archive data and don't touch it too often.

The result is the script I wrote for backing up files to the S3 bucket.

Backup to S3 script

Below is the full, detailed explanation. If you are interested only in the script and usage instructions, you can find the link to the GitHub repository at the end.

Prerequisites

The script uses rclone and GNU Parallel.

On macOS, you can install them with Homebrew:

brew install rclone parallel

Provision AWS resources

To store backups in an S3 bucket, you need to have such a bucket. And while you could create and configure it by hand, it will be easier to provision it with a simple CloudFormation template.

stack.yml:

AWSTemplateFormatVersion: 2010-09-09  

Resources:  

  BackupBucket:  
    Type: AWS::S3::Bucket  
    Properties:  
      PublicAccessBlockConfiguration:  
        BlockPublicAcls: true  
        IgnorePublicAcls: true  
        BlockPublicPolicy: true  
        RestrictPublicBuckets: true  
      OwnershipControls:  
        Rules:  
          - ObjectOwnership: BucketOwnerEnforced  
      VersioningConfiguration:  
        Status: Enabled  
      LifecycleConfiguration:  
        Rules:  
          - Id: AbortIncompleteMultipartUpload  
            Status: Enabled  
            AbortIncompleteMultipartUpload:  
              DaysAfterInitiation: 3  
          - Id: NoncurrentVersionExpiration  
            Status: Enabled  
            NoncurrentVersionExpiration:  
              NewerNoncurrentVersions: 3  
              NoncurrentDays: 30  

  BackupUser:  
    Type: AWS::IAM::User  
    Properties:  
      Policies:  
        - PolicyName: s3-access  
          PolicyDocument:  
            Version: "2012-10-17"  
            Statement:  
              - Effect: Allow  
                Action:  
                  - 's3:*MultipartUpload*'  
                  - 's3:ListBucket'  
                  - 's3:GetObject'  
                  - 's3:PutObject'  
                Resource:  
                  - !Sub '${BackupBucket.Arn}'  
                  - !Sub '${BackupBucket.Arn}/*'  

Outputs:  

  BackupBucketName:  
    Value: !Ref BackupBucket  

  BackupUserName:  
    Value: !Ref BackupUser

The template defines two resources: the BackupBucket and BackupUser.

The BackupBucket has disabled public access for all the objects, as you don't want any of the files to be publicly accessible by mistake.

It also enables object versioning. When uploading new versions of existing files - fresh backups of the same files - the previous ones will be kept instead of immediately overridden.

On the other hand, to not keep old backups indefinitely (and pay for them), the bucket has a lifecycle rule that automatically removes old file versions. It will keep only the last 3 versions and remove others 30 days after they become "older".

The other lifecycle rule aborts incomplete files uploads after 3 days. The script will upload big files in multiple chunks. If the process fails or is interrupted, you are still charged for the uploaded chunks until you complete or abort the upload. This rule will prevent those incomplete uploads from staying forever and generating charges.

The second created resource is the BackupUser. It's an IAM user with permission to upload files to the bucket.

Deploy stack

To deploy the stack, run:

aws cloudformation deploy --stack-name backupToS3 --template-file stack.yml --capabilities CAPABILITY_IAM

Get bucket name

After the deployment is completed, go to the CloudFormation in the AWS Console and find the backupToS3 stack. Then, in the "Outputs" tab, you will see the BackupBucketName key with the generated S3 bucket name. You will need it in a moment.

Get access key

Similarly, you will find the BackupUserName with the IAM user name. Go to the IAM, open that user details, and create an access key in the "Security credentials" tab.

Setup rclone

rclone requires setting up the storage backend upfront. You can do this by running rclone config and setting up the S3 or manually editing the configuration file.

In the configuration, set the access key ID and secret access key generated for the IAM user.

~/.config/rclone/rclone.conf:

[backup]
type = s3
provider = aws
env_auth = false
access_key_id = xxxxxx
secret_access_key = xxxxxx
acl = private
region = eu-west-1

Backup to S3

Generally, the idea is straightforward: we copy everything to the S3 bucket.

But things are rarely so simple. So let's break it down, step by step.

The script is based on the minimal Bash script template. Bash provides the easiest way to glue together various CLI programs and tools.

#!/usr/bin/env bash

set -Eeuo pipefail

usage() {
  # omitted for brevity
  exit
}

parse_params() {
  split_depth=0
  max_size_gb=1024
  storage_class="GLACIER"
  dry_run=false

  while :; do
    case "${1-}" in
    -h | --help) usage ;;
    -v | --verbose) set -x ;;
    -b | --bucket) bucket="${2-}"; shift ;;
    -n | --name) backup_name="${2-}"; shift ;;
    -p | --path) root_path="${2-}"; shift ;;
    --max-size) max_size_gb="${2-}"; shift ;;
    --split-depth) split_depth="${2-}"; shift ;;
    --storage-class) storage_class="${2-}"; shift ;;
    --dry-run) dry_run=true ;;
    -?*) die "Unknown option: $1" ;;
    *) break ;;
    esac
    shift
  done

  [[ -z "${bucket-}" ]] && die "Missing required parameter: bucket"
  [[ -z "${backup_name-}" ]] && die "Missing required parameter: name"
  [[ -z "${root_path-}" ]] && die "Missing required parameter: path"

  return 0
}

main() {
  root_path=$(
    cd "$(dirname "$root_path")"
    pwd -P
  )/$(basename "$root_path") # convert to absolute path

  # division by 10k gives integer (without fraction), round result up by adding 1
  chunk_size_mb=$((max_size_gb * 1024 / 10000 + 1))

  # common rclone parameters
  rclone_args=(
    "-P"
    "--s3-storage-class" "$storage_class"
    "--s3-upload-concurrency" 8
    "--s3-no-check-bucket"
  )

  if [[ -f "$root_path" ]]; then
    backup_file "$root_path" "$(basename "$root_path")"
  elif [[ "$split_depth" -eq 0 ]]; then
    backup_path "$root_path" "$(basename "$root_path")"
  else
    traverse_path .
  fi
}

parse_params "$@"
main

The script expects at least three parameters: the S3 bucket name (--bucket), the backup name (--name), and the local path to be backed up (--path). The backup name serves as an S3 prefix to separate distinct backups.

After parsing the input arguments, the script does four things.

Firstly, it converts the path to absolute.

Secondly, it calculates the chunk size for the multipart file upload based on the max archive size. More on this later.

Thirdly, it creates an array of common parameters for rclone.

And finally, it executes backup based on the provided arguments.

Single file backup

If the backup path points to a file, the script uses the rclone copy command to simply upload the file to the bucket.

# Arguments:
# - path - absolute path to backup
# - name - backup file name
backup_file() {
  local path=$1
  local name=$2

  msg "⬆️ Uploading file $name"

  args=(
    "-P"
    "--checksum"
    "--s3-storage-class" "$storage_class"
    "--s3-upload-concurrency" 8
    "--s3-no-check-bucket"
  )
  [[ "$dry_run" = true ]] && args+=("--dry-run")

  rclone copy "${args[@]}" "$path" "backup:$bucket/$backup_name"
}

rclone will calculate an MD5 of the file and upload it only if a file with the same name and checksum does not yet exist. This will prevent wasting time uploading the file if it's unmodified from the last time you backed it up.

Directory backup

If the path points to a directory, things get more complex.

# Arguments:
# - path - absolute path to backup
# - name - backup name, without an extension, optionally being an S3 path
# - files_only - whether to backup only dir-level files, or directory as a whole
backup_path() {
  (
    local path=$1
    local name=$2
    local files_only=${3-false}

    local archive_name files hash s3_hash

    path=$(echo "$path" | sed -E 's#(/(\./)+)|(/\.$)#/#g' | sed 's|/$||')     # remove /./ and trailing /
    archive_name=$(echo "$backup_name/$name.tar.gz" | sed -E 's|/(\./)+|/|g') # remove /./

    cd "$path" || die "Can't access $path"

    if [[ "$files_only" == true ]]; then
      msg "🔍 Listing files in \"$path\"..."
      files=$(find . -type f -maxdepth 1 | sed 's/^\.\///g')
    else
      msg "🔍 Listing all files under \"$path\"..."
      files=$(find . -type f | sed 's/^\.\///g')
    fi

    # sort to maintain always the same order for hash
    files=$(echo "$files" | LC_ALL=C sort)

    if [[ -z "$files" ]]; then
      msg "🟫 No files found"
      return
    fi

    files_count=$(echo "$files" | wc -l | awk '{ print $1 }')
    msg "ℹ️ Found $files_count files"

    if [[ "$files_only" == true ]]; then
      msg "#️⃣ Calculating hash for files in path \"$path\"..."
    else
      msg "#️⃣ Calculating hash for directory \"$path\"..."
    fi

    # replace newlines with zero byte to distinct between whitespaces in names and next files
    # "md5sum --" to signal start of file names in case file name starts with "-"
    hash=$(echo "$files" | tr '\n' '\0' | parallel -0 -k -m md5sum -- | md5sum | awk '{ print $1 }')
    msg "ℹ️ Hash is: $hash"

    s3_hash=$(aws s3 cp "s3://$bucket/$archive_name.md5" - 2>/dev/null || echo "")

    if [[ "$hash" == "$s3_hash" ]] && aws s3api head-object --bucket "$bucket" --key "$archive_name" &>/dev/null; then
      msg "🟨 File $archive_name already exists with the same content hash"
    else
      msg "⬆️ Uploading file $archive_name"

      if [[ "$dry_run" != true ]]; then
        echo "$files" | tr '\n' '\0' | xargs -0 tar -zcf - -- |
          rclone rcat -P \
            --s3-storage-class "$storage_class" \
            --s3-chunk-size "${chunk_size_mb}Mi" \
            --s3-upload-concurrency 8 \
            --s3-no-check-bucket \
            "backup:$bucket/$archive_name"
        echo "$hash" | aws s3 cp - "s3://$bucket/$archive_name.md5"
        echo "$files" | aws s3 cp - "s3://$bucket/$archive_name.txt"
        msg "🟩 File $archive_name uploaded"
      fi
    fi
  )
}

Directory upload function starts with cleaning the path from parts like /./ and creating the archive name. The archive will be named just like the directory, with a .tar.gz extension.

The subsequent process is best explained with a flowchart:

After the directory is compressed and uploaded, the script creates two additional text files in the S3 bucket. One contains the calculated MD5 hash of the files and the other files list.

Subdirectories backup

Since we compress the directory and calculate its hash to not re-upload it unnecessary, it makes sense to archive individual subdirectories separately. This way, if the content of one of them changes, only it must be updated, not everything.

At the same time, we should aim to have a smaller number of bigger archives instead of creating too many small ones. This will make the backup and restore process more effective, both in time and cost.

For those reasons, an optional parameter --split-depth defines how many levels down the directories tree script should go and create separate archives.

# Arguments:
# - path - the path relative to $root_path
# - depth - the level from the $root_path
traverse_path() {
  local path=$1
  local depth=${2-1}

  cd "$root_path/$path" || die "Can't access $root_path/$path"

  backup_path "$root_path/$path" "$path/_files" true

  # read directories to array, taking into account possible spaces in names, see: https://stackoverflow.com/a/23357277/2512304
  local dirs=()
  while IFS= read -r -d $'\0'; do
    dirs+=("$REPLY")
  done < <(find . -mindepth 1 -maxdepth 1 -type d -print0)

  if [[ -n "${dirs:-}" ]]; then # if dirs is not unbound due to no elements
    for dir in "${dirs[@]}"; do
      if [[ "$dir" != *\$RECYCLE.BIN && "$dir" != *.Trash-1000 && "$dir" != *System\ Volume\ Information ]]; then
        if [[ $depth -eq $split_depth ]]; then
          backup_path "$root_path/$path/$dir" "$path/$dir" false
        else
          traverse_path "$path/$dir" $((depth + 1))
        fi
      fi
    done
  fi
}

Each found directory is archived with the backup_path() function, the same as before. Additionally, all the files in directories above the --split-depth level are archived as a _files.tar.gz.

To illustrate this, let's take this files structure:

my_disk
├── browsing_history.txt
├── documents
│  ├── cv.doc
│  ├── chemtrails-evidence.pdf
│  ├── work
│  │  ├── report1.doc
│  │  └── report2.doc
│  └── personal
│     └── secret_plans.txt
├── photos
│  ├── 1947-07-02-roswell
│  │  └── evidence1.jpg
│  │  └── evidence2.jpg
│  └── 1969-07-20-moon
│     └── moon-landing-real-001.jpg
│     └── moon-landing-real-002.jpg
└── videos
   ├── area51.avi
   └── dallas-1963.avi

With --split-depth 1, the disk will be backed up as four archives:

my_disk
├── _files.tar.gz
├── documents.tar.gz
├── photos.tar.gz
└── videos.tar.gz

And with --split-depth 2:

my_disk
├── _files.tar.gz
├── documents
│  ├── _files.tar.gz
│  ├── work.tar.gz
│  └── personal.tar.gz
├── photos
│  ├── 1947-07-02-roswell.tar.gz
│  └── 1969-07-20-moon.tar.gz
└── videos
   └── _files.tar.gz

Example

$ ./backup.sh -b backuptos3-backupbucket-xxxxxxxxxxxxx -n radziki -p "/Volumes/RADZIKI" --split-depth 1
2022-02-18 16:30:35 🔍 Listing files in "/Volumes/RADZIKI"...
2022-02-18 16:30:39 🟫 No files found
2022-02-18 16:30:39 🔍 Listing files under "/Volumes/RADZIKI/nat"...
2022-02-18 16:35:13 ℹ️ Found 55238 files
2022-02-18 16:35:13 #️⃣ Calculating hash for directory "/Volumes/RADZIKI/nat"...
2022-02-18 18:11:04 ℹ️ Hash is: 989626a276bec7f0e9fb6e7c5f057fb9
2022-02-18 18:11:05 ⬆️ Uploading file radziki/nat.tar.gz
Transferred:       41.684 GiB / 41.684 GiB, 100%, 2.720 MiB/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:   1h57m45.1s
2022-02-18 20:08:58 🟩 File radziki/nat.tar.gz uploaded
2022-02-18 20:08:58 🔍 Listing files under "/Volumes/RADZIKI/Photo"...
2022-02-18 20:12:43 ℹ️ Found 42348 files
2022-02-18 20:12:43 #️⃣ Calculating hash for path "/Volumes/RADZIKI/Photo"...
2022-02-18 22:19:42 ℹ️ Hash is: c3e347566fa8e12ffc19f7c2e24a1578
2022-02-18 22:19:42 ⬆️ Uploading file radziki/Photo.tar.gz
Transferred:      177.471 GiB / 177.471 GiB, 100%, 89.568 KiB/s, ETA 0s
Transferred:            1 / 1, 100%
Elapsed time:   8h12m19.6s
2022-02-19 06:32:02 🟩 File radziki/Photo.tar.gz uploaded

As you can see, two directories were found in the path and uploaded.

The first one was 41 GiB in total and was transferred in 1h57m. The other was 177 GiB and took 8h12m to upload. If you calculate that, it almost perfectly matches my internet upload speed of 50 Mbit/s.

Questions and answers

Why IAM user?

Generally, using IAM users is not the best practice. Instead, it's far better (and more common) to use SSO and grant access through IAM roles. Furthermore, it's even safer (and more convenient) to use tools like Leapp or aws-vault to manage AWS access.

However, credentials obtained this way are valid for only up to 12 hours and may require manual actions like providing an MFA token to refresh them. Depending on the directory size and your internet connection, uploading backup may take longer.

For that reason, we use an IAM user with static access credentials set directly in rclone. This user has access limited to only the backup bucket.

Putting the user access credentials directly in the rclone configuration has one additional perk. It allows for the backup to be done in the background without affecting your other work with AWS.

Why .tar.gz archives?

If you just want to upload the whole directory recursively to the S3 bucket, rclone copy or sync commands will handle it. So why bother with compressing files into a .tar.gz archive?

Uploading or downloading a single archive file will be much faster than doing the same with hundreds and thousands of individual files. Since this is backup, not for regular storage, there is no requirement for quickly fetching a single file.

Working on a smaller number of big files also reduces costs for S3 operations PUT and GET operations.

This may be simplified in the future when the rclone gets the archive capability built-in.

Why stream archive?

The .tar.gz archive is streamed with the pipe (|) operator to the rclone rcat command, which reads the data from the standard input and sends it to the storage. The archive file is never created on the disk. Thus you don't need free space in memory or on the disk equal to the backup file size.

This, however, brings some consequences.

One is that the rclone does not know the total size of the archive upfront.

Sending a big file to the S3 is done with a multipart upload. First, the upload process is started, then the next chunks of the file are sent separately, and, finally, the transfer is completed. There is a limit, though, to 10.000 parts max. Thus each chunk must be big enough. Since rclone does not know the total file size, we must manually instruct it.

This is where the previously calculated $chunk_size_mb takes part. By default, it's set so the total file size limit is 1 TiB. You can use the --max-size parameter to modify it.

chunk_size_mb=$((max_size_gb * 1024 / 10000 + 1))

The other consequence of streaming the archive file is that the rclone can't calculate its checksum before uploading.

Why calculating hash?

Typically, the rclone calculates file checksum and compares it with the checksum of the file already existing in the storage. For example, this happens when we back up a single file using rclone copy.

But since we are streaming the archive, rclone can't calculate the checksum on its own, so we need to do it ourselves.

We never create the whole archive file locally, so we can't calculate its hash. So instead, we calculate the MD5 of all the files and then compute the MD5 of those hashes:

hash=$(echo "$files" | tr '\n' '\0' | parallel -0 -k -m md5sum -- | md5sum | awk '{ print $1 }')

This hash is uploaded as a separate object next to the archive. On the next script run, the hash for the local files is re-calculated and compared with the one in the storage.

Calculating MD5 for thousands of files can be time expensive. However, the biggest bottleneck of the backup process is the upload speed. Therefore, calculating the hash and possibly skipping the archive upload can reduce the total process time, especially in the case of rarely modified files.

Why rclone?

rclone provides a layer on top of AWS SDK for interacting with S3 buckets. While the same could be achieved with AWS CLI, rclone displays upload progress when sending data from the input stream, and AWS CLI does not.

However, the progress shown by the rclone is not 100% accurate. rclone shows data as transferred once buffered by the AWS SDK, not when actually sent. For that reason, the displayed progress may go faster and slower or even stop from time to time. In reality, data is constantly uploaded.

Another reason to use rclone is that, because access credentials for rclone are set separately in its config file, ongoing backup upload does not interfere with other activities and tools that use local AWS credentials. So we can work with other AWS accounts at the same time.

Why S3 Glacier Flexible Retrieval?

AWS S3 has different storage classes, with various pricing per GB and operations. The S3 Glacier Flexible Retrieval class is over 6 times cheaper per GB stored than the default S3 Standard. In the least expensive AWS regions (like us-east-1 or eu-west-1), the storage price is only $3.69 per TiB per month.

On the other hand, fetching objects from the S3 Glacier Flexible Retrieval is not instantaneous. Firstly you have to request object restore. It is free if you are okay with waiting 5-12 hours for it to be ready. Otherwise, it can be sped up by choosing the Expedited retrieval, but for $30 per TiB.

An even cheaper storage class is the S3 Glacier Deep Archive. Using it can further reduce costs to only $1.01 per TiB. But, contrarily to Flexible Retrieval, it does not provide an option to retrieve the data in less than a few hours. Also, there is no free data retrieval option, although it costs only $2.56 per TiB if you are willing to wait for up to 48 hours.

With personal backups in mind, data retrieval should be rare and, hopefully, not time-critical. Thus the S3 Glacier Flexible Retrieval storage class provides a reasonable balance between costs and data access options.

Keep also in mind that with the S3 Glacier Flexible Retrieval, you are billed for at least 90 days of storing your objects, even if you remove them sooner.

The storage class can be set with the --storage-class parameter.

Please see the S3 pricing page for all the details.

Problems and considerations

Automatic backups

Backups are best if they are done regularly. Otherwise, you can find yourself looking for the backup to restore and finding out that the last was done a year ago.

If you want to backup local files from your computer, nothing stops you from automating it with a CRON job.

Unfortunately, automation can be impossible with external drives, attached only from time to time. It relies only on your good practices.

Backup from Google Photos (and similar)

Apart from physical drives and local files, backing up photos and documents from services like Google Photos, Google Drive, etc., is not a bad idea either. Accidents happen, even with cloud services.

As a Google Photos user, I hoped to fetch photos with rclone. Unfortunately, it's not possible to download photos in the original resolution this way, which makes it unfit for backup.

The only possibility I found is to use Google Takeout to get the data dump from Photos, Drive, GMail, and other Google services, and then upload it to S3 with the backup script.

Disk operations vs power-saving mode

Starting backup and leaving the laptop for the night may not necessarily bring the best results. I did it and was surprised to see that calculating files checksum did not complete after 8 hours. But when I re-run the script again, it finished in an hour or so.

Even when not on the battery, computers tend to minimize background operations when not used.

On macOS, in Settings -> Battery -> Power Adapter, you can uncheck "Put hard disks to sleep when possible". A more aggressive option is to disable sleep altogether.

Conclusion

250-lines Bash script is not a foolproof backup system. The backup is not incremental. It does not handle file modifications during the archive process.

Nonetheless, this is just what I need to back up my external drive from time to time and sleep without worrying about losing data if it will not work the next time I attach it to the computer.

The whole script with usage instructions is available on GitHub: aws-s3-personal-backup

Finally, as I said at the beginning, I take into account that I may have reinvented the wheel. If so, please share what backup systems work for you.

Headless CMS with Gatsby on AWS for $0.00 per month

Maciej Radzikowski — Tue, 19 Oct 2021 15:19:02 +0000

Can you have a website with a CMS on AWS and not pay just for its existence? I looked at Amazon Lightsail, headless WordPress, and Webiny CMS but found none of those suitable. So I choose Prismic – a SaaS headless CMS, and Gatsby to create the site. Yes, I needed to make a pipeline to build my website after content changes. But when I did it, I got a website with CMS hosted at no cost.

In Need of a Website

I build serverless applications on AWS daily. But when I needed to create a simple website for my new project, SiteClue, I surprisingly had no solution ready off the top of my head.

My requirements were simple:

landing page with a custom HTML
ability to edit content with CMS
hosted on AWS – to have everything in one place
low to no fixed costs – paying for hosting a website that (for now) has almost no visits is simply against my rules

Regular, Headless or Serverless CMS on AWS

There are multiple ways to host a website with CMS on AWS. So why not…?

Why not Amazon Lightsail with WordPress

If you look for hosting a WordPress site on AWS, the default recommended service is Lightsail. It’s an out-of-the-box solution for simple web applications. While being much less complex than EC2, you still get access to a virtual machine to install your app.

Hosting WordPress on Lightsail can be as simple as that (source)

The problem is, Lightsail pricing starts from $3.50 per month. That’s about twice what I pay for hosting this blog. And I don’t feel like paying three bucks for hosting a website that I edit once or twice per month just for the fact it exists. Serverless has spoiled me.

Why not Headless WordPress on AWS

My second thought: hey, there was an article about Serverless Static WordPress on AWS for $0.01 a day not long ago. A penny a day – I can accept that. Let’s dig into this.

It turns out that this solution works like that:

You go to AWS CodeBuild and run the job
The job starts an ECS container with WordPress
You log in to WordPress to make changes
You click “Generate Static Site” in WordPress to dump the website to the S3 bucket and host it from there
You run the job to stop the ECS container (or pay for the running container that you don’t use and forgot to stop)

Serverless Static WordPress architecture (source)

Well… That’s an interesting solution, but not appropriate for my needs.

With it, you are launching the CMS only when you need it. I see several drawbacks here:

You need to go to AWS to launch it and stop it. Not optimal if you want to give access to CMS for someone not technical to write a new blog post.
Even if you automate the above with some UI button, you have to wait a minute or two for the container to start.
And lastly, you need to remember to stop the instance to not pay for it when you no longer use it. However, you could automate this with some inactivity timeout.

In short, the whole solution is a bit too complicated for hosting a simple website, and the user experience is not the best.

Why not Webiny CMS

At that point, I resigned from using WordPress and focused my search on “serverless CMS”. Pretty soon, I came across Webiny Serverless CMS.

Webiny is an open-source project that you deploy to AWS. It takes advantage of serverless services to bring you self-hosted, customizable, and extensible CMS. It creates the infrastructure containing Lambda functions, API Gateway, DynamoDB, S3, and… Elasticsearch.

Webiny API architecture (source)

Almost everything deployed is serverless, where you pay only for what you really use. Except for the Elasticsearch, recently renamed to Amazon OpenSearch Service.

While you can have a small Elasticsearch/OpenSearch instance free of charge in the free tier for the first year, afterward, you have to pay at least $13 per month. I rejected Lightsail for $3.50 per month, so I won’t pay 4x as much for OpenSearch.

Nonetheless, I like the idea of Webiny and see a great use for it in the future. The good news is that they work on the ability to make Elasticsearch optional.

Headless CMS and Static Website

Eventually, I came with the solution. So why…?

Why Prismic CMS

Finally, I looked for “headless CMS”. I quickly found out there is a ton of them.

The general idea of a headless CMS is that it provides an editor to create content and an API that exposes that content. Then the website fetches content from the API. This is, by the way, the same workflow as with the Webiny described above.

Building a website with a headless CMS can be more work than with a standard CMS like WordPress. But at the same time, it brings some benefits, which I will cover at the end.

Prismic CMS UI (source)

Some of the headless CMS services are open-source, and you host them by yourself. This is nice, but I don’t want to have an EC2 or even a Docker container running all the time.

Others are SaaS products, where you edit content and have access to the API. I reviewed a few of them and ended up choosing Prismic. Why? I heard of it before, docs looked good, and it has a free plan for one user as well as sensible pricing when the editorial team grows.

Why Gatsby for Website

Gatsby is a static site generator, one of the most popular ones. I think the peak of its popularity, when everyone was talking about it, was actually some time ago, but most days, I look at the frontend world from the distance.

Now, needing to build a website that will consume and display the content from an API, I decided to catch up and see for myself what’s the deal with Gatsby. It’s based on React and has a lot of plugins, which really speeds up development. Importantly, like every other major headless CMS, Prismic provides a plugin for Gatsby. That makes the integration effortless.

Oops, sorry, wrong Gatsby

Gatsby produces a static website, meaning all the content is generated/fetched during the build and converted into a simple HTML page you can host from an S3 bucket. This gives an ultra-fast website that does not rely on any backend to make a bunch of database queries on each page view. However, on every change of content, you need to re-build the website.

CI Pipeline for Headless CMS and Gatsby

With the tech stack selected, now it’s time to make it work.

As mentioned above, you need to build the Gatsby website to update it. This is necessary for two situations:

you update the website code itself (e.g., website structure),
or you update the content in the CMS.

Because of this first case, I decided to build the website using GitHub Actions, since this is my preferred CI system. Every time I push changes to the repository main branch, it triggers the build and deployment of the website.

Now, I needed the build on the GitHub to happen also after changes in Prismic. Thankfully, in Prismic, you can add a webhook to trigger after content edits.

In the perfect world, we could point the Prismic webhook directly to the GitHub API to trigger the build job. But the GitHub /dispatches endpoint requires an event_type parameter in the payload body. Unfortunately, we can’t add it to the messages sent from Prismic. However, as you probably know, there is nothing we couldn’t patch with a Lambda function:

import {APIGatewayProxyHandlerV2} from 'aws-lambda/trigger/api-gateway-proxy';
import axios from 'axios';

const prismicSecret = process.env.PRISMIC_SECRET || '';
const githubUser = process.env.GITHUB_USER || '';
const githubRepo = process.env.GITHUB_REPO || '';
const githubToken = process.env.GITHUB_TOKEN || '';

export const handler: APIGatewayProxyHandlerV2 = async (event) => {
    const body = JSON.parse(event.body || '{}') as { secret: string };
    if (body.secret !== prismicSecret) {
        return {
            statusCode: 403,
        };
    }

    const response = await axios.post(`https://api.github.com/repos/${githubUser}/${githubRepo}/dispatches`, {
        event_type: 'prismatic_update',
    }, {
        auth: {
            username: githubUser,
            password: githubToken,
        }
    });
    console.log('GitHub response', response.status, response.data);

    return {};
};

When invoked from the Prismic webhook, this Lambda verifies the request by checking the secret token and calls the GitHub API to start the build.

The whole architecture looks as follows:

Pipeline for updating Gatsby website after changes in Headless CMS

In practice, the build and deployment in the GitHub Actions are handled by Serverless Framework, which deploys AWS services along with the website.

To make it all work, you need to configure few things:

in Prismic, you need to create a webhook providing the HTTP API URL as a target and some secret value for authorization,
in Build Trigger Lambda, you need to set:
- the same secret value to validate requests,
- GitHub username and repository name,
- generated GitHub Personal Access Token to be able to dispatch builds,
and finally in GitHub, you need to create repository secrets used in the build:
- AWS access key ID and secret,
- Prismic repo name and generated API token.

And that’s it. From this moment, every published change in the Prismic will trigger the build. The website will be updated in about 3-4 minutes.

You could optimize it by only building and uploading the website to the S3 bucket from the webhook instead of deploying the whole stack every time. Let’s make this a homework, shall we?

You can find the link to the repository with the entire solution at the end of the article 👇

Real Cost of Static Website with CMS

Is it really $0.00 per month?

Yes. But also no.

There are no fixed or minimal costs:

Prismic is free for one user,
GitHub Actions provide free 2000 execution minutes per month (each build takes about 4 minutes),
HTTP API requests cost $0.000001 per invocation, so to get billed $0.01, you would need to make at least 10.000 updates,
Lambda is free up to 1M invocations and 400.000 GB-seconds of compute time per month (single execution takes milliseconds),
S3 with even 50 MB website and 100 file uploads also costs below $0.01 per month.

Thus you can update the website multiple times per month, and as long as no one visits it, you don’t pay a cent.

On the other hand, you may eventually begin paying for the website when people start visiting the website. The only significant cost comes from CloudFront, where you pay for requests count and total data transfer size. Pricing will depend on the CloudFront settings and the location of your visitors.

I think that’s fair – you pay only for the actual usage, as with other serverless services. But, if that cost becomes a problem, you can use another CDN in that place. Cloudflare may be a much cheaper option here (even free).

Is using SaaS CMS cheating?

Maybe.

Yes, I said I wanted to have everything hosted on AWS. But that means I don’t have to deploy anything elsewhere. Using an external SaaS solution is much less of a problem for me.

And yes, I’m eager to try Webiny as soon as they provide a built-in deployment without the Elasticsearch option. I will write a follow-up of this post then – subscribe to not miss it 😉

Pros and Cons of Headless CMS and Gatsby

There are several benefits of this solution:

It’s easy to make changes. Any non-technical person can modify the content. Even a technical one wouldn’t like to change the HTML and manually update the website every time few words need to be added to the page.
It’s so cheap it’s basically free. You pay nothing for the existence of the website and making changes several times a month. Even with a lot of changes, it would be hard to spend more than few cents. You pay for the visits, like with any page you expose through CloudFront.
The website loads fast. That’s the beauty and main idea of static site generators like Gatsby. No backend and database queries. Everything is already in HTML, hosted to you from the nearest CDN edge location.

However, there are also some drawbacks:

Changes are not visible immediately. After every modification, the page needs to be built and deployed, which takes a moment.
The CMS is not as well-known and flexible as WordPress. You can be almost sure that any writer will be familiar with WordPress. Here the Prismic CMS is not (yet?) well known, although it provides all the essential features needed in most cases.

I will leave the rest of the “WordPress vs. anything else” discussion out of this list. There is no universal answer to it, and everything depends on the use case.

Final notes

As always, you can find a repository with the complete example on GitHub:

m-radzikowski / aws-website-cms

Gatsby website with Prismic CMS automatically updated on AWS.

If you are interested in website analytics that cares about privacy and where you pay based on the actual usage (just like in the solution the whole article was about), please check out the SiteClue. This post would not exist if I wouldn’t have to make this website:

My AWS toolbox – tools, plugins and applications

Maciej Radzikowski — Fri, 23 Oct 2020 15:58:19 +0000

This post was originally published at https://betterdev.blog/my-aws-toolbox/ - check it out for more related content.

Developers, like all specialists, discover and collect their favorite tools over time. Having a good, proven set of tools makes the work easier and more pleasant. We can focus on getting the job done. Sometimes eliminating minor inconveniences or improving a small element of everyday activity makes the greatest impact on the comfort of work.

It’s not always easy to find the best tools. There is a wide choice. More importantly, everyone has different habits and preferences. The best way is to test them yourself and see what suits you.

To help a little bit with that, here I present a collection of my AWS tools. These are applications, plugins, and extensions that I use in my daily work with AWS.

CLI

AWS CLI

The AWS CLI is the obvious first position on this list. After all, sometimes it’s just quicker to do something in the CLI. Other times we need to wrap some process interacting with AWS in a simple script.

The AWS CLI v2 has some nice features, such as improved command completion. I’m using a fish shell in the terminal and AWS CLI does not natively provide command completion for it. Fortunately, fish is extremely good with completions, so the fix is quite easy. It’s enough to add one (quite long) line to the config file and it works like a charm.

~/.config/fish/config.fish:

test -x (which aws_completer); and complete --command aws --no-files --arguments '(begin; set --local --export COMP_SHELL fish; set --local --export COMP_LINE (commandline); aws_completer | sed \'s/ $//\'; end)'

AWS CLI v2 completion in fish

asp plugin for oh-my-fish

As mentioned above, I’m using a fish shell. True beauty and power of it can be unlocked with Oh My Fish, which is basically a plugin and theme manager for the shell.

The OMF plugin I use daily when working with AWS is asp. It’s a small, handy plugin that allows changing the currently selected AWS profile. I took it over from the original author and I’m its maintainer right now.

Oh My Fish asp plugin

If you are using zsh instead of fish, a similar plugin exists also for Oh My Zsh.

Infrastructure as Code

Serverless Framework

The Serverless Framework is the most basic tool for my work with AWS. The built-in functionalities and number of community plugins accelerate infrastructure development. Even when creating just “ordinary” stacks, without any Lambda functions or other plugin-driven resources, writing CloudFormation with syntax extended by Serverless (for example, with variables) is far easier.

While CloudFormation is not always the best, it’s the default IaC for AWS and supported by them. The Serverless Framework is, in fact, building and deploying normal CloudFormation templates. That gives me confidence that I’m depending mostly on AWS, without additional parties. Anything that is not directly supported by Serverless or its plugins can be created using raw CloudFormation in the stack. This makes the IaC, the critical element of systems, stable and powerful.

Sample Serverless stack with raw CloudFormation resource

Chrome extensions

AWS Extend Switch Roles

If you are working on multiple AWS accounts and/or using various roles, then you must know the pain of switching between them in the AWS Console. The site remembers your past roles, so you don’t have to provide the role name and account ID every time. At least as long as you have no more than 5 of them. That’s the limit of roles history, after which they are overridden.

Here to help comes AWS Extend Switch Roles extension. The configuration is dead simple – you just copy the content of ~/.aws/config file. From that point, when you click on the extension icon, you will get a nice, filterable list of all defined roles to choose from. And you can have as many of them as you need.

AWS Extend Switch Role extension

Available also for Firefox.

AWS Simple Iconification Service

This one is from the category “small but delightful”. AWS Simple Iconification Service extension fixes favicons in AWS Console.

The fact that half of the service pages in AWS Console has one of two versions of the same default favicon is somehow astonishing. The fact that the other half has favicons in a few different styles, from 3D to flat, is just amusing. Well, we all know that the UI is not the AWS team priority, and the whole site looks a little bit like a Frankenstein’s monster.

But identical or inconsistent favicons are not only hurting someone’s sensitive UI feelings. It also makes it more difficult to quickly find one of 15 currently open AWS Console tabs during development. Or, worse case, while looking for the cause of an error on the production on some pleasant Friday afternoon.

With the Iconification extension, all services have their own favicons, from the official AWS architecture icons.

AWS Simple Iconification Service – favicon comparison

Available also for Firefox.

IDE plugins

AWS Toolkit for JetBrains

We can argue what IDE is the best, but for me, it’s always the ones from the JetBrains stable. Thus that list could not be missing the AWS Toolkit for JetBrains IDE.

There is a slowly growing list of services that the plugin supports. As I’m not building SAM applications, so far most useful for me are the S3, CloudWatch, and CloudFormation interfaces. Being able to operate with them directly from the IDE, sometimes easier and faster than going through the AWS Console in the browser, is really handy.

AWS Toolkit for JetBrains menu

The plugin works with all JetBrains IDE (IntelliJ, WebStorm, PyCharm, Rider, etc.).

AWS Toolkit for VS Code

The AWS Toolkit for Visual Studio Code is a little bit younger brother of the Toolkit for JetBrains. Their development goes with similar, but not identical paths. Some features are available sooner in one of them.

I’m not using VS Code on a daily basis, but the AWS Toolkit for it is one of the reasons I launch it. It provides Amazon States Language graph preview, which is a great help when working a lot with Step Functions.

Step Function graph preview in VS Code

This will stay on the list for now, at least until the same feature is not available in the Toolkit for JetBrains.

Serverless Framework plugin

The Serverless Framework Completion/Navigation/Syntax plugin for IntelliJ provides support for writing Serverless stacks. While rather basic, it can help a lot. First, it warns of references to non-existing files or resources. Furthermore, the ability to click on the resource name or path and jump straight to the code is very useful and minimizes the scrolling and clicking through files.

Architecture diagrams

Picture tells more than a thousand words. And a good software architecture diagram can tell more than any other kind of documentation. Especially when working in microservice or serverless environment.

OmniGraffle

The OmniGraffle is a paid and Mac-only application for prototyping, design, and diagramming. My case is the latter and the application does a good job in that area. After remembering only a few shortcuts the work is intuitive and fast. Even if you are pedantic like me and everything on the diagram must be exactly aligned, with OmniGraffle it’s quick to do.

OmniGraffle AWS architecture example

The nice feature is the Stenciltown – community-driven library of “stencils”. Stencils are packs of graphics that you can add and use in the OmniGraffle. Apart from that, there are also paid stencils over the internet.

If you use OmniGraffle and need AWS icons, here is a stencil from me.

And if you want to create a stencil on your own, here is my tool that will do it for you: OmniGraffle Stencil generator.

diagrams.net / draw.io

The OmniGraffle app is great to use but has several drawbacks. It’s for macOS only and paid. Sometimes you cannot expect everyone to use it.

For such cases, I use diagrams.net (previously known as draw.io). It’s free and works in the browser, so everyone can edit the diagrams. And for Confluence, it’s really worth to buy an add-on that integrates it. Having editable diagrams in the same place as the rest of the documentation is the best thing possible.

diagrams.net AWS architecture example

Sadly, in comparison with OmniGraffle, while diagrams.net win in the accessibility category, the usability and user experience is, in my opinion, worse. Not bad, just worse.

Summary

It’s not an especially long list. There are a lot more tools, toolkits, extensions, and plugins on the internet. From quite a few that I revied and tested only the ones above survived the time trial. Maybe some list of “tools for AWS that I do not use” can appear someday?

Of course, apart from AWS-related tools, there are a lot of different ones that I use. But it also may be a topic for another post.

Maybe you have some tools not listed here that you find extremely useful when working with AWS? Or at least ones that solve some minor inconveniences – that’s important as well. If so, let me know in the comments, and I will be happy to check them out!

Toolbox icon in the featured image made by Smashicons from www.flaticon.com

⚡ Speed up everyday work with handy Git aliases

Maciej Radzikowski — Thu, 17 Sep 2020 10:01:09 +0000

This post was originally published at https://betterdev.blog/handy-git-aliases/ - check it out for more related content.

Git allows us to define aliases, which are basically our own commands we can use. They may be just a calls for other commands with parameters, or even shell scripts. Possibilities are unlimited.

Do you ever google for this Git command you forgot every time? Often execute several commands one by one, every time in the same combination for a final effect? Or saw a really nice Git command on the internet, but with way too more flags to use it in a real-life? Git aliases are the solution.

Here I will show Git aliases that I use in everyday work. With explanation.

Defining first Git alias

Aliases are part of the Git configuration that is saved to the ~/.gitconfig file. They can be added or modified directly by editing this file, or by executing a command that will do this for us.

Let’s create an alias to the git status command that will be just a git s command – simple abbreviation.

git config --global alias.s status

Alternatively, we can open the ~/.gitconfig file (creating it if it does not already exist) and add this config lines by hand:

[alias]
    s = status

In both cases, we add new alias named s for status command. Since then those two calls are equal.

git status
git s

In ~/.gitconfig file more aliases may be added to the [alias] block, we don’t need to repeat it. We just add next lines under it.

Useful Git aliases for everybody

Handy shortcuts for Git commands

[alias]
    s   = status
    c   = commit
    go  = checkout
    gob = checkout -b
    d   = diff
    dc  = diff --cached

It may look silly at first. Oh, we can save 5 letters calling git status.

But when you start to use those short command versions, you will find it frustrating to go back. Those versions are two times shorter (including git command call at the beginning) than normal ones. It means you execute them two times faster. And for a git status and git commit commands, that we usually execute multiple times a day, it’s a significant change. You stop spending time typing commands, but rather analyzing the results instead. It’s also one second of distraction from work less.

Alias I use to move between the branches, git go, has a new alternative available from Git 2.23. It’s git switch, which is dedicated to changing branches. But still, I prefer my shorter version.

Beautiful and meaningful Git history log

[pretty]
    better-oneline = "format:%C(auto)%h%d %s %Cblue[%cn]"

[alias]
    tree    = log --pretty=better-oneline --all --graph
    ls      = log --pretty=better-oneline
    ll      = log --pretty=better-oneline --numstat

    details = "!f() { git ll "$1"^.."$1"; }; f"

For those we need to declare a new git log format first, so we can use it in all three aliases.

The first 3 aliases (lines #5-7) are used to show Git history in a nicer form. git tree is my favorite one and most used, as it shows a branching graph similar to the one shown on GitHub or GitLab. Irreplaceable in a branch-based workflow.

You can think about git ls and git ll as an equivalent of shell ls command for files listing. The first version will show commits history in a nice, one-line-per-commit format. The second one will additionally show modified files with a number of added and removed lines in each of them.

The custom format we define makes the output of all 3 commands nice and significant. We get short commit hash, message, author, and branch name pointing on that commit, if any.

Last one, git details (line #9), is used to show the same statistics as git ll, but for a single commit. We use it providing commit reference as an argument, for example, git details HEAD to show modified files in the last commit.

This last alias has a different syntax than all previous. It starts with an exclamation mark (!), which makes Git execute it as a shell command rather than git command. The commands will be always executed in a root repository directory. Additionally, we need to wrap it in a function to be safe using positional arguments – see explanation here.

Plurals for listing all elements

[alias]
    branches = branch -a
    tags     = tag

Those two aliases are very helpful, especially for beginners. They make listing all branches and tags easy with just a plural form of the word. Additionally, it fixes inconsistency in commands, as we see that in a normal form showing all branches and all tags is done in a different way.

Fast files adding and committing

[alias]
    a   = !cd ${GIT_PREFIX:-.} && git add . && git s
    aa  = !git add -A && git s

    ac  = !cd ${GIT_PREFIX:-.} && git add . && git c
    aac = !git add -A && git c

Those commands will speed up committing but needs to be used with care.

git a will add files from our current directory and subdirectories to the index/staging area, ready to be committed. It will also show git status right away, so you will see what is the state of your repository after the call. Remember to check if no random files were added automatically with this command by a mistake.

git aa will do a very similar thing but will add all modified files from the repository, no matter from what directory you will call it. You can memorize these two commands as “add” and “add all”.

The other two aliases are extended versions of previous ones. They will additionally commit changes. Using them is very helpful and time-saving, but we need to be sure only what we want is committed.

Clearing Git workspace and index

[alias]
    unstage  = reset HEAD
    cleanout = !git clean -df && git checkout -- .

The first alias, unstage, does the opposite of git aa, that we defined previously. It’s a shortcut for removing all files from the index/staging area. It doesn’t undo changes done in the files. They just go back to the workspace.

The next alias behaves differently. It acts on the files in the workspace (not added to the staging area with git add yet). git cleanout will undo all changes from the files in the workspace and remove all new, never committed files. Effectively we will get a clean repository state of the last commit.

Carefully, both discard and cleanout will cause you loos uncommitted changes!

Undoing commits and merges

[alias]
    uncommit = reset --soft HEAD~1
    unmerge  = reset --hard ORIG_HEAD

How many times did you realize that you committed or merged branches too soon? Here are commands to revert it easily. They do exactly what the names suggest.

git uncommit will remove the last commit. All committed changes will go back to the index, nothing will be lost. You can change them and commit them again. This is helpful to use instead of git commit --amend when we need to make some bigger modifications or wait longer before committing.

The other alias does a similar thing for branch merges. When we do git merge and then git unmerge right away, we will go back to the repository state from before the merge. This time no files will be found in the index after the command, and if we did some merge conflicts resolution – it will be lost. It’s important to know that we can do unmerge safely only right after the merge, as it uses special pointer ORIG_HEAD created during the merge. It may be changed by other Git commands, so calling it in any other circumstances could lead us to lose other commits and their content.

Both these commands rewrite Git history by removing commits. A good practice is not to do such operations on commits already pushed to the remote repository. In fact, without special permissions to the remote repository, we won’t be even able to push such changes. Therefore it’s best to use uncommit and unmerge only when we spot a mistake right after we do it, so we can fix it before it leaves our computer.

Showing Git merge details

[alias]
    merge-span  = "!f() { echo $(git log -1 $2 --merges --pretty=format:%P | cut -d' ' -f1)$1$(git log -1 $2 --merges --pretty=format:%P | cut -d' ' -f2); }; f"
    merge-log   = "!git ls `git merge-span .. $1`"
    merge-diff  = "!git diff `git merge-span ... $1`"

Git branch merges may cause a lot of headache, especially for beginners. Here is a way to at least understand and review what was introduced by a given merge.

While merge-span is only a subsidiary function, merge-log and merge-diff show us a list of commits added by a merge and all introduced changes. We can put a merge commit hash as a parameter to review the chosen merge, otherwise, the last found merge on the current branch will be shown.

Summary

Having aliases for most common actions and using them will speed up and simplify everyday work, allowing us to get most from the version control system.

There is nothing stopping you from adding more aliases. You can create them by yourself based on your most common actions. There are also plenty of other ready to use aliases created by various people. I’ve also taken some of my aliases from them and got inspired by others. Here they are:

If you have your own favorite aliases, share them in the comments.