Forem: fme Group

Managing AWS Lambda Versions with AWS Step Functions: A comprehensive guide

Maureen Plank — Fri, 19 Apr 2024 12:44:30 +0000

Each AWS account has a regional default storage limit for Lambda and Lambda Layers (Zip archives) of 75 GB. This sounds a lot, and it can even be increased by demand, but many functions combined with frequent deployments might lead to storage pollution quite quickly. Therefore, it makes sense to clean up old deployment packages regularly. This blogpost introduces an AWS Step Functions state machine that is designed to automate the identification and deletion of older Lambda function versions.

Starting point

AWS creates a new version for each lambda function deployment and keeps them all available – just in case an older one is needed in future. Every deployment package is stored and counts to the overall size limit of 75 GB. Even though this amount of storage is quite large it is advisable (and recommended by AWS) to clean up older versions which are no longer needed on a regular basis.

As of now, AWS does not provide a direct solution to manage these versions, prompting developers to create custom solutions. One example of a custom solution is the use of a Lambda function like the Lambda Janitor by Yan Cui. Our idea for an implementation was a state machine – an automated and visual solution for the Lambda version management.

This state machine handles all Lambda functions (optionally targeting specific ones based on predefined criteria) and deletes / removes old versions based on threshold which determines how many versions should be kept.
A visual representation of the final workflow as well as a brief explanation of the most important steps is given below:

Retrieving and filtering the function names

The workflow begins with calling and listing the names of all existing Lambda functions within the AWS region. The “lambda:listFunctions” API call returns up to 50 items per call and a marker token in case more results are available. To take this into account the paginator pattern after the “Iterate Functions” map state is being used.

For the first step we transform the result with the ResultSelector to extract only the function names. By doing that we get an output like this:

To iterate through the array of lambda function names in the list, we use a Map state, which allows us to set a concurrency limit of 5. This limit helps to control the number of simultaneous requests and prevents throttle calls.

In our case we just want to clean up versions of Lambda functions which belong to our project. Others which are managed by the platform team need to stay untouched.

The Lambda functions from our project have a specific prefix in the name, that makes it easy for us to filter through all functions in the region. To do that, we added a “Choice” step to determine whether the state machine should continue the version cleanup or ignore the current Lambda function. This prefix can also be retrieved from the state machine invocation event.

Getting all the available versions

In the following step, all available versions are retrieved using the “lambda:listVersionsByFunction” API call. The outcome of this step is shown below – a list of all versions and the name of the function currently processed.

Keeping the function name in this step and adding it to the ResultPath is mandatory because we will need it again in a few steps when deleting the versions.

The “lambda:listVersionsByFunction” API call returns up to 50 items and a marker token to retrieve more results. To take care of the pagination – the paginator pattern would have been required again. However, we have decided not to implement it to keep the workflow simpler.

There are not that many deployments in our environment that we would reach more than 50 versions per Lambda function before the workflow runs the next time. In addition, the workflow which is triggered by an AWS EventBridge scheduler could run more often to avoid this situation – for instance, every day instead of every week – just as an example.

Removing the $LATEST version and get version count

The Lambda API returns the versions in a way that the latest (called $LATEST 😊) one comes first followed by all other versions in ascending order – from oldest to newest. We decided not to rely on this implicit order but to exclude the $LATEST version explicitly from all further processing steps as well as to sort the remaining version numbers – just in case AWS is going to change the response format.

Unfortunately, AWS Step Functions does only provide a limited set of array and JSON processing functions – so called intrinsic functions. A Lambda function is required to perform the necessary work. Hopefully, AWS adds some more power to the intrinsic function’s palette so that these kind of simple helper functions will no longer be necessary in future.

The outcome of this step consists of the lambda function name, the array of versions to be processed and their count.

Check number of available versions and remove outdated ones

In the next step, the state machine compares the number of available versions against a threshold. This is done with a “Choice” State. In our case we wanted to always keep the three most recent versions + the $LATEST one, so the “version_count” is compared with 3:

If the number of available versions exceeds the threshold we set, the oldest version (first position in the array) is deleted using the “lambda: deleteFunction” API call in the step “Delete oldest version”.

After that, a modification of the payload is required as the deleted version number and the version count itself needs to be adapted. A “Pass” state is used to remove the oldest version (the first position in the array with the versions) and to decrease the version counter.

The execution keeps checking and deleting old versions until their number has reached the threshold value (three in this example).

One remaining thing to consider related to Lambda Aliases

A Lambda alias reference one or two function versions which cannot be deleted if the alias exists. It is possible to retrieve all aliases of a function via the “lambda:listAliases” API call. One way to make sure to take aliases into account would be to get all involved versions and to remove them from the versions array before the version deletion action. This requires some custom code in a Lambda function.

Another option – which we have implemented – consists in defining an error handler for the “Delete old version” step. This one captures the “Lambda.ResourceConflictException” which is thrown when trying to delete a version which cannot be deleted for instance because of an alias reference. This error handling makes sure that the state machine does not fail in case of alias references or other unforeseen problems when trying to perform the deletion.

The state machine finishes after all Lambda functions which have been supposed to be processed do not possess more than the most recent versions – all others have been deleted. This mechanism helps to prevent deployment package pollution when used regularly.

Conclusion

In this guide on managing AWS Lambda versions with AWS Step Functions, I’ve shown how developers can leverage automation to streamline the management of Lambda function versions. By automating these processes, teams can focus more on development rather than maintenance, maintaining operational efficiency and resource optimization. Overall, this solution provides a practical, visual tool for managing Lambda functions effectively, helping users navigate the complexities of cloud operations with greater ease.

AWS documentation and service quotas are your friends – do not miss them!

Thomas Laue — Fri, 06 Oct 2023 14:31:16 +0000

Who loves reading docs? Probably not that many developers and engineers. Coding and developing are so much more exciting than spending hours reading tons of documentation. However just recently I was taught again that this is one of the big misconceptions and fallacies -- probably not only for me.

AWS provides an extensive documentation for each service which contains not just a general overview but, in most cases, a deep knowledge of details and specifies related to an AWS service. Most service documentations consist of hundreds of pages including a lot of examples and code snippets which are quite often helpful -- especially related to IAM policies. It is not always easy to find the relevant pieces for a specific edge case or it might be missing from time to time, but overall AWS have done a great job documenting their landscape.

Service quotas which exist for every AWS service are another part which should not be missed when either starting to work with a new AWS service or to use one more extensively. Many headaches and lost hours spent to debug an issue could be avoided by taken these quotas into account right from the start. Unfortunately, this lesson is too easy to forget like
will be shown in the following example.

In a recent project, AWS DataSync was used to move about 40 million files from an AWS EFS share to a S3 bucket. The whole sync process should be repeated from time to time after the initial sync to take new and updated files into account. AWS DataSync supports this scenario by applying an incremental approach after the first run.

One DataSync location was created for EFS and another one for S3 and both where sticked to gether by an DataSync task which configures among other things the sync properties. The initial run of this task went fine. All files were synced after about 9 hours.

Some days later a first incremental sync was started to reflect the changes which had happened on EFS since the first run. The task went into the preparation phase but broke after about 30 minutes with a strange error message:

"Cannot allocate memory" -- what are you trying to tell me? No memory setting was configured in the DataSync task definition as no agent was involved. The first hit on Google shed some light on this problem by redirecting me to the documentation of AWS DataSync

which contains a link to the DataSync task quotas:

Apparently, 40 million files are way too much for one task as only 25 million are supported when transferring files between AWS Storage services. A request to the AWS support confirmed as well that problem was related to the large number of files. I have no idea why the initial run was able to run through but at least the follow up one failed. Splitting up the task into several smaller ones solved this issue so that the incremental run could finally be succeeded as well.

Nevertheless, some hours were lost even though we learned something new.

Lessons learned -- again:

Embrace the docs -- even though they are really extensive!
Take the service quotas into account before starting to work and while working with an AWS service. They will get relevant one day -- possibly earlier than later!
AWS technical support really like to help and is quite competent. Do not hesitate to contact them (if you have a support plan available).

Azure CosmosDB — why technology choices matter

Jens Goldhammer — Wed, 09 Aug 2023 05:48:03 +0000

Some months ago, my colleague Florian and me joined a development team of one of our clients. We are involved as architects and engineers of the application used in their retail stores.

The client currently migrates the core software from runni ng decentralized in each retail store (with its own databases) to a central solution. They heavily invested in Microsoft Azure as a Cloud provider and are moving more and more workloads to the Azure Cloud.

The client currently uses MSSQL databases in combination with the open-source Firebird database and has started to migrate data into the cloud. They have decided to use Cosmos DB as standard database for all new services in the cloud some years ago as it was the cheapest choice from their point of view.

What is Azure Cosmos DB?

Azure Cosmos DB is the solution of Microsoft for fast NoSQL databases. For those who live in the AWS (Amazon Web Services) world, Cosmos DB can be compared to the service DynamoDB. You can learn more about Cosmos DB here: https://azure.microsoft.com/en-us/products/cosmos-db

Source: https://azure.microsoft.com/en-us/products/cosmos-db

When working with Cosmos DB, you have to forget all the things you learned in the relational database world. To design a good data model, you need to learn how to design a data model depending on your future access patterns to your data, because the performance of Cosmos DB depends on the partitions. Therefore you must put more effort into the data modelling upfront. You can find more about here: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data

Cosmos DB instances can be created on demand and can be used in many programming languages.

The unique point of Cosmos DB — in comparison to traditional relational databases — is the distribution of the stored data around the world, the on-demand scalability and the effortless way to get data out of it. Due to the ensured low response times Cosmos DB allows multiple use cases in the web, mobile, gaming and IoT applications to handle many reads and writes.

Further use cases can be found here: https://learn.microsoft.com/en/azure/cosmos-db/use-cases

Having joined the project as an architect and engineer, I was critical of using Azure Cosmos DB from the start, as I am a big fan of relational databases, especially for transactional data. My Cosmos DB journey began with writing a centralized device service to store clients’ purchased devices. We have used Azure Functions to implement the business logic on top of Azure Cosmos DB to retrieve and store the data.

Structure of Azure Cosmos DB

Microsoft allows their customers to create several Cosmos DB instances in one Azure tenant. You can compare it to a database which holds several database tables. These instances can be used to separate workloads for different teams, stages or use cases.

Of course, in the document database world you can model your data without a specific schema. You can create your own JSON structures which makes it flexible as well. Often the idea is to combine different data into one item to allow fast reads. To reference data in other domains, you can use unique identifiers like the property id in the customer object.

You can find more about the structure of Azure Cosmos DB here: https://learn.microsoft.com/en-us/azure/cosmos-db/account-databases-containers-items

Accessing data

Azure provides multiple ways to query data from Azure Cosmos DB.

The following interfaces are possible:

NoSQL API
MongoDB API
Cassandra API
Gremlin API
Table API

An overview can be found here: https://learn.microsoft.com/en-us/azure/cosmos-db/choose-api

We started to use the NoSQL API in our project with the SQL like interface. It was an easier migration path coming from SQL based relational databases in comparison to the other interfaces. To access the data, you can also use the Azure Portal with the Data Explorer — the data explorer allows you to access your collections, query and manipulate data.

Migration / Import mass data

Cosmos DB allows to import data flexibly with different APIs.
There are two options at the moment:

Import via Azure Datafactory service which can be used out of the box
Import via custom CLI Tool which uses the cosmosDB API -> this tool needs to be developed

One way to import mass data is via Azure Data Factory which allows to pipeline data and map data from various sources and import into a Cosmos DB collection. We have used this mechanism a lot to transfer data from on-premises relational databases into the cloud and migrate data via pipelines into Cosmos DB collections.

Source: Azure Portal example pipeline

Azure Data Factory works quite well, is fast and very flexible, but has its own drawbacks and challenges. Unfortunately, this topic is a subject in itself, so we cannot go into more detail here.

In the past we have also written our own CLI tools; they are more flexible for the datamapping and can be reviewed easier by other team members. By using bulk import with parallel threads with the cosmos API you will be as fast as the importing data via Azure Datafactory.

You can find a list of available SDKs here: https://developer.azurecosmosdb.com/community/sdk

Querying data

Azure Cosmos DB provides several APIs to retrieve the data out of the containers. We have decided to use the SQL interface to get data out in the used Azure functions.

You can for example take this SQL like query to select all devices of a customer.

This looks familiar, right?

After some time, you notice that the SQL capabilities are limited as Azure Cosmos DB implements only a limited set of SQL specifications:

Cosmos DB does not allow to join items from different collections — it only allows to join the item with itself which means that you need to read data separately from different collections to join your data. The documentation says that you have to change your data model if you have needs for joining.
Cosmos DB provides functions as well, but you may know only a few of them and sometimes in a completely different way as you may know from SQL. You have to learn the cosmos specific syntax as there is no standard for querying data in NoSQL databases.
Cosmos DB has limited capabilities for the group-by with having clause. Sometimes there are workarounds, sometimes not.
Cosmos DB supports Limit and Offset, but this is very slow (as it is implemented) and you should use continuation tokens instead. Why? If you are interested to understand this, read here: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/offset-limit and https://stackoverflow.com/questions/58771772/cosmos-db-paging-performance-with-offset-and-limit

My experience was that you often found good workarounds or a complete cosmos specific way, but sometimes you didn’t find a solution which was a little bit frustrating.

Nevertheless, the most painful issue was that CosmosDB only report errors with following message: “One of the input values is invalid.” in your query without a useful hint.

In this case, I made the mistake of putting a semicolon at the end of the query.

Updating data

Updating one or multiple rows in a relational database with one SQL statement is a common request for processing data.

Azure Cosmos DB exactly allows to update one item within a container by using multiple requests.

The procedure looks like this:

Retrieve the whole document you want to update
Update the fields you want to update in your application code
Write back the whole document to Cosmos DB

Microsoft provides a new SQL update API to update one item without reading it before. The syntax for updating data is driven by the JSON PATCH standard (https://jsonpatch.com). This feature was a long time in preview and now is generally available in Azure Cosmos DB.

https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update-getting-started?tabs=dotnet

Mass updates to a larger set of documents cannot be done out of the box with one SQL statement. You must update each document separately. This limitation is a little bit surprising when you want to evolve your document schema.

Yes, you can write a tool based on the bulk API. But you probably know that updating a lot of data is slow and involves much more effort instead of writing a single update query like in the relational data world.

Deleting data

Deleting data in Azure Cosmos DB is generally possible by removing one item at a time via API. But there is unfortunately no support for SQL Delete-Statements!

In general all limitations for mass operations in Cosmos DB may have reasons — for instance, the guaranteed response times for any action in the Cosmos DB. Operations on a bigger set of data might lead to higher execution times.

But indeed this is an annoying point while writing and testing your software. Sometimes you need to remove specific data very quickly. One workaround is to drop the whole container and create your test data again, but often you have the case that you want to keep specific data in it.

For example, we had to remove two million entries from a collection to repeat a migration, but wanted to keep other data in the collection. This action took half an hour by using a developed tool.

Transactions

Azure Cosmos DB provides a simple transactional concept. It allows to group a set of operations in a batch operation. Unfortunately, the batch concept is not integrated nicely into the API as it does not allow to wrap your code into a transactional block like interfaces to relational databases allow.

Additionally, transactions do not allow to update documents from different partitions which is understandable from a technical point of view, but this limits a lot. In our service we had the use case to update several documents at once from several partitions and we had to live without transactions in the end.

Ecosystem & Community

I was very surprised starting with Cosmos DB to find such limited resources, articles and tools around the platform. But I understood this situation quickly because Cosmos DB is an exclusive, commercial service of Microsoft and is not as popular as Amazon DynomoDB for example.

Microsoft itself provides limited tooling only:

the azure portal with the data explorer (https://cosmos.azure.com)
a visual studio code extension https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-cosmosdb
a CosmosDB emulator which runs natively under Windows https://learn.microsoft.com/de-de/azure/cosmos-db/data-explorer

The community has written some tooling:

The Cosmos DB Explorer for Windows https://github.com/sachabruttin/Cosmos DBExplorer
CosmicClone to clone data from one container to another https://github.com/microsoft/CosmicClone

Unfortunately there is no big community around Cosmos DB like for example for PostgreSQL or MySQL/MariaDB. Anyways, most of the good, known database tools out there which are supporting several vendors do not support Azure Cosmos DB. Mostly because it works completely different than relational databases.

Advanced topics

Additionally Azure Cosmos DB allows to use stored procedures. Wait — Are stored procedures not a thing from the last century? Why should we use it? Probably you will notice that you need to use stored procedures for some scenarios like a mass deletion of entries in your collection as this is not supported out of the box.

Writing stored procedures is supported in Cosmos DB with JavaScript. As you may know, testing becomes challenging for this kind of code and besides that, most of the backend developers in our team are not familiar with JavaScript. Due to these challenges, we decided not to use them within the application apart from administrative purposes!

There are many advanced topics for Cosmos DB like scaling, partition keys etc. — these topics need their own blog post. You can read more about that in the official documentation: https://learn.microsoft.com/en-us/azure/cosmos-db/

Summary

Using a document-based database is not a no-brainer. Document-based databases like Azure Cosmos DB are not a replacement for relational databases and it was never the intention.

Yes, Azure Cosmos DB has its use cases:

If you have the use case “write once — read many” (for example just store data with a stable structure), you can use it.
If you need global distribution of your data, you probably need it.

The problem is that you do not have these requirements in general for business applications very oftem. In my opinion most of the applications do not need to scale this way (besides, you are Amazon, Microsoft, Netflix or another global player).

On the other side Azure Cosmos DB has some heavy limitations when working with the data, especially if you want to evolve your schema. If you want to store relational data within Cosmos DB and have a lot of changes in the data over time, Cosmos DB makes it very complicated and is currently not a good choice from my point of view.

Besides these considerations one task has become very, very important from the beginning: design how to model and to partition your data. But this is a story on its own.

Resources

Microsoft Graph API- a practical example in python

Torben Bruns — Thu, 23 Mar 2023 13:58:21 +0000

Nothing is as constant as change." Following this theme, Microsoft is planning to discontinue Azure AD Graph in 2023 and introduce something new: Microsoft Graph. It will not only replace the former API but also enhance it with new capabilities. Apart from interacting with Azure AD Graph, the new API can also communicate with Microsoft 365 products. If you want a successful pipeline run to post a message in a Microsoft Teams channel, Microsoft Graph can do it. And if an application needs to send emails to users, Microsoft Graph can also handle that.

To put it simply, Microsoft Graph is a REST-API and acts as gateway to numerous services Microsoft365 offers [1].

Using Microsoft Graph in your environment

To begin with, you need an active subscription for Microsoft 365. The actual plan does not matter, as even the Basic tier is sufficient. If you want to get a first look at the API's capabilities, check out Microsoft Graph Explorer. (https://developer.microsoft.com/en-us/graph/graph-explorer).

Writing our own application

If you want to create your own application, let's get started. Let's consider an application that monitors inventory stock. As soon as the stock falls below a certain number, an email should be sent to the orders team.

We will focus on the following things:
• Registering an application in Azure AD
• Setting up a Graph Client in Python
• Sending an email

The image below visualizes what we want to achieve.

Monitoring the stock is not covered within this article.

AzureAD Registration

There are two types of permissions in AzureAD:

Delegated permissons
Application permissions

With delegated permissions the application acts as a logged in user like the Graph Explorer does. Application permissions on the other hand allow the app to act as own entity rather than on behalf of a user. Downside is that for this type of permission you need administrative rights.
After this short explanation on types of permissions in Azure let us begin with registering an application in AzureAD.

Go to portal.azure.com and login with your credentials
Click on Azure Active Directory
From the left side select App Registrations
Click on New Registration and copy the configuration from below image

The supported account types can be adjusted to your needs.

Click on the newly created app registration
Select Authentication from the menu on the right
Add a new Authentication of type Mobile and desktop application

For our example to work enter below configuration:

Switch the slider for Allow Public Client Flows to the “on” position and save
From the menu select Certificates & Secrets
Add a new client secret and remember to save it as it is only shown once
Go to API permissions and select permissions like shown below

That is all, the configuration of the application in the Azure portal is done.

Save the following values for later:

Client id
Client secret
Tenant id
Implementation

For the purpose of simplicity, I used Python. Microsoft offers SDK’s for different languages like C#, Java, Go and PHP. Still, all that is necessary is implementing HTTP-Calls. If there is no SDK for your specific language, you are only losing some comfort.

Let us have a look at the source code:

The packages msgraph and azure make it relatively simple to implement a Microsoft Graph API client. First, a GraphClient is created, which then queries the API for a list of users. Then, we call the "send_mail" function, which takes a GraphClient and userlist as inputs. It sends an email with some example text on behalf of the first user found in the list using their Outlook account to the recipients listed under the keyword "toRecipients". If you want to know the exact mechanism, please refer to Microsoft’s documentation [4].

A mail is not limited to plain text, it is also possible to send attachments through a call to the url

/users/{id | userPrincipalName}/mailFolders/{id}/messages/{id}/attachments

The result of the above call to the API looks like this:

Conclusion

The Graph API is a powerful gateway to the services offered by Microsoft. There are numerous applications imaginable, such as status updates on pipeline runs through Teams, email notifications like in the example, or user management within Azure AD.

Sources

A cross account cost overview dashboard powered by Lambda, Step Functions, S3 and Quicksight

Thomas Laue — Fri, 17 Feb 2023 13:01:44 +0000

Keeping an eye on cloud spendings in AWS or any other cloud service
provider is one of the most important parts of every project team's
work. This can be tedious when following the AWS recommendation and best
practice to split workloads over more than one AWS account. Consolidated
billing -- a feature of AWS Organization would help in this case -- but
access to the main/billing account is very often not granted to project
teams and departments for good reason.

In this article, a solution is presented which allows to automatically
collect billing information from various accounts to present them in a
concise format in one or more AWS Quicksight dashboards. Before starting
to go into the details, let's look shortly on the available AWS services
for cost management and the difficulty when working with more than one
account.

AWS Cost Management Platform and Consolidated Billing

AWS provides a complete toolset consisting of AWS Billing, Cost
Explorer, and other sub services. They offer functionalities to get a
focused overview about costs and usages as well as tools to drill down
into the details of most services. The tooling is sufficient to get the
job done when working with one AWS account even though the user
experience might not be as awesome as provided by other more dedicated
and focused 3^rd^ party tools. Special IAM permissions are needed to
access all this information to tackle the governance part as well.

AWS Organization which rules all AWS accounts associated with it
provides some extended functionalities (aka. Consolidated billing) to
centralize cost monitoring and management. However, in most companies
and large enterprises only very few people are allowed to access the
billing account. This does not help a project lead to get the required
information easily. Depending on the number of AWS accounts used by
team, someone who is allowed has either to login into every account
regularly and check the costs or rely on "Cost and Usage" reports which
can be exported automatically to S3. These reports are very detailed
(maybe too much for simple use cases) and require some custom tooling to
extract the required information.

AWS has published a solution called Cloud Intelligence
Dashboards -- a collection of Quicksight dashboards which are among other data sources based on these cost and usage reports. Beside this one, company internal cost control tools -- sometimes bases on the same AWS services -- exist and can be "rented". All these solutions have their advantages and use cases -- but also drawbacks (mostly related to their costs and sometimes also due to overly large IAM permission requirements).

An approach for simple use cases

Sometimes it is fully sufficient to present some information in a
concise manner to stay informed about the general trends and total
amounts. In case something reveals to be strange or not to be in the
expected range, a more detailed analysis can be performed using for
instance the AWS tooling mentioned above.

Following this idea, Step Functions, Lambda, S3 and Quicksight powers a
small application which retrieves cost related data for every designated
AWS account and stores it as a JSON file in S3. Quicksight which
supports S3 as a data source directly reads this data and provides it to
build one or more dashboards displaying various cost related diagrams
and tables. This workflow (which is shown below) is triggered by an
EventBridge rule regularly (e.g., once a day) so that up-to-date
information is available.

The „Data preparation" step provides the AWS account ids and names as input for the following Map State.

The Map State starts for every array element (= AWS account) an
instance of a Lambda function which assumes an IAM role in the
relevant account and queries the AWS Cost Center and AWS Budget APIs
to collect the required information: total current costs, costs per
service, cost forecasts...

@logger.inject_lambda_context(log_event=True)
@tracer.capture_lambda_handler
def lambda_handler(event, context):
     validate_input(event)

     assumed_role_session = assume_role(
         session,
         f'arn:aws:iam::{event["account_id"]}:role/{ROLE_NAME}',
         region_name="eu-central-1",
     )

     client = assumed_role_session.client("ce")
     budget_client = assumed_role_session.client("budgets")

     now = arrow.utcnow()
     first_day_of_current_month, end_date = get_start_and_end_date(now)

     cost_response = client.get_cost_and_usage(
         TimePeriod={
             "Start": first_day_of_current_month.format(date_format),
             "End": end_date.format(date_format),
         },
         Granularity="MONTHLY",
         Metrics=["UnblendedCost"],
     )

     cost_per_service = client.get_cost_and_usage(
         TimePeriod={
             "Start": first_day_of_current_month.format(date_format),
             "End": end_date.format(date_format),
         },
         Granularity="MONTHLY",
         Metrics=["UnblendedCost"],
         GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
     )

 …

     payload = {
         "date": now.format(date_format),
         "current_month": now.month,
         "current_year": now.year,
         "account_name": f'{event["account_name"].upper()}',
         "current_costs": f'{float(cost_response["ResultsByTime"][0]["Total"]["UnblendedCost"]["Amount"]):.2f}',  
         "forecasted_costs": f"{float(forecasted_cost):.2f}",
         "budget": f'{float(budget_response["Budgets"][0]["BudgetLimit"]["Amount"]):.2f}',
         "account_id": f'{event["account_id"]}',
     }

     for item in service_costs:
         payload = payload | item

     return {"statusCode": 200, "body": json.dumps(payload)}

The outcome of the Map State is an array consisting of the cost data retrieved from every AWS account.

[
  {
     "date": "2023-02-09",
     "current_month": 2,
     "current_year": 2023,
     "account_name": "Test",
     "current_costs": "152.49",
     "forecasted_costs": "468.10",
     "budget": "700.00",
     "account_id": "11111111111x",
     "Amazon DynamoDB": 10.0002276717,
     "AWS CloudTrail": 3.0943441333,
     "AWS Lambda": 30.24534433,
     "AWS Key Management Service": 1.8699148608,
     "AWS Step Functions": 32,5439809890,
     "Amazon Relational Database Service": 16.1240975116,
 …
 },
 {}
]

This data is stored as JSON file in a S3 bucket in the last workflow
step. Every file name contains the current date to make them unique.

Apart from the StepFunctions workflow and the Lambda function which can
for instance run in a dedicated AWS account to simplify the permission
management, an IAM role needs to be deployed to every account whose
costs should be retrieved. This role must trust the Lambda function and
contain the necessary permissions to query AWS Cost Center and AWS
Budget. An example written in Terraform is given below.

module "iam_assume_role_sts" {
   source  = "terraform-aws-modules/iam/aws//modules/iam-assumable-role"
   version = ">= 4.7.0"

   role_name = "query-cost-center-role"
   trusted_role_arns = [
     var.query_cost_center_lambda_role_arn
   ]

   create_role       = true
   role_requires_mfa = false

   custom_role_policy_arns = [
     module.query_cost_center_policy.arn
   ]
 }

 module "query_cost_center_policy" {
   source  = "terraform-aws-modules/iam/aws//modules/iam-policy"
   version = ">= 4.7.0"

   name        = "Query-cost-center-policy"
   path        = "/"
   description = "This policy allows to query the AWS CostCenter"
   policy      = data.aws_iam_policy_document.query_cost_center_policy.json
 }

 data "aws_iam_policy_document" "query_cost_center_policy" {
   statement {
     actions = [
       "ce:GetCostAndUsage",
       "ce:GetCostForecast"
     ]

     resources = ["*"]
   }

   statement {
     actions = [
       "budgets:ViewBudget"
     ]

     resources = [
       "arn:aws:budgets::${data.aws_caller_identity.current.account_id}:budget/project_budget"
     ]
   }
 }

Data visualization with Quicksight*

Quicksight supports S3 as a direct data source -- only a manifest file
containing a description of the data stored in the bucket is needed.
This is quite handy for data whose structure does not change or only
very seldom.

A more involving setup including AWS Glue and AWS Athena might be
beneficial in cases where either a lot of details (not only basic cost
information) are queried, or a lot of different AWS services are used
over time in the different AWS accounts. It might happen that Quicksight
runs into problems when trying to load this kind of data as it is going
to change constantly, and the manifest file requires a lot of updates. A
Glue Crawler combined with an Athena table might be the better approach
in such a scenario.

As soon as a new dataset based on the S3 bucket has been created, one or
several dashboards can be implemented. They can represent some overview
data like it is done in the example below or go into more detail --
depending on the specific requirements. How to create these dashboards
is out of scope of this article but Quicksight offers enough tooling to
start from simple to go a long way to sophisticated information display.

A Quicksight dashboard can either be shared with individuals in need of
this information or a scheduled email notification can be established.
Quicksight will sent a mail to all specified recipients which can
include an image of the dashboard as well as some data in CSV format.
This feature helps a lot it is not always necessary to login to keep the
costs under control. Simply by receiving an automated message for
instance every day or just once a week can already help to stay
informed.

Wrapping up

Cost monitoring is an important topic for every project -- from small to
large. AWS offers various tools to stay up to date but this task is
getting tedious when following AWS best practice and separating an
application into different stages and AWS accounts. There are 3rd party
or company-internal tools available which helps to overcome this
situation, but it is not always possible to use them (especially in an
enterprise setup) or they come with their own drawbacks.

This blog post has presented a small-scale application which offers
enough information and details to monitor the costs generated by small
to medium size projects. It has its own limitations as it might not be
powerful enough when dealing with tens or even hundreds of AWS accounts
-- but this is normally not the typical setup of a project.

Photo credits
Photo of Anna Nekrashevich: https://www.pexels.com/de-de/foto/lupe-oben-auf-dem-dokument-6801648/

How not to send all your money to AWS

Thomas Laue — Fri, 07 Oct 2022 12:52:33 +0000

The AWS environment has grown to a kind of universe providing more than 250 services over the last 20 years. Many applications which quite often easily use 5-10 or more of these services benefit from the rich feature set provided. Burdens which have existed for instance in local data centers like server management have been taken over by AWS to provide developers and builders more freedom to be more productive, creative and more cost effective at the end.

Billing and cost management at AWS are one of the hotter topics which have been discussed and complained about throughout the internet. AWS provides tools like the AWS Pricing Calculator which helps to create cost estimates for application infrastructure. Significant efforts have been spent over the last couple of years to improve the AWS Billing Console in order to provide better insights about daily and monthly spendings. However, it still can be hard to get a clear picture as every service and quite often also every AWS region has its own pricing structure.

At the same time, more and more cost saving options have been released. Depending on the characteristics and architecture of an application it can be astonishing easy to save a considerable amount of money with no or only limited invest in engineering time using some of the following tips.

Cleanup resources and data

Probably the most obvious step to reduce costs is to delete unused resources like dangling EC2 or RDS instances, old EBS snapshots or AWS Backup files etc. S3 buckets containing huge amounts of old or unused data might as well be a good starting point to reduce costs.

Beside not wasting money all these measures enhance the security at the same time: things which are no longer available cannot be hacked or leaked. AWS provides features like AWS Backup/S3 retention policies to support managing data lifecycle automatically so that not everything must be done manually.

AWS Saving Plans

Savings plans which come in different flavours were released about 3 years ago. They offer discount opportunities of up to 72 percent compared to On-Demand pricing when choosing the "EC2 Instance Savings" plan. Especially the more flexible "Compute Savings Plans" which still offers up to 66 percent discount is quite attractive as it covers not only EC2, but Lambda and Fargate as well.

Workloads running 24/7 with a somehow predictable workload are mostly suited for this type of offering. Depending on the selected term length
(1 or 3 years) and payment option (No, Partial or All Upfront) a fixed discount is granted for a commitment of a certain amount of dollars spent for compute per hour. Architectural changes like switching EC2 instance types or moving workloads from EC2 to Lambda or Fargate are possible and covered by the Compute Savings Plans.

Purchasing savings plans requires a minimum of work with a considerable savings outcome especially as most workloads require some sort of compute which contribute significantly to the total costs.

Reserved Instances and Reserved Capacity

Unfortunately, AWS does not offer Savings Plans for all possible scenarios or AWS services but other powerful discount options like Amazon RDS Reserved Instances come to a rescue. Reserved Instances use comparable configuration options like Savings Plans and promise a similar discount rate for workloads which require continuously running database servers.

The flexibility of change is however limited and depends on the database used. Nevertheless, it is worth considering Reserved Instances as a cost optimization choice with again only a minimum amount of time invest necessary.

Amazon DynamoDB, the serverless NoSQL database, offers a feature called Reserved Capacity. It reserves a guaranteed amount of read and write throughput per second per table. Similar term and payment conditions as already mentioned for Savings Plans and Reserved Instances apply here as well. Predictable traffic patterns benefit from cost reduction compared to the On-Demand or Provisioned throughput modes.

Automatic Shutdown and Restart of EC2 and RDS Instances

Many EC2 and RDS instances are only used during specific times during a day and most often not at all during weekend. This applies mostly for development and test environments but might also be valid for production workloads. A considerable amount of money can be saved by turning off these idle instances when they are not needed.

An automatic approach which initiates and manages the shutdown and restart according to a schedule can take over this task so that nearly no manual intervention is needed. AWS provides a solution called "Instance Scheduler" which can be used to perform this work if no certain start or shutdown logic for an application has to be followed.

Specific workflows which require for instance to start the databases first, prior to launching any servers can be modelled by AWS Step Functions and executed using scheduled EventBridge rules. Step Functions is extremely powerful and supports a huge range of API operations so that nearly no custom code is necessary.

An example of a real-world workflow which stops an application consisting of several RDS and EC2 instances is shown in the image below. A strict sequence of shutdown steps must be followed to make sure that the application stops correctly. This workflow is triggered every evening when the environment is no longer needed.

The counter part is given in the next example. This workflow is used to launch the environment every morning before the first developer starts working.

Shutting down all EC2 and RDS instances during night and over the weekend cut down the compute costs by about 50 percent in this project which is significant for a larger environment. The only caveat with this approach has been so far insufficient EC2 capacity when trying to restart the instances in the. It has happened very seldom, but took about half a day until AWS had enough resources available for a successful launch.

Up and downscale of instances in test environments

This option might not work in all cases as test (aka. UAT) environments often mirror the production workload by design to have nearly identical conditions when performing manual or automated tests. Especially load tests, but others as well should be executed based on production like systems as their results are not reliable otherwise. In addition, not every application runs on smaller EC2 instances as smooth as on larger ones respectively changing an instance size might require additional application configuration modifications.

Nevertheless, it sometimes is possible to downscale them (RDS databases might be an additional option) when load and other heavy tests are not performed on a regular basis (even though this might be recommended in theory).

Infrastructure as code frameworks like Terraform or CloudFormation make it relatively easy to define two configuration sets. They can be swapped prior to running a load test to upscale the environment. EC2 supports instance size modifications on the fly (no restart necessary) and even some RDS databases can be modified without system interruption. The whole up- or downscale process requires only a small amount of time (depending on the environment size and instance types) and can save a considerable amount of money.

Designing new applications with a serverless first mindset

Serverless has become a buzzword and marketing term during the last few years (everything seems to be serverless nowadays), but in its core it is still a quite promising technological approach. Not having to deal with a whole bunch of administrative and operative tasks like provisioning and operating virtual servers or databases paired with the "pay only for what you use" model is quite appealing. Other advantages of serverless architectures should not be discussed in this article but can be found easily using your favorite search engine.

Especially the "pay as you go" cost model counts towards direct cost optimization (excluding in this post topics like total cost of ownership and time to market which are important in practice as well). There is no need to shut down or restart anything when it is not needed. Serverless components do not contribute to your AWS bill when they are not used -- for instance at night in development or test environments. Even production workloads which often do not receive a constant traffic flow but more a spiky one benefit compared to an architecture based on containers or VMs.

Not every application or workload is suited for a serverless design model. To be fair it should be mentioned that a serverless approach can get much more expensive than a container based one in case of very heavy traffic patterns. However, this is probably relevant for just a very small portion of all existing implementations.

Quite often it is possible and beneficial to replace a VM or container by one or more Lambda function(s), a RDS database by DynamoDB or a custom REST/GraphQL API implementation by API Gateway or AppSync. The learning curve is steep and well-designed serverless architectures are not that easy to achieve at the beginning as a complete mind shift is required but believe me: this journey is worth the effort and makes a
ton of fun after having gained some insights into this technology.

Think about what should be logged and send to CloudWatch

Logging has been an important part of any application development and operation since the invention of software. Useful log outputs (hopefully in a structured form) can help to identify bugs or other deficiencies and provide a useful insight into a software system. AWS provides with CloudWatch a log ingesting, storage and analyzing platform.

Unfortunately, log processing is quite costly. It is not an exception that the portion of the AWS bill which is related to CloudWatch is up to 10 times higher than for instance the one of Lambda in serverless projects. The same is valid for container or VM based architectures even though the ratio might not be that high, but still not neglectable. A concept how to deal with log output is advisable and might make a considerable difference at the end of the month.

Some of the following ideas help to keep CloudWatch costs under control:

Change log retention time from "Never expire" to a reasonable value
Apply log sampling like described in this post and for instance provided by the AWS Lambda Powertools
Consider using a 3rd party monitoring system like Lumigo or Datadog instead of outputting a lot of log messages. These external systems are not for free and not always allowed to use (especially in an enterprise context) but provide a lot of additional features which can make a real difference.
In some cases, it might be possible to send logs directly to other systems (instead of ingesting them first into CloudWatch) or to store them in S3 and use Athena to get some insights.
Activate logging when needed and suitable but not always by default -- not every application requires for instance VPC flow logs or API Gateway access logs even though good reasons exist to do so in certain environments (due to security reasons or certain regulations snd company rules)

Logging is important and quite useful in most of the cases, but it makes sense to have an eye on the expenditures and to adjust the logging concept in case of sprawling costs.

Wrap up

All the cost optimization possibilities mentioned above can only scratch the surface of what is possible in the AWS universe. Things like S3 and DynamoDB storage tiers, EC2 spot instances and many others have not even been mentioned nor explained. Nevertheless, applying one or several of
the strategies shortly discussed in this article can help to save a ton of money without having to spend weeks of engineering time. Especially Savings Plans and Reserved Instances as well as shutting down idle instances are easy and quite effective measures to reduce their contribution to the AWS bill by 30% to 50% for existing workloads. Newer ones which are suited for the serverless design model really benefit from its cost and operation model and provide a ton of fun for developers.

Recommender systems based on AWS Personalize

Jens Goldhammer — Tue, 27 Sep 2022 08:47:15 +0000

With its Personalize service, AWS offers a complete solution for
building and using recommendation systems in its own solutions. The
service, which is now also offered in the Europe/Frankfurt region, has
been available since 2019 and is constantly being improved. Only last
year, major improvements in the area of filters were added to the
product.

AWS Personalize allows customers to create recommendations based on an
ML model for platform or product users. The following activities are
abstracted and made particularly easy by AWS Personalize:

Import business data into AWS Personalize
Continuous training of models with current data
Read recommendations with filtering capabilities from AWS
Personalize

You might think that recommendation systems based on machine learning
are old news and that you can do it all yourself anyway. Machine
learning has been around for a few years now, and with it the
possibility of developing such recommendation systems yourself.

But the difference is: AWS Personalize takes the complete management of
Machine Learning environments off the users' hands and allows to take
first steps here very quickly. And we don't need the best ML experts on
the team, because AWS Personalize takes a lot of more complex issues off
our hands. Why it's still good to understand Machine Learning is shown
by the challenges.

AWS Personalize makes it easier than ever for us to create
recommendations. From a technical perspective, everything seems easy to
master. We find challenges mainly in the clear delineation of the use
case and the meaningfulness of the recommendations.

The recommendations should be as clearly delimited as possible for a use
case. This influences both the selection of data and the structure of
the data model and schema.

A recommendation system lives from the meaningfulness and topicality of
the displayed recommendations. If, for example, I make recommendations
to a user that he already knows, are 2 years old or are not relevant at
all, I lose his interest and trust. Recommendations are therefore first
viewed critically and must therefore be convincing from the outset, even
if this is of course partly viewed subjectively.

Therefore, we need to ask the following questions from the beginning:

It is very important to constantly validate the recommendations created
by AWS Personalize. At the start, it is important to validate the
recommendations manually, i.e., to check randomly whether the
recommendations appear meaningful to a user at all. Recommendation is
therefore to start with a recommendation system whose validity can be
easily checked. In order to give a user recommendations that he or she
does not yet know, it is necessary to work a lot with recommendation
filters, so that favourites of users or content that has already been
seen do not appear again.

Now how do we make Personalize create recommendations for us? To do
this, there are a few steps to complete.

First, you should select a domain that best matches the use case
(1).
In case of a user-defined use case, data models are defined
afterwards. Importing your own data into Personalize is done once or
continuously based on the defined data models (2).
Amazon Personalize uses the imported data to train and provide
recommendation models (3).
To query recommendations, both an HTTP-based real-time API for one
user and batch jobs for multiple users can be integrated (4).

Now let's take a look at these data models.

Data models

In addition to the e-commerce and video use cases, AWS Personalize
offers the option of mapping your own use cases (domain). The bottom
line is that it is always about the following data sets:

Users
Items
Interactions of the user with these items

These datasets form a dataset group and are used as a whole in
Personalize. Crucial here are the interactions that are necessary for
most ML models and are used for training. A short example will
illustrate this data model:

"A fme employee reads the blog post "AWS Partnership" on the social
intranet and writes a comment below it."

Item: Blogpost "AWS Partnership

Interactions: read | comment

For this data a developer can define his own schema --- one schema each
for Interactions, Users and Items.

The following is an example schema for a user with 6 fields. These
fields can be later used to get recommendations for content of specific
users, e.g. users from a specific company or country.

When importing data, this schema must then be followed. All three
datasets have mandatory attributes (e.g. ID) as well as additional
attributes that help to refine the ML model so that the recommendations
can become even more precise. The additional attributes can be textual
or categorical. They can also be used to filter recommendations.

However, there are a few restrictions in modeling that you need to be
aware of, such as the restrictions on 1000 characters per metadata. This
is especially important if you want to model lists of values.

Further info can be found here.

Import data into Personalize

The quality of recommendations are dependent on the data provided. But
how does the data get into the system ?

The import of data always takes place in these data pots, so-called
datasets (see above) --- there is exactly 1 dataset each for Users,
Items and Interactions. These datasets are combined in a dataset group.

To be able to train the ML model, the data sets have to be imported at
the beginning (bulk import via S3). It is also possible to update the
data continuously (via an API), which ensures that the model can always
be improved.

When you start with AWS Personalize, you usually already have a lot of
historical data in your own application. This is necessary because
recommendations only "work" meaningfully once a certain amount of data
is available ( as with any ML application).

Here it is recommended to use the bulk import APIs of AWS Personalize.
For this, the data must first be stored in S3 in CSV format and
according to the previously defined schema. Then you can start import
jobs ( 1 per record) via AWS Console, AWS CLI, AWS API or AWS SDKs.

For continuous updating of the Users and Items datasets, AWS Personalize
provides REST APIs that can be easily used with the AWS Client SDKs.

A so-called event tracker can be used for updating the interactions.
This previously created tracker can be used for a large amount of events
within a very short time to get data into the system via HTTP .

Train models

Once the initial data is imported, AWS Personalize can now use this data
in the form of the Dataset Group to train a model. To do this, you can
first create a Solution, which is a "folder" for models. This sets the
Recipe that Personalize should use.

The recipe represents the ML model, which is then later trained (as a
Solution version) with user-defined data. There are different types of
recipes that offer different types of recommendations. For example,
USER_PERSONALIZATION provides personalized recommendations (from all
items) and PERSONALIZED_RANKING can provide a list of items with
rankings for a particular user. Some recipes use all three data sets and
some use only parts of them (e.g. SIMS does not need user data).

After creating a solution, it can then be trained with the current state
of the data sets, resulting in a solution version. Depending on the
amount of data, this can take a little longer --- our tests showed
runtimes of around 45 minutes. A solution version is the fully trained
model that can be used directly for batch inference jobs or as the basis
for a campaign --- a real-time API for recommendations.

Use recommendations in your own application

Now it's time to integrate recommendations into our own application. AWS
provides a REST interface that allows us to retrieve recommendations
from AWS Personalize in real-time. This makes it easy for us to
integrate with any system

Recommendations in AWS Personalize are always user-related.
Recommendations can therefore look different for each user --- but can also be the same for certain recipes, as in the case of "Popularity count".

The response is a list of recommendations in the form of IDs of the recommended items, each with a score. The items are uniquely referenced via the ID.

These recommendations can now be evaluated in your own application,
linked with the content from your own database and then displayed to the
user in a user interface. The performance of the query (at least for
smaller amounts of data) is so good that this query can be done live.
However, one can also think about keeping the results of the query for a
while per user, so as not to have to constantly request the service.

If you need recommendations for a large number of users for mailings,
batch jobs ( batch inference jobs ) can efficiently create these
recommendations in the background. These batch jobs can be "fed" with
the UserIds --- the result are recommendations for each user within one
big JSON file.

Is Personalize worth the effort?

The pricing model of the service can be quite demanding, so it is
advisable to define in advance a result that you want to achieve with
appropriate recommendations and resulting follow-up activities or repeat
business.

As a guide, to get individual recommendations for individual users in
the Personalize Batch, we assume about 0.06 ct per recommendation for
the user. That doesn't sound like a lot, but with several hundred
thousand users and individual recommendations, it's part of the overall
consideration. Depending on how often and to what extent batch runs for
mailings etc. take place, it can get expensive. And the instances AWS
uses for batch runs are very large and very fast. We created several
batch jobs for mass exporting recommendations for 200k users for testing
purposes. The batch jobs then ran overnight. We incurred costs of
several hundred Euros --- we had probably underestimated the numbers in
the AWS Calculator a bit.

If referrals have a positive impact on the business and thus directly
generate more sales for the customer, it can pay off very well. But what
if my recommendations do not have a direct positive impact on my sales?
One reason could be to bind customers more closely (subscription
model) --- in the long term, this will in turn lead to more sales, but
perhaps not in the short term.

Summary

AWS Personalize is a service that makes it very easy to get started with
recommendation systems. As a development team, all you have to do is
deliver the data in the right format and pick up the recommendations. It
doesn't get much easier than that from a technical perspective.

AWS Personalize can therefore be used well to extend existing systems
without having to make deep changes. With the ability to create custom
data models and tune the different ML algorithms, you can apply AWS
Personalize to a wide variety of scenarios.

The real work is in finding meaningful use cases, delineating them from
one another, and providing the system with the right data.

As always, this comes at a price. Is it worth it for them? Let's find
out together.

References and Links

Below are a few more links to help dig deeper into the topic:

Official documentation:
[https://aws.amazon.com/de/personalize]{.underline}

AWS Blogposts:
[https://aws.amazon.com/de/blogs/machine-learning/category/artificial-intelligence/amazon-personalize/]{.underline}

AWS Personalize Best Practices:
[https://github.com/aws-samples/amazon-personalize-samples/blob/master/PersonalizeCheatSheet2.0.md]{.underline}

Efficiency of models:
[https://aws.amazon.com/de/blogs/machine-learning/using-a-b-testing-to-measure-the-efficacy-of-recommendations-generated-by-amazon-personalize/]{.underline}

AWS Personalize Code Samples:
[https://github.com/aws-samples/personalization-apis]{.underline}

Originally published at
[https://content.fme.de]{.underline}.

Fortifying federated access to AWS via OIDC

Konstantin Troshin — Fri, 12 Aug 2022 22:02:00 +0000

In order to avoid management of numerous long-term IAM users, AWS
provides federated access options that include SAML2.0 and OIDC identity providers (IDP). Whereas the SAML option is used by many of our customers and there are numerous examples of how to set it up , the examples of use of OIDC are much scarcer. Thus, while selecting our own method of access federation, we decided to try OIDC out to get better understanding of its limits and advantages and be able to better advise our customers.

Differences between SAML and OIDC identity federation

To demonstrate the key differences between OIDC and SAML, I have created a small repo that allows to deploy Keycloak on an EC2 instance and then configure the SAML and OIDC clients to use with AWS.
For those unfamiliar with Keycloak, it is an open source Identity
and Access Management tool sponsored by RedHat and widely used by many of our customers and ourselves as an identity provider. Among other features, Keycloak supports SAML and OIDC protocols for identity management and provides user federation via LDAP that allows to use it with an existing user base from an Active Directory. After deployment of Keycloak and configuring the SAML and OIDC clients, we can use Keycloak to login into AWS.
The SAML login can be performed by going to https://auth.${TF_VAR_root_dn}/realms/awsfed/protocol/saml/clients/amazon-aws where ${TF_VAR_root_dn} is the subdomain you need to create before the deployment. After entering the credentials for the user testuser that is created by the deployment scripts, we get redirected to the AWS console for the AWS account to which Keycloak has been deployed. If we would have assigned multiple roles to the same Keycloak group (or multiple groups to testuser), a page like the one below would appear (which would look familiar to everyone who already used SAML federation with AWS).

If you like to experiment and have deployed everything from the repo, you can go to the network tab of the development tools of the browser, find the saml document there and copy its contents.

Save the contents as aws-saml/assertion and run the saml.sh from the same folder. If you are fast enough (per default, the SAML assertion for AWS is valid only for 5 minutes), the assuming should work for the first role but fail for the second. If you look at the trust policies for the corresponding roles (whose names should end with _Federated_Admin-SAML and _Federated_Admin-SAML2, respectively), you will see that those are identical and allow the AssumeRoleWithSAML operation for the same SAML provider. So, why is access granted for the first and denied for the second role? This is because AWS actually checks the SAML assertion itself for the presence of the role that you try to assume. Looking at the script we ran to configure Keycloak, we can see these two lines:

kcadm create "clients/$clientId/roles" -r ${REALM_NAME} -s "name=$(terraform output -raw role_arn),$(terraform output -raw provider_arn)" -s 'description=AWS Access'
kcadm add-roles -r ${REALM_NAME} --gname "${GROUP_NAME}" --cclientid 'urn:amazon:webservices'  --rolename "$(terraform output -raw role_arn),$(terraform output -raw provider_arn)"

These lines create an entry for the first role (the one without 2) in Keycloak and map this role to a group aws_access that is later assigned to our testuser. Thus, this role shows up in the SAML assertion and can be assumed. Since the same thing does not happen for the second role, the access to it is denied to testuser (of course, this would change if you created the corresponding entry and mapping in Keycloak for this one too).

But what about OIDC? Running the ./oidc.sh script from the aws-oidc folder, we can see that our testuser can assume the role for which our OIDC provider is listed in the trust policy. A closer look at this policy shows that it contains only two things: the ARN of the OIDC provider and the client ID as aud. This corresponds to what AWS Console is doing
if a role is created there.

Also note that (as opposed to the SAML case), there was no need to do anything in Keycloak after running terraform scripts in the aws-oidc folder. What does this mean? Well, in the case of OIDC, AWS does not check for any role or group assignments in the ID token. The only two things that matter with the default settings are the IDP itself (which is defined by the URL and the thumbprint as you can clearly see from the openid.tf file) and the client ID (defined in the aud section of the trust policy).

{
  "exp": 1657326250,
  "iat": 1657322650,
  "auth_time": 0,
  "jti": "a valid id must be here",
  "iss": "https://our.domain/realms/somerealm",
  "aud": "THISISWHATMATTERS",
  "sub": "typically_this_is_the_user_id",
  "typ": "ID",
  "azp": "the_same_as_aud",
  "session_state": "another id is here",
  "at_hash": "some stuff",
  "sid": "and yet another id",
  "email_verified": true,
  "groups": [
    "group1",
    "group2",
    "group3",
    "group4",
    "group5"
  ],
  "preferred_username": "some_user",
  "email": "some_user@our.domain",
  "username": "some_user"
}

This all means that any user that has access to the corresponding
Keycloak realm can assume any role that trusts the IDP which is not very granular or secure and way inferior to SAML, right? Well, that would be so if not for a very important thing - the way I used OIDC in this example is not how it is supposed to be used. Let's look at the oidc.sh script more closely.

function getClientSecret(){
  kcadm get -r ${REALM_NAME} "clients/$(getClientId ${1})/client-secret" | jq -r '.value'
}

Here, I use kcadm.sh (which is containerized and kind of hidden behind source ../kcadm.sh) to get the client secret for the Keycloak OIDC client. This operation requires admin rights and would be equal to a Keycloak administrator giving a client secret to a user in a regular context. This secret is then used together with the username and password for testuser to directly get the ID token that is in turn submitted to AWS STS. Of course, as a Keycloak admin I would never do this in the non-test environment because the client secret (which is bound to the client ID that is, in turn, specified in the IAM trust policy) is not meant to be available for the users. But what is it for then? Looking at the AWS documentation on the OIDC topic, we can see that it mentions an identity broker. And this identity broker (which is not provided by AWS as in the case of SAML) is actually what the client ID and secret are destined for.
So, what is an identity broker anyway? An identity broker (IB) is an application that should function as a link between AWS and Keycloak and take over the management of user rights (it should know which user should be able to assume what role). A proper OIDC login flow should be started by the IB that redirects the user to the IDP (Keycloak in our case) which, after verifying the user credentials, provides the ID Token for that user to the IB. The IB uses client ID and secret to authenticate itself against the IDP. As you also can see from the oidc.sh script, it would be a bad idea to provide the ID token to the user because a combination of the role ARN and the ID token is all you need to assume a role with OIDC.
Instead, the IB should check if the user has access to a requested role and then use the ID token to get the temporary AWS credentials (by using the AssumeRoleWithWebIdentity operation) and then return these credentials to the user (or use them to get the login URL for the AWS console). In my demo above, I use cURL as an IB which is obviously a very poor choice for a production environment since it grants access to any role to any user.

Hardening the OIDC-based roles

Whereas use of a proper identity broker minimizes the risk of the OIDC access to AWS being misused, the experiments above brought me to the question whether it is possible to get AWS STS to look at the user attributes from the ID token and not only at the client ID (aud) and the IDP itself. Looking at the documentation for GitHub (which also uses OIDC) as IDP, I saw that there is another attribute - sub - that is used in trust policies. For Keycloak, the default value of sub is the user ID, which is not very useful, but Keycloak has mappers that can be assigned to
clients and can override the defaults. Experimenting with mappers, I discovered that it is indeed possible to get Keycloak to provide any LDAP user attribute (we use LDAP user federation in our environment) as sub to AWS. The only caveat here is that this attribute needs to be a string, so that it is not directly possible to use group memberships (which would be arrays) to additionally secure the trust policies. It is, however, possible to use the StringLike operator to match substrings. Using this operator, it is possible to check for LDAP groups with AWS STS as long as those are stringified. For instance, the following trust policy checks for a certain group (provided by terraform as ${var.group}) in a group string looking like this:
-group1-group2-group3-...

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": aws_iam_openid_connect_provider.oidc.arn
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${var.oidc_provider}:aud": var.client_id
        },
        "StringLike":{
          "${var.oidc_provider}:sub": ["*-${var.group}-*"]
        }
      }
    }
  ]
}

So, what could this group string come from? One option would be to write a custom plugin for Keycloak and another would be to let the IB (which is a custom app) handle this. My repo actually contains such a custom mapper (next section also discusses that a bit more in detail) that should be active in your Keycloak if you deployed it as described above. To see the mapper in action, we can run the ./oidc_protected.sh script from the aws-oidc folder. As you will see, you would be able to assume the first role but not the second one.

Why? Let's take a look at the trust policies: the one for the first role contains the aws_access group which as we know is assigned to our testuser, the one for the second role refers to the aws_access_exclusive group which does not even exist in Keycloak yet. So, even though our user had a valid ID Token, it was not possible to assume a protected role because this token did not contain the correct group. If you want to verify that the access will be granted once you create the corresponding group and assign it to the testuser
and also to look at the new Keycloak UI (which is at preview for
Keycloak 18.0.2), you can do so at https://auth.${TF_VAR_root_dn}.
In this case, you would need to use the admin credentials (admin and ${TF_VAR_keycloak_password}defined in export.sh). Once the group is created and assigned, the access works as expected. Sweet!

Developing custom mappers for Keycloak

Keycloak documentation mentions "JavaScript" providers which can be used to create custom mappers. As I read JavaScript, I was expecting to write something like this:

function stringifyGroups(groups){ 
    return groups.reduce((current, element)=>{ 
        return current+"-"+element; 
    }, "")+"-"; 
} 
token.setOtherClaims("sub",stringifyGroups(token.groups));

and then place this into a .jar file as described in the documentation.
It turned out that it does not work like that at all. Firstly, the
custom scripts are disabled by default as I found out looking at the Keycloak logs. To fix this, one needs either to activate the preview functions or enable the scripts option alone as described here.
Secondly, "JavaScript" turned out to be very Java-based and needs to call functions from the corresponding Java classes of Keycloak:

var res="";
var forEach = Array.prototype.forEach;
forEach.call(user.getGroupsStream().toArray(), function (group) {
  res=res+"-"+group.getName();
});
res=res+"-";
exports=res;

The repo shows how it all comes together.

Conclusions

In conclusion, both SAML2 and OIDC are great options of access federation for AWS and have their advantages and drawbacks. If you decide to use OIDC like us, you need an identity broker (IB) that provides a link between an IDP (such as Keycloak) and AWS. It would be unwise and potentially dangerous to provide ID tokens directly to the federated users, because a combination of such a token with a role ARN is usually enough to be able to assume that role. Of course, it would be even more unwise to provide an AWS-trusted client ID to the users. A combination of the StringLike operator and the Keycloak mappers can be used to increase the security of OIDC-Federated AWS accounts by restricting the access to the roles to certain user attributes such as group membership similarly to how the SAML2 federation works.

Certbot as an init container for AWS ECS.

Konstantin Troshin — Mon, 08 Aug 2022 17:00:00 +0000

Encryption in transit has become a security standard for most
network-based applications and is requested by the majority of our
customers for all applications we help them to build or manage. Most of the modern applications support TLS out of the box but require the certificate and the corresponding private key to be provided externally.
In some cases (for example, for intranet apps), self-signed certificates (or certificates signed by an internal CA) are sufficient, but if the application is internet-facing and needs to be used without additional steps on the client side, a certificate signed by a commonly trusted certificate authority (CA) is required. For AWS-based applications (as you may have guessed from the title, AWS are a main focus of this post), AWS Certificate Manager (ACM) can be used in combination with a load balancer to provide an amazon-signed certificate. This simple and efficient method is not applicable, however, if the certificate and the corresponding private key need to be provided to the application directly instead of an AWS-managed load balancer. This can be the case if the application is using TLS in combination with its own protocol which would make TLS termination on the load balancer impossible. Let's Encrypt is an open CA that provides trusted certificates which can be acquired by using a tool that supports the ACME protocol. In this case, the certificate and private key can then be provided to the application directly and used also for custom TLS-based protocols. Certbot is one of such tools and can be used to obtain the TLS credentials.

The use case

Recently, I have been asked to provide a publically accessible Neo4j database to use for development purposes. Since a Neo4j installation is available as a docker container, I chose to use AWS ECS to run it (a Kubernetes-based solution such as EKS would be quite an overkill for such a simple use case). To start things up, I deployed a Network Load Balancer (NLB) with three listeners and the corresponding target groups for ports 7474 (HTTP), 7473 (HTTPS), and 7687 (bolt). To improve security of the database, I decided to activate TLS for the bolt and HTTPS endpoints.
Neo4j provides support for both out of the box, but requires the certificates to be provided externally. My initial approach was to use TLS listeners in combination with an Amazon-signed ACM certificate and TLS target groups to talk to the Neo4j container. I used openssl to create a self-signed certificate and provided it via an ECS mount point to Neo4j. This worked just fine for the HTTPS endpoint but did not for bolt which is, however, crucial for the Neo4j clients. It has become clear that TLS termination would not work for this use case and that I needed to use TCP listeners and target groups and to provide the publically facing certificate directly to Neo4j. Since the request of the customer included a wish that the database can be easily accessed by the clients without much configuration on their side, I also wanted this certificate to be publically trusted. In many of our k8s-based projects, we use cert-manager which can directly obtain Let's Encrypt (LE) certificates, which brought me to the idea of using LE for my current task. Thinking about k8s and init containers, I also remembered reading some stuff about container dependencies in ECS, so I came to an idea of using a certbot docker container as such an "init container" for my Neo4j database. The schematic architecture is depicted below and includes an EC2 ECS host on which three containers should run: first the certbot container is started that can request the certificate for the corresponding domain. Once the certificate and the private key are there, the certbot container exits successfully upon which the second container (copier) is started. This container just needs a shell (I decided to use debian:latest for this purpose) and its purpose is to copy the certificate and the private key into the folders and under the file names Neo4j expects. Upon the successful exit of this container, the Neo4j container is finally started.

To achieve the correct order of the containers, AWS ECS supports the
dependsOn attribute - a list of ContainerDependency objects
that in turn consist of containerName and condition. The
condition attribute allows to specify whether the previous container
should have started (START), exited (COMPLETE), ran successfully
(SUCCESS) or is passing docker health checks (HEALTHY). In the present
use case, SUCCESS is the correct condition, since both certificate
retrieval and copy are crucial for the Neo4j container to work properly
(the copier container is called debian here):

  {
    "dependsOn": [
      {
        "containerName": "certbot",
        "condition": "SUCCESS"
      },
      {
        "containerName": "debian",
        "condition": "SUCCESS"
      }
    ],
...

Routing to certbot

A small challenge for the architecture above is to ensure that certbot can solve the HTTP challenge of Let's encrypt which is a part of the ACME protocol (this is needed to verify that the domain for which the certificate is requested is indeed controlled by us). The problem here is that if targets of type instance are used with the load balancer (which makes sense for an ECS EC2 host), health checks are mandatory. On the other hand, since certbot is running only for a short time, it itself cannot be used for health checks on port 80. Also, LE expects the domain to be already routable to certbot requesting the certificate which means that a typical registration delay that load balancers have is not acceptable in this case. As a result, the instance should be
registered at the corresponding target group of the NLB and already healthy before the certbot container is even started. To address this issue, I decided to use a simple trick based on a small handshaker app. This app provides a golang-based http server listening on a specified port that simply replies "OK" to any request and can be deployed as a scratch-based docker container (or a binary). Since the app cannot block the port 80 (which is required by certbot once it is ready to accept the HTTP challenge), I configured the corresponding target group (TG80) to forward to port 80 but health check on another port (6666) which I then assigned to handshaker. To ensure the correct timing, I included starting the app into the user data script of the ECS EC2 instance and made terraform (with which the whole infrastructure is built) to register the auto scaling group that deploys these instances at TG80.

docker run -d -e HEALTH_CHECK_PORT=6666 -p 6666:6666 \
${SOME_ACCOUNT_ID}.dkr.ecr.eu-central-1.amazonaws.com/handshaker:latest

As expected, shortly after terraform apply, the instance was registered at TG80 and became healthy. After this, I used aws cli to scale the ECS service to 1 task (I usually initially deploy the ECS services with the task count of 0, so that the whole infrastructure such as load balancers, instances, Route53 entries, etc. is available before the containers are even started).

To my delight, certbot successfully requested the certificate, passed the HTTP challenge and stored the results in a shared folder mounted via a mount point. After this, the following script ran in the copier container followed by the successful start of Neo4j.

#!/bin/bash

#The le folder will be mounted from the host and filled by certbot
cp /le/live/"${DOMAIN}"/cert.pem /home/neo4j/certificates/bolt/public.crt
cp /le/live/"${DOMAIN}"/privkey.pem /home/neo4j/certificates/bolt/private.key
#from here, we just need to create some more copies
cp /home/neo4j/certificates/bolt/private.key /home/neo4j/certificates/https/
cp /home/neo4j/certificates/bolt/public.crt /home/neo4j/certificates/https/
cp /home/neo4j/certificates/bolt/public.crt /home/neo4j/certificates/bolt/trusted/
cp /home/neo4j/certificates/bolt/public.crt /home/neo4j/certificates/https/trusted/

chown -R 7474 /home/neo4j/certificates #so that the neo4j user can read 'em

Alternatives

Of course, the described approach is not the only way to get a
certificate from LE and provide it to a Neo4j container (or another application). Some of the simple alternatives I can immediately think of would be:

Run certbot directly on the EC2 host instead of a container
Use k8s/k3s/k0s in combination with cert-manager
Build a custom container that has certbot inside of it

That being said, I think that the init container approach shows a way of using ECS similar to k8s pods and can be successfully applied to other ECS-based solutions. Also, it allows to use the upstream containers which makes upgrades seamless as opposed to the "one custom container" approach. In case you wonder, how the hell I could run a custom bash script within an upstream debian container -- I just used a mount point to mount a folder from the host that has been created and filled by the user data script during the EC2 deployment.

...
mkdir -p /home/xtra #prepare the xtra folder that will be attached to the debian contaner
echo "${CERT_SCRIPT}" | base64 -d >/home/xtra/copy_certs.sh
chmod +x /home/xtra/copy_certs.sh
...

Scaling

In the current example, I used an auto scaling group with just one
instance in it, which allows all the mount points to be folders on this instance. Of course, the local folder solution would not scale well. In this case, however, EFS can be used instead, so that the certificate and the key would be requested just once by one of the certbots (certbot exits automatically if a valid certificate is already present), but can then be used by all of the corresponding Neo4j containers. All other services used in the infrastructure above (NLB, NAT Gateway, ECS) support horizontal scaling, so that a solution based on this approach can be scaled out with ease.

Conclusions

In conclusion, AWS ECS provides a nice option to include k8s-like "init containers" by using container dependencies and non-essential containers. Those can be employed for a variety of purposes including generation of TLS certificates with a certbot container. The TLS credentials can be then immediately provided to an application running in the essential container on the same host resulting in a publically trusted secured app.

Refactor Terraform code with Moved Blocks - a new way without manually modifying the state

Thomas Laue — Fri, 08 Jul 2022 11:07:49 +0000

Most software and IT infrastructure projects which have been deployed to production have to deal with requirement changes during their lifetime. User expectations change, new use cases appear, traffic patterns are different than expected or new technology becomes available. Refactoring of existing code (application code as well as infrastructure-as-code) has always been an important task but also one of the major pain points in IT. A good support of refactoring tools and patterns can make a difference for a framework like Terraform compared with its competitors.

Setting the stage

Terraform by HashiCorp -- one of the major players in the
infrastructure-as-code framework world - has been around since 2014. It has been used to setup a lot of small, medium, and large projects all over the world. It provides a rich feature set to define infrastructure in a concise manner. One of its strengths is the way to create identical/similar resources using either the meta-argument count or the newer version for_each.

count makes it very easy to define identical resources like shown in the listing below which defines a very basic setup for 3 EC2 instances running on AWS:

locals {
  server_names = ["webserver1", "webserver2", "webserver3"]
}

resource "aws_instance" "web" {
  count = length(local.server_names)

  ami                       = "ami-0a1ee2fb28fe05df3"
  instance_type             = "t3.micro"

  tags = {
    Name = local.server_names[count.index]
  }
}

Terraform stores references to resources created by using the count meta-argument in its internal state in an array using an index-based approach.

This works fine if a single instance must not be replaced or deleted. Such an action will affect all resources which are located on a higher index in the array due to the nature Terraform manages its state.

Trying to remove "webserver2" in the example above

locals {
  server_names = ["webserver1", "webserver2", "webserver3"]
}
...

will result in the destruction of the EC2 instance tagged "webserver3" and a renaming of the previous named "webserver2" instance into "webserver3". The result does not correspond to the expressed intention.

Version 0.12.6 of Terraform introduced the for_each meta-argument - a more flexible way to create identical/similar resources.

resource "aws_instance" "web" {
  for_each = toset(local.server_names)

  ami                         = "ami-0a1ee2fb28fe05df3"
  instance_type               = "t3.micro"

  tags = {
    Name = each.value 
  }
}

The Terraform state references the resources no longer based on an index but by using a key-based approach. It is now possible to address a single resource without affecting others.

The removal of "webserver2" can now be performed successfully without affecting other resources.

Due to the greater flexibility of for_each it might be helpful or even required to refactor existing code (migrate from count to for_each). This has been possible in the past by manipulating the Terraform state directly using the terraform state mv CLI command. However, all manual state manipulations are brittle and prone to errors which make them as a kind of last resort.

From imperative to explicit

HashiCorp introduced an improved refactoring experience with version 1.1 of Terraform: the moved block syntax which allows to express refactoring steps in code instead of using an imperative attempt via CLI.

The moved block allows to specify the old and new reference of a resource like shown in the following example which has been rewritten to use for_each instead of count:

locals {
  server_names = ["webserver1", "webserver2", "webserver3"]
}

moved {
  from = aws_instance.web[0]
  to   = aws_instance.web["webserver1"]
}

moved {
  from = aws_instance.web[1]
  to   = aws_instance.web["webserver2"]
}

moved {
  from = aws_instance.web[2]
  to   = aws_instance.web["webserver3"]
}

resource "aws_instance" "web" {
  for_each = toset(local.server_names)

  ami           = "ami-0a1ee2fb28fe05df3"
  instance_type = "t3.micro"

  tags = {
    Name = each.value
  }
}

A following terraform plan/apply reveals that no instance will be destroyed or modified in any way but only moved in the state from its old reference to its new one created by the way for_each works. No need for any manual state manipulation anymore but everything can be done securely using Terraforms native way to work.

moved blocks cannot only be applied to refactor count into
for_each syntax but also be used to rename resources, to move resources into modules and so on. Not everything is possible using the new language element, but many (not extremely complex) refactoring tasks can benefit from using it. Terraforms documentation contains different examples and use cases with further details.

Wrap-up

moved blocks have made refactoring existing Terraform projects easier and safer to perform. No manual steps are required any longer for many use cases even though terraform state mv is still there to solve problems which cannot be tackled by using the new element. It is helpful to have tooling/framework elements like this on at hand.

Depending on the type and size of the project (internal project or public module) it might make sense respectively it is even recommended by HashiCorp not to delete the blocks after having applied the changes. Not everyone using the module might already have fetched the latest version. Apart from avoiding trouble for users it might be helpful to document any significant changes on the project structure for later reviews. A short well written and dated comment combined with the moved block syntax might answer your question or the one of a colleague six months down the road.

Docker is dead?!? Podman - an alternative tool?

Maxi Krone — Fri, 08 Jul 2022 11:04:36 +0000

> https://www.flaticon.com/free-icon/fight_1256685

You're no stranger to container images, Docker, and Kubernetes, and you may have heard it said in many places that Docker is dead? You can't explain exactly where this statement comes from, or maybe you haven't quite figured out the topic yet? Then this blog post will help you out.

⚠️ Updated: June, 22nd 2022: This article was originally published on May 29, 2022 and blew up on hackernews.com with a broad discussion of the role of Podman and Docker in the future. This discussion also contained valid criticism and corrections of some erroneously assumed facts. We are very happy that this discussion took place and that it sparked a nice and lively discussion about containers, Docker, Podman and all the other containerization related topics between all the professionals. We want to correct the wrong facts in a transparent way that is why we decided to not change the original article, but extend it with a paragraph at the end where the wrong assumptions will be corrected. Again, thanks a lot for all your feedback! We appreciate it very much!

In this blog post, I would like to briefly explain how the statement comes about, what background it has and what possible alternatives there are with my own impressions.

In particular, I will discuss the alternative Podman - which does not require a background service (daemon) - including its advantages and disadvantages compared to Docker. First, I will explain the OCI and then how Podman works in general. What Podman is not, however, and a list of its advantages over Docker will follow later.

To make a long story short: Container images - especially according to the OCI standard - are not dead and Docker also still has its application areas and accordingly will not die out quickly. This article is much more about the licensing changes of Docker - especially Docker Desktop - and what consequences they entail.

Why is Docker dead? The history behind Docker.

Beginnings of containers and Docker

Docker is open-source software that primarily serves to isolate
applications with the help of container virtualization. It does this by
using Linux techniques such as cgroups and namespaces to implement
containers, among other things. More detailed information on this can
also be found in Docker's official documentation.

Docker was not, as often thought, the first approach in this direction.
The isolation of applications via process isolation - in principle, containerization is this - goes far back into the past of computers. For example, the first approaches can be traced back to the chroot system call (syscall) of the late 1970s. This advance was the beginning of process isolation: separate file access for each process.

However, in March 2013, Docker Inc, initially known as dotCloud, managed to take containerization to a new level that made it usable to the general public by releasing Docker as free and open source software.
Above all, the simple, understandable usability, high reliability and extreme added value of Docker containers led to the great success and revival of containerization in the world of IT in the following years.

Aquisition Mirantis

Source: https://www.mirantis.com/wp-content/uploads/2017/01/mirantis-logo-2color-rgb-transparent.png

Then, in November 2019, Mirantis acquired Docker Inc. along with the Docker suite. Shortly thereafter, Mirantis announced an adjustment to the existing licensing model. Thus, the standard use of Docker including Daemon, CLI and Docker Desktop was previously free of charge. Instead of free use of Docker Desktop until now, this software suite is now available for rent after the transition phase until the end of January 2022, starting at $5 per user/month, provided it is for professional use. Since then, only individuals, open-source communities and small businesses with up to 250 employees or $10 million in annual revenue are exempt.

In addition to this, a rate limit was created in the central Docker registry - the DockerHub- so that anonymous users can only download 100 images and authenticated free users can only download 200 images in a six-hour period.

These operations resulted in the departure of several major vendors and platforms, as they disagreed with the sudden change in licensing terms. With the use of Docker in popular platforms such as Kubernetes, this showed the danger of dependency to Docker Inc. If this company can continue to adjust their licensing model at will, this can result in major consequences for all companies and platforms that rely on Docker.

Exemplary of this followed the announcement by the Kubernetes project that Docker or dockershim will be discontinued as of version 1.24.
RedHat also decided to do so with RHEL 8.Due to these decisions and also RedHat and Kubernetes moving away from Docker, there is an increasing portion of developers moving away from Docker and looking for alternatives. An interesting report on the declining use of Docker can be found here.

Docker Desktop

To understand the scope of it all, here's an explanation of the Docker Desktop application.

Docker Desktop is a software suite for Mac, Windows and now
Linux that can be used to create containerized applications using Docker's own tools. Here, Docker Desktop includes the Docker Engine, docker-cli, docker-compose and a credential helper, among others.

By adapting the licensing model, Docker Desktop may now only be used in its entirety by the exceptions described above, provided that licensing costs are to be avoided.

What is Podman?

Source: https://developers.redhat.com/sites/default/files/podman-logo.png

In the scene, for many, Podman is following in the footsteps of Docker as free software. But what exactly Podman is and that it is not just a simple replacement is explained below.

Podman is an open source and free container management tool for developing, managing and running OCI containers.

OCI-Images and -Runtime

OCI stands for the Open Container Initiative and was initiated by Docker Inc. in 2015. They describe two specifications: The Runtime Specification and the Image Specification.

Any software can implement the specifications and thus create
OCI-compatible container images and container runtimes that are
compatible with each other. In addition, there are the so-called
Container Runtime Interfaces (CRI), which are based on the OCI runtimes and abstract them.

Example implementations of the container runtime interfaces in this context are dockershim (OCI wrapper for the original Docker Engine implementation, see this article), containerd (new implementation of Docker's container runtime interface (CRI)) and cri-o (implementation of the Kubernetes container runtime interface).

Implementations of the OCI images can be found in Docker,
Buildah, kaniko, and Podman, for example. The OCI-compatible
container images can then be executed on a CRI-compatible runtime (containerd, cri-o), which in turn calls an OCI runtime such as runc or runsc. Runc or runsc then start the actual container on the client.

This may seem a bit confusing at first glance, but can be clarified using the article from tutorialworks and the following figure.

The projects involved in running a container with Docker -- Source: Tutorial Works

What is Podman not?

Because Podman - unlike Docker - is not a container runtime, but only an implementation of the OCI image specification, Podman cannot launch images itself. It requires the aforementioned CRI container runtime, which in turn uses the hardware-related OCI runtime such as runc or runsc to start the actual containers.

Podman itself can only take over administration tasks of the containers including build. To execute the images Podman then uses e.g. the mentioned containerd, which in turn runs e.g. runc to actually start the container.

Benefits of Podman

The following is a brief explanation of the advantages of Podman over Docker.

Proximity to Docker

Despite the sometimes significant differences in the implementation of Podman to Docker, the most common commands of Docker can be used one-to-one. For example, common commands like login, build, pull, push, tag, etc. work exactly as expected by the user. Also, Dockerfiles are still used to build the container images. This makes it much easier for a former Docker user to get started.

Additionally, this compatibility of commands helps a necessary migration from Docker to Podman. For example, by setting an alias from podman to docker, most scripts can continue to be used.

Resource requirements and embedding in OS

Podman was built as a lean and efficient solution. As a result, it has a lower overall memory footprint, is considered significantly faster and comparatively efficient. Also for these reasons, Podman has become a standard for some Linux distributions, such as Fedora's CoreOS.

Additionally, Podman is now included in the default repositories of Ubuntu 22.04 LTS, see also this blog post.

Pods

As mentioned earlier, Kubernetes discontinues support for Docker with version 1.24. This means that dockershim must be replaced by another container runtime such as containerd or cri-o.

With Podman, however, collaboration should continue to be possible
without any problems. Podman even supports so-called pods, as can also be derived from the name. These were originally established by Kubernetes and can combine several containers, similar to a
Docker-compose.

Schematic concept of a pod

This makes it a good idea to use Podman to build and test pods locally before rolling them out to Kubernetes. Alternatively, a local Minikube installation would be a good way to test your Kubernetes manifests.
Podman, however, is leaner and correspondingly more performant in small setups.

More about Pods in Podman can also be found here.

Rootless Mode

Podman does not require root permissions to execute its commands unlike Docker which depends on the background process (daemon). Docker requires root privileges or requires the user to be in the Docker group. Should he not be and should sudo not be used, an appropriate error message will be thrown and it cannot be used accordingly.

Why that may be a problem at all, one now asks the question? As already explained, the Docker daemon requires root privileges on the server and thus creates a potential security risk. If an attacker were to break into the container, there is a risk that he could break out onto the underlying server and then infiltrate other services in the network.

Auditing

Unlike Docker, Podman follows the fork-exec model. This means that changes are recorded in the auditd system. This can be an advantage over Docker from a compliance perspective. An exploitation of the audit model in Docker versus the model in Podman can be understood thanks to the following blog post.

Limitations of Podman

There are two main limitations that Podman brings with it. These will be briefly explained.

Linux based

Podman currently only runs stably on Linux-based systems. Under Windows or MacOS it becomes a bit more demanding, although it is possible with detours.

On Windows, Podman can be used well via WSL, on MacOS it can be installed directly with homebrew. However, these solutions are not yet fully developed and are still considered "in development".
More about this can be found in the official Podman blog post.

Personally, I've been using Podman for some time now with a WSL 2 and the Ubuntu distribution and I'm basically very satisfied. Here and there are still minor teething problems, but these are usually quickly fixable via google search.

Docker-Compose

Podman itself does not support Docker Compose to launch multiple
containers locally. There are two alternatives for this. First, there is already a project called Podman-Compose that is supposed to fulfill the core functionalities of Docker-Compose, and second, Podman supports the pods described above. These can also be used to launch and manage multiple containers at once - even via a more Kubernetes-friendly path.

Another useful alternative to Docker-Compose is, for example, the use of minikube or also k3d. These tools can be used to easily and quickly roll out local Kubernetes clusters. These can then be used for development purposes to deploy and test local Kubernetes objects such as deployments, services or pods.

MacOS: Alternative Lima

Source: https://github.com/lima-vm/lima

Besides Podman, there is another alternative worth mentioning:
Lima.

Lima - short for Linux virtual machines - is mainly used as an alternative for MacOS in this context and comes with QEMU (a hypervisor), containerd and nerdctl.

Thus, Lima uses QEMU in the background for virtualization, containerd as container runtime and nerdctl as Docker-compatible CLI for containerd.
Lima is similar to the Windows Subsystem for Linux (WSL) in version 2.

Thus, it is definitely worth a look for our Mac users.

Outlook Unikernel

Another possibility to switch from Docker containers are the so-called Unikernels. These are briefly mentioned here for the sake of completeness, even though they currently have no significance in the Kubernetes context. An interesting blog post on the topic can be found on Hackernoon. It is an interesting construct that may one day find its way into the world of Kubernetes if containers can be replaced by unikernels. Currently, the use of unikernels would not be feasible in my opinion.

Source: https://github.com/cetic/unikernels

Summary

Last but not least, I would like to give a summary of the usage of
Docker and Podman in fme AG to support our point of view.

Many of us have used the Docker suite for years - some continue to use it in ongoing projects, some have already switched to other
alternatives.

There is - as always - no black and white, but many shades of gray in between. Accordingly, there are still plenty of use cases for Docker Desktop, and some customers are also willing to pay the associated licensing costs.

However, there are also many opposing voices to this - both in the community and among our customers. The sudden change of a licensing model is a nuisance there and contributes to the fact that people refrain from this software.

We try to use Podman, Lima or similar free software in newer projects, because we want to stay independent from vendors and licensing constraints. In my opinion, open source and free software is an important cornerstone of successful and promising IT projects. Even Microsoft has understood and lived this in recent years by investing heavily in cross-platform compatibility with products like .NET Core and WSL.

For this reason, I am pro-Podman and the win - in my opinion - goes to Podman and the idea of free software.

Corrections

As explained above, we want to correct the wrong facts in this article. Therefore, this paragraph contains a few corrections and statements.

Mirantis acquired Docker Inc.

This was a wrong assumption. Mirantis did not acquire Docker Inc. but Docker Enterprise which was a part of the Docker Inc. For more information have a look at this article. Another useful article can be found [German only] here.

Podman uses containerd

This was a wrong assumption. Podman directly uses runC or crun instead of containerd using a technology named conmon. Some more useful information can be found in this article.

Docker-Compose not working with Podman

Docker-Compose is working with Podman since version 3.0. Basically, you point to the Podman socket instead of Docker socket and it works just fine. This leads to the phenomenon that you can use docker-based scripts and tools with Podman as they would use docker in general. They will not see any difference and they will think that they use Docker even though they are using Podman.

Rate limits

We said that rate limits in the Dockerhub are a reason why people and companies are changing from Docker to alternatives like Podman. There was a commentary that said that the Container Registry has nothing to do with the container runtime itself. That's 100% true and valid. Even for Podman the Dockerhub - next to quay.io - is the default registry to pull images.

Nevertheless, the Dockerhub is part of the Docker ecosystem and therefore a rate limit of the official docker registry brought our customers to the point of changing the registry to other registry providers or implement their own container registry. Therefore they get rid of this service which is part of the plain Docker ecosystem, so its use in enterprise is decreasing.

RedHat replaced Docker with Podman in RHEL 8

There was a commentary that it makes only sense that RedHat supports Podman since it comes from their own product forge. Thats definetly right!

Still, there are a lot of businesses using and paying RedHat, which automatically leads to a decreased use of Docker vs. Podman. Less uses of Docker means more uses of Podman.

Last but not least, RedHat would not invent Podman when there were not the need of an independent tool to Docker. Podman helps in a few manners where Docker lacks functions like support of Pods, rootless mode etc. Useful information about why RedHat invests in Podman can be found here.

Automate DevOps Workflows using AWS StepFunctions Service Integrations

Thomas Laue — Tue, 31 May 2022 18:48:45 +0000

AWS Step Functions, a serverless workflow orchestration service offering by AWS, has been around since several years now. Many blog posts (like Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions), presentations and learning courses (e.g. Complete guide to AWS Step Functions) have been published showing the capabilities and rich feature set provided.

However not many of them deal with topics related to DevOps tasks -- maybe because Step Functions only offered a limited set of direct service integrations like AWS Lambda until recently. Accessing an AWS API required using for instance an AWS SDK or AWS CLI commands in a script or Lambda function, but this changed a few months ago.

In September 2021 AWS added support for over 200 AWS Services with AWS SDK integration resulting in over 9000 AWS API Actions available. Only a few weeks before, another major enhancement, the new Workflow Studio -- a low-code visual tool for building state machines, had been released so that it is now easier than ever to build workflows -- from simple to complex.

The challenge

Around the same time, we joined a migration project at a customer who was moving a large application which had been hosted on-premises so far to AWS using services like EC2, RDS, ALB.... Some of the typical operational tasks like managing the database servers are now gone as AWS takes care for the heavy lifting but new ones have arrived and others stay the same.

As the project proceeded, we thought about how we could automate as many operational tasks as possible using native AWS services. AWS Step Functions Service Integrations came right around the corner to make our life much easier. We were able to handle many repeating tasks by creating state machines which are sometimes triggered by scheduled Amazon EventBridge rules or used manually via CLI or Console.

The simple one

A workflow consisting only of two steps (neglecting Start and End) is triggered shortly before the next EC2 maintenance window to get an overview about all security patches which will be installed.

AWS Systems Manager`s service integration ssm:describeInstancePatches is used to get the list all patches which will be sent to an AWS SNS topic in order to be delivered to an email inbox of someone who is in charge to check if there might be a conflict ahead with the application requirements.

The Workflow Studio editor makes it quite easy to assemble a workflow and to enrich every step with the required parameters and settings. All service integrations are based on the AWS SDK API calls so that the parameters can be retrieved from the SDK documentation (an example is shown for Systems Manager API).

Workflow Editor allows exporting the state machine definition to a JSON or YAML file so that it can be included into an infrastructure as code project using for instance Terraform.

Information like the EC2 instance ID or the SNS topic ARN can be derived during deploy time using for instance Terraform template variables as shown in the example JSON state machine definition.

The big benefit of using Step Functions is that no custom code and no additional overhead for managing a Lambda function is required to complete this task and the best thing: the state machine is quite intuitive to create, self-documenting and easy to follow and to recap.

The more complex process

Following the sample principles, it is possible to create more complex workflows. The given example shows a workflow which is used to restart all servers belonging to the web app tier which are behind an AWS application load balancer in a rolling manner. No application downtime is required in order to restart them as only a certain number is restarted at once.

In the first step, the alarm actions of some CloudWatch alarms and which should not fire during the restart process and some AWS EventBridge rules are disabled using a Lambda function as the logic to filter these resources needs some custom code.

A property of the AWS Step Functions Map state, the Maximum
Concurrency Control, is used to restrict the number of instances which are deregistered from the ALB target group, followed by a reboot and a final check if the application has been launched successfully before bringing it back into the target group.

Rebooting only a limited number of instances makes sure that the application stays online, and that always enough servers are available to handle user traffic without a significant influence on the user experience.

The new AWS SDK service integrations help again to model the workflow as a sequence of steps must be followed in order to reboot a running instance successfully. Not only has a server to be de-/registered from the target group (among others using elasticloadbalancingv2:registerTargets SDK command) but also to be rebooted (ec2:rebootInstances).

After a certain wait period, an application startup check is performed to make sure that everything is working correctly using a Lambda function as the whole check process requires again some custom logic. Only a healthy and working server should be put back into the ALB target group.

The application requires some minutes to get everything sorted out until it is ready to serve whereby the startup time various depending on factors like external database connections... The Wait state helps in this case to pause the workflow for a certain time. Nevertheless, it can happen that the following startup check fails as the application is not yet ready and another wait period is required.

An in-build "for-loop" feature for Step Functions would be quite helpful in this case to re-run the last two steps (wait + startup check) again. It is possible to model this construct using a Choice state which checks the result return from startup check Lambda function and acts upon it (i.e., go back to the Wait state if the application is not ready yet).

However, this feels somehow clumsy and more like a workaround. Additionally, a break condition (e.g., max. number of checks is required) which introduces a stateful condition which must be passed somehow around or stored somewhere.

Custom Retry and Error Handling for Lambda functions, another cool feature of Step Functions, comes to our rescue. Custom errors which are thrown from a Lambda function can be handled. Depending on the use case, a Catcher or a Retrier for this custom error class might be defined to deal with this situation. The later one is used to simulate a "for-loop" without relying on the Choice state workaround.

Lambda raises a custom InstanceNotYetStartedException in case the health check fails. This exception is handled by a specific Retrier which defines a longer wait interval (120 seconds) to give the application some additional time before the next check. This whole procedure is repeated up to three times in this case until it can be assumed that something went wrong and should be handled otherwise (processing moves on to a States.ALL Catcher which calls a SNS integration step for publishing an alarm).

As a last note to this workflow: the Map state fails as soon as one if its execution has failed. All running inner executions are aborted and all waiting once are cancelled. Care should be taken for this scenario: adding a dead later queue to the inner Map state workflow would be one option, defining a States.ALL Catcher on the Map state level another one or even failing the complete state machine execution by purpose. The best error handling method depends on the workflow requirements. The global Catcher is used in the presented case as some additional steps (putting the deactivated CloudWatch alarms back on place) must be
executed in all cases.

When not to use

Step Functions has some limits like every other AWS service which might prevent one from using it in some rare cases or which requires a workaround. Furthermore, there are external API properties which might not fit to Step Functions. Two examples should shortly be discussed:

Maximum input/output size for a task is 256 KB: AWS API calls might return a lot of JSON data but there are various mechanisms like the filters parameter and pagination support in place to narrow down the scope of a request. Additionally, Step Functions provide output processing functions to extract the data of interest so that this limitation should not be a blocker for most use cases.
How to deal with API calls supporting pagination: many AWS API endpoints return a maximum number of items and an additional NextToken value which can be used to retrieve the next batch with a following call. The clumsy Choice-state construct mentioned above could be used to handle this, but this is not practical. A Lambda function is much more suited in this situation in case a lot of data must be retrieved.

Wrapping up

This blog presents use cases for Step Functions which might not be the most common ones out there but proved to be extremely useful. The new SDK integrations have opened a wide field of possibilities to model workflows visually without writing a lot of custom code (even though Lambda is always there if something cannot be solved by in-build mechanisms).

The Step Functions Workflow Studio allows to design and build-up workflows from simple to quite complex ones in an intuitive and rapid way. The ready-to-be-used workflow can be exported to code (is JSON code?) so that a developer's heart does not need to cry and the integration into an infrastructure as code framework can be made.

Some additional features like more intrinsic functions (e.g., string processing) to deal with the sometimes very large JSON results of AWS SKD calls would make working with Step Functions even easier (big point for #awswishlist)