Forem: Thomas Laue

AWS documentation and service quotas are your friends – do not miss them!

Thomas Laue — Fri, 06 Oct 2023 14:31:16 +0000

Who loves reading docs? Probably not that many developers and engineers. Coding and developing are so much more exciting than spending hours reading tons of documentation. However just recently I was taught again that this is one of the big misconceptions and fallacies -- probably not only for me.

AWS provides an extensive documentation for each service which contains not just a general overview but, in most cases, a deep knowledge of details and specifies related to an AWS service. Most service documentations consist of hundreds of pages including a lot of examples and code snippets which are quite often helpful -- especially related to IAM policies. It is not always easy to find the relevant pieces for a specific edge case or it might be missing from time to time, but overall AWS have done a great job documenting their landscape.

Service quotas which exist for every AWS service are another part which should not be missed when either starting to work with a new AWS service or to use one more extensively. Many headaches and lost hours spent to debug an issue could be avoided by taken these quotas into account right from the start. Unfortunately, this lesson is too easy to forget like
will be shown in the following example.

In a recent project, AWS DataSync was used to move about 40 million files from an AWS EFS share to a S3 bucket. The whole sync process should be repeated from time to time after the initial sync to take new and updated files into account. AWS DataSync supports this scenario by applying an incremental approach after the first run.

One DataSync location was created for EFS and another one for S3 and both where sticked to gether by an DataSync task which configures among other things the sync properties. The initial run of this task went fine. All files were synced after about 9 hours.

Some days later a first incremental sync was started to reflect the changes which had happened on EFS since the first run. The task went into the preparation phase but broke after about 30 minutes with a strange error message:

"Cannot allocate memory" -- what are you trying to tell me? No memory setting was configured in the DataSync task definition as no agent was involved. The first hit on Google shed some light on this problem by redirecting me to the documentation of AWS DataSync

which contains a link to the DataSync task quotas:

Apparently, 40 million files are way too much for one task as only 25 million are supported when transferring files between AWS Storage services. A request to the AWS support confirmed as well that problem was related to the large number of files. I have no idea why the initial run was able to run through but at least the follow up one failed. Splitting up the task into several smaller ones solved this issue so that the incremental run could finally be succeeded as well.

Nevertheless, some hours were lost even though we learned something new.

Lessons learned -- again:

Embrace the docs -- even though they are really extensive!
Take the service quotas into account before starting to work and while working with an AWS service. They will get relevant one day -- possibly earlier than later!
AWS technical support really like to help and is quite competent. Do not hesitate to contact them (if you have a support plan available).

A cross account cost overview dashboard powered by Lambda, Step Functions, S3 and Quicksight

Thomas Laue — Fri, 17 Feb 2023 13:01:44 +0000

Keeping an eye on cloud spendings in AWS or any other cloud service
provider is one of the most important parts of every project team's
work. This can be tedious when following the AWS recommendation and best
practice to split workloads over more than one AWS account. Consolidated
billing -- a feature of AWS Organization would help in this case -- but
access to the main/billing account is very often not granted to project
teams and departments for good reason.

In this article, a solution is presented which allows to automatically
collect billing information from various accounts to present them in a
concise format in one or more AWS Quicksight dashboards. Before starting
to go into the details, let's look shortly on the available AWS services
for cost management and the difficulty when working with more than one
account.

AWS Cost Management Platform and Consolidated Billing

AWS provides a complete toolset consisting of AWS Billing, Cost
Explorer, and other sub services. They offer functionalities to get a
focused overview about costs and usages as well as tools to drill down
into the details of most services. The tooling is sufficient to get the
job done when working with one AWS account even though the user
experience might not be as awesome as provided by other more dedicated
and focused 3^rd^ party tools. Special IAM permissions are needed to
access all this information to tackle the governance part as well.

AWS Organization which rules all AWS accounts associated with it
provides some extended functionalities (aka. Consolidated billing) to
centralize cost monitoring and management. However, in most companies
and large enterprises only very few people are allowed to access the
billing account. This does not help a project lead to get the required
information easily. Depending on the number of AWS accounts used by
team, someone who is allowed has either to login into every account
regularly and check the costs or rely on "Cost and Usage" reports which
can be exported automatically to S3. These reports are very detailed
(maybe too much for simple use cases) and require some custom tooling to
extract the required information.

AWS has published a solution called Cloud Intelligence
Dashboards -- a collection of Quicksight dashboards which are among other data sources based on these cost and usage reports. Beside this one, company internal cost control tools -- sometimes bases on the same AWS services -- exist and can be "rented". All these solutions have their advantages and use cases -- but also drawbacks (mostly related to their costs and sometimes also due to overly large IAM permission requirements).

An approach for simple use cases

Sometimes it is fully sufficient to present some information in a
concise manner to stay informed about the general trends and total
amounts. In case something reveals to be strange or not to be in the
expected range, a more detailed analysis can be performed using for
instance the AWS tooling mentioned above.

Following this idea, Step Functions, Lambda, S3 and Quicksight powers a
small application which retrieves cost related data for every designated
AWS account and stores it as a JSON file in S3. Quicksight which
supports S3 as a data source directly reads this data and provides it to
build one or more dashboards displaying various cost related diagrams
and tables. This workflow (which is shown below) is triggered by an
EventBridge rule regularly (e.g., once a day) so that up-to-date
information is available.

The „Data preparation" step provides the AWS account ids and names as input for the following Map State.

The Map State starts for every array element (= AWS account) an
instance of a Lambda function which assumes an IAM role in the
relevant account and queries the AWS Cost Center and AWS Budget APIs
to collect the required information: total current costs, costs per
service, cost forecasts...

@logger.inject_lambda_context(log_event=True)
@tracer.capture_lambda_handler
def lambda_handler(event, context):
     validate_input(event)

     assumed_role_session = assume_role(
         session,
         f'arn:aws:iam::{event["account_id"]}:role/{ROLE_NAME}',
         region_name="eu-central-1",
     )

     client = assumed_role_session.client("ce")
     budget_client = assumed_role_session.client("budgets")

     now = arrow.utcnow()
     first_day_of_current_month, end_date = get_start_and_end_date(now)

     cost_response = client.get_cost_and_usage(
         TimePeriod={
             "Start": first_day_of_current_month.format(date_format),
             "End": end_date.format(date_format),
         },
         Granularity="MONTHLY",
         Metrics=["UnblendedCost"],
     )

     cost_per_service = client.get_cost_and_usage(
         TimePeriod={
             "Start": first_day_of_current_month.format(date_format),
             "End": end_date.format(date_format),
         },
         Granularity="MONTHLY",
         Metrics=["UnblendedCost"],
         GroupBy=[{"Type": "DIMENSION", "Key": "SERVICE"}],
     )

 …

     payload = {
         "date": now.format(date_format),
         "current_month": now.month,
         "current_year": now.year,
         "account_name": f'{event["account_name"].upper()}',
         "current_costs": f'{float(cost_response["ResultsByTime"][0]["Total"]["UnblendedCost"]["Amount"]):.2f}',  
         "forecasted_costs": f"{float(forecasted_cost):.2f}",
         "budget": f'{float(budget_response["Budgets"][0]["BudgetLimit"]["Amount"]):.2f}',
         "account_id": f'{event["account_id"]}',
     }

     for item in service_costs:
         payload = payload | item

     return {"statusCode": 200, "body": json.dumps(payload)}

The outcome of the Map State is an array consisting of the cost data retrieved from every AWS account.

[
  {
     "date": "2023-02-09",
     "current_month": 2,
     "current_year": 2023,
     "account_name": "Test",
     "current_costs": "152.49",
     "forecasted_costs": "468.10",
     "budget": "700.00",
     "account_id": "11111111111x",
     "Amazon DynamoDB": 10.0002276717,
     "AWS CloudTrail": 3.0943441333,
     "AWS Lambda": 30.24534433,
     "AWS Key Management Service": 1.8699148608,
     "AWS Step Functions": 32,5439809890,
     "Amazon Relational Database Service": 16.1240975116,
 …
 },
 {}
]

This data is stored as JSON file in a S3 bucket in the last workflow
step. Every file name contains the current date to make them unique.

Apart from the StepFunctions workflow and the Lambda function which can
for instance run in a dedicated AWS account to simplify the permission
management, an IAM role needs to be deployed to every account whose
costs should be retrieved. This role must trust the Lambda function and
contain the necessary permissions to query AWS Cost Center and AWS
Budget. An example written in Terraform is given below.

module "iam_assume_role_sts" {
   source  = "terraform-aws-modules/iam/aws//modules/iam-assumable-role"
   version = ">= 4.7.0"

   role_name = "query-cost-center-role"
   trusted_role_arns = [
     var.query_cost_center_lambda_role_arn
   ]

   create_role       = true
   role_requires_mfa = false

   custom_role_policy_arns = [
     module.query_cost_center_policy.arn
   ]
 }

 module "query_cost_center_policy" {
   source  = "terraform-aws-modules/iam/aws//modules/iam-policy"
   version = ">= 4.7.0"

   name        = "Query-cost-center-policy"
   path        = "/"
   description = "This policy allows to query the AWS CostCenter"
   policy      = data.aws_iam_policy_document.query_cost_center_policy.json
 }

 data "aws_iam_policy_document" "query_cost_center_policy" {
   statement {
     actions = [
       "ce:GetCostAndUsage",
       "ce:GetCostForecast"
     ]

     resources = ["*"]
   }

   statement {
     actions = [
       "budgets:ViewBudget"
     ]

     resources = [
       "arn:aws:budgets::${data.aws_caller_identity.current.account_id}:budget/project_budget"
     ]
   }
 }

Data visualization with Quicksight*

Quicksight supports S3 as a direct data source -- only a manifest file
containing a description of the data stored in the bucket is needed.
This is quite handy for data whose structure does not change or only
very seldom.

A more involving setup including AWS Glue and AWS Athena might be
beneficial in cases where either a lot of details (not only basic cost
information) are queried, or a lot of different AWS services are used
over time in the different AWS accounts. It might happen that Quicksight
runs into problems when trying to load this kind of data as it is going
to change constantly, and the manifest file requires a lot of updates. A
Glue Crawler combined with an Athena table might be the better approach
in such a scenario.

As soon as a new dataset based on the S3 bucket has been created, one or
several dashboards can be implemented. They can represent some overview
data like it is done in the example below or go into more detail --
depending on the specific requirements. How to create these dashboards
is out of scope of this article but Quicksight offers enough tooling to
start from simple to go a long way to sophisticated information display.

A Quicksight dashboard can either be shared with individuals in need of
this information or a scheduled email notification can be established.
Quicksight will sent a mail to all specified recipients which can
include an image of the dashboard as well as some data in CSV format.
This feature helps a lot it is not always necessary to login to keep the
costs under control. Simply by receiving an automated message for
instance every day or just once a week can already help to stay
informed.

Wrapping up

Cost monitoring is an important topic for every project -- from small to
large. AWS offers various tools to stay up to date but this task is
getting tedious when following AWS best practice and separating an
application into different stages and AWS accounts. There are 3rd party
or company-internal tools available which helps to overcome this
situation, but it is not always possible to use them (especially in an
enterprise setup) or they come with their own drawbacks.

This blog post has presented a small-scale application which offers
enough information and details to monitor the costs generated by small
to medium size projects. It has its own limitations as it might not be
powerful enough when dealing with tens or even hundreds of AWS accounts
-- but this is normally not the typical setup of a project.

Photo credits
Photo of Anna Nekrashevich: https://www.pexels.com/de-de/foto/lupe-oben-auf-dem-dokument-6801648/

How not to send all your money to AWS

Thomas Laue — Fri, 07 Oct 2022 12:52:33 +0000

The AWS environment has grown to a kind of universe providing more than 250 services over the last 20 years. Many applications which quite often easily use 5-10 or more of these services benefit from the rich feature set provided. Burdens which have existed for instance in local data centers like server management have been taken over by AWS to provide developers and builders more freedom to be more productive, creative and more cost effective at the end.

Billing and cost management at AWS are one of the hotter topics which have been discussed and complained about throughout the internet. AWS provides tools like the AWS Pricing Calculator which helps to create cost estimates for application infrastructure. Significant efforts have been spent over the last couple of years to improve the AWS Billing Console in order to provide better insights about daily and monthly spendings. However, it still can be hard to get a clear picture as every service and quite often also every AWS region has its own pricing structure.

At the same time, more and more cost saving options have been released. Depending on the characteristics and architecture of an application it can be astonishing easy to save a considerable amount of money with no or only limited invest in engineering time using some of the following tips.

Cleanup resources and data

Probably the most obvious step to reduce costs is to delete unused resources like dangling EC2 or RDS instances, old EBS snapshots or AWS Backup files etc. S3 buckets containing huge amounts of old or unused data might as well be a good starting point to reduce costs.

Beside not wasting money all these measures enhance the security at the same time: things which are no longer available cannot be hacked or leaked. AWS provides features like AWS Backup/S3 retention policies to support managing data lifecycle automatically so that not everything must be done manually.

AWS Saving Plans

Savings plans which come in different flavours were released about 3 years ago. They offer discount opportunities of up to 72 percent compared to On-Demand pricing when choosing the "EC2 Instance Savings" plan. Especially the more flexible "Compute Savings Plans" which still offers up to 66 percent discount is quite attractive as it covers not only EC2, but Lambda and Fargate as well.

Workloads running 24/7 with a somehow predictable workload are mostly suited for this type of offering. Depending on the selected term length
(1 or 3 years) and payment option (No, Partial or All Upfront) a fixed discount is granted for a commitment of a certain amount of dollars spent for compute per hour. Architectural changes like switching EC2 instance types or moving workloads from EC2 to Lambda or Fargate are possible and covered by the Compute Savings Plans.

Purchasing savings plans requires a minimum of work with a considerable savings outcome especially as most workloads require some sort of compute which contribute significantly to the total costs.

Reserved Instances and Reserved Capacity

Unfortunately, AWS does not offer Savings Plans for all possible scenarios or AWS services but other powerful discount options like Amazon RDS Reserved Instances come to a rescue. Reserved Instances use comparable configuration options like Savings Plans and promise a similar discount rate for workloads which require continuously running database servers.

The flexibility of change is however limited and depends on the database used. Nevertheless, it is worth considering Reserved Instances as a cost optimization choice with again only a minimum amount of time invest necessary.

Amazon DynamoDB, the serverless NoSQL database, offers a feature called Reserved Capacity. It reserves a guaranteed amount of read and write throughput per second per table. Similar term and payment conditions as already mentioned for Savings Plans and Reserved Instances apply here as well. Predictable traffic patterns benefit from cost reduction compared to the On-Demand or Provisioned throughput modes.

Automatic Shutdown and Restart of EC2 and RDS Instances

Many EC2 and RDS instances are only used during specific times during a day and most often not at all during weekend. This applies mostly for development and test environments but might also be valid for production workloads. A considerable amount of money can be saved by turning off these idle instances when they are not needed.

An automatic approach which initiates and manages the shutdown and restart according to a schedule can take over this task so that nearly no manual intervention is needed. AWS provides a solution called "Instance Scheduler" which can be used to perform this work if no certain start or shutdown logic for an application has to be followed.

Specific workflows which require for instance to start the databases first, prior to launching any servers can be modelled by AWS Step Functions and executed using scheduled EventBridge rules. Step Functions is extremely powerful and supports a huge range of API operations so that nearly no custom code is necessary.

An example of a real-world workflow which stops an application consisting of several RDS and EC2 instances is shown in the image below. A strict sequence of shutdown steps must be followed to make sure that the application stops correctly. This workflow is triggered every evening when the environment is no longer needed.

The counter part is given in the next example. This workflow is used to launch the environment every morning before the first developer starts working.

Shutting down all EC2 and RDS instances during night and over the weekend cut down the compute costs by about 50 percent in this project which is significant for a larger environment. The only caveat with this approach has been so far insufficient EC2 capacity when trying to restart the instances in the. It has happened very seldom, but took about half a day until AWS had enough resources available for a successful launch.

Up and downscale of instances in test environments

This option might not work in all cases as test (aka. UAT) environments often mirror the production workload by design to have nearly identical conditions when performing manual or automated tests. Especially load tests, but others as well should be executed based on production like systems as their results are not reliable otherwise. In addition, not every application runs on smaller EC2 instances as smooth as on larger ones respectively changing an instance size might require additional application configuration modifications.

Nevertheless, it sometimes is possible to downscale them (RDS databases might be an additional option) when load and other heavy tests are not performed on a regular basis (even though this might be recommended in theory).

Infrastructure as code frameworks like Terraform or CloudFormation make it relatively easy to define two configuration sets. They can be swapped prior to running a load test to upscale the environment. EC2 supports instance size modifications on the fly (no restart necessary) and even some RDS databases can be modified without system interruption. The whole up- or downscale process requires only a small amount of time (depending on the environment size and instance types) and can save a considerable amount of money.

Designing new applications with a serverless first mindset

Serverless has become a buzzword and marketing term during the last few years (everything seems to be serverless nowadays), but in its core it is still a quite promising technological approach. Not having to deal with a whole bunch of administrative and operative tasks like provisioning and operating virtual servers or databases paired with the "pay only for what you use" model is quite appealing. Other advantages of serverless architectures should not be discussed in this article but can be found easily using your favorite search engine.

Especially the "pay as you go" cost model counts towards direct cost optimization (excluding in this post topics like total cost of ownership and time to market which are important in practice as well). There is no need to shut down or restart anything when it is not needed. Serverless components do not contribute to your AWS bill when they are not used -- for instance at night in development or test environments. Even production workloads which often do not receive a constant traffic flow but more a spiky one benefit compared to an architecture based on containers or VMs.

Not every application or workload is suited for a serverless design model. To be fair it should be mentioned that a serverless approach can get much more expensive than a container based one in case of very heavy traffic patterns. However, this is probably relevant for just a very small portion of all existing implementations.

Quite often it is possible and beneficial to replace a VM or container by one or more Lambda function(s), a RDS database by DynamoDB or a custom REST/GraphQL API implementation by API Gateway or AppSync. The learning curve is steep and well-designed serverless architectures are not that easy to achieve at the beginning as a complete mind shift is required but believe me: this journey is worth the effort and makes a
ton of fun after having gained some insights into this technology.

Think about what should be logged and send to CloudWatch

Logging has been an important part of any application development and operation since the invention of software. Useful log outputs (hopefully in a structured form) can help to identify bugs or other deficiencies and provide a useful insight into a software system. AWS provides with CloudWatch a log ingesting, storage and analyzing platform.

Unfortunately, log processing is quite costly. It is not an exception that the portion of the AWS bill which is related to CloudWatch is up to 10 times higher than for instance the one of Lambda in serverless projects. The same is valid for container or VM based architectures even though the ratio might not be that high, but still not neglectable. A concept how to deal with log output is advisable and might make a considerable difference at the end of the month.

Some of the following ideas help to keep CloudWatch costs under control:

Change log retention time from "Never expire" to a reasonable value
Apply log sampling like described in this post and for instance provided by the AWS Lambda Powertools
Consider using a 3rd party monitoring system like Lumigo or Datadog instead of outputting a lot of log messages. These external systems are not for free and not always allowed to use (especially in an enterprise context) but provide a lot of additional features which can make a real difference.
In some cases, it might be possible to send logs directly to other systems (instead of ingesting them first into CloudWatch) or to store them in S3 and use Athena to get some insights.
Activate logging when needed and suitable but not always by default -- not every application requires for instance VPC flow logs or API Gateway access logs even though good reasons exist to do so in certain environments (due to security reasons or certain regulations snd company rules)

Logging is important and quite useful in most of the cases, but it makes sense to have an eye on the expenditures and to adjust the logging concept in case of sprawling costs.

Wrap up

All the cost optimization possibilities mentioned above can only scratch the surface of what is possible in the AWS universe. Things like S3 and DynamoDB storage tiers, EC2 spot instances and many others have not even been mentioned nor explained. Nevertheless, applying one or several of
the strategies shortly discussed in this article can help to save a ton of money without having to spend weeks of engineering time. Especially Savings Plans and Reserved Instances as well as shutting down idle instances are easy and quite effective measures to reduce their contribution to the AWS bill by 30% to 50% for existing workloads. Newer ones which are suited for the serverless design model really benefit from its cost and operation model and provide a ton of fun for developers.

Refactor Terraform code with Moved Blocks - a new way without manually modifying the state

Thomas Laue — Fri, 08 Jul 2022 11:07:49 +0000

Most software and IT infrastructure projects which have been deployed to production have to deal with requirement changes during their lifetime. User expectations change, new use cases appear, traffic patterns are different than expected or new technology becomes available. Refactoring of existing code (application code as well as infrastructure-as-code) has always been an important task but also one of the major pain points in IT. A good support of refactoring tools and patterns can make a difference for a framework like Terraform compared with its competitors.

Setting the stage

Terraform by HashiCorp -- one of the major players in the
infrastructure-as-code framework world - has been around since 2014. It has been used to setup a lot of small, medium, and large projects all over the world. It provides a rich feature set to define infrastructure in a concise manner. One of its strengths is the way to create identical/similar resources using either the meta-argument count or the newer version for_each.

count makes it very easy to define identical resources like shown in the listing below which defines a very basic setup for 3 EC2 instances running on AWS:

locals {
  server_names = ["webserver1", "webserver2", "webserver3"]
}

resource "aws_instance" "web" {
  count = length(local.server_names)

  ami                       = "ami-0a1ee2fb28fe05df3"
  instance_type             = "t3.micro"

  tags = {
    Name = local.server_names[count.index]
  }
}

Terraform stores references to resources created by using the count meta-argument in its internal state in an array using an index-based approach.

This works fine if a single instance must not be replaced or deleted. Such an action will affect all resources which are located on a higher index in the array due to the nature Terraform manages its state.

Trying to remove "webserver2" in the example above

locals {
  server_names = ["webserver1", "webserver2", "webserver3"]
}
...

will result in the destruction of the EC2 instance tagged "webserver3" and a renaming of the previous named "webserver2" instance into "webserver3". The result does not correspond to the expressed intention.

Version 0.12.6 of Terraform introduced the for_each meta-argument - a more flexible way to create identical/similar resources.

resource "aws_instance" "web" {
  for_each = toset(local.server_names)

  ami                         = "ami-0a1ee2fb28fe05df3"
  instance_type               = "t3.micro"

  tags = {
    Name = each.value 
  }
}

The Terraform state references the resources no longer based on an index but by using a key-based approach. It is now possible to address a single resource without affecting others.

The removal of "webserver2" can now be performed successfully without affecting other resources.

Due to the greater flexibility of for_each it might be helpful or even required to refactor existing code (migrate from count to for_each). This has been possible in the past by manipulating the Terraform state directly using the terraform state mv CLI command. However, all manual state manipulations are brittle and prone to errors which make them as a kind of last resort.

From imperative to explicit

HashiCorp introduced an improved refactoring experience with version 1.1 of Terraform: the moved block syntax which allows to express refactoring steps in code instead of using an imperative attempt via CLI.

The moved block allows to specify the old and new reference of a resource like shown in the following example which has been rewritten to use for_each instead of count:

locals {
  server_names = ["webserver1", "webserver2", "webserver3"]
}

moved {
  from = aws_instance.web[0]
  to   = aws_instance.web["webserver1"]
}

moved {
  from = aws_instance.web[1]
  to   = aws_instance.web["webserver2"]
}

moved {
  from = aws_instance.web[2]
  to   = aws_instance.web["webserver3"]
}

resource "aws_instance" "web" {
  for_each = toset(local.server_names)

  ami           = "ami-0a1ee2fb28fe05df3"
  instance_type = "t3.micro"

  tags = {
    Name = each.value
  }
}

A following terraform plan/apply reveals that no instance will be destroyed or modified in any way but only moved in the state from its old reference to its new one created by the way for_each works. No need for any manual state manipulation anymore but everything can be done securely using Terraforms native way to work.

moved blocks cannot only be applied to refactor count into
for_each syntax but also be used to rename resources, to move resources into modules and so on. Not everything is possible using the new language element, but many (not extremely complex) refactoring tasks can benefit from using it. Terraforms documentation contains different examples and use cases with further details.

Wrap-up

moved blocks have made refactoring existing Terraform projects easier and safer to perform. No manual steps are required any longer for many use cases even though terraform state mv is still there to solve problems which cannot be tackled by using the new element. It is helpful to have tooling/framework elements like this on at hand.

Depending on the type and size of the project (internal project or public module) it might make sense respectively it is even recommended by HashiCorp not to delete the blocks after having applied the changes. Not everyone using the module might already have fetched the latest version. Apart from avoiding trouble for users it might be helpful to document any significant changes on the project structure for later reviews. A short well written and dated comment combined with the moved block syntax might answer your question or the one of a colleague six months down the road.

Automate DevOps Workflows using AWS StepFunctions Service Integrations

Thomas Laue — Tue, 31 May 2022 18:48:45 +0000

AWS Step Functions, a serverless workflow orchestration service offering by AWS, has been around since several years now. Many blog posts (like Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions), presentations and learning courses (e.g. Complete guide to AWS Step Functions) have been published showing the capabilities and rich feature set provided.

However not many of them deal with topics related to DevOps tasks -- maybe because Step Functions only offered a limited set of direct service integrations like AWS Lambda until recently. Accessing an AWS API required using for instance an AWS SDK or AWS CLI commands in a script or Lambda function, but this changed a few months ago.

In September 2021 AWS added support for over 200 AWS Services with AWS SDK integration resulting in over 9000 AWS API Actions available. Only a few weeks before, another major enhancement, the new Workflow Studio -- a low-code visual tool for building state machines, had been released so that it is now easier than ever to build workflows -- from simple to complex.

The challenge

Around the same time, we joined a migration project at a customer who was moving a large application which had been hosted on-premises so far to AWS using services like EC2, RDS, ALB.... Some of the typical operational tasks like managing the database servers are now gone as AWS takes care for the heavy lifting but new ones have arrived and others stay the same.

As the project proceeded, we thought about how we could automate as many operational tasks as possible using native AWS services. AWS Step Functions Service Integrations came right around the corner to make our life much easier. We were able to handle many repeating tasks by creating state machines which are sometimes triggered by scheduled Amazon EventBridge rules or used manually via CLI or Console.

The simple one

A workflow consisting only of two steps (neglecting Start and End) is triggered shortly before the next EC2 maintenance window to get an overview about all security patches which will be installed.

AWS Systems Manager`s service integration ssm:describeInstancePatches is used to get the list all patches which will be sent to an AWS SNS topic in order to be delivered to an email inbox of someone who is in charge to check if there might be a conflict ahead with the application requirements.

The Workflow Studio editor makes it quite easy to assemble a workflow and to enrich every step with the required parameters and settings. All service integrations are based on the AWS SDK API calls so that the parameters can be retrieved from the SDK documentation (an example is shown for Systems Manager API).

Workflow Editor allows exporting the state machine definition to a JSON or YAML file so that it can be included into an infrastructure as code project using for instance Terraform.

Information like the EC2 instance ID or the SNS topic ARN can be derived during deploy time using for instance Terraform template variables as shown in the example JSON state machine definition.

The big benefit of using Step Functions is that no custom code and no additional overhead for managing a Lambda function is required to complete this task and the best thing: the state machine is quite intuitive to create, self-documenting and easy to follow and to recap.

The more complex process

Following the sample principles, it is possible to create more complex workflows. The given example shows a workflow which is used to restart all servers belonging to the web app tier which are behind an AWS application load balancer in a rolling manner. No application downtime is required in order to restart them as only a certain number is restarted at once.

In the first step, the alarm actions of some CloudWatch alarms and which should not fire during the restart process and some AWS EventBridge rules are disabled using a Lambda function as the logic to filter these resources needs some custom code.

A property of the AWS Step Functions Map state, the Maximum
Concurrency Control, is used to restrict the number of instances which are deregistered from the ALB target group, followed by a reboot and a final check if the application has been launched successfully before bringing it back into the target group.

Rebooting only a limited number of instances makes sure that the application stays online, and that always enough servers are available to handle user traffic without a significant influence on the user experience.

The new AWS SDK service integrations help again to model the workflow as a sequence of steps must be followed in order to reboot a running instance successfully. Not only has a server to be de-/registered from the target group (among others using elasticloadbalancingv2:registerTargets SDK command) but also to be rebooted (ec2:rebootInstances).

After a certain wait period, an application startup check is performed to make sure that everything is working correctly using a Lambda function as the whole check process requires again some custom logic. Only a healthy and working server should be put back into the ALB target group.

The application requires some minutes to get everything sorted out until it is ready to serve whereby the startup time various depending on factors like external database connections... The Wait state helps in this case to pause the workflow for a certain time. Nevertheless, it can happen that the following startup check fails as the application is not yet ready and another wait period is required.

An in-build "for-loop" feature for Step Functions would be quite helpful in this case to re-run the last two steps (wait + startup check) again. It is possible to model this construct using a Choice state which checks the result return from startup check Lambda function and acts upon it (i.e., go back to the Wait state if the application is not ready yet).

However, this feels somehow clumsy and more like a workaround. Additionally, a break condition (e.g., max. number of checks is required) which introduces a stateful condition which must be passed somehow around or stored somewhere.

Custom Retry and Error Handling for Lambda functions, another cool feature of Step Functions, comes to our rescue. Custom errors which are thrown from a Lambda function can be handled. Depending on the use case, a Catcher or a Retrier for this custom error class might be defined to deal with this situation. The later one is used to simulate a "for-loop" without relying on the Choice state workaround.

Lambda raises a custom InstanceNotYetStartedException in case the health check fails. This exception is handled by a specific Retrier which defines a longer wait interval (120 seconds) to give the application some additional time before the next check. This whole procedure is repeated up to three times in this case until it can be assumed that something went wrong and should be handled otherwise (processing moves on to a States.ALL Catcher which calls a SNS integration step for publishing an alarm).

As a last note to this workflow: the Map state fails as soon as one if its execution has failed. All running inner executions are aborted and all waiting once are cancelled. Care should be taken for this scenario: adding a dead later queue to the inner Map state workflow would be one option, defining a States.ALL Catcher on the Map state level another one or even failing the complete state machine execution by purpose. The best error handling method depends on the workflow requirements. The global Catcher is used in the presented case as some additional steps (putting the deactivated CloudWatch alarms back on place) must be
executed in all cases.

When not to use

Step Functions has some limits like every other AWS service which might prevent one from using it in some rare cases or which requires a workaround. Furthermore, there are external API properties which might not fit to Step Functions. Two examples should shortly be discussed:

Maximum input/output size for a task is 256 KB: AWS API calls might return a lot of JSON data but there are various mechanisms like the filters parameter and pagination support in place to narrow down the scope of a request. Additionally, Step Functions provide output processing functions to extract the data of interest so that this limitation should not be a blocker for most use cases.
How to deal with API calls supporting pagination: many AWS API endpoints return a maximum number of items and an additional NextToken value which can be used to retrieve the next batch with a following call. The clumsy Choice-state construct mentioned above could be used to handle this, but this is not practical. A Lambda function is much more suited in this situation in case a lot of data must be retrieved.

Wrapping up

This blog presents use cases for Step Functions which might not be the most common ones out there but proved to be extremely useful. The new SDK integrations have opened a wide field of possibilities to model workflows visually without writing a lot of custom code (even though Lambda is always there if something cannot be solved by in-build mechanisms).

The Step Functions Workflow Studio allows to design and build-up workflows from simple to quite complex ones in an intuitive and rapid way. The ready-to-be-used workflow can be exported to code (is JSON code?) so that a developer's heart does not need to cry and the integration into an infrastructure as code framework can be made.

Some additional features like more intrinsic functions (e.g., string processing) to deal with the sometimes very large JSON results of AWS SKD calls would make working with Step Functions even easier (big point for #awswishlist)

Keep your CloudWatch bill under control when running AWS Lambda at scale

Thomas Laue — Tue, 19 Jan 2021 21:11:56 +0000

In this post, I am showing a way how to keep the AWS CloudWatch costs caused by log messages coming from AWS Lambda under control without losing insights and debug information in case of errors. A logger with an included cache mechanism is presented. It manages the number of messages sent to AWS CloudWatch depending on the log level and function invocation result.

AWS Lambda and AWS CloudWatch

AWS Lambda, the serverless compute service offered by AWS, sends all log messages (platform as well as custom messages) to AWS CloudWatch. Log messages are sorted into log groups and streams which are associated with the Lambda function and its invocations from which the messages originated.

Depending on the AWS region CloudWatch charges for data ingestion (up to $0.90 per GB) and data storage (up to $0.0408 per GB and month). These fees sum up really quickly and it is not uncommon to spend a lot more on CloudWatch logs (sometimes up to 10 times more) than on Lambda itself in a production environment. In addition, log files are often sent from CloudWatch to 3rd party systems for analyzation adding even more spendings to the bill.

Logging

Nevertheless, log files are an important resource to debug problems and to get deeper insights into the behavior of a serverless system. Every logged detail might help to identify issues and to fix bugs and problems. Structured logging is important as log files can be analyzed much easier (e.g. with AWS CloudWatch Insights) which will save time and engineering costs. The dazn-lambda-powertools library provides a logger that supports structured logging for Node.js, the AWS Lambda Powertools offer the same for Python and Java.

Furthermore, it is highly recommended to reduce the retention time of Cloudwatch log groups to a suitable time period. By default, logs will be stored forever leading to increasing costs over time. The retention policy for every log group might be changed manually using the AWS Console or preferably by using an automated approach provided for instance by this AWS SAR app.

Finally, sampling debug logs might cut off the biggest part of the CloudWatch Logs bill especially when running AWS Lambda at scale without losing the complete insight into the system. Depending on the sampling rate (which has to be representable for a workload), a certain amount of debugging information is available for monitoring and diagnostics.

The following image shows a CloudWatch log stream belonging to a Lambda function for which a sampling rate of 10 % was used for demonstration purposes. A reasonable value for production will probably be much lower (e.g. 1%).

Problem with sampling debug logs

Nevertheless - as life goes - the sampling might not be in place when something goes wrong (e.g. a bug which only happens for edge cases) leaving a developer without detailed information to fix this issue. For instance, the invocation event or parameters for database or external API requests, are of interest in case of issues.

A logger that caches all messages which are not written to the output stream as their severity is below the defined log level could be used. The cached messages would only be sent to CloudWatch in case of a program error - in addition to the error information to get a full picture of the function invocation. This idea originated from the Production-Ready Serverless course by Yan Cui.

A reduced version of the logger which is based on the dazn-lambda-powertools-logger:

const log = require("@dazn/lambda-powertools-logger");

const LogLevels = {
  DEBUG: 20, INFO: 30, WARN: 40, ERROR: 50
};

class Logger {
  #logMessages = [];
  #level = "DEBUG";

  constructor() {
    this.#level = log.level;
  }

  handleMessage(levelName = "debug", message = "", params = {}, error = {}) {
    log[levelName](message, params, error);

    const level = LogLevels[levelName.toUpperCase()];

    if (level < LogLevels[this.#level]) {
      this.addToCache(levelName, message, params, error);
      return;
    }
  }

  addToCache(levelName, ...params) {
    this.#logMessages.push({ levelName, params });
  }

  writeAllMessages() {
    try {
      // The log level of the log has to be set do "debug" as
      // the current log level might prevent messages from
      // being logged.
      log.enableDebug();

      this.#logMessages.forEach((item) => {
        log[item.levelName.toLowerCase()](...item.params);
      });
    } finally {
      log.resetLevel();
    }
  }

  static debug(message, params) {
    globalLogger.handleMessage("debug", message, params);
  }

  static info(message, params) {
    globalLogger.handleMessage("info", message, params);
  }

  static warn(message, params, error) {
    globalLogger.handleMessage("warn", message, params, error);
  }

  static error(message, params, error) {
    globalLogger.handleMessage("error", message, params, error);
  }

  static writeAllMessages() {
    globalLogger.writeAllMessages();
  }

  ...
}

const globalLogger = new Logger();
module.exports = Logger;

The logger provides methods for the most common log levels. A message is either written to the output stream or added to the internal cache depending on the current log level defined in the Lambda environment. If required all cached messages can be logged out as well using the "writeAllMessages" method.

How to use the logger within AWS Lambda

All required logic (including sample logging configuration) has been added to a wrapper that receives the Lambda handler function as an argument. This wrapper can be reused for any Lambda function and published for instance in a private NPM package.

const middy = require("middy");
const sampleLogging = require("@dazn/lambda-powertools-middleware-sample-logging");

const log = require("./logger");

module.exports = (lambdaHandler) => {
  const lambdaWrapper = async (event, context) => {
    log.debug(`Input event...`, { event });

    try {
      const response = await lambdaHandler(event, context, log);

      log.info(
        `Function [${context.functionName}] finished successfully with result: [${JSON.stringify(
          response
        )}] at [${new Date()}]`
      );

      return response;
    } catch (error) {
      log.writeAllMessages();
      throw error;
    } finally {
      log.clear();
    }
  };

  return middy(lambdaWrapper).use(
    sampleLogging({
      sampleRate: parseFloat(process.env.SAMPLE_DEBUG_LOG_RATE || "0.01"),
    })
  );
};

An example of a simple Lambda handler in which some user information is retrieved from DynamoDB is given below. This function fails on a random basis to demonstrate logger behavior.

const { DynamoDB } = require("@aws-sdk/client-dynamodb");
const { marshall, unmarshall } = require("@aws-sdk/util-dynamodb");

const dynamoDBClient = new DynamoDB({ region: "eu-central-1" });

const handler = async (event, context, log) => {
  const userId = event.queryStringParameters.userId;
  const { name, age } = await getUserDetailsFromDB(userId);

  if (Math.random() > 0.5) {
   throw new Error("An error occurred");
  }

  let response = {
    statusCode: 200,
    body: JSON.stringify({
      name,
      age,
    }),
  };

  log.debug(`Response...`, { response });

  return response;
};

const getUserDetailsFromDB = async (userId) => {
  log.debug(`Get user information for user with id...`, { userId });

  const { Item } = await dynamoDBClient.getItem({
    TableName: process.env.TABLE_NAME,
    Key: marshall({
      userId: 1,
    }),
  });

  const userDetails = unmarshall(Item);
  log.debug("Retrieved user information...", { userDetails });

  return userDetails;
};

module.exports.handler = wrapper(handler);

A small sample application (as shown by the lumigo platform) demonstrates the different logger behavior:

A successful invocation of the sample app with log level set to "INFO" does not write out any debug message (only in the rare case of a sampled invocation):

However, all debug information will be sent to CloudWatch Logs in case of an error as can been seen below:

Caveats

Platform errors like timeouts or out of memory issues will not trigger the logger logic as the function will not run to its end but will be terminated by the Lambda runtime.

Takeaways

Logging is one of the important tools to get some insights into the behavior of any system including AWS Lambda. CloudWatch Logs centralizes and manages all logs from most AWS services. It is not free but there are possibilities like to sample logs in production to reduce the bill. As this might result in NO logs in case of an error, a logger with an internal cache has been presented which outputs all logs but only in case of a problem. This logger can be combined with the sample logging strategy to keep the bill low but get all information when it is really required.

Let me know if you found this useful and what other approaches are used to keep the CloudWatch bill reasonable without losing all insights. Thank you for reading.

The full code including a small test application can be found in: