Forem: Ivica Kolenkaš

Unable to emit metadata to DataHub GMS with Airflow - a solution

Ivica Kolenkaš — Thu, 14 Aug 2025 05:47:04 +0000

DataHub is a popular open-source data catalog and its Lineage feature is one of its highlights.

Doing ingestion or data processing with Airflow, a very popular open-source platform for developing and running workflows, is a fairly common setup. DataHub's automatic lineage extraction works great with Airflow - provided you configure the Airflow connection to DataHub correctly.

This article shows how to resolve the infamous Unable to emit metadata to DataHub GMS when using the DataHub Airflow plugin.

datahub.configuration.common.OperationalError: (
'Unable to emit metadata to DataHub GMS', 
{
  'message': '404 Client Error: Not Found for url: <a href="https://my-datahub-host.net/aspects?action=ingestProposal"
})

TL;DR

URL-encode the host portion of your Airflow connection string:

correct: datahub-rest://my-datahub-host.net%2Fapi%2Fgms
incorrect: datahub-rest://my-datahub-host.net/api/gms

The problem - 404, wrong URL

The problem is very obvious from the error message; 404 - Not Found, indicating that the URL does not exist or is wrong.

A quick glance at the DataHub API docs shows that the REST API is available at https://my-datahub-host.net/api/gms.

Compare that to the URL reported in the error message above and it is obvious that /api/gms is missing from our Airflow connection string - woohooo!

So I quickly look at HashiCorp Vault, which we use as the external connections store in our Airflow deployments and the connection string looks just fine to me - /api/gms is there.

datahub-rest://https://:TOKEN@https%3A%2F%2Fmy-datahub-host.net/api/gms

Let's check how Airflow "understands" the connection because that is what it will use:

airflow connections get datahub_rest_default --output yaml

which outputs

# shortened for brevity
- conn_id: datahub_rest_default
  conn_type: datahub_rest
  host: https://my-datahub-host.net
  schema: 'api/gms'

Two issues pop out immediately:

schema should only have http or https in it, not /api/gms (source)
host value is missing the /api/gms path

To understand why the connection is not parsed properly, lets look at how Connections look like under the hood.

Anatomy of an Airflow connection

An Airflow Connection object (source) is very well documented so I won't repeat that; use the source, Luke:

class Connection:
    """
    A connection to an external data source.

    :param conn_id: The connection ID.
    :param conn_type: The connection type.
    :param description: "The connection description."
    :param host: The host.
    :param login: The login.
    :param password: The password.
    :param schema: The schema.
    :param port: The port number.
    :param extra: Extra metadata. Non-standard data such as private/SSH keys can be saved here. JSON
        encoded object.
    """

It is good to know that a Connection can also be represented as a connection string (also called an URI).

Because we are dealing with an HTTP connection, the HOST consists of the full URL, including the path; for example, a connection URI for Google Images page would look like http://https://google.com/imghp.

Airflow connection parsing

DataHub's DatahubRestHook (source) is based on Airflow's BaseHook (source) so it inherits this method:

@classmethod
def get_connection(cls, conn_id: str) -> Connection:
    """
    Get connection, given connection id.

    :param conn_id: connection id
    :return: connection
    """
    # shortened for brevity
    conn = ConnectionModel.get_connection_from_secrets(conn_id)
    return conn

From it, and several levels down the code path, we find a function that parses the URI to turn it into a Connection object. To simplify the demo below, I've extracted a couple of functions that parse the URI this into this Gist.

Using that code standalone (without Airflow) shows exactly what's wrong with the connection:

Connection string without URL-encoding it first:

from airflow_connection_parse import parse_from_uri

parse_from_uri("datahub-rest://https://:TOKEN@my-datahub-host.net/api/gms")
Host: https://my-datahub-host.net
Schema: api/gms

Connection string with the URL (host) being URL-encoded:

from airflow_connection_parse import parse_from_uri

parse_from_uri("datahub-rest://https://:TOKEN@my-datahub-host.net%2Fapi%2Fgms")
Host: https://my-datahub-host.net/api/gms
Schema:

schema is empty and the host contains /api/gms as it should according to the DataHub's Airflow integration docs.

URL-encode your connection strings if you're creating them outside of the Airflow ecosystem - using the Airflow UI (not recommended for production) or the airflow CLI will take care of that for you.

In our case, all the connections are managed with Terraform and the code for it missed a simple urlencode(MY_HOST_HERE) function call.

Data and analytics reimagined - platform architecture

Ivica Kolenkaš — Fri, 25 Jul 2025 14:10:11 +0000

The previous article in the series introduces the business, its data-related challenges and a vision of a data platform to tackle them.

This article goes over the architecture of the platform in its most basic form - arrows and rectangles.

A business analyst trying to answer "How many white shirts do I need?" will have their work made easier by having all the relevant sales data in a uniform shape and in the same place. The reality is, the sales data from previous years is fragmented in several data silos.

Relevant data - wish vs. reality

These data silos will have differently shaped data (databases, files, or worse), their security practices (if any) will be different and ownership may or may not be known.

Having all the data be uniform and in the same place is a hard nut to crack for a large organization with highly autonomous teams. A good alternative is to have the data in a similar-enough place, and uniform-enough shape so that it appears "the same".

To get it into a similar-enough place and a uniform-enough shape, we put a fence around it and it became a data domain. Very similar to containerization of shipped goods and software products.

Data domains

"What is a data domain on our platform?" is a question that has a different answer depending on who you ask. For a data engineer, a data domain is a grouping of similar data; for example, all sales data for the wholesale sales channel belongs to a WHOLESALE data domain.

A security specialist would argue that a data domain is a security boundary, while a data platform engineer will say that a data domain is a collection of - spoiler alert - AzureAD groups and Snowflake objects.

All three engineers are correct; a data domain is a grouping of data that belongs together, forms a security boundary, and is formed by, in our case, Snowflake objects and AzureAD groups.

Relevant data in a similar-enough place; a data domain

Encapsulating data in a data domain helps us tackle the three main challenges of the existing data landscape:

clear ownership
defined security guidelines
defined shape of data

Data domain ownership

Each data domain must have an owner and a self-sufficient data domain team behind it.

Being self-sufficient means that they own and manage the data lifecycle and the domain fully. Ingestion of raw data, its transformation or serving as data products is entirely up to them.

"Want to make another data product?" Sure. "Want to delete all of them?" Absolutely.

Ownership over a data domain can be split into two:

business (administrative) ownership
technical ownership

Both types of owners have an overlapping right and responsibility; to manage the access to the data they own, while being aware of any sensitive data. They are expected to reject data access requests that do not conform to rules.

The main difference is that a business owner manages access for people to perform data exploration, while the technical owner will manage it for automated processes using code (more on that in the next article!).

For everyone outside of the domain, domain owners serve as a point of contact regarding the data they own.

Security guidelines

One selling point of our data platform, our curated collection of tools, standards and processes is exactly that - standards.

Same security standards being applied to every data domain makes accessing the data sets in those domains a seamless experience. No matter where the data set you need is located, getting to it is technically the same.

For people accessing data, this means having one of the standardized domains roles that grant read or write permissions.

For machines (think scheduled jobs, automated processes etc.) this means having a service principal that authenticates with a private key, among other things.

Our security guidelines span all the systems and tools we offer on the platform. Deviations must have a very strong business case - after all, the platform should be the way but not in the way.

Shape of data products

This aspect of the platform is the trickiest to tackle from the platform perspective since the shape of data products is ultimately chosen by the owning data domain team.

We can however standardize on basic rules that the shape of a data product is:

managed with code ("Why Git" from Atlassian)
described with a data contract
true to that data contract

Data products are meant to be used and they provide an interface. In the same way that you know how to operate a door by its interface - unless they're Norman doors - you should know how to use a data product by seeing its interface, its contract.

Data domain teams are responsible for defining data contracts and making sure their data products adhere to them.

Data mesh

Data domains on their own are powerful, but their superpower is in their ability to interconnect.

Data mesh

A great definition of a data mesh architecture:

A data mesh architecture is a decentralized approach that enables domain teams to perform cross-domain data analysis on their own. At its core is the domain with its responsible team and its operational and analytical data. The domain team ingests operational data and builds analytical data models as data products to perform their own analysis. It may also choose to publish data products with data contracts to serve other domains’ data needs. Source

Data domains are the nodes in this mesh. What makes it a proper mesh are the (data) connections between these domains.

A business analyst trying to answer "How many white shirts do I need?" now has a data mesh at their disposal. A data mesh made up of clearly defined data domains, each with an owner, with described and maintained data products that adhere to a contract and have the same security guidelines.

Business analyst using a data mesh

They know who to contact regarding data access (they can even request it themselves!), and once that access is given, the data is stored in a uniform place and is secured in a uniform way.

By building data domains and by organizing them into a data mesh we have established a framework for organizing and connecting data on our platform. More importantly, we have set the groundwork for success and hugely improved our data landscape.

This article went over the base architecture of our platform and certain standards applied to objects in it.

But what are the tools of the trade? Which processes are standardized on the platform? The third article in the series goes over the tools chosen to make up the platform and to maintain it.

Data and analytics reimagined with Terraform and DevOps principles

Ivica Kolenkaš — Wed, 16 Jul 2025 18:19:41 +0000

"How many white shirts do I need?" is a very simple question to answer for you and me. Answering that same question as a demand planner in a fashion enterprise requires a data driven approach, because the consequences of being wrong are far reaching. To be data driven, you must have actionable data.

The following blog series describes the path that BESTSELLER's Data & Analytics Platform started 2.5 years ago in an effort to produce actionable data. It will reason about the choices we made, focus on key challenges we faced, and celebrate the big wins that we got from it.

The first step on this path is to understand the business and the platform we're building for it.

The business

BESTSELLER is a family-owned fashion business that is a home to more than 22000 colleagues across design, logistics and tech. They work for more than 20 brands, including JACK & JONES, ONLY and VERO MODA, across 75 countries.

Each brand in this multi-brand matrix organisation operates with a high degree of independence, which allows them to remain agile in their decision making process. Brand's identities are different, but their operational practices remain the same. That's why shared functions, such as the Data & Analytics platform provide the common building blocks for the brands to use.

During daily operations, clothes are sold through multiple sales channels (retail, wholesale, online etc.) which produces large amounts of operational data. It makes every sense to use this data for analytical purposes - to understand the past and predict the future. But with the high degree of independence comes the responsibility to exercise it responsibly. Over the years, several data silos formed, each with its own governance practices, levels of data maturity and ownership structures.

These silos made answering the question "How many white shirts do I need?" very difficult, because the data you need to answer it is scattered, possibly unavailable and under unclear ownership. Data producers and consumers started to become connected in an ever-growing web of point-to-point connections. Even worse, those connections were between clients and databases located on-premise or in the cloud, semi-accessible data stores, semi-structured files and even data stores on personal laptops.

The business is ambitious and we have clear goals:

We want to open one retail store each working day in 2026.

To live up to the expectations of the business, and to answer "How many white shirts do I need?" reliably, across 20 brands and across multiple sales channels, we needed a structured solution that provides actionable data.

The platform

The data & analytics platform (the platform) we're building has a very clear goal - to enable data and machine learning (ML) engineers to create data products and make them readily available to various parts of the business. These data products can be schemas or views in a database, (semi)structured files unloaded to blob storage, various reports compiled for executives or anything else in between.

Data products are used by various departments in the company to understand the past and predict the future. When used by business analysts and decision makers, these data products help in demand planning, supply chain optimizations, understanding the environmental impact and so on. They augment the decision making process with data.

When describing it to prospective stakeholders, we describe the platform as a highway, which enables you to drive from point A to point B. You could be driving a small electric car or a diesel-powered truck; the highway serves the same purpose. The signage is uniform, the speed limits are known in advance, and the rules apply to every vehicle.

Tenets

We are building our platform to be:

The way, but not in the way.
Flexible, while having general rules.
In service of those who are using it.

These three tenets shape our vision, decisions and priorities.

If I had to describe the platform in a single sentence, I would say that it is a curated collection of tools, standards and processes to ingest, store, transform and serve data.

The second article in the series explains the architecture chosen for the platform and what challenges it is meant to solve.

Validating Terraform configuration just got easier

Ivica Kolenkaš — Mon, 30 Dec 2024 22:20:37 +0000

Upgrading provider versions is essential for keeping your infrastructure managed with Terraform up-to-date and feature-rich.

In software engineering, it is inevitable that components (functions, APIs, implementations) will become deprecated and be phased out. Luckily, Terraform has a robust way of managing provider versions and validating your configuration, so that you can understand which resources are deprecated or misconfigured at the moment.

Working with the output of Terraform's validate command is not always convenient, considering that it can easily be over 50000 (yes, fifty thousand) lines.

A bit on `terraform validate`

I was recently in such a situation; my Terraform state has close to 3600 resource instances, 2075 of which were deprecated - a cool 57% of all resource instances 😄

│ This resource is deprecated and will be removed in a future major version release.
│ 
│ (and 4133 more similar warnings elsewhere)

terraform validate (docs) is a great tool - it shows you all the details about deprecated and misconfigured resource instances that need your attention:

{
  "format_version": "1.0",
  "valid": true,
  "error_count": 0,
  "warning_count": 2075,
  "diagnostics": [
    {
      "severity": "warning",
      "summary": "Deprecated Resource",
      "detail": "This resource is deprecated and will be removed in a future major version release. Please use CDEF instead.",
      "address": "module.some.module.address.abcd.name",
      "range": {
        "filename": ".terraform/modules/some.address/main.tf",
        "start": {
          "line": 539,
          "column": 71,
          "byte": 13195
        },
        "end": {
          "line": 539,
          "column": 72,
          "byte": 13196
        }
      }
    }
  ]
}

The only issue is that the output file of terraform validate -json has more than 50000 lines and is not very convenient to work with. terraform-validate-explorer to the rescue!

terraform-validate-explorer

terraform-validate-explorer is a tool that helps you search and filter resource instances from the output of terraform validate -json. Get it from this GitHub repository.

The idea for this tool came from a situation at work: the state file has many Snowflake resources, and the Terraform provider for Snowflake has undergone many changes in the past year, leading to plenty of deprecations.

Version 1.x of the Snowflake provider became available and I wanted to upgrade the provider, meaning that I had to deal with 2075 resource instances that were deprecated. Some of these manage account role grants and I don't want to break those. As a matter of fact, I don't want to break anything for my stakeholders, so I decided to take things slowly.

"contains" filter

Upgrading these resources one-by-one means that I have to find them first, and this is where the "contains" filter helped me:

The screenshot above shows a search for all resources that have tables_future_read in the name (Snowflake's "future" grants are amazing btw!)

"does not contain" filter

To verify that only snowflake_ resources are deprecated, I filtered all the warnings that do not contain the word snowflake:

No errors and no warnings - perfect!

"regex" filter

If the other two filters are not cutting it for you, you can always do it with one of the worlds's write-only languages.

Suppose you want to look for a resource instance that has future_ in the name, followed by a four-letter word that is at the end of the resource name:

With regex, sky's the limit! Also the 255 character limit I put on that QLineEdit is the limit.

Output file validation

If the output file of terraform validate -json was somehow made invalid (with errrrm, manual edits?), terraform-validate-explorer will check for that too:

Next steps

terraform-validate-explorer is simple at the moment, with just the basic functionality. To make it more useful and more stable in the future, I plan to implement:

unit tests
error handling
showing only unique resources
filtering an already filtered dataset

Import AzureAD app role assignments into Terraform state

Ivica Kolenkaš — Sat, 30 Dec 2023 18:07:35 +0000

This article shows how to import AzureAD app role assignments into the Terraform state. With app role assignments, AzureAD users, groups, or service principals are assigned a role in an application. Source.

In the use case I'm writing about, AzureAD groups are assigned a role "User" in the AzureAD enterprise application for Snowflake, allowing members of those AzureAD groups to single sign-on to Snowflake.

In short

The problem
Get Microsoft Graph API token
List existing app role assignments using the Graph API
- Handling multiple assignments
Import existing app role assignments into the Terraform state
Conclusion

Context

Creating and managing a data platform involves managing its infrastructure and user access among other things. In the data platform my team is managing, we use Snowflake for analytical data warehousing and user access is managed from AzureAD (Microsoft Entra).

This setup allows us to use existing AzureAD users and groups and grant them access to roles inside of Snowflake. In short, adding a group named "ECOM_SALES" will allow any member of that group to log in to Snowflake and use a functional role named ECOM_SALES.

Below is a screenshot of the Snowflake enterprise application showing multiple groups assigned the role "User":

Being strong believers in infrastructure-as-code, my team manages the entire platform with Terraform. However, during the very early days, some changes were done manually, such as adding AzureAD groups to the enterprise application for Snowflake single sign-on.

Access management is an important part of the overall platform security and as such should be managed through code. Manual configuration can be overlooked during migrations, it does not follow the four-eyes principle which can cause outages and does not provide a stable foundation for a data platform.

The problem

With the understanding of why we want to manage all AzureAD app role assignments with Terraform, let's see how those added manually can be imported and managed with Terraform.

The docs state that app role assignments are imported with:

terraform import \
  azuread_app_role_assignment.example \
  00000000-0000-0000-0000-000000000000/appRoleAssignment/aaBBcDDeFG6h5JKLMN2PQrrssTTUUvWWxxxxxyyyzzz

where:

00000000-0000-0000-0000-000000000000 is the object ID of the service principal object associated with your AzureAD enterprise application for Snowflake.
aaBBcDDeFG6h5JKLMN2PQrrssTTUUvWWxxxxxyyyzzz is the app role assignment ID.

I had absolutely no idea where to find the real value of aaBBcDDeFG6h5JKLMN2PQrrssTTUUvWWxxxxxyyyzzz in the UI.

Since Terraform providers communicate with 3rd party APIs (AWS, AzureAD etc...) I figured I could do the same.

Get Microsoft Graph API token

Microsoft's Graph API allows programmatic access to the Microsoft Cloud service resources.

To use it, you must authenticate and obtain a JWT:

TOKEN=$(curl -d grant_type=client_credentials \
  -d client_id=$CLIENT_ID \
  -d client_secret=$CLIENT_SECRET \
  -d scope=https://graph.microsoft.com/.default \
  -d resource=https://graph.microsoft.com  \
  https://login.microsoftonline.com/$TENANT_ID/oauth2/token  |
  jq -j .access_token)

Of course, export CLIENT_ID, CLIENT_SECRET and TENANT_ID first.

List existing app role assignments using the Graph API

Having the JWT in the TOKEN variable, we can list the existing app role assignment for group ECOM_SALES in the AzureAD enterprise application for Snowflake with ID 1ab2c3de-f456-7890-fghi-j12345k67lm8:

ASSIGNMENT_ID=$(curl -sX GET \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://graph.microsoft.com/v1.0/servicePrincipals/1ab2c3de-f456-7890-fghi-j12345k67lm8/appRoleAssignedTo |
  jq '.value[] | select(.principalDisplayName =="ECOM_SALES").id')

The API returns:

{
    "@odata.context": "LINK_HERE",
    "@odata.nextLink": "LINK_HERE",
    "value": [
        {
            "id": "ZMW3ujNOGUuENMLa2k6-mVWFENfBHbVDlwQJg1ui848",
            "deletedDateTime": null,
            "appRoleId": "33c0d484-efdc-4e65-b6fc-470f4bcb4f46",
            "createdDateTime": "2023-02-17T08:48:17.1803369Z",
            "principalDisplayName": "ECOM_SALES",
            "principalId": "bab7c564-4e33-4b19-8434-c2dada4ebe99",
            "principalType": "Group",
            "resourceDisplayName": "Snowflake AAD",
            "resourceId": "1ab2c3de-f456-7890-fghi-j12345k67lm8"
        }
    ]
}

and our variable ASSIGNMENT_ID has the value ZMW3ujNOGUuENMLa2k6-mVWFENfBHbVDlwQJg1ui848.

With this value we now have all the elements necessary to import this app role assignment into the Terraform state.

Multiple assignments

In a situation where you have multiple AzureAD app role assignments for groups such as ECOM_SALES, ECOM_FINANCE and ECOM_LEGAL, you can use the following jq expression to list them all:

.value[] | select(.principalDisplayName | startswith("ECOM_"))

Import existing app role assignments into the Terraform state

Having the app role assignment ID in a variable named ASSIGNMENT_ID and the AzureAD enterprise application for Snowflake having ID 1ab2c3de-f456-7890-fghi-j12345k67lm8, we can:

terraform import \
  azuread_app_role_assignment.example \
  1ab2c3de-f456-7890-fghi-j12345k67lm8/appRoleAssignment/$ASSIGNMENT_ID

Conclusion

This article provided a quick solution for importing Azure AD app role assignments into Terraform, addressing the challenge of manual interventions in infrastructure management. Leveraging Microsoft's Graph API and Terraform, we demonstrated a streamlined process to list and import app role assignments.

Practical ECS scaling: horizontally scaling an application based on its response time

Ivica Kolenkaš — Fri, 22 Dec 2023 17:48:09 +0000

The previous article looked at whether changing the performance envelope for an application with a memory leak was effective. This article answers the question: "Should you horizontally scale your application based on its response time?"

Horizontal scaling
The endpoint under test
Running tests
Should you horizontally scale your application based on response times?

Horizontal scaling

Since this is the first article that deals with horizontal scaling, a quote from Nathan reminds us what horizontal scaling is.

Horizontal scaling is when you spread your workload across a larger number of application containers. It is based on aggregate resource consumption metrics for the service. For example you can look at average CPU resource consumption metric across all copies of your container.

When the aggregate average utilization breaches a high threshold you scale out by adding more copies of the container. If it breaches a low threshold you reduce the number of copies of the container. Source

The endpoint under test

Our mock application comes with this REST API endpoint:

/long_response_time, simulating an increasingly busy database.

When this endpoint is invoked, the application calculates the square root of 64 * 64 * 64 * 64 * 64 * 64 ** 64 and saves the result to a database. Due to an increased load on the database, each INSERT query takes longer and longer to complete.

Running tests

Your morning just started and that coffee is smelling so good. While sipping it, you glance at your application's monitoring dashboard and notice that the average response time went from ±300ms to more than 2 seconds.

Not good.

You decide to configure application autoscaling based on the response time. The idea is that running more containers will help distribute the workload and bring down the response time to acceptable levels.

scaling = service.auto_scale_task_count(max_capacity=5)

scaling.scale_to_track_custom_metric(
    "responsescaling",
    metric=target_group.metrics.target_response_time(
        period=Duration.minutes(1)
    ),
    target_value=2,
    scale_in_cooldown=Duration.minutes(1),
    scale_out_cooldown=Duration.minutes(1)
)

The above CDK code configures autoscaling for you service running on ECS. A metric for response time is being tracked, and if its value is bigger than 2, additional ECS tasks are added every minute. The maximum number of tasks is 5.

The configuration is applied:

and it seems to be working! The response time has dropped below 2 seconds.

That cold coffee is the least of your concerns now because increasing the number of tasks helped only temporarily.

The third and fourth containers starts up fairly quickly but the response time rises relentlessly.

After a few minutes, the service scales up to the maximum of 5 tasks to try and cope with the rising response times...

... but it is completely ineffective, as response time is still growing:

Why is that?

Well, the ship is taking on water and many sailors are rushing to empty it, but there's only a handful of buckets available 🪣 🪣 🪣.

Should you horizontally scale your application based on response times?

You can, but it won't do much good.

The "Identifying a utilization metric" paragraph has a great explanation on choosing the metric to track and base the autoscaling on.

The metric must be correlated with demand. When resources are held steady, but demand changes, the metric value must also change. The metric should increase or decrease when demand increases or decreases.

The metric value must scale in proportion to capacity. When demand holds constant, adding more resources must result in a proportional change in the metric value. So, doubling the number of tasks should cause the metric to decrease by 50%.

The part in bold is important because it is applicable in our use case. The metric value is not scaling in proportion to capacity. Doubling the number of tasks (and doubling them again) did not cause the metric value to decrease by 50%.

An overloaded database can cause response times to skyrocket, and adding more tasks won't help.

In fact, it may actually make it much worse because launching more copies of the application causes more connections to an already overloaded database server. Source

Dear reader, thank you for following my journey through practical ECS scaling. We looked into how a CPU-heavy application performs better with more CPU resources, how memory leaks are like monkey wrenches in the machine and the futility of horizontally scaling an application when the database is overloaded (this article).

Practical ECS scaling: vertically scaling an application with a memory leak

Ivica Kolenkaš — Wed, 20 Dec 2023 18:59:37 +0000

The previous article looked at how changing the performance envelope for a CPU-heavy application affected its performance. This article shows whether vertically scaling an application with a memory leak is effective.

The endpoint under test
Running tests
Results
- Container 1 (1GB of memory)
- Container 2 (2GB of memory)
How effective is it to vertically scale an application that has a memory leak?

The endpoint under test

Our mock application comes with this REST API endpoint:

/memory_leak, simulating a memory leak.

When this endpoint is invoked, the application calculates the square root of 64 * 64 * 64 * 64 * 64 * 64 ** 64 and returns the result. Due to a bad code merge, 1MB of memory is added to a Python list on each request.

leaked_memory = []

@app.route("/memory_leak")
def memory_leaky_task():
    global leaked_memory

    # appends approx. 1MB of data to a list on each
    # request, creating a memory leak
    leaked_memory.append(bytearray(1024 * 1024 * 1))

    return _build_response(do_sqrt())

Running tests

It is better to be roughly right than precisely wrong.
— Alan Greenspan

To load-test this application, I used hey to invoke the endpoint with 5 requests per second for 30 minutes using hey -z 30m -q 1 -c 5 $URL/memory_leak

To be able to compare results, I ran the same application in two containers with different hardware configurations:

	CPUs	Memory (GB)
Container 1	0.5	1.0
Container 2	1.0	2.0

Results

Container 1 (1GB of memory)

Looking at the summary of hey, we notice that not all requests were successful:

Summary:
  Total:        1800.0255 secs
  Slowest:      10.1465 secs
  Fastest:      0.0088 secs
  Average:      0.1860 secs
  Requests/sec: 4.3499

Status code distribution:
  [200] 6148 responses
  [502] 341 responses
  [503] 1211 responses
  [504] 130 responses

Roughly 21% of all requests had a non-200 status code 😞 This is not a great user experience.

Looking at ECS task details we notice that there are 7 tasks in total, only one of which is running at the moment - other 6 are stopped.

To get more details, we can describe one of the stopped tasks:

aws ecs describe-tasks \
  --cluster ecs-scaling-cluster \
  --tasks 7f0872485e6e421e8f83a062a3704303 |
  jq -r '.tasks[0].containers[0].reason'

OutOfMemoryError: Container killed due to memory usage

OutOfMemoryError: Container killed due to memory usage becomes obvious when looking at the memory utilization metric:

The sawtooth pattern reveals the problem: our application is exceeding the performance envelope for the "Memory" dimension! Each request leaks approx. 1MB of memory, and because the container is given 1GB of memory, serving roughly a 1000 requests leads to the container running out of memory. At that point the ECS service forcefully stops the container that is out of memory and starts a fresh one.

Container 2 (2GB of memory)

Running another container with double the memory (1GB → 2GB) and load testing it in the same way we get very similar results, with approx. 7% of all requests having a 5xx status code:

Summary:
  Total:        1800.0240 secs
  Slowest:      10.0976 secs
  Fastest:      0.0119 secs
  Average:      0.0850 secs
  Requests/sec: 4.7249

Status code distribution:
  [200] 7868 responses
  [502] 177 responses
  [503] 405 responses
  [504] 55 responses

In this instance only 4 tasks were started, 3 of which were forcefully stopped:

aws ecs describe-tasks \
  --cluster ecs-scaling-cluster \
  --tasks c35b7029c38c4383b26e768aec3c77f2 |
  jq -r '.tasks[0].containers[0].reason'

OutOfMemoryError: Container killed due to memory usage

And again, the sawtooth memory utilization reveals that we have a memory leak:

How effective is it to vertically scale an application that has a memory leak?

Not at all.

A memory leak can not be fixed by scaling. You can’t vertically or horizontally scale yourself out of a memory leak. The only way to fix this is to fix the application code. You cannot have scalability with a memory leak. Source

Regardless of its memory configuration, a container with an application that has a memory leak will sooner or later run out of memory.

AWS re.Post has an article on troubleshooting OutOfMemory errors. This blog post explains how containers (in general, and those running on ECS) consume CPU and memory.

Next up: Should you horizontally scale your application based on response times?

You also might want to check how changing the performance envelope for a CPU-heavy application affects its performance.

Practical ECS scaling: vertically scaling a CPU-heavy application

Ivica Kolenkaš — Sun, 17 Dec 2023 15:35:52 +0000

The introductory article defined the performance envelope, and this one looks at how changing the performance envelope for a CPU-heavy application affects its performance.

The endpoint under test
Running tests
Results
- Container 1 (0.25CPU)
- Container 2 (0.5CPU)
- Container 3 (1CPU)
Can a CPU-heavy application perform better with more CPU resources?

The endpoint under test

Our mock application is built in Flask and has several REST API endpoints, one of which is:

/cpu_intensive, simulating a CPU-intensive task.

When this endpoint is invoked, the application calculates the square root of 64 * 64 * 64 * 64 * 64 * 64 ** 64 and returns the result.

$ http  http://ALB.eu-central-1.elb.amazonaws.com/cpu_intensive

HTTP/1.1 200 OK
Connection: close
Content-Length: 36
Content-Type: application/json
Date: Sun, 26 Nov 2023 15:52:46 GMT
Server: Werkzeug/2.3.7 Python/3.9.6

{
    "result": "2.0568806966515076e+62"
}

Running tests

It is better to be roughly right than precisely wrong.
— Alan Greenspan

To load-test this application, I used hey to invoke the endpoint with 5 requests per second for 30 minutes using hey -z 30m -q 1 -c 5 $URL/cpu_intensive

To be able to compare results, I ran the same application in three containers, each with different hardware constraints:

	CPUs	Memory (GB)
Container 1	0.25	0.5
Container 2	0.5	1.0
Container 3	1.0	2.0

Results

Container 1 (0.25CPU)

As expected, Container 1 performed the worst, averaging 3.13 requests per second. Containers 2 and 3 were both able to serve 4.99 requests per second.

One of the graphs from Nathan's article shows a CPU load peaking and staying at 100% for the duration of the load test. I was able to achieve the same results with container 1 in my test.

Container 1 clearly on its knees with average CPU utilization at 100% for the duration of the test:

In this graph you can see CPU and memory utilization over time as the load test ramps up. The CPU metric is much higher than the memory metric, and it flattens out around 100%.

This means that the application ran out of CPU resource first. The workload is primarily CPU bound. This is quite normal, as most workloads run out of CPU before they run out of memory. As the application runs out of CPU, the quality of the service suffers before it actually runs out of memory.

This tells us one micro optimization we might be able to make, is to modify the performance envelope to add a bit more CPU and a bit less memory. Source

Container 2 (0.5CPU)

Container 2 has the double amount of CPU and delivers the expected performance of 5 requests per second with an average CPU utilization around 90%:

Container 3 (1CPU)

Doubling the amount of CPU again, container 3 delivers the expected performance with average CPU utilization around 35%:

We could even say that container 3, with 1CPU and 2GB of memory is over provisioned. In dollar amounts, it would cost $41 to run per month. On the other hand, container 2 would cost $20 while delivering the same baseline performance of 5 requests per second.

Can a CPU-heavy application perform better with more CPU resources?

As expected, yes. Increasing the amount of CPU from 0.25 to 0.5 allows the application container to deliver the expected performance of 5 requests per second while doing a CPU-heavy calculation.

Going from 0.5CPU to 1CPU doesn't add any measurable benefit at 5 requests per second, but it would allow the application to respond more quickly and scale to more requests per second.

Looking at hey's output in more detail, we can see that container 3 had response times that are almost 3 times faster that those from container 2.

	CPUs	Memory (GB)	Requests/sec	Avg. response time (sec)
Container 1	0.25	0.5	3.1384	1.5909
Container 2	0.5	1.0	4.9974	0.8514
Container 3	1.0	2.0	4.9990	0.3217

The end goal of all this load testing and metric analysis is to define an expected performance envelope that fits your application needs. Ideally it should also provide a little bit of extra space for occasional bursts of activity. Source

Container 2, with 0.5CPU and 1GB of memory provides just that. Vertically scaling a CPU-heavy applications results in increased performance.

Next up: Let's look at how vertically scaling an application with a memory leak goes. ☠️

Practical ECS scaling: an introduction

Ivica Kolenkaš — Sun, 17 Dec 2023 15:35:35 +0000

How effective is it to vertically scale an application that has a memory leak? Can a CPU-heavy application perform better with more CPU resources? Should you horizontally scale your application based on response times?

To show the answers to these questions in practice, I built and load tested a mock application in order to achieve results that are the same or very similar to those shown in Nathan Peck's great article on Amazon ECS Scalability Best Practices.

Meet the app

The application itself is built in Flask and its infrastructure is managed with AWS CDK for Python. The app has several REST API endpoints:

/cpu_intensive, simulating a CPU-intensive task. Calculates the square root of 64 * 64 * 64 * 64 * 64 * 64 ** 64 on each request.
/memory_leak, simulating a memory leak. Adds 1MB of memory on each request.
/long_response_time, simulating increasingly longer responses from a busy downstream system (e.g. a database). Sleeps for 2ms for each request received since the app was started.

All source code is available in this repository.

The performance envelope

Vertically scaling a container means giving the container more hardware resources.

When you vertically scale an application the first step is to identify what resources the application container actually needs in order to function.

There are different dimensions of resources that an application needs. For example: CPU, memory, storage, and network bandwidth. Some machine learning workloads may actually require GPU as well. Source

These resources form the performance envelope; a set of hardware constraints for the container.

The first article in the series deals with an application that exceeds the "CPU" dimension of the performance envelope.

Have a look at how changing the performance envelope for a CPU-heavy application affects its performance.

Manage Airflow connections with Terraform and AWS SecretsManager

Ivica Kolenkaš — Tue, 15 Aug 2023 17:57:42 +0000

Managing infrastructure as code brings speed, consistency and it makes the software development lifecycle more efficient and predictable. Infrastructure for your ETL/orchestration tool is managed with code - why not manage the secrets that your tool uses with code as well?

This article shows a proof-of-concept implementation of how to manage Airflow secrets through Terraform and keep them committed to a code repository.

Caveats/assumptions:

Users (developers) on the AWS account don't have permissions to retrieve secret values from SecretsManager
Terraform is using a remote state with appropriate security measures in place
IAM role used by Terraform has relevant permissions to manage a wide range of AWS services and resources

TL;DR

Repository with example code.

Architecture

How it works

Intro to Airflow Connections

Airflow can connect to various systems, such as databases, SFTP servers or S3 buckets. To connect, it needs credentials. Connections are an Airflow concept to store those credentials. They are a great way to configure access to an external system once and use it multiple times.

External secrets backends

Airflow supports multiple external secrets backends, such as AWS SecretsManager, Azure KeyVault and Hashicorp Vault.

Connection details are read from these backends when a connection is used. This keeps the sensitive part of the connection, such as a password, secure and minimizes the attack surface.

AWS SecretsManager backend

Configuring your Airflow deployment to use AWS SecretsManager is well explained on this page.

Creating AWS SecretsManager secrets with Terraform is done in a simple way:



resource "aws_secretsmanager_secret" "secret" {
  name = "my-precious"
}

resource "aws_secretsmanager_secret_version" "string" {
  secret_id     = aws_secretsmanager_secret.secret.id
  secret_string = "your secret here"
}

but committing this to a code repository is a cardinal sin!

So how do you manage Airflow Connections in such a way that:

sensitive part of a connection is hidden
users can manage connections through code and commit them to a repository
Airflow can use these connections when running DAGs

Read on!

Encryption (is the) key

Encryption is a way to conceal information by altering it so that it appears to be random data. Source

The AWS Key Management Service (KMS) allows us to create and manage encryption keys. These keys can be used to encrypt the contents of many AWS resources (buckets, disks, clusters...) but they can also be used to encrypt and decrypt user-provided strings.

Your users (developers) need a developer-friendly way of encrypting strings without having access to the KMS key. Developers love APIs. Keep your developers happy and give them an API.

In this case, we have an API built with Powertools for AWS Lambda (Python) and Lambda Function URLs.

A custom Lambda function can be used to encrypt or generate and encrypt random strings. This covers two use cases:

Administrator of an external system has created credentials for us and we are now using them to create an Airflow connection
We are creating credentials to a system we manage and will use those credentials to create an Airflow connection

Give a string to this Lambda



POST https://xxxxxx.lambda-url.eu-central-1.on.aws/encrypt
Accept: application/json

{
"encrypt_this": "mysql://username:password@hostname:3306/database"
}

and it returns something like this:



AQICAHjTAGlNShkkcAYzHludk...IhvcNAQcGoIGOMIGLAgEAMIGFBg/AluidQ==

Completely unreadable to you and me and safe to commit to a repository. (If you recognized it, yes, it is base64 encoded. Try decoding it; even less readable!)

Create secrets with Terraform

That unreadable "sausage" from before can be used with Terraform, given that it has the permission to decrypt using the key that encrypted the original string.



data "aws_kms_secrets" "secret" {
  secret {
    name    = "secret"
    payload = "AQICAHjTAGlNShkkcAYzHludk...IhvcNAQcGoIGOMIGLAgEAMIGFBg/AluidQ=="
  }
}

resource "aws_secretsmanager_secret" "secret" {
  name = var.name
}

resource "aws_secretsmanager_secret_version" "string" {
  secret_id     = aws_secretsmanager_secret.secret.id
  secret_string = data.aws_kms_secrets.secret.plaintext["secret"]
}

Code above will happily decrypt the encrypted string using a KMS key from your AWS account and store the decrypted value in SecretsManager.

Warning:

It will store the secret in the Terraform state - take the necessary precautions to secure it. Anyone with access to the KMS key that encrypted the string can decrypt it.

Keep your Terraform state secure and your KMS keys secure-er(?).

Using it in practice

Encrypt a MySQL connection string that Airflow will use:



http -b POST https://xxxxxx.lambda-url.eu-central-1.on.aws/encrypt encrypt_this='mysql://mysql_user:nb_6qaAI8CmkoI-FKxuK@hostname:3306/mysqldb'
{
    "encrypted_value": "AQICAHjTAGlNShkkcAYzHl8C2qXs7f...zaxroJDDw==",
    "message": "Encrypted a user-provided value."
}

Use the `"encrypted_value"` value with a Terraform module to create a secret



module "db_conn" {
  source = "./modules/airflow_secret"

  name             = "airflow/connections/db"
  encrypted_string = "AQICAHjTAGlNShkkcAYzHl8C2qXs7f...zaxroJDDw=="
}

after which you get a nice AWS SecretsManager secret.

Warning:

Anyone with the secretsmanager:GetSecretValue permission will be able to read the secret. Keep access to your AWS SecretsManager secrets secured.

Configure Airflow to use the AWS SecretsManager backend

One of the great features of Airflow is the possibility to set (and override) configuration parameters through environment variables. We can leverage this to configure MWAA so that it uses a different secrets backend:



resource "aws_mwaa_environment" "this" {
  airflow_configuration_options = {
    "secrets.backend"               = "airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend",
    "secrets.backend_kwargs"        = "{\"connections_prefix\": \"airflow/connections\",\"variables_prefix\": \"airflow/variables\"}"
  }
... rest of the config...
}

With the secret in SecretsManager and Airflow configured to use SecretsManager as the backend, we can finally use the secret in a default way.

Example DAG shows that fetching the secret works.

With this proof-of-concept solution we were able to achieve the following:

sensitive part of a connection is hidden
users can manage connections through code and commit them to a repository
Airflow can use these connections when running DAGs

One obvious downside of having encrypted strings in Git is that you can't understand what actually changed:



diff --git a/infra/secrets.tf b/infra/secrets.tf
index fe53e5f..7e85d90 100644
--- a/infra/secrets.tf
+++ b/infra/secrets.tf
@@ -9,5 +9,5 @@ module "db_conn" {
   source = "./modules/airflow_secret"

   name             = "airflow/connections/db"
-  encrypted_string = "AQICAHjTAGlNShkkcAYzHl8C2qXs7fs5x9gByXim/PPuwt+TuwH8pYZHik8Cx0AZDM+ECML8AAAAnzCBnAYJ...XwF2a8zaxroJDDw=="
+  encrypted_string = "AQICAHjTAGlNShkkcAYzHl8C2qXs7fs5x9gByXim/PPuwt+TuwGhmhBNcePnQmhjrTgozm6rAAAAnTCmgYJK...nLU8TVWkLDUsSfDs="
 }

This encrypted approach also doesn't work well for connections that have no secrets in them, for example AWS connections that use IAM roles:



aws://?aws_iam_role=my_role_name&region_name=eu-west-1&aws_account_id=123456789123

If you would like to improve the cost-effectiveness of your MWAA setup, give this article by Vince Beck a read.

Nice things I said about MWAA in my thoughts on MWAA 18 months ago are still valid to this day, and the speed of development on the aws-mwaa-local-runner has increased.

Until next time, keep those ETLs ET-elling!

How (not) to spend $15000 per month with AWS Athena

Ivica Kolenkaš — Mon, 01 May 2023 15:07:29 +0000

In the 1980s, traffic in downtown Boston was nearly unbearable so city planners came up with a plan to reroute the highway tunnel below downtown Boston. The project was nicknamed "the Big Dig". Construction started in 1991 and when it finished in 2007, the final price tag was around 15 billion dollars, about twice the original cost that was expected. Source.

This is a story about a software project that ran over budget due to organizational mishaps, no oversight and no awareness of cost.

Spoiler alert: No actual query analysis is shown in the article. I am not a SQL expert; more of an observer with a passion for writing and with 20/20 hindsight.

Before we dive deeper into the money bleeding monster, here's a bit of background.

Cost of a project

Total cost of ownership (TCO) is the total cost of owning software over its entire lifecycle, including the initial building price and ongoing charges, such as maintenance, human capital investments, resource allocation, and opportunity costs. Source.

A software product can make financial sense if the value it provides to the organization is greater than the TCO over its lifespan.

The cost of this particular software product, whose main part was the Athena query, was around $15000 per month in its operational phase (excluding the cost of making it). At one point, it was more expensive than ~40 RDS databases. Its value to the business is hard to determine because it provided data for dashboards.

Cost breakdown per service for October 2022

The organization

The organization and its people built this product during a difficult period, with COVID, financial insecurity and layoffs looming over everyone and everything. I'm sure they did their best under the circumstances and this article is in no way meant to criticize.

Most software development in this organization is done with 2-pizza teams working in 2-week sprints to deliver peer-reviewed code that runs in the cloud configured by infrastructure-as-code. Decent operational monitoring and alerting exists and it works.

Our money bleeding monster was an outlier; it was built by one developer. Read on to understand how serverless can be expensive.

1. Cost awareness, or lack of it

From the initial idea about this product, over its design and creation, all the way through delivery no one sat down to calculate how much it would cost. Architecture, design and development was done by a single developer to fulfill a business need. But it is not the fault of that developer that the TCO calculation was not done.

Business stakeholders and product owners usually think about TCO and whether a software product makes financial sense. A developer could have calculated, with a fair amount of certainty, the operational costs of the product. AWS Athena pricing is dead simple: $5 per terrabyte of data scanned. Fourth grade math at best.

So what failed then? There was no one to ask the question: "How much will this query cost?"

2. No systematic cost reporting

AWS infrastructure used by the organization is managed through code and it is uniformly tagged. Some of those tags include a project name, the environment where the project is running (DEV, PROD) and owner/cost center.

It would be trivially easy to create a monthly cost report using AWS Cost Explorer or any similar tool. The report could break down cost per project, environment and cost center and it could be sent to owners for review.

So what failed then? There is always something more pressing to work on. No one cares about these cost reports so the platform team never prioritized them.

3. Managed service misuse

AWS Athena is a serverless analytics service with capability to query structured data from AWS S3 and other data sources using SQL-like syntax.

In our case, data was uploaded to an AWS S3 bucket from Kafka. Kafka is the backbone for all microservices and a large chunk of business data flows through it. All that data ends up in an S3 bucket and is partitioned to support WHERE clauses in queries. Having a WHERE clause tells Athena to scan only the data partition that matches it. Partitioning schema is based on YEAR/MONTH/DAY pattern, so an example query can look like:



SELECT style_id FROM schema.SALES
WHERE month=11 AND day=20 AND color=blue

This would return all the style IDs of all blue clothing items that were sold on 20th of November. So far so good.

The real query did not use any WHERE clauses. This might be acceptable in the DEV environment and personally I'd use LIMIT 10. Not acceptable in the PROD environment, especially when the volume of data will only ever go up. The ever-growing amount of data in PROD meant that every time the query ran:

it scanned more and more data
it scanned all the data; even data it does not need

Cost report for AWS Athena from June to November 2022

Athena is truly serverless and it will happily scan, scale and charge you for what it scanned. "Scan everything? On it boss!". "Scan some more? Don't mind if I do".

Using a WHERE clause can drastically reduce the amount of data scanned which directly correlates to incurred costs. It will also shorten the query execution time.

So what failed then? There was no one to ask the question: "Can this query be improved?".

4. One person "team"

Architecture, design and development on this product was done by a single developer. He had no help, no one to discuss ideas with, no one to review his code.

Commits made and merged by the same person

Every developer has the responsibility to write good software. Every human being has the right to make mistakes.

So what failed then? The organization should not have allowed this. The developer was not set up for success from day one. One person is not a team.

Learnings

I hope it goes without saying but: do the opposite of what the points above illustrate. Also, here are some learnings we acquired over time.

Do the math

Operations phase is an important phase of every product/service. That is when they generate value for the organization. This phase is hopefully also the longest. Because of these two facts, understanding operational costs of the product is very important. Calculate them early.

Own it

And I really mean OWN IT. All of it. Product's lifecycle is in its infancy when you ship the code - it doesn't end there. All the logs, metrics, alerts, bills etc. that the product creates must be owned by someone.

Inform and be informed

If I tell you that on average, an ice cream costs $0.40 where I live, that's data. But if I add that the ice cream truck passes my house every day, that's information (and temptation!) that you can use to buy cheap ice cream every day.

1TB of data scanned with Athena costs $5 and that's a fact. Knowing that your query will scan close to 100TB each time it runs is valuable information. Be informed and inform business stakeholders too.

Friends or foes

Managed services can be great friends. With all the heavy lifting they do and a tendency to reduce prices over time, one would be hard-pressed to not use them.

You can make sure they stay your friends by owning your product and creating appropriate billing alerts. Even hardcore serverless teams experience runaway costs - but they catch them early!

// Detect dark theme var iframe = document.getElementById('tweet-1635244161778737152-395'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1635244161778737152&theme=dark" }

In case you're wondering, the query was improved and it now costs a fraction of what it cost before. And if you're thinking that the money that was wasted on a poor query could have been used to hire more people, I agree with you. We humans learn and evolve our whole lives, but so do organizations.

Learning from our own mistakes is the most difficult but those lessons tend to stick the longest. I can only hope that this organization has enough knowledge management capacity to prevent scenarios like this one in the future. Otherwise, the myth that serverless is expensive will live on. High cost comes from sloppy code, bad development practices and processes in the organization that allow those to happen.

How working with AWS open-source tools made me a better developer

Ivica Kolenkaš — Mon, 10 Apr 2023 14:14:48 +0000

In the beginning

In the beginning God created the heaven and the earth. The earth began to cool, the autotrophs began to drool, Neanderthals developed tools and to-make-a-long-story-short I started learning Python. Much like our early ancestors benefited from using fire to prepare food, I benefited from using (and sporadically contributing to) AWS open-source tools.

This is a story of how I went from a self taught Python developer with profound dislike of type hints to tolerating and even using them. All it took to convince me is a multi-billion dollar company with tens of thousands of developers.

I am not stubborn and I'm not a duck

Contrary to what the intro paragraph led you to believe, I am not a stubborn person. I'd say I'm opinionated, and my opinion about using data types was formed in my teen years. I was a rebel without a pause. It was the time of Python 2.6, which of course has multiple data types, but they are dynamic. This means that you don't have to declare a type for a variable -- the type is inferred by the interpreter at runtime.

You could just re-declare a variable with a different type (int to a float) and that was fine.

salary = 12000
salary = 12000.0

I was amazed. It was so easy. No static final void warranty<List> that college professors tried to teach me. You just sit down (sitting is optional), type Python and watch magic happen in front of you. An ocassional TypeError wasn't gonna stop me. Trying to iterate over an integer? Been there, done that. Neither of those convinced me to spend some time and learn about type hints in Python.

// Detect dark theme var iframe = document.getElementById('tweet-1638201600995934210-342'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1638201600995934210&theme=dark" }

I won't blame you if by reading this you assume I'm a bad developer. I never said I was good :)

Don't believe the hype

Python and me grew older together and with Python version 3.5 came type hints. As their name suggests, they are just hints; Python is still a dynamically typed language. They are not evaluated and they don't affect your application at runtime.

Those hints do help though; modern IDEs understand them and suggest better auto-complete results. They also warn when incorrect types are used:

def add(a, b) -> int:
    return str(a+b)

I was aware of these Python advancements but I wasn't aboard the hype train. Who is this Guido person and why is he bringing the noise? Python's syntax was so clean before type hints. My colleagues, who were wiser than me, did get on the hype train and soon I was looking at code that resembled this:

def calculate_avg(salaries: List[int]) -> AverageSalary:
    # perform calculations

salaries: List[int] = [12000, 13000, ...]
avg_salary: AverageSalary = calculate_avg(salaries)

It was awful. It was noisy, it was unreadable, it was sacrilegious.

The big melt

Some time ago I was introduced to AWS CDK for Python and it was a breath of fresh air after years of Terraform (I still <3 Terraform, don't get me wrong). Then I started noticing that CDK code is strongly typed in all its language variants, and for very good reasons. I liked CDK but did I like it enough?

I gulped, bit the bullet and added types where I had to. I also had a brief adventure with CDK for Typescript and...and... I started liking strict typing. It made the code strict, and well defined, and you knew what to expect as a return value. It all made sense! Not to mention how it improves collaboration, makes large code-bases more maintainable and prevents silly bugs.

Peasants put down their pitchforks because the giant's heart has melted.

From "I hate this" to...

I truly disliked type hints. Working with CDK started to change that for the better and then I discovered AWS Lambda Powertools. It combines multiple things I am passionate about so I became passionate about using the library and improving it. Lambda Powertools utilizes static typing for several of its main features (Parser, Event Sources and Typing).

I was keen to contribute to this open-source project which meant I had to jump into static typing and contribute code that will blend in and be accepted. Using static types was so natural, it made so much sense, especially in a codebase that is unknown to me. Instead of being noise, those type hints were real hints that helped me understand a codebase I was unfamiliar with.

My stomach now stays perfectly calm while I'm reading or writing code like this:

log_level: Optional[Union[int, str]] = None

Using CDK and contributing to Lambda Powertools made me turn a 180 degrees. I went from "I hate this..." to "type hints are great!". Using them made me understand why type hints and static typing are necessary and helpful. I even use them sporadically in my pet projects while trying to become a better developer.

Our ancestors cooked food which allowed for their digestive tracts to become smaller, leaving more energy for brain growth. Something similar happened to me; using type hints allowed me to understand unfamiliar codebases and improve my coding skills. My humble contributions are a way of paying back to the open-source community and thanking contributors of Lambda Powertools for Python

A good way to end this article is with another Public Enemy reference: It took a multi-billion dollar company with tens of thousands of developers just a few months to convert this non-stubborn dynamic-typer into a typing hints aficionado.

No statically typed languages were hurt during the writing of this article.

Forem: Ivica Kolenkaš

Unable to emit metadata to DataHub GMS with Airflow - a solution

The problem - 404, wrong URL

Anatomy of an Airflow connection

Airflow connection parsing

Data and analytics reimagined - platform architecture

Data domains

Data domain ownership

Security guidelines

Shape of data products

Data mesh

Data and analytics reimagined with Terraform and DevOps principles

The business

The platform

Tenets

Validating Terraform configuration just got easier

A bit on terraform validate

terraform-validate-explorer

"contains" filter

"does not contain" filter

"regex" filter

Output file validation

Next steps

Import AzureAD app role assignments into Terraform state

In short

Context

The problem

Get Microsoft Graph API token

List existing app role assignments using the Graph API

Multiple assignments

Import existing app role assignments into the Terraform state

Conclusion

Practical ECS scaling: horizontally scaling an application based on its response time

Horizontal scaling

The endpoint under test

Running tests

Should you horizontally scale your application based on response times?

Practical ECS scaling: vertically scaling an application with a memory leak

The endpoint under test

Running tests

Results

Container 1 (1GB of memory)

Container 2 (2GB of memory)

How effective is it to vertically scale an application that has a memory leak?

Practical ECS scaling: vertically scaling a CPU-heavy application

The endpoint under test

Running tests

Results

Container 1 (0.25CPU)

Container 2 (0.5CPU)

Container 3 (1CPU)

Can a CPU-heavy application perform better with more CPU resources?

Practical ECS scaling: an introduction

Meet the app

The performance envelope

Manage Airflow connections with Terraform and AWS SecretsManager

TL;DR

Architecture

How it works

Intro to Airflow Connections

External secrets backends

AWS SecretsManager backend

Encryption (is the) key

Create secrets with Terraform

Using it in practice

Encrypt a MySQL connection string that Airflow will use:

Use the "encrypted_value" value with a Terraform module to create a secret

Configure Airflow to use the AWS SecretsManager backend

How (not) to spend $15000 per month with AWS Athena

Cost of a project

The organization

1. Cost awareness, or lack of it

2. No systematic cost reporting

3. Managed service misuse

4. One person "team"

Learnings

Do the math

Own it

Inform and be informed

Friends or foes

A bit on `terraform validate`

Use the `"encrypted_value"` value with a Terraform module to create a secret