Forem: Ervin Szilagyi

my latest blog post

Ervin Szilagyi — Mon, 28 Jul 2025 06:15:35 +0000

Ervin Szilagyi for AWS Community Builders

Jul 25 '25

Lessons Learned from Building an MCP Client

#mcp #java #langchain

Comments

15 min read

Lessons Learned from Building an MCP Client

Ervin Szilagyi — Fri, 25 Jul 2025 20:05:43 +0000

Introduction

It is ~~May~~ July 2025. By now, everyone and their mother has heard about AI agents, MCP, and all these fancy words being thrown around. According to some, MCP will change the world, while others consider it a marginal improvement over existing solutions. Having a skeptical mind, I decided to fiddle around and figure out what all the fuss is about with this shiny new technology.

What is MCP exactly?

MCP (Model Context Protocol) is an open protocol that standardizes how applications provide context to LLMs¹. It is a set of rules and concepts defining how additional information can be provided to an LLM to achieve better results from prompts or how to augment LLMs with tools, such as the ability to search the web, use a calculator, or control a computer. The MCP protocol defines other concepts besides tool calling and resources, and it is likely we will see further additions in the future. Currently, tool calling is by far the most widely implemented concept from the specification.

The protocol defines the notions of MCP Servers and MCP Clients. While these resemble the classical client-server architecture, there are certain important differences, such as:

The majority of MCP Server implementations run locally on the same machine as the MCP Client. Although the protocol provides the option for a remote server, it currently feels somewhat like an afterthought. I am confident this will be improved in the future.
Many MCP servers are implemented to serve one user at a time. This is somewhat unheard of in the traditional server landscape. However, since most of these servers are intended to be local-first, this is to be expected.

The concepts brought together by MCP are not entirely new. Tool/function calling, for instance, was introduced by OpenAI back in 2023, after which others, such as Anthropic, followed suit with their own support and APIs. The aim of the MCP protocol is to standardize these efforts and provide a common way of implementing agentic workflows.

My Project of Choice

Back to my project: I decided to build an MCP client. My idea was to create something similar to Goose CLI. The initial goal was to develop a CLI application that can be executed from a terminal. I aimed to support multiple large language models and enable the configuration of one or more MCP servers.

My programming language of choice for this project was Java. While most online tutorials focus on JavaScript or Python, I wanted to approach this task differently. Java might not be the trendiest option, and it is certainly not always the most recommended choice for a CLI application, but it is not an inherently worse one. Several libraries, such as picocli and jline3, are aimed at helping with the development of text-based CLI applications. Additionally, LangChain4J provides an implementation for many MCP features, although it is still in an experimental phase.

Ultimately, my goal was to get a better understanding of the current state of working with LLMs and MCP using Java. I was not looking to reinvent the wheel.

Lessons Learned

Here are a few lessons I learned during development. Some may be obvious to anyone who has spent more than a few minutes with an LLM model, while others might be new.

1. It is really easy to build the main loop of the application

With the help of LangChain4J, it can be implemented in a few lines of code:

while (true) {
    try {
        // Read something from the user 
        String line = reader.readLine(">> ", null, (MaskingCallback) null, null);
        if (line.isBlank()) {
            continue;
        }

        // Some boilerplate code to handle commands such as /exit from the user
        Optional<SessionCommand> command = toCommand(line.strip().toLowerCase());

        if (command.isPresent()) {
            switch (command.orElseThrow()) {
                case SessionCommand.EXIT:
                    return;
                // .. other commands
            }
            continue;
        }

        // This is the gist of our app. This makes an API call to the LLM, it automatically handles MCP tools calls and sends back the responses to the LLM.
        Result<String> response = llmClient.chat(line, LocalDate.now().toString());

        // Print the final answer to the user
        stylizedPrinter.printMarkDown(response.content());
    }
    catch (RateLimitException rateLimitException) {
        // ...
    } catch (Exception e) {
        // ...
    }
}

Since it is Java, it is not as concise as it could be, but nevertheless, the main interaction with the LLM can be boiled down to this line:

Result<String> response = llmClient.chat(line, LocalDate.now().toString());

That's it. The llmClient itself is an interface.

public interface LlmClient {
    @SystemMessage(
            """
                    You are a general-purpose AI agent.

                    The current date is: {{currentDateTime}}.

                    ...
                    """)
    Result<String> chat(@UserMessage String question, @V("currentDateTime") String currentDateTime);
}

We don't have to provide an implementation for this interface. All of this is handled under the hood by LangChain. It uses reflection to provide an implementation for this interface; we just have to specify that we want to use that interface for "AI" kind of purposes:

return AiServices.builder(LlmClient.class)
        .chatModel(chatModel)
        .toolProvider(toolProvider)
        .chatMemory(chatMemory)
        .build();

I personally don't like this pattern. It makes our life harder whenever we need to debug something; it can be challenging to place a breakpoint inside something that exists only at runtime.

2. Use a capable model for better outcomes

I'm aware that what I'm saying with this point is obvious to any reader. However, in the case of MCP, it is really important to have a good and powerful model.

First of all, the model should be able to support tool calling/function calling. Nowadays, most LLM models support tool calling, so this is not something we have to worry about.

During development, I had the chance to test out the tool calling capabilities of a few models. To get straight to the most well-known LLM providers, at the time of writing this article, the latest models from Anthropic—Claude Sonnet 4, Claude 3.7, and even Claude 3.5—are, in my opinion, perfectly fine for "any" MCP tool calls. I'm using quotes because, obviously, there can be some counter-examples. Again, from my experience, these models are pretty good for this purpose. For most of my development, I used Claude Sonnet 3.5. It is faster than 3.7 most of the time, and it is good enough to focus on the things that matter when providing responses.

That being said, I had a bad experience with Haiku 3.5. As a concrete example, I built a small MCP server to be able to query AWS cost usage based on a specific time period. Whenever I asked something like what are my costs for the last month, Haiku most of the time forgot what last month was. This information was explicitly provided in the system prompt The current date is: {{currentDateTime}}.. On the other hand, Sonnet 3.5 was entirely fine remembering this piece of information, and it managed to query the correct information each and every time.

Moving on to OpenAI models, the latest O1 model, GPT-4.1, and even GPT-4.5 are also good for anything. From my experience, O1 specifically gives more natural answers compared to any Anthropic model. Claude likes to be as chatty as possible and many times answers with a lot of fluff, while O1 usually tries to stick to the topic. With GPT-4.1, I noticed something interesting. In some cases, it simply avoids calling a tool and answers the question from its own knowledge base. This might be good; it will use fewer tokens, but it can also be a bad thing since it can come up with some nonsense.

From Google, I tried the Gemini 2.5 Pro preview model. This model is also perfectly fine for an MCP client, especially for tool usage.

I also tried Amazon Nova models. I had a really bad experience with Amazon Titan models² in the past, and unfortunately, I was still struggling with the latest AWS Nova models. For tool usage, everything other than Nova Pro is just borderline usable. Nova Pro itself also managed to hallucinate tools on the fly. This happened only a few times; in most cases, it is able to navigate between tools at hand. Nevertheless, I do not prefer the answers I get from it. More often than not, it feels like something is off. Even Haiku manages to feel more "natural," although Haiku can simply forget things, as I've mentioned before.

What is important is that there are many options. I prefer Sonnet models; I did most of my development using Sonnet 3.5. Probably I would recommend Gemini 2.5 Pro; at the time of writing this article, it is the cheapest as far as I checked, and it is powerful enough for everything we might need.

3. Dealing with all kinds of limits

Here is an exception I managed to receive many times while I was developing my MCP client:

Error: Rate limited by the LLM provider: {
    "error": {
        "message": "Request too large for gpt-4o in organization org-gqgox1u3NjYa2JiUAHl3HrMn on tokens per min (TPM): Limit 30000, Requested 87968. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.",
        "type": "tokens",
        "param": null,
        "code": "rate_limit_exceeded"
    }
}

And here is another one from Claude:

Error: {"type":"error","error":{"type":"rate_limit_error","message":"This request would exceed the rate limit for your organization (874899f0-c2de-4906-8802-cc478416bee6) of 20,000 input tokens per minute. For details, refer to: https://docs.anthropic.com/en/api/rate-limits. You can see the response headers for current usage. Please reduce the prompt length or the maximum tokens requested, or try again later. You may also contact sales at https://www.anthropic.com/contact-sales to discuss your options for a rate limit increase."}}

The problem we may face while working with LLM models, especially if we do tool calls, is that we cannot really control the size of our messages and also the rate of our messages. What do I mean by this?

Depending on the MCP servers we use, we might have just a few available tools that return short responses. On the other hand, we might be working with a server that exposes many tools, some of which can generate lengthy responses (looking at you, GitHub MCP). All of these things—tool definitions, requests for tool usage, and responses from tools—become part of the chat context, which ultimately translates to tokens.
Depending on what the model decides, after a response from an MCP tool call, it can request further tool calls instead of providing a final answer to the initial user query. This will result in more back-and-forth communication between our client and the LLM model, racking up even more token usage.

The challenge with this is that you, as the MCP client developer, cannot really intervene here in any of those cases, especially if you develop a general-purpose MCP client that is meant to be used with any kind of MCP server.

Possible solutions I could think of to deal with the challenge would be the following:

Try to somehow shorten/compress large MCP server responses. If your client somehow knows what to expect from the MCP tool call and also knows that some parts of the response are irrelevant, you might want to extract only the meaningful part from the response.
Try to use another model (probably from another vendor) to do a summary of the response from the MCP. This might introduce a bunch of other issues, like information loss, even more token usage (more $), and way more waiting for a final answer to the initial query.
Apply some rate limiting. Aside from the fact that this could make the user experience way worse, it could be challenging to implement in case we want to support several models from different providers. Limits can be different for each provider; each of them is using a different tokenizer; limits can and will change over time, etc.

4. Let's Talk about $

The limits we talked about in the previous section are there for several reasons: to save you from racking up a huge bill and also for the LLM providers to save themselves from bad actors. Running an LLM model requires a lot of computing power, hence a lot of energy.

I figured that this would be the time to talk about money from the perspective of a client developer. While developing this client, I spent around $30 on Claude, around $10 on OpenAI, and a few bucks on both Gemini and Bedrock. Models from Bedrock are barely usable because of their aggressive rate limiting, so I could not rack up a huge cost there even if I wanted to.

LLM providers are mainly charging based on the number of tokens used. As I mentioned before, if we add MCP tools to the mix, we kind of lose control over how many tokens we eat up with each query. This, I think, is a problem. In my client, I display the number of tokens used after each user query, but I have no way to estimate either the token usage or the cost beforehand. In the end, it is a good thing to have a limit imposed because I can imagine that we could pretty easily go haywire with a simple query.

6. Overcoming limitations imposed by the Context Windows

Words and phrases we send to LLM models via API calls are transformed into tokens. This action is called tokenization and it is performed by a tokenizer³. There are different tokenization algorithms, the output of each being a set of tokens. Most of the time, we can call these tokens sub-words - they are groups of characters that represent part of the words from the original text.

Why is it important to be aware of this tokenization process?

LLM providers charge based on the number of tokens.
Each LLM model has a context window, that is, the amount of input in tokens that the model can accept at any time⁴.

In case a query from a user cannot fit in the context window of the model, then the outcome will be a huge error response from the model. Nowadays, most of the models have a context window above 128k tokens, a number that is significantly larger than what we had around a year ago (for example, Amazon Titan, a model that was decommissioned at the start of this year, had a context window of 8k tokens). Even with a large context window, MCP usage, especially with tool calling, can quickly become problematic. To illustrate this, let's break down what a context window will contain:

System prompt: MCP clients need a system prompt. This can be shorter or lengthier depending on what we consider a good system prompt.

Definition of each available tool with a description of what it does, description of each argument, and other relevant information. As an example, this is a snippet that will be sent to the model in case we are using GitHub's MCP server. Emphasis on the part that this is a snippet; the tool definition goes on and on.

"tools" : [ {
"name" : "create_pull_request_review",
"description" : "Create a review on a pull request",
"input_schema" : {
  "type" : "object",
  "properties" : {
    "body" : {
      "type" : "string",
      "description" : "Review comment text"
    },
    "comments" : {
      "type" : "array",
      "description" : "Line-specific comments array of objects to place comments on pull request changes. Requires path and body. For line comments use line or position. For multi-line comments use start_line and line with optional side parameters.",
      "items" : {
        "type" : "object",
        "properties" : {
          "body" : {
            "type" : "string",
            "description" : "comment body"
          },
          "line" : {
            "type" : "number",
            "description" : "line number in the file to comment on. For multi-line comments, the end of the line range"
          },
          "path" : {
            "type" : "string",
            "description" : "path to the file"
          },
          "position" : {
            "type" : "number",
            "description" : "position of the comment in the diff"
          },
          "side" : {
            "type" : "string",
            "description" : "The side of the diff on which the line resides. For multi-line comments, this is the side for the end of the line range. (LEFT or RIGHT)"
          },
          "start_line" : {
            "type" : "number",
            "description" : "The first line of the range to which the comment refers. Required for multi-line comments."
          },
          "start_side" : {
            "type" : "string",
            "description" : "The side of the diff on which the start line resides for multi-line comments. (LEFT or RIGHT)"
          }
        },
        "required" : [ "path", "body" ]
      }
    },
    "commitId" : {
      "type" : "string",
      "description" : "SHA of commit to review"
    },
    "event" : {
      "type" : "string",
      "description" : "Review action ('APPROVE', 'REQUEST_CHANGES', 'COMMENT')"
    },
    "owner" : {
      "type" : "string",
      "description" : "Repository owner"
    },
    "pullNumber" : {
      "type" : "number",
      "description" : "Pull request number"
    },
    "repo" : {
      "type" : "string",
      "description" : "Repository name"
    }
  },
  "required" : [ "owner", "repo", "pullNumber", "event" ]
}
]
...

User message(s): obviously, user requests will also be included in the context window.
AI messages: the responses from the model itself will also be part of the context window, in case there is an ongoing conversation, of course.
Requests from the model for calling tools and responses from tools: as you may have guessed, the whole procedure of executing a tool will also be part of the context window at some point. This is important because, depending on which MCP server we are communicating with, the responses from tools can be short or really, really large. For example, when using GitHub's MCP server, if I ask what kind of repositories I have, I will get a very lengthy response back, since I have a lot of repositories published to GitHub. To make matters worse, the LLM can request further tool calls after it gets back the response from the one it called.
AI message summarizing the response from the tool: this happens each time a tool invocation occurs. The model will do its best to either summarize the tool response or to extract the necessary information from the tool response to answer the initial question from the user.

The point of all this was to illustrate that we can fill the available context window quickly. From a user's perspective, it can be really annoying if, after one or two interactions, the whole communication stream falls apart.

The challenge is: how can we limit the growth of the content sent back and forth between the client and the model? Personally, I don't think there is a silver bullet to solve this challenge. What I did in my CLI application was try to squash tool messages. Requests and responses from tool calls can be eliminated without losing much information. The goal of the model is to give an answer to the user's question; tool calls essentially act as additional context (or RAG⁵) to provide a more accurate answer. What we can do, to be able to continue the conversation for longer, is remove certain "tool messages," essentially reducing the size of the context we are sending back and forth. In case the model finishes triggering new tool calls and decides to return an answer of its own, we can go back from the client and remove all the tool messages that are not visible to the user anyway.

Again, this is probably a naive solution to the problem at hand, and it has downsides, such as information loss; or, in many cases, it simply does not work if the context window gets filled before the LLM provides a final answer. Nevertheless, it was an attempt to avoid crashing the program as often.

7. Lack of examples/best practices

While scouring the internet for anything meaningful in my quest to build an MCP client application, I learned one thing: there is no clear way of doing it right. Everybody does tool calling; that is easy to understand and to implement. For anything else, well...

My client also supports tool calls. It can also list available resources and prompts, but at this point, these cannot really be used for anything meaningful. MCP offers a lot more concepts. Aside from tools, resources, and prompts, there are concepts such as sampling and elicitation. My client does not use them in any way. My client is essentially a CLI app that can do tool calling. That's it. Admittedly, there are certain limitations to what is possible to accomplish with a text-based CLI app, but still, my app is nothing special. To be fair, most of the available open-source MCP clients are mainly based on tool calling.

Why mainly tool calling? It is easy to understand, simple to implement, and can be genuinely useful to enhance the capabilities of an LLM model.

In contrast, implementation for MCP resources is more challenging than expected. According to the protocol, resources can be anything⁶. It is literally impossible to implement a client that can use any kind of resource. I understand that specialized clients, such as code editors with MCP client features, would be able to support text-based resources of git repositories (GitHub MCP servers define them with the protocol repo://), and the notion of resources was essentially invented for this kind of purpose. My main problem with resources is that the protocol definition is really vague for them. And it is similarly vague for other concepts as well.

That being said, I'm pretty sure this will improve in the future as long as the protocol is adopted by many more people and organizations.

Final Thoughts

I started writing this article in May 2025. Then life happened, and I had to put it aside to work on other things. Now, at the end of July, a lot has happened in the world of MCP. The protocol definition has improved significantly in the area of security, elicitation was introduced, and HTTP/SE is already considered legacy.

The protocol is still changing, which is good. There are still a lot of rough edges to be ironed out.

From a client development perspective, libraries are getting better. Langchain4j has evolved a lot in the last few months, but it lags behind in implementing everything from the MCP protocol definition. We are getting there.

Additional Links

The source code for the CLI application I developed can be found on GitHub: https://github.com/Ernyoke/jmcpx

References

MCP - Introduction: link ↩
How to Build a BlueSky RSS-like Bot with AWS Lambda and Terraform: link ↩
The Technical User's Introduction to LLM Tokenization: link ↩
What is a context window?: link ↩
What is RAG (Retrieval-Augmented Generation)?: link ↩
From the MCP protocol documentation as of version 2025-06-18: "Resources represent any kind of data that an MCP server wants to make available to clients." ↩

Building Super Slim Containerized Lambdas on AWS - Revisited

Ervin Szilagyi — Wed, 02 Apr 2025 14:11:50 +0000

Introduction

Recently, I was reading through some AWS blogs when I stumbled upon this article, Optimize your container workloads for sustainability by Karthik Rajendran and Isha Dua. Among other topics, the article discusses reducing the size of your Lambda container images to achieve better sustainability. Some of the main points discussed are how and why to reduce the size of Lambda containers - the idea being that smaller containers require less bandwidth to transfer over the internet, take up less storage on disk, and are usually faster to build, thereby consuming less energy.

At first glance, this might seem like a minor optimization, but when you consider that AWS has millions of customers building Docker images, it suddenly makes sense to recommend working with slimmer images. Furthermore, having smaller images is considered a best practice overall.

Around three years ago, I wrote an article about the "how", I was discussing ways to reduce the size of Lambda containers. My article was titled Building Super Slim Containerized Lambdas on AWS and it primarily focused on Lambda functions written in Rust. Reading the AWS blog article reminded me that I should probably revisit the topic of creating slim Lambda images and provide a more informed perspective.

Short Recap

As I mentioned, in my old article, I used Rust to build a Lambda function. Code written in Rust is compiled into a binary. To execute this binary as a Lambda function, you can either upload the binary directly to AWS Lambda or package it in a Docker image, push the image to Amazon ECR, and configure a Lambda function to use that image.

The size of this Docker image can vary significantly. If you use the default image recommended by AWS (public.ecr.aws/lambda/provided), the container size can be a few hundred megabytes. However, if you go with a minimal approach, such as a Distroless image, you can get containers down to just a few dozen megabytes, depending mostly on the size of the compiled binary.

When it comes to Lambda execution time, image size has little impact. Whether the image is large or small, the function's execution time remains roughly the same - though I haven’t tested exceptionally large images that might push Lambda’s limits.

Ultimately, the takeaway from the article was that while Distroless allows you to build smaller images, it might not be all that beneficial since it doesn’t improve performance. I also pointed out that a smaller image can speed up the build pipeline. Admittedly, at that time, I didn’t even consider the sustainability angle. Even after all these years, I can only speculate about its impact. As a solo AWS user, it’s difficult to quantify how much of a difference using smaller images would actually make.

On the other hand, I’m really excited to dive into the nifty details of building tiny images.

Steps to Build a Slim Container

In order to build slim containers, you would want to leverage the following steps:

1. Use tiny base images

This is probably the most important step in reducing the image size. In most cases, this step allows us to achieve the greatest size reduction without modifying our application's source code.

The general idea is that instead of base images such as public.ecr.aws/lambda/provided:al2 (the AWS-recommended image for compiled executables like Rust applications) or images based on Ubuntu you can use images such as Alpine, Distroless or Chainguard. These alternatives have a significantly smaller footprint.

Distroless and Chainguard images are optimized to include only the essentials needed for execution. They lack a shell, package manager, and most standard Linux binaries, making them extremely small. Alpine-based images include more utilities, such as a shell and basic Linux binaries, yet they are still designed to occupy only a few megabytes on disk.

The downside of using these small images is that they can be challenging to work with during development or debugging. For example, you cannot directly open a shell in a Distroless image since it does not include one.

ℹ️ Note:
To start a shell inside a Distroless image, you can rebuild your image using the :debug tag. For example, instead of FROM gcr.io/distroless/cc:latest-amd64 you can have FROM gcr.io/distroless/cc:debug. "Debug" images include busybox, which provides a minimal set of common Linux binaries..

ℹ️ Note:
Similarly, with Chainguard images, you can append the -dev keyword to the tag of the image. For example, if you have an image built on FROM cgr.dev/chainguard/static:latest, you can rebuild it with FROM cgr.dev/chainguard/static:latest-dev to have access to a shell and other debug tools.

2. Add strictly what you need to the image

This point is similar to the previous one, the difference being that even if you choose the smallest base image possible, you might found yourself having a bunch of unnecessary stuff added back to your image at build time. To avoid this, you can do the following:

Use multistage builds¹: Most likely, your application is built alongside the Docker image you publish. In many cases, developers copy the source code into the image and then run commands to create the application package. As a result, the final image may contain unnecessary development dependencies that are not required for execution. You can either remove these manually, or just simply use multistage builds. In case of multistage builds, you can use different images for build and execution. In fact, you can use a fully fledged, "large" development image for the build step, after which you copy the artifacts to a stripped down image used for execution.
Use .dockerignore to copy only what you need to your image: similarly to .gitignore, .dockerignore let's you specifies files and folders which you should not copy over to your images at build time.

3. Use a compiled language to build a small executable

Nowadays, it is becoming increasingly challenging to distinguish between interpreted and compiled languages. Many interpreted (or so-called "scripting") languages use just-in-time (JIT) compilers, meaning the code is compiled to machine code at execution time.

The key takeaway here is that, to optimize the size of your container, you may want to avoid writing your Lambda functions in languages that require an interpreter or JIT compiler. Instead, you should choose something that compiles to machine code and is compressed into an executable.

To be more specific, rather than using Python or JavaScript, you might consider Rust, Go, or C++. Of course, I fully understand that this may not always be feasible, and my intention is not to discredit Python, JavaScript, or any other language. However, it is important to recognize that if we prioritize minimizing container size, eliminating the interpreter can free up tens-if not hundreds—of megabytes.

4. Static vs dynamic linking

In case you followed previous points, your image should be really small right now. At this point, probably your main goal is to reduce the size of the executable. One thing you may encounter at this point is the presence of libc (usually glibc) in your Docker image. Both Distroless and Chainguard present the option to chose a base image the has glibc and an equivalent image that does not have it. Obviously, the image that has glibc is larger in size.

glibc, or the GNU C Library, is the standard C library (libc) implementation used on Linux systems. It offers a wide range of functions that allow programs to interact with the operating system, such as handling input/output, managing memory, and manipulating strings. Rust, relies on glibc for interacting with the operating system. On Linux, this typically means linking against glibc, as Rust's standard library, libstd, uses it for system calls and other operations. On the other hand, Go offers more flexibility by allowing compilation without relying on the system's C standard library. By setting CGO_ENABLED=0, Go programs use their own implementations for system interactions, avoiding glibc dependency. This means that if you build a Lambda function with Go, you can just disable linking against glibc and you can put our executable in an image that does not have the library. In case of Rust, you can build an executable that statically links libc by using musl²³.

5. Building an image from Scratch

The most minimal base image you can use is scratch⁴. You can simply copy your binary executable to it the image and it will be executed as PID 1. This can work for AWS Lambda functions as well, but you may encounter issues depending on what your Lambda function attempts to accomplish. For example:

CA (Certificate Authority) certificates will be missing, you wont even have a /etc/ssl/certs/ folder. This will cause HTTPS connections to fail. To use any AWS service such as DynamoDB or S3, HTTPS must work.
Standard directories such as /var, /home, and /root will be missing. The exception is the /tmp directory, which will be mounted by AWS, allowing us to write to a temporary folder if needed.
Time zone data may cause issues, as the /usr/share/zoneinfo directory will be missing.

Of course, you can overcome these issues by adding the necessary files and folders at build time, but that defeats the purpose of using the scratch base image. Instead, you would rather choose Distroless or Chainguard.

Build the Slimmest Image Possible for a Rust Lambda

Following these steps let's try to build a slim but usable containerized image for a Lambda function developed in Rust.

ARG FUNCTION_DIR="/function"

FROM rust:1.84-bullseye AS builder

WORKDIR /build

ADD Cargo.toml Cargo.toml
ADD Cargo.lock Cargo.lock
ADD src src

# Cache build folders, see: https://stackoverflow.com/a/60590697/7661119
# Docker Buildkit required
RUN --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/home/root/app/target \
    rustup target add x86_64-unknown-linux-musl && \
    cargo build --release --target x86_64-unknown-linux-musl

# copy artifacts to a clean image
FROM cgr.dev/chainguard/static:latest

COPY --from=builder /build/target/x86_64-unknown-linux-musl/release/bootstrap bootstrap

ENTRYPOINT [ "./bootstrap" ]

The size of this image is 4.03 MB, of which the final base image itself (chainguard/static) accounts for approximately 1.33 MB, while the remaining 2.7 MB is the executable. Admittedly, my Lambda function does not do a lot and has only a few dependencies (the code for the function can be found here: GitHub). What I would want to point out is that this Docker image follows the steps outlined above to achieve the reduced size:

It is using a builder image to compile the executable.
Only the necessary files a copied to the final image - specifically, the executable named bootstrap
It is using chainguard/static to run the Lambda function. Distroless could have been an option as well, but that would result in a slightly larger image size (4.68 MB).
It is using x86_64-unknown-linux-musl toolchain to build the executable, ensuring that libc is statically linked.

Additionally, the target architecture is x86_64. However, with a few modifications, you could build the same image for arm64:

ARG FUNCTION_DIR="/function"

FROM rust:1.84-bullseye AS builder

WORKDIR /build

ADD Cargo.toml Cargo.toml
ADD Cargo.lock Cargo.lock
ADD src src

# Cache build folders, see: https://stackoverflow.com/a/60590697/7661119
# Docker Buildkit required
RUN --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/home/root/app/target \
    rustup target add aarch64-unknown-linux-musl && \
    cargo build --release --target aarch64-unknown-linux-musl

# copy artifacts to a clean image
FROM gcr.io/distroless/static:latest-arm64

COPY --from=builder /build/target/aarch64-unknown-linux-musl/release/bootstrap bootstrap

ENTRYPOINT [ "./bootstrap" ]

The size of this image will be roughly the same - I measured 4.06 MB on my computer. There are minor variations in size depending on the target architecture, with x86_64 typically being a few KB smaller. However, this difference is negligible.

You can still get the image slimmer by using scratch:

ARG FUNCTION_DIR="/function"

FROM rust:1.84-bullseye AS builder

WORKDIR /build

ADD Cargo.toml Cargo.toml
ADD Cargo.lock Cargo.lock
ADD src src

# Cache build folders, see: https://stackoverflow.com/a/60590697/7661119
# Docker Buildkit required
RUN --mount=type=cache,target=/usr/local/cargo/registry \
    --mount=type=cache,target=/home/root/app/target \
    rustup target add x86_64-unknown-linux-musl && \
    cargo build --release --target x86_64-unknown-linux-musl

# copy artifacts to a clean image
FROM scratch

COPY --from=builder /build/target/x86_64-unknown-linux-musl/release/bootstrap bootstrap

ENTRYPOINT [ "./bootstrap" ]

In my case, this Lambda function works - it executes successfully and produces the expected output. However, all it does is calculate the value of PI using the Unbounded Spigot Algorithm for the Digits of PI⁵. It is a "toy" function, serving as proof that scratch can work for Lambda functions. Nevertheless, I do not recommend using this base image. The size of this image is 2.69 MB.

Before closing this section, here is a comparison between base images with Rust executables:

Build the Slimmest Image Possible for a Python Lambda

Understandably, there can be several reasons not to use a compiled language for your Lambda function. In this section of the article, we will try to build a slim image for a Lambda function written in Python.

For this example, I will use a Python image based on Alpine Linux. Alpine-based images are widely known as slim images, but they are not the slimmest possible options. They come with a package manager (apk), a shell, and all the well-known Linux utilities, so they are not as "clean" as Distroless or Chainguard-based images.

I tried to build an image for a Python Lambda using a Distroless base, but I failed miserably. The AWS Lambda Python Runtime Interface Client relies on C++ modules using CPython⁶. These C++ modules have to be built at installation time and require a bunch of dependencies. Besides, they rely on dynamically linking several dependencies (musl vs. glibc, remember?). I tried to add all the necessary dependencies to the final runtime image, but in the end, whatever I did, it didn’t work out. So I gave up. This might be another project of mine for a later time—to try to make it work.

Nevertheless, here is how a containerized Lambda function for Python would look:

ARG FUNCTION_DIR="/home/app/"
ARG RUNTIME_VERSION="3.11"
ARG DISTRO_VERSION="3.21"

# Stage 1 - bundle base image + runtime
# Grab a fresh copy of the image and install GCC
FROM python:${RUNTIME_VERSION}-alpine${DISTRO_VERSION} AS python-alpine
# Install GCC (Alpine uses musl but we compile and link dependencies with GCC)
RUN apk add --no-cache libstdc++

# Stage 2 - build function and dependencies
FROM python-alpine AS build-image

# Needed for libexecinfo-dev. Alternatives such as libunwind may build awslambdaric, but the function wont execute in the final runtime image.
# Tanks https://stackoverflow.com/questions/77518311/dockerfile-for-node16-alpine-in-aws-lambda
RUN apk add --no-cache --update --repository=https://dl-cdn.alpinelinux.org/alpine/v3.16/main/ libexecinfo-dev

# Install aws-lambda-cpp build dependencies
RUN apk add --no-cache \
    build-base \
    libtool \
    autoconf \
    automake \
    libexecinfo-dev \
    make \
    cmake \
    libcurl

ARG FUNCTION_DIR
ARG RUNTIME_VERSION

RUN mkdir -p ${FUNCTION_DIR}

COPY . ${FUNCTION_DIR}

RUN python${RUNTIME_VERSION} -m pip install awslambdaric --target ${FUNCTION_DIR}
RUN python${RUNTIME_VERSION} -m pip install -r ${FUNCTION_DIR}requirements.txt --target ${FUNCTION_DIR}

# Stage 3 - final runtime image
# Grab a fresh copy of the Python image
FROM python-alpine

ARG FUNCTION_DIR

WORKDIR ${FUNCTION_DIR}

COPY --from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR}

ENTRYPOINT [ "python", "-m", "awslambdaric"]

CMD [ "main.handler" ]

Admittedly, I didn’t come up with all of this by myself. The image is based on this blogpost⁷ from Danilo Poccia.

The final image size is 81.6 MB, which is significantly smaller than the 717 MB base image (public.ecr.aws/lambda/python) recommended in the AWS documentation⁸.

To put into perspective what we built, here is a comparison of the Rust x86-64 images and the Python images. The size of the Alpine image is more than three times larger than the largest Distroless image. But all in all, it is still relatively small compared to the base images provided by AWS.

Conclusions

Revisiting this topic made me realize that building smaller Docker images can be a lot of fun, but it can also be quite challenging. Returning to the original point of this article, it is true that smaller containers can contribute to sustainability. However, in order to build and work with them as developers, we must invest a significant amount of time and effort.

Should you go in tomorrow and try to reduce the size of all your Lambda images running in production? Probably not. There is no such thing as a free lunch. You trade ease of use for bandwidth and storage savings. It is up to you to decide if it’s worth it.

As always, the code referenced in this article can be found on Github: https://github.com/Ernyoke/aws-lambda-benchmarks

References:

Multistage Builds. ↩
Rust - Static Linking ↩
The Rust Reference - Static and dynamic C runtimes ↩
Docker Docs - Create a minimal base image using scratch ↩
Unbounded Spigot Algorithms for the Digits of Pi - Jeremy Gibbons ↩
GitHub Repository for AWS Lambda Python Runtime Interface Client ↩
New for AWS Lambda – Container Image Support - in the blogpost Danilo gives an example for Python 3.9. Based on his Dockerfile, I updated mine to use Python 3.11. At first glance, this should have been pretty easy to do, but I still had to track down dependencies (libexecinfo-dev) that got removed from newer version of Alpine images. ↩
Deploy Python Lambda functions with container images ↩

How to Build a BlueSky RSS-like Bot with AWS Lambda and Terraform

Ervin Szilagyi — Wed, 20 Nov 2024 14:53:42 +0000

BlueSky, an alternative social media platform to the well-known X (or commonly known as Twitter), is currently experiencing a surge of new users. There are multiple reasons why many people, especially from Twitter, are migrating to BlueSky, but this blog post is not about that. We want to talk about bots, useful bots, not spam/scam bots, obviously.

With the influx of the new user base, I also decided to create a new account there. I'm not a social butterfly, I tend to post once in a while. Seeing the activity on BlueSky, I decided to get involved in the way I can the best, which is building something useful (or at least I want to believe that is useful) for a part of the community.

As a disclaimer before going into the technicalities of building a bot:

Social media bots, especially in the context of Twitter, have a negative connotation. That is because many people are abusing them. In this article, I don't want to promote that. A bot, a client that automatically can share/re-share content on social media, can be useful. Many organizations rely on bots and/or external clients for automatically sharing information on multiple social media sites at once. In the other hand, I strongly condemn bots that have malicious intent in the messaging they spread, spam bots whose reason is to create as many posts as possible regardless of whether the content they share is meaningful or not, and scam bots.

The Idea for a Bot

I became an AWS Community Builder in 2022. Since then, I've authored a few blog posts and have read many more by other authors. My initial idea for a BlueSky bot was to create one that shares blog posts written by fellow builders as part of the DEV.to AWS Community Builders organization.

This idea is not entirely new. A similar bot was created some time ago by another fellow Community Builder Jeroen Reijn. If you use Twitter and are interested in a feed of Community Builder blog posts, please follow @aws_cb_blogs.

How would this bot work? Pretty simple:

Fetch the latest blog posts from DEV.to
Make a nice post on BlueSky: add tags, a card with the link to the origin post, mention the author, etc.
Wait for N minutes
Go to step 1. and repeat

Implementation of the "AWS Community Builder Blog Posts" Bot

As you can imagine, the idea is pretty simple, so the implementation would be also straightforward. That's partially true, although there might be some edge cases to handle.

First, let's take a look at the API provided by DEV.to. DEV.to is powered by Forem, an open-source platform for blogging. The API provided by Forem for fetching articles from an organization is as simple as it gets. We need to specify an identifier for the organization we’re interested in and provide pagination details. Articles are sorted by their publishing timestamp, from the most recent to the least recent. For pagination, we need to specify the page number (the default is 1, which contains the most recent articles) and the number of articles per page.

As far as I know, there is no way to specify a timestamp in the past and retrieve all more recent articles. Therefore, we need to rely on pagination to create our own solution for identifying which articles we have already shared on BlueSky and which ones have not yet been posted.

My solution is illustrated in the following state diagram:

In short, what I'm doing is the following:

In a loop get the the last 10 articles from the first page;
For all of the articles, check if they exist in a DynamoDB table;
Drop all the articles which already exist in the table;
The remaining articles are considered "new" or "recently published" articles. Save them in the table and publish them to BlueSky;
In case all the 10 articles are considered "new" fetch the second page as well and repeat the actions from step 1. In case there is at least one article that has already been published OR we reached page 3, stop and go to the next step;
Wait for 5 minutes and re-do everything from the beginning.

You might ask, "How did I come up with all of these numbers? Why do I fetch 10 articles, and why do I do this 3 times?" To answer briefly, I'm just guessing that they are safe, even for periods with higher activity, like when re:Invent happens.

To make a more educated guess, here is the number of posts published monthly by all Community Builders under the DEV.to organization (thank you, Jeroen Reijn, for the chart):

(source: https://bsky.app/profile/jeroenreijn.com/post/3l7uzeyay3r2u)

The number of posts fluctuates, but the periods we are most interested in are the following:

March: This is when a new cohort of Community Builders joins. At this point, everyone is enthusiastic about posting something;
December: This is usually when re:Invent takes place. Many new things are introduced during this event, providing plenty of exciting topics to write about.

Over the past year, the highest number of articles was posted in March, slightly exceeding 250. With this in mind, we can conclude that the values I chose are likely much higher than necessary. Realistically, we don’t expect all 250 articles to be posted within a 5-minute window. If that were the case, I’d have other concerns, such as BlueSky’s rate limits, especially for blob content like images.

Technology Stack

For the technology stack, I went with an AWS Lambda using TypeScript. For detecting which articles I should re-share, I'm using a DynamoDB. I'm using CloudWatch scheduled events to trigger the Lambda every 5 minutes. At this point, I don't have any special error-handling mechanism for the Lambda (other than the obvious exception checking and logging), but I would most likely create a dead-letter queue for missed events.

For the infrastructure, I'm using Terraform. The reason is that I'm mostly familiar with it, and it took the shortest time for me to set everything up and get running.

The whole codebase can be found on GitHub: https://github.com/Ernyoke/bsky-aws-community-builder-blogposts

If you have a BlueSky account and you want to see recently published blog posts by AWS community builders, you can follow this account: https://bsky.app/profile/awscmblogposts.bsky.social.

Giving Another Try and Implementing an RSS Feed Bot for AWS News

After finishing with the "AWS Community Builder Blog Posts" bot, I wanted to move on to another topic that interests me: AWS News. AWS has an RSS feed for any important short announcement they make. These announcements are succinct and to the point. I think it would make sense to re-share them also on BlueSky, since there is a similar account doing the same thing on Twitter with a considerable amount of followers.

The approach to implementing this kind of bot is similar to the one for the Community Builder blog posts. One thing that differs is that in this case, we have to parse an RSS feed. Usually, there are only a few announcements made by AWS on a daily basis. We can extend the fetching schedule to 30 minutes (or even more). For the DynamoDB table, we can lower the provisioned READ/WRITE capacity to a small amount, since there won't be a huge number of reads and writes.

Again, the implementation for this bot can also be found on GitHub: https://github.com/Ernyoke/bsky-aws-news-feed

In case you would like to follow the BlueSky account with RSS news feed from AWS, you can do it here: https://bsky.app/profile/awsrecentnews.bsky.social.

After I finished developing this bot, I noticed that a fellow community builder, Thulasiraj Komminar developed his own variant. You can follow whichever account you like, the more important thing is to stay updated.

Taking it a Step Further: Re-sharing Content from Official AWS Blogs and Detecting Deprecations

In case you want to be as up-to-date as possible with all things AWS, you most likely stumbled into the AWS News Feed page. This page is created and maintained by fellow AWS Hero, Luc van Donkersgoed. The purpose of this page is to aggregate blog posts from all the official AWS blogs. Moreover, it can detect blog posts talking about service and feature deprecations, which unfortunately, are getting more frequent lately. It is a pretty impressive piece of work, and I absolutely recommend bookmarking and following this page.

My idea was to bring both of those functionalities to my BlueSky feed. I wanted to create a bot that simply re-shares and tags all the articles from different kinds of official AWS blogs, and I also wanted to create a bot that shares posts talking about AWS service deprecations.

Long story short, I came up with the following event-based architecture:

I split the so-called business logic into three parts (three Lambda Functions):

Fetcher Lambda: Works very similarly to the previous RSS-like bots I presented. It queries the AWS API for blog posts, uses a DynamoDB table to detect posts that have not yet been shared on BlueSky, and publishes those to an SNS topic..
Blogs Publisher Lambda: Uses an SQS standard queue to listen to the topic. From this queue, it polls the messages and simply re-shares them to BlueSky.
Deprecations Lambda: The "fun" part of this architecture. It also has its own queue from which it retrieves newly published blog posts. Before re-sharing them to BlueSky, it will use an LLM Model from Bedrock to detect if the article is about any kind of service deprecation. If it is, it will proceed to post the article to BlueSky.

What we have here is called a fan-out architecture. We have a producer Lambda (Fetcher) which produces events for multiple consumers.

Event Sourcing with Lambda

Having an SQS queue gives a lot of flexibility and control over how are our functions invoked. Using a Lambda event source mapping we get access to features such as:

Batching: we can group messages from a queue together and have our function invoked once for multiple messages;
Error handling and partial batch processing: in case something fails while dealing with a message from the queue, we can have a number of retry attempts. Event source mapping allows partial batch processing. In case there is an error from one of the messages from the batch, we don't necessarily have to reprocess everything. We can reprocess only those for which the execution failed;
Parallel concurrence invocation: we can limit how many function invocations should we have in parallel.

All of these features have to be configured when defining the event source mapper. Since, I'm using Terraform, in my case I have the following configuration for the Blogs Publisher Lambda:

resource "aws_lambda_event_source_mapping" "blogs_event_source_mapping" {
 event_source_arn                   = aws_sqs_queue.blogs_queue.arn        # ARN of the source SQS queue
 enabled                            = true                                 # Flag used mainly for debugging
 function_name                      = aws_lambda_function.blogs_lambda.arn # Lambda ARN
 batch_size                         = 10                                   # Accept a batch of max 10 messages
 maximum_batching_window_in_seconds = 60                                   # Time to wait for messages to arrive to be able to be gathered in a batch
 function_response_types            = ["ReportBatchItemFailures"]          # Used for partial error handling of a batch

 scaling_config {
 maximum_concurrency = 5  ## Limit the number of instances of the function that can be invoked at the same time
 }
}

In the case of the Deprecation Lambda, I have to deal with other limitations. Since AWS decided to limit my account to 20 invocations per minute of a base model from Bedrock, I decided to use the poor man's approach to Lambda rate limiting: setting the reserved concurrency at 1. This will allow only one instance of my function to be executed at the same time. I'm aware that with this I still can hit the rate limit imposed by Bedrock, but I feel like at this point there is not much I can do. Also, important to notice, that in this case maximum_concurrency has to be disabled.

Working with AI

I'm using AI to detect if an article is about any AWS service deprecation. This works, most of the time, but in many cases, it can decide to be as disciplined as a badly behaved toddler.

What I'm doing is extracting the text content of each article. This text is provided to the bot. As a response, I expect answers to the following questions:

Does the article mention any deprecation of any AWS service?
If yes, what is the name of those services?

Moreover, I expect to get the response in JSON format.

At first, I was under impression that this should not be a big challenge for any available models Boy, I was wrong!

Both of my questions require summarization. LLM models should be pretty good at summarization. The second part of my challenge to the model is to provide the answer in structured format.

I tried different models for this, with different degrees of success:

Amazon Bedrock Titan: it can do summarization really well. It could answer both of my questions. However, when trying to get the answers in a JSON format, it was a challenge. I was unable to get a valid JSON no matter how much I tried. I used LangChain's Structured Output and explicitly provided the format instructions to the bot as part of the prompt. Titan managed to generate something similar to a valid JSON, but every time something was off. Ultimately, I decided to drop it.
Claude Instant: I decided to try one of the cheapest offerings from the Anthropic models. Summarization works well enough, and it can build JSON most of the time. When I don't get a valid JSON response, the solution is to retry. Considering that I have 20 invocations per minute (thanks to AWS), this will quickly consume those invocations. So, I decided to try another model.
Claude Haiku: I'm currently using Haiku, which provides a correctly formatted JSON 99% of the time. When it doesn't, I simply retry the request. It seems stable enough for my purposes, and I can work within the strong rate limiting imposed by AWS.

Aside from that, I still need to refine the prompt to avoid having false positive detections. Sometimes AI struggles with identifying AWS services, or I might simply be bad at prompting it.

Technology Stack

In short, I'am using TypeScript for the Lambda Functions, DynamoDB for knowing what article to re-share, SNS with SQS standard for fan-out and LangChain with Claude Haiku from Bedrock. For the infrastructure I'm using Terraform.

The whole codebase can be found on GitHub: https://github.com/Ernyoke/bsky-aws-blogs

If you want to see the deprecations warning in your BlueSky feed, you can follow the handle: https://bsky.app/profile/deprecatedbyaws.bsky.social.

Final Words

In conclusion, I had a lot of fun developing these bots. It was just about time for me to revisit all the features we have for streaming and event sourcing. Working with AI, although it is a lot of fun, can be challenging sometimes.

BlueSky also has the concept of starter packs. A starter pack makes it easier to follow multiple accounts at the same time with the push of a button. If you have a lot of AWS-related blog posts/articles/news in your feed, I created a starter pack for all of these bots. You can simply follow them from here: https://go.bsky.app/EdJArRR

References

About AWS AI Practitioner (Beta) Exam

Ervin Szilagyi — Sun, 29 Sep 2024 10:11:31 +0000

In June this year, AWS announced two new certification exams: the AI Practitioner exam and the Machine Learning Engineer Associate exams. At the same time, AWS has the Machine Learning Specialty (MLS-C01) certification. You can still take this exam, and at the point of writing this article, there is no news of being retired. However, we can assume that the Machine Learning Engineer Associate will take its place.

On the flip side, the AI Practitioner certification is an entirely new offering, marking the second foundational level certification alongside the AWS Cloud Practitioner exam. Foundational level certifications are considered to be entry level certifications. They can be attempted by people who do not necessarily have an in-depth technical knowledge of cloud concepts.

From pure curiosity, I decided to attempt this certification. In my day-to-day job, I had the chance to work with AWS technologies and with LLM models lately, so I thought it would be an interesting challenge.

Exam Overview

The exam is intended not only for IT professionals, but also for people working as business analysts, sales and marketing professionals, and IT managers. It relies mainly on theoretical knowledge, it does not require as much hands-on experience as an associate or professional-level exam. This does not mean that it does not have its challenges.

The exam consists of 85 questions which should be answered in 120 minutes (though I had 130 minutes through Pearson's online testing). You can opt for accommodation of an extra 30 minutes, in case English is not your mother tongue. I did not opt for that, but I recommend considering it if you are in a similar situation as myself.

Aside from the multiple choice, multiple selection types of questions, AWS recently introduced new question types: ordering, matching, and case studies. At this point, I believe, AWS is A/B testing these new types of questions, since I did not encounter any of them during my session.

What You Need to Know for the Exam?

Considering that this is my 8th AWS certification, I can affirm that the exam was more challenging than expected. Regardless of your level of AWS cloud and AI/machine learning knowledge, I strongly suggest taking the time to do some meaningful preparation. To aid with that, I will present my learning plan and what I think you should know to take the exam successfully. However, I suggest considering a fully-fledged course.

Warning

Bear in mind that this is not a comprehensive guide. If you are seriously considering attempting this certification, you may want to enroll in training provided by AWS or by a third-party trainer. See the next section for my recommendations for courses and learning material.

My training guide does not specifically respect the order of the domains enumerated in the official exam guide, which I strongly suggest that you read. That being noted, these are the topics I learned about during my preparation:

1. Cloud Computing Basics

For the exam, you need to have a clear understanding of some basic concepts related to cloud computing, such as:

Cloud vs on-prem: what are the advantages, what are the drawbacks of both
Public vs private cloud
Pricing of cloud services: one important keyword that you will encounter is on-demand. Everything in the cloud is pay-per-use, you only pay for what you use. Whenever you see a question talking about pricing, as a rule of thumb, you can default to the answer that mentions the on-demand keyword. Of course, there might be exceptions, so use your best judgment.

2. Basic Machine Learning Concepts

For the exam, you will need to have surface-level knowledge of machine learning concepts and algorithms. In-depth understanding of these topics is not required, but you should be able to recognize them and understand their applications. The key concepts are as follows:

AI/Machine learning: What is AI and what is machine learning? What use cases are solved by AI systems?
AI application components: Data layer, ML algorithms, model layer, application layer.
Neural Networks: What are neural networks? Also, you should know about their components: neurons, layers, activation functions, and loss functions.
You should understand what backpropagation is: the process of updating the weights in the network to minimize the loss function
Deep Learning: Know what deep learning is and what it is used for. Remember the keyword convolution, as it refers to a type of deep learning network (convolutional networks), primarily used for computer vision and image recognition
GenAI: What is generative AI and what is it used for?
Transformer Models/Diffusion Models/Multi-Modal Models: These are types of GenAI models. There is a high likelihood that you will encounter questions regarding the properties of these models, so it's recommended to have a basic understanding of them. You likely won't need to go into detail about their inner workings.
Supervised vs. Unsupervised learning: Understand the difference between these two approaches and know when to use one over the other.
Reinforcement Learning (RL): How does it work, and what is it used for? Be familiar with some key use cases.
Machine learning model training and evaluation:
- Model fit: The exam covers some more technical topics, such as underfitting and overfitting. These concepts are essential when training a model. A model is considered to underfit if it does not perform well on the training data. Conversely, it is overfitting if it performs well on the training data but poorly on evaluation or real-world data. If neither of these applies to your model, you can assume you have a balanced model.
- Bias and variance: Both of these refer to errors introduced by your model. A model is biased when it consistently makes the same error in its predictions, often due to erroneous assumptions during training. Variance, on the other hand, reflects a model's sensitivity to small fluctuations in the input data. For the exam, you should be able to identify whether a model is highly biased or has high variance. Additionally, you should understand how bias and variance relate to model fit. For example, a model with high variance will likely overfit, while a model with high bias will probably underfit.
- Model evaluation metrics: it is expected that you are familiar with concepts of how to evaluate a model. It is not required to know the math behind those concepts, but it expects you to know when you should use one metric versus another. These metrics are the following:
  - Metrics used for classification models:
    - Confusion Matrix: used to evaluate the performance of classification models. It is usually structured as a square matrix, where rows represent the actual classes, and columns represent the predicted classes.
    - Accuracy: measures the fraction of correct prediction.
    - Recall (or Sensitivity): measures the true positive rates of the predictions.
    - Precision: measures the correct positive rate.
    - Specificity: measures the true negative rate.
    - F1 score: combines precision and recall into a final score. Use it when both precision and recall are considered important for evaluating your model.
  - Metrics used for regression models: similar to the metrics from the classification models, you don't need to know the formulas and mathematics behind these metrics. What you should know for the exam is to recognize them and know if they can be used with a presented model or not:
    - MAE (Mean Absolute Error): measures the average magnitude of errors between the predicted values and the actual values.
    - MAPE (Mean Absolute Percentage Error): used to assess the accuracy of a predictive model by calculating the average percentage error between the predicted values and the actual values. MAPE expresses the error as a percentage.
    - RMSE (Root Mean Squared Error): measures the average magnitude of the error between the predicted values and the actual values, with a higher emphasis on larger errors.
    - R Squared: explains variance in our model. R2 close to 1 means predictions are good.
  - Metrics used to evaluate LLMs - the exam might ask you about evaluating the performance of a large language model. In this case, you would want to know about these:
    - Perplexity loss: measures how well the model can predict the next word in a sequence of text
    - Recall-Oriented Understudy for Gisting Evaluation (ROUGE): a set of metrics used in the field of natural language processing to evaluate the quality of machine-generated text.

3. Generative AI and AWS Services for GenAI

The exam expects you to be familiar with GenAI models. While it does not require an understanding of their inner workings, you should have experience using them. For example, you might be asked whether, for a certain problem, you would prefer to use a GenAI-based model or a more "legacy" ML model.

Probably the most important AWS service for this exam is AWS Bedrock. You can expect around 15 to 20 questions involving Bedrock in one way or another. Bedrock is essentially a one-stop shop for GenAI models. It provides access to a variety of Foundation Models, such as Amazon Titan, several Anthropic models (Claude), and models from AI21 Labs, Cohere, Stability AI, Meta (LLaMA), Mistral, and others. In addition to allowing you to build on these models, Bedrock offers several other features. The following are relevant for the exam:

Knowledge Bases: A solution for provisioning and using vector databases. You may want to use vector databases if you are building a system based on RAG (Retrieval-Augmented Generation)
Agents: BBedrock's method for function calling. You can integrate your model with an "action," allowing the model to perform tasks in addition to responding to messages, such as executing a Lambda function.
Guardrails: You can use Guardrails to limit the model's responses in terms of what it should be able to answer. Additionally, Guardrails offer features such as detecting personally identifiable information (PII) and removing it from the response.
Model Evaluation: You can evaluate a model using AWS-provided metrics or by employing human reviewers.

Aside from these, there are other features of Bedrock the exam might ask about (Bedrock Studio, Watermark Detection). I strongly recommend doing some hands-on practice with Bedrock and experiencing what it has to offer.

Another GenAI-based AWS service is Amazon Q, which is a fully managed GenAI assistant for enterprise usage (whatever that means). It combines a lot of stuff into one service. It has a few flavors:

Amazon Q Business: it is a question-answer-based chatbot built on Amazon Bedrock. It can ingest your company-owned internal data, and it will be able to answer your questions based on that data.
Amazon Q Developer: it servers two completely different use cases:
- It is a chatbot built on AWS documentation, so it can answer your questions related to AWS services.
- It is a code companion (previously known as CodeWhisperer) similar to GitHub Copilot. It can generate source code. It can integrate with a bunch of IDEs and it can help you write code.

I personally would not worry too much about Amazon Q for the exam. In my experience, there aren’t many questions about this service. On the other hand, it’s important to ensure you don’t confuse Amazon Q with Kendra, another service built for similar purposes. The exam may present them side by side, but you should be able to determine which one is more appropriate for your scenario.

4. Prompt Engineering

Before embarking on my learning journey for the AI Practitioner certification, I considered prompt engineering to be a pseudo-science. My rule of thumb was (and still is) that if you want better answers from a model, provide it with as much information as possible. During my preparation and while building AI chatbots at my workplace, I learned that there are useful prompting techniques that can yield significantly better answers compared to what I was accustomed to before.

For the AI practitioner certification you should be aware of the following prompting techniques:

Zero-Shot Prompting: This is a more "scientific" definition of what I was doing before adopting any prompt engineering techniques. You simply present a query to the model without specific wording, formatting, or examples of expected output. The model's output may vary. It could be useful or it might be complete garbage.
Few-Shots Prompting: In this approach, you provide a prompt with examples of what you would expect as output from the model. Surprisingly, this technique works better than I had imagined. In terms of the exam, you should choose this technique when asked for a low-cost solution with the highest precision.
Chain of Thought Prompting: This involves dividing your queries into a sequence of reasoning steps, using phrases like "Think step by step." Interestingly, the GPT-3 model utilizes chain of thought prompting under the hood, which has contributed to its recent popularity. Therefore, expect some questions about it.
Retrieval-Augmented Generation (RAG): RAG is also considered a prompting technique. It relies on using a vector database to fetch content related to the user query. This content is then injected into the user prompt as additional context.

Related to the prompt engineering, the exam might ask you about hyperparameters you can set for a model to optimize its responses. Parameters you should be aware of are the following:

Temperature: value between 0 and 1.0, defines the creativity of a model. The higher the value, the more creative responses you will get.
Top P: value between 0 and 1.0, defines the size of the set of available words when generating the parts of an answer. For example, for a value of 0.25, the model will use the 25% most likely next word.
Top K: similar to Top P, but it is an integer value. By setting a Top K value, we tell our model that it only should use words from the next Top K available options.

The general rule for the hyperparameters is that setting lower values will make your model more conservative and give more coherent responses while setting a parameter to a higher value will result in more creative and less coherent responses.

4. Amazon SageMaker

Another important AWS service for the exam is Amazon SageMaker. SageMaker is a managed service used by data and machine learning engineers to build, train, and deploy machine learning models. Like other AWS services, it offers a variety of features. I will focus on those that may appear in the exam, although I encountered some unexpectedly challenging questions during my session. These questions felt as if they were taken from the Machine Learning Specialty exam question set.

One of the most important offerings of Sagemaker is SageMaker Studio. At first glance, this looks like a managed Jupyter Notebook, where a machine learning engineer can write Python code. It is way more than that. Part of SageMaker Studio is Data Wrangler, used for feature engineering and data preparation before training. From Data Wrangler we can publish data into SageMaker Feature Store.

Part of SageMaker Studio is SageMaker Clarify. It is used to evaluate foundation models against AWS-provided metrics or metrics provided by you. It even lets you leverage human intervention (have your employee evaluate the model, or use Ground Truth). SageMaker Clarify has a specific feature you should be aware of for the exam, this is Model Explainability. This is used to explain why you get certain outputs from a model and what kind of feature influenced the output.

SageMaker Ground Truth is another sub-service the exam expects you to know about. It is based on Reinforcement Learning from Human Feedback (RLHF), whenever you see this keyword, think of Ground Truth and vice-versa. Ground Truth is used to review models, do customizations, and do evaluations based on human feedback.

In terms of ML Governance, you should be aware of SageMaker Model Cards, SageMaker Model Dashboards, and SageMaker Role Manager. Model Cards lets you create cards with essential information about a model. Model Dashboard is a centralized repository for ML models. It displays insights for each model such as risk ratings, model quality, and data quality. Role Manager lets you create and define roles for AWS users.

SageMaker Model Monitor lets you monitor the quality of your models deployed in production. You can also create alerts for your models.

SageMaker Pipelines allows you to create pipelines for training and deploying models. Whenever the exam asks about MLOps-related services, most likely SageMaker Pipeline would be the correct answer.

Model Fine-Tuning: in the exam, you might face questions about model fine-tuning. You may want to use fine-tuning when you want to take an existing model and do some additional training on it with your data. SageMaker JumpStart is one of the places where you want to start fine-tuning a model.

The exam also likes to compare fine-tuning with other preparation techniques of an LLM model. In case you are faced with a comparison based on price, you would want to keep in mind the following order:

Prompt engineering (excluding RAG): least expensive
RAGs: they are more expensive than other prompt engineering techniques because they usually require the presence of a vector database
Instruction-based fine-tuning: it is a fine-tuning approach that uses labeled data to modify the weights of a model. It requires model training, which demands specific hardware, so it is considered more expensive than RAGs
Domain adaptation fine-tuning: uses unlabeled data for fine-tuning, it is the most expensive approach

Other SageMaker sub-services you would want to look up are the following: SageMaker Canvas, MLFlow for SageMaker, Automatic Model Tuning. Moreover, SageMaker provides a bunch of built-in machine-learning algorithms (for example: XGBoost, DeepAR, Seq2Seq, etc.). You may want to check them out, to at least recognize them if they pop up during the certification.

SageMaker is an advanced topic. It’s somewhat surprising that AWS expects such a deep level of knowledge about it, especially considering the exam is recommended for individuals who may never use this product. If you're comfortable writing a few lines of code and are familiar with Jupyter Notebooks, I recommend doing some hands-on practice with SageMaker.

5. AWS Managed AI Services

AWS offers a comprehensive list of AI services that are managed and trained by them. The list of the services you should be aware of are the following:

Amazon Comprehend and Amazon Comprehend Medical: extract relevant information from documents of all kinds.
Amazon Translate: an on-demand translation service, think of it as an on-demand version of Google Translate.
Amazon Transcribe and Amazon Transcribe Medical: speech-to-text service.
Amazon Polly: text-to-speech service.
Amazon Rekognition: used for image recognition and classification.
Amazon Forecast: used with time series data to forecast stuff. Discontinued by AWS, but is still part of the exam.
Amazon Lex: it is similar to Amazon Q, or an Amazon Bedrock agent. It is technically Alexa as a service.
Amazon Personalize: recommendation service.
Amazon Textract: used to extract text from images (OCR).
Amazon Kendra: it is a document search service. It is somewhat similar to Amazon Q, but it is way more restricted and it cannot do summarization (good idea to keep this in mind!)
Amazon Mechanical Turk: it is not necessarily an AI service. With Mechanical Turk you rely on a human workforce to carry out certain tasks for machine learning, such as labeling, classification, and data collection.
Amazon Augmented AI (A2I): likewise Mechanical Turk, is not necessarily a managed AI service. It is a service that lets you conduct a human review of machine learning models. It can use Mechanical Turk under the hood.
AWS DeepRacer: this is also an interesting thing to mention. It is a game, where you use reinforcement learning to drive a race car. While DeepRaces is still part of the exam, the service is discontinued by AWS.

The exam might present a task and it might ask which service would be able to solve that task. It also might put one of these services head-to-head with Bedrock or Amazon Q.

6. AI Challenges and Responsibilities

The exam will ask you about generative AI challenges and how to overcome them. A few challenges you should keep in mind are the following:

Regulatory violation
Social risks
Data security and privacy concerns
Toxicity
Hallucinations
Nondeterminism

You should be able to identify the challenges of a given AI application. You might be asked to find solutions to overcome some of these challenges. For example, to address hallucinations, you can use Knowledge Bases and Retrieval-Augmented Generation (RAG) techniques. To mitigate toxicity, you can implement Guardrails. To reduce nondeterminism, you can adjust the model's hyperparameters (such as temperature, Top P, or Top K).

Another important topic that may arise is governance. Governance refers to a set of practices that must be followed when developing AI products. For example, you should be able to address the ethical concerns of an AI-based solution, consider bias and fairness, adhere to regulatory and compliance requirements, and ensure data lineage and data quality. There are several AWS services you should be familiar with when discussing governance, including AWS Config, Amazon Inspector, AWS Audit Manager, AWS Artifact, AWS CloudTrail, and AWS Trusted Advisor. You should understand the purpose of each of these services and be able to recognize them in the context of governance-related questions.

Generative AI Security Scoping Matrix: it is a framework designed to identify and manage security risks associated with deploying GenAI applications.
It is used to classify your app in 5 defined GenAI scopes, from low to high ownership:

Scope 1: your app is using public GenAI services
Scope 2: your app is using a SaaS with GenAI features
Scope 3: your app is using a pre-trained model
Scope 4: your app is using a fine-tuned model
Scope 5: your app is using a self-trained model

7. AWS Security Services and Other Services

For the exam, you should be aware of a list of AWS services. As the themes go with other topics, you should have a surface-level knowledge about them. Most importantly, you should know when to use which.

The most important service you will be asked about is AWS Identity and Access Management (IAM). It is used to define roles and access policies. Whenever you, as a user, want to perform any action in your AWS account, you need the necessary permissions. These permissions are granted through roles and policies. Similarly, when one AWS service needs to interact with another, it must have an assigned role that grants the required access. In some cases, this interaction can also be facilitated through service policies. The exam does not go into detail about when to use roles versus policies. The key point to remember is that whenever you are asked about security, think of IAM.

Another important service that will pop-up is S3. S3 is an object storage service, think of it as Google Drive on steroids. Whenever the exam asks about storage for model input/out, you would want to default to S3.

EC2 is also an important service. EC2 provides virtual machines. In the context of machine learning, you need EC2 instances for training and inference. There are several types of EC2 instances. For the exam, you may want to remember the following ones:

P3, P4, P5, G3, G6 instances: These are instances with a GPU assigned to them, they can be used for training and for inference as well;
AWS Trainium and AWS Inferentia: these are instances specifically built for training and inference. They provide a lower cost for either training or inference.

The exam might mention Spot Instances. Spot Instances are EC2 instances available at a lower cost. You can obtain a Spot Instance by placing a bid. They are cheaper than On-Demand Instances, but the trade-off is that they can be interrupted if someone bids a higher price or if AWS needs the capacity for other purposes.

Networking:

You should know what a VPC is. A VPC (Virtual Private Cloud) is a virtual private network, isolated by default from internet traffic or other AWS services. Certain resources, such as EC2 instances or most databases, need to be placed in a VPC to function properly. You can provide internet access to the VPC if desired.
The exam often asks about VPC Endpoints in terms of security. VPC Endpoints allow communication with AWS services from a private VPC, one that does not have internet access. When using a VPC Endpoint, the traffic does not pass through the public internet.

Other services:

CloudWatch: used for monitoring and logging
CloudTrail: used for having a trail about any action in an AWS account
AWS Config: used to enforce compliance in an account
AWS Lambda Function: serverless functions that run only when needed. In the context of this exam, usually they are used for integration between 2 services

Courses and Practice Exams that I Recommend

The previous section aimed to present what you need to know to pass the exam. It is not comprehensive material, it might have missing topics or inaccuracies. If you want a more robust preparation plan, I recommend enrolling in a paid course.

AWS Skill Builder is the official learning portal run by AWS Training and Certification. They have a learning path for the AI practitioner exam. I personally did not use this because the video courses in the learning path do not strictly focus on exam topics.

I recommend taking one of the Udemy courses from Stephane Maarek or Frank Kane. For my preparation, I used Stephane Maarek's course. I'm also familiar with Frank's teaching style, making his course a solid recommendation as well.

Stephane's course was the first available on the market. I purchased it on the day of its release. Having taken his other AWS certification courses, I appreciate his organized content and clear presentation, which makes note-taking easy. Initially, the course had some gaps in the required material, but it was updated based on student feedback. I confidently recommend it.

In addition to courses, taking a practice exam before the live exam is beneficial. For this, Skill Builder is an excellent option. It offers a free practice exam with 20 questions, which is essential for anyone preparing for this certification. If you're willing to pay, Skill Builder also provides a comprehensive practice exam with 85 questions and explanations.

On Udemy, Stephane also offers a set of practice exams. I did not purchase these after completing everything on Skill Builder, so I cannot review them.

My Experience Taking the Exam

I took the exam on Thursday morning from home through Pearson. I had no issues this time, everything went smoothly.

I went through all 85 questions within an hour, after which I spent the next 20 minutes reviewing the questions I flagged. As I hinted before, the exam was not easy and had its fair share of challenges. Most of the questions had the difficulty I expected, but many seemed to be taken directly from the Machine Learning Specialty question set, or at least that’s how it felt. Moreover, the spelling of some questions felt really awkward, making me guess what the author was looking for. I suppose this is a downside to taking a beta exam.

I did not encounter any new types of questions, all of them were multiple choice and multiple selection. These types of questions are not a new innovation from AWS. Anyone who has taken a Microsoft Azure exam should be familiar with them. In my opinion, they do not affect the difficulty of the exam. Personally, I detest the ordering type of questions, while I prefer the case studies.

That being said, I passed without issues. My score was lower than I expected, but at the end of the day, it’s not something I care that much about.

My List of Recommendations for You for the Exam

Keep an eye on the clock. There are 85 questions, and you have less than 1.5 minutes per question. Don't spend too much time on any single question. Many of them can be answered quickly as long as you're well-prepared. Try to select the correct answer and move on.
If you're unsure about a question, flag it and move on. You'll have time to return to it later and figure it out. It's important not to panic if you don't know something right away.
You will encounter questions you don't know the answer to, and that's okay. The official exam guide is intentionally vague. Any course you take will have some gaps. Don't be discouraged if you come across something completely new during the exam. The important thing is to pass. Your final score doesn't matter much.
The exam is challenging, but with adequate preparation, anyone should be able to pass it. You'll be fine.

Closing Notes

Ultimately, I enjoyed the process of preparation and taking the exam itself. You should enjoy yours too! Even though it may not have a significant impact on my career, I’m proud that I was able to pass it.

I wish happy learning and good luck to anyone preparing to become an AI Practitioner!

Recertification and the Current State of AWS Exams

Ervin Szilagyi — Wed, 14 Aug 2024 14:30:22 +0000

Three years ago I wrote one of my first blog posts titled "My Journey to become 5 times AWS Certified". This was right after I received my badge for the Solutions Architect Professional certification. I was happy and relieved back then after I successfully passed one of the most challenging exams.

The unfortunate thing about AWS exams is that they have an expiration date. They are valid for 3 years, after which you either have to recertify or you just accept that your certification has expired and you move on with your life. Recertification means you will have to sit through the same exams all over again. If you go for the professional exams and successfully pass them, the associate exams will be automatically renewed, hence it was enough for me to go for both the Devops Professional and Solutions Architect Professional exams to have all of my 5 badges active.

Preparation

For the preparation, I used the same resources I was using 3 years ago. For the DevOps exam, back then I purchased Stephane Maarek's course from Udemy, based on which I wrote my notes back then, which I planned to reuse. The problem with this course is that it was entirely re-shot by Stephane. Admittedly, a lot of topics have changed. There is a new version of this exam, DOP-C02, which requires knowledge of a set of new AWS services such as Security Hub, Control Tower, Network Firewall, etc. At the same time, AWS tends to introduce updates and new features for most of its offerings. I decided to re-watch the whole course content again, while also skimming over certain lessons, and I did not do any practical exercise. I work with AWS daily, I did not feel the need to do any of those.

For the Solutions Architect Professional exam back then I purchased the course from Adrian Cantril. In contrast, this course remained roughly the same as 3 years ago, with additional updates where it was needed. The Direct Connect portion, for example, received a major overhaul. Adrian added all the DX content from the Advanced Networking course to the Solutions Architect Professional course. Unfortunately, there is some legacy leftover content, which was required for the previous version of the exam, but it is simply not needed anymore. Services such as Server Migration Service or Simple Workflow Service are not in scope anymore for the exam.

To put in contrast these courses, I find it difficult to recommend the course from Stephane as the sole source of preparation. Admittedly, Stephane himself recommends going through his courses from the associate exams before going for the professional ones. The DevOps exam really dropped the ball in terms of difficulty, and I think the course material is not detailed enough for the exam. Adrian's course, on the other hand, is very lengthy with lots of practice examples. While this course covers most of the necessary topics for the exam, it still has some things missing (and some unnecessary stuff, as stated before). I'm confident there will be updates and more with more topics. As guidance, for both of the exams, I strongly recommend downloading the exam guides (can be found here for the SAP and here for the DOP) and going through the Appendix part with the list of services. You should at least be able to identify the purpose of each service in case you are seeing them in questions and answer choices. In-depth knowledge of some services is more important for each exam, but for many other things, it is enough to mainly know what is used.

In terms of practice questions, I've used the ones from TutorialDojo and also the ones from Digital Cloud Training. As with the courses, I did my preparation with the same question sets 3 years ago, but I was not able to recall any of those questions from back then. These question sets are good, but unfortunately, they start to show their age, specifically those from TutorialDojo.

The DevOps question set from either of these vendors does not reflect the difficulty that you will face in the exam. In my case, probably, I was unlucky or something, but the questions I got in the exam were way more challenging than expected. Many of them were mainly focused on AWS Organizations, Control Tower, Security Hub, and other newer features. These services are barely covered in the question sets from TutorialDojo and Digital Cloud Training.

The Solutions Architect Professional question set is a bit better in the case of Tutorial Dojo, the one from Digital Cloud is fantastic, to the point that for certain questions in the live exam, you will have a deja-vu feeling.

A more representative practice exam comes straight from AWS on their SkillBuilder site. Sadly, the question set contains only 20 questions, but at least it is for free. AWS offers these practice questions for each of their certifications, so I strongly recommend going through the ones for your exam.

Exam Experience

I took the DevOps exam in May, after which I took the Solutions Architect Professional in August. To begin with my experience with the DevOps exam, I must confess, that I had a rough time. I was aware of the fact that there was a new version of this exam around a year ago, but I did not think it would be so much different than the previous one.

In my opinion, the exam become way more challenging. For my test case, most of the questions were wordy and lengthy, many of which were multiple selections. I understand that this is what we should expect from a professional-level exam, but I felt that this exam was crossing a threshold and was testing my sanity instead of my knowledge. Usually, I don't even bother about the clock, I tend to be fast enough to finish in time, but in this case, I had to use all the time I had.

For the types of questions, I had all the usual CICD and cloud automation-related problems, stuff you might expect. Aside from that, I had way too many questions about AWS Organizations and Control Tower. In fact, at some point, I was rolling my eyes while encountering another AWS Organizations question. Other significant services touched were the following: Security Hub, AWS Config, AWS Lambda, S3 (I had a question about S3 Object Lambda as well), Storage Gateway, and FSx (I got 2 questions which touched FSx for NetApp, topic which is not covered by any tutor at the point of writing this article), EventBridge, CloudWatch (Logs, Dashboards, Synthetics), and many other stuff I don't remember.

Aside from this, there were some borderline infuriating questions. To provide an example, the exam expected me to know what kind of condition should (or should not) accept a certain action from a bucket policy or what exactly a certain AWS Config rule (more specifically it expected me to know if there is a Config rule for a certain thing). Memorizing such things is a waste of time in my opinion, and I think it proves basically nothing.

Continuing with the Solutions Architect Professional exam, most of the things I was asked were in the realm of what I had expected. Even though this exam essentially throws every AWS product at you, you only need in-depth experience with a handful of services.

Nevertheless, it is a challenging exam, and it can have some surprises. For example, I had a question involving Lambda SnapStart. SnapStart is a new feature used to improve cold starts for Lambdas developed in Java. I was closely monitoring this functionality when it was released by AWS since I am a Java developer at heart. Back then, I did not consider it as revolutionary as AWS was advertising it. As a tangent, in my opinion, if you want fast cold starts you either go for GraalVM or just simply drop Java and write your Lambda in anything else. A native option such as Rust (no fanboy, I'm just stating facts) would offer way better cold starts.

Anyway, aside from SnapStart, most of the questions were okay. In some cases, the wording was abysmal for either the questions or the answer options, but I'm not a native English speaker, so I won't complain. A few services I would like to mention here for which I encountered multiple questions: networking and DX, IoT (IoT Core, Greengrass, and some other obscure IoT service, you should know about each of them anyway), containers (ECS, Kubernetes - I had one question where Kubernetes made the most sense), streaming services (Kineses Streams and Firehose, AWS MSK in one question only), big data and data engineering (Redshift, Glue, EMR, Athena) and many other.

Is it Worth to get (Re)Certified?

Is it worth getting recertified?

As cliche as it sounds, it depends. I tend to believe it was worth it for me. I wanted to maintain the status of having active certifications in my current workplace. Also, I wanted to be up to date with the latest changes in AWS.

Will it be worth it for you? I don't know.

In case you are reading this article before having any certification, a more important question for you would be if it is worth getting certified.

If you stumbled on this blog post just before your exam, you must know that you did not waste your time and money. You acquired useful knowledge that you will be able to apply even outside of the AWS cloud.

On the other hand, if you are still considering an AWS certification, any type of AWS certification, you may take into consideration the following as well:

You are being tested on a volatile technology. This means that there are a lot of changes happening for many of the AWS products while you are doing your preparation. AWS constantly releases upgrades to products, making them more usable with more features and this is a good thing. From a certification perspective, however, these new upgrades are not introduced right away in the required curriculum. What might happen is that your knowledge might be outdated at the moment you get your certification badge.
Admittedly, even if certain updates are rolled out and you do not learn about those, is not the end of the world. Similarly, in the ballpark of volatility, AWS can retire services. Unfortunately, this has happened more often lately. For example:

Most of these AWS products are part of the currently required learning material for certifications. In fact, CodeCommit is a pivotal part of the DevOps exam. Aside from these services, other services were sunset such as OpsWorks, Server Migration Service, Snowmobile, and others which I cannot remember. The thing is that these are products about which the exam expects in-depth knowledge, but this knowledge becomes outdated right away. I understand that what you are learning might be carried over to alternative solutions, but many of those are not drop-in replacements.
The next point might apply to some specialized exams focusing on certain areas of AWS, such as the DevOps exam. You may not benefit as much from what you are learning. Your workplace might use alternative tools, or your role might not correspond with how AWS envisions it. To give you an example, I worked as a DevOps/Platform Engineer in the past. We were using AWS for the infrastructure. We were using Terraform for IaC code, we did not touch CloudFormation at all. We were not using SAM for Lambda functions, Terraform could deploy Lambda functions more cleanly without all the baggage of a Serverless Application. We were working for a huge organization, meaning that account creation and managing the AWS Organizations was the role of an entirely different team. AWS Config rules were managed by the security team, an entirely different department. For CICD we were using GitHub with GitHub Actions. As you can see, we are diverging more and more compared to how AWS saw a DevOps engineer. They have all the right to ask questions about their services, and I strongly believe that all the CloudFormation and CodeCommit/CodeBuild/CodeDeploy questions have their place in the exam. Ultimately, my point is that I'm not the DevOps engineer that the exam portrays, and this certification might not be as beneficial for my short-term career. This might apply to you as well.

Will the Certifications Help Me Get a Job?

During my career, I've seen a handful of job descriptions having a certification as a requirement. They do exist, but in most of the cases, the hiring company does not care about certification. Also, I did not see any job description explicitly mentioning that it needs an active certificate from a candidate. As long as your certificate is relatively recent, you probably should be fine regardless if it is still active or not. And probably, nobody will care anyway if you have hands-on experience.

Will a certificate help me get a job?

It might help you get noticed or jump the queue in certain cases. But that's all about it, or at least this is what I experienced. You will still have to pass the interview after that, no certification will give you an automatic pass.

TLDR

I recently got recertified. I took both the AWS Certified DevOps Engineer - Professional (DOP-C02) and the AWS Certified Solutions Architect - Professional (SAP-C02) exams. Here are the resources I used:

AWS Certified DevOps Engineer - Professional (DOP-C02):

Learning Material	Type	Free or Paid	Additional Comments
Stephane Maarek - AWS Certified DevOps Engineer Professional - Udemy Course	Video content	Paid	I recommend only if you did Stephane's associate level Udemy Courses
TutorialDojo practice exams	Practice Questions	Paid	They are good, but they are starting to show their age
Exam Prep Official Practice Question Set	Practice Questions	Free	Strongly recommend doing these questions, sadly the set contains only 20 of them
Digital Cloud Training practice exams	Practice Questions	Paid	They are similar to TutorialDojo questions

My notes can be found here: https://github.com/Ernyoke/certified-aws-devops-professional

AWS Certified Solutions Architect - Professional (SAP-C02):

Learning Material	Type	Free or Paid	Additional Comments
AWS Certified Solutions Architect - Professional - Adrian Cantrill	Video content	Paid	It is a lengthy 60+ hours course. I recommend it if you can afford the higher price
TutorialDojo practice exams	Practice Questions	Paid	They are good, but they are starting to show their age
Exam Prep Official Practice Question Set	Practice Questions	Free	Strongly recommend doing these questions, sadly the set contains only 20 of them
Digital Cloud Training practice exams	Practice Questions	Paid	They are really good and they manage to be very similar to what you will get on the exam

My notes can be found here: https://github.com/Ernyoke/certified-aws-solutions-architect-professional

Terragrunt for Multi-Region/Multi-Account Deployments

Ervin Szilagyi — Tue, 07 May 2024 22:38:24 +0000

Since a few years ago I've been working for a company whose products are used by millions. It feels somewhat refreshing to know that my contributions have an impact on the daily lives of so many people. On the other hand, this work also comes with a lot of anxiety in cases when you have to make decisions, even though these decisions are usually made together with a team of other highly experienced individuals.

Such a decision was the introduction of Terragrunt in our workflow. Why did we need Terragrunt, you may ask? This is what I will try to answer in the following lines of this article.

As a disclaimer, this article is subjective, based on my own experience in finding a solution to the problems we had. Usually, there is more than one way to tackle a challenge and in many cases, there are no perfect solutions. Knowing this, I think it is important to address the weaknesses and limitations of your solution, which this article will do later.

What is Terragrunt?

Before going into the "whys", I think it is important to know what is Terragrunt. Now I won't give you their marketing points or any definition copied from sites like Wikipedia, I will try to explain what is the purpose of this tool from my point of view.

Essentially, Terragrunt is a wrapper around Terraform, acting as an orchestrator over multiple Terraform modules. As developers, we have to organize our Terraform code in modules. Under to hood, each module gets embedded inside a Terraform project and it is deployed individually. Modules communicate between themselves with outputs and dependency blocks, while the dependency tree and rollout order are determined by Terragrunt.

Why Terragrunt?

To understand why Terragrunt was chosen, it would make sense to go through a timeline of challenges we encountered.

Let's assume we have an application spanning through a few micro-services with an SQL database, some static assets, and a few ETL jobs bringing in some data from external providers. We decide we want to migrate everything to the AWS. Our users are from all over the globe, but our main focus is Europe and the US. We have to offer different data to the EU users and the US users, moreover, we would like to reduce the latency of the responses. So it makes sense to deploy the whole stack on 2 different regions. We also want to have the application deployed separately for development and testing, for which we can use different AWS accounts.

For IaC we decided to use Terraform because we have the most experience with that compared to other options. Having these in mind, the following events happened afterward:

We started writing our Terraform code. We put everything in a single Terraform project. We relied on tfvars files to have inputs for different environments and regions.
We shortly ran into a scaling problem: attempting to do an apply went from a few minutes to a few tens of minutes. Moreover, we run into so communication and deployment issues in terms of certain changes being deployed to production before we wanted them.
Our Terraform project got bigger and bigger. We decide to slice it somehow into smaller pieces. We introduced the internal concept of "stacks" (this was well before the introduction of Terraform Cloud Stacks). From our point of view, a "stack" essentially was a Terraform project deploying a well-defined part of our infrastructure. Each stack could use resources deployed by other stacks by relying on Terraform outputs and terraform_remote_state or just by simply using data sources.
With the introduction of stacks we had different projects for networking, databases, ETL (we used mainly AWS Batch), storage (S3 buckets), and so on. This worked for a while until we ran into another problem. At first, it was easy to follow which stack depends on which other stack, but shortly we ran into the issue of circular dependencies. Stack A could create resources used by stack B, while also relying on resources created by stack B. Obviously, this is bad, and at this point, there is no entity to check and police our dependencies.
Moreover, we run into another problem. Certain resources are needed only for certain environments. For example, we needed a read replica only for prod, for testing and development we could get by with only the main database. In the beginning, we could solve this by having conditions on whether we want to deploy the resource in the current environment or not. At a certain point, we notice that we have to put these conditions in many places, adding a lot of complexity baggage to our infrastructure code.
So we decided to introduce Terragrunt.

To answer the initial question, we chose Terragrunt because:

It solved the dependency hell we encountered. With Terragrunt we have to be explicit in defining on who does our current module depends.
It fits the multi-region/multi-account approach. In case we dour our modules wisely, we use only the necessary modules for each region/environment. The catch here is that we have to modularize our code and we do it adequately, which might be not as easy as we would expect.
By introducing versioning for our modules, we could evolve different environments at their own pace.

Now, all of these come with a price: refactoring. Terragrunt relies on Terraform modules. Our initial code was not as modular as we might expected. So we had to do a lot of refactoring, which also came with an even bigger challenge: state management and transferring resources between states.

How does Terragrunt Work?

To use Terragrunt, first, we have to be comfortable with Terraform modules. The concept of modularization is simple:

Terraform provides building blocks such as resources and data sources;
Some of these resources are often used together (for example: a database and Route 53 record for its hostname)
It would make sense to group these resources in a reusable container. These reusable containers are called modules.

Modules can communicate between themselves with inputs and outputs. Terragrunt requires that all of our Terraform resources be part of modules.

Setting Up a Terragrunt Project

In the official Terragrunt documentation there is a good article about how to set up a Terragrunt project and where to place modules. In fact, there is also a repository on GitHub providing an example project on how the creators recommend setting up Terragrunt. I certainly recommend going through that repository, because it is a good reference for a starting point. Having that said, I like to structure mine a little bit differently. My recommendation is to have different AWS accounts for each environment. Getting a new account usually is relatively easy to accomplish even if we are working in a corporate environment (your workplace most likely is using AWS Organizations to manage accounts). The existence of multiple accounts does not require additional costs, we only pay for what we use.

In the terragrunt-infrastructure-live-example the split for the environments is done by prod and non-prod accounts. Each of these is further split by region. The non-prod account is also used for qa and stage environments. This setup can be perfectly acceptable, the one downside being that we will have to think about a naming convention for our resources, since in non-prod we will have the same cloud resources for both qa and stage. While this is not that big of a deal, I prefer to have one environment per account. My proposal for a Terragrunt project setup would look like this (GitHub repository for this example project can be found here: https://github.com/Ernyoke/tg-multi-account):

tg-multi-account
│   .gitignore
│   global.hcl
│   terragrunt.hcl
│
├───dev
│   │   account.hcl
│   │
│   └───us-east-1
│       │   region.hcl
│       │
│       ├───alb
│       │       terragrunt.hcl
│       │
│       ├───ecs-cluster
│       │       terragrunt.hcl
│       │
│       ├───ecs-services
│       │   └───frontend
│       │           terragrunt.hcl
│       │
│       └───vpc
│               terragrunt.hcl
│
├───prod
│   │   account.hcl
│   │
│   ├───eu-west-1
│   │   │   region.hcl
│   │   │
│   │   ├───alb
│   │   │       terragrunt.hcl
│   │   │
│   │   ├───ecs-cluster
│   │   │       terragrunt.hcl
│   │   │
│   │   ├───ecs-services
│   │   │   └───frontend
│   │   │           terragrunt.hcl
│   │   │
│   │   └───vpc
│   │           terragrunt.hcl
│   │
│   └───us-east-1
│       │   region.hcl
│       │
│       ├───alb
│       │       terragrunt.hcl
│       │
│       ├───ecs-cluster
│       │       terragrunt.hcl
│       │
│       ├───ecs-services
│       │   └───frontend
│       │           terragrunt.hcl
│       │
│       └───vpc
│               terragrunt.hcl
│
├───qa
│   │   account.hcl
│   │
│   ├───eu-west-1
│   │   │   region.hcl
│   │   │
│   │   ├───alb
│   │   │       terragrunt.hcl
│   │   │
│   │   ├───ecs-cluster
│   │   │       terragrunt.hcl
│   │   │
│   │   ├───ecs-services
│   │   │   └───frontend
│   │   │           terragrunt.hcl
│   │   │
│   │   └───vpc
│   │           terragrunt.hcl
│   │
│   └───us-east-1
│       │   region.hcl
│       │
│       ├───alb
│       │       terragrunt.hcl
│       │
│       ├───ecs-cluster
│       │       terragrunt.hcl
│       │
│       ├───ecs-services
│       │   └───frontend
│       │           terragrunt.hcl
│       │
│       └───vpc
│               terragrunt.hcl
│
└───_env
        frontend.hcl
        vpc.hcl

Here we have 3 environments: dev, qa, and prod. Each environment should be living in a single AWS account. The root of the project contains configuration (locals) shared by every environment. If we go inside a directory acting as an environment/account, we have the account-specific properties (account.hcl). We also create directories here for each region in which we would want to provision things. Navigating one step deeper we find the region-specific configuration (region.hcl) and all the modules we would like to have in that region.

Now let's focus on the Terragrunt modules. If we open a configuration, for example for the VPC, a possible implementation would be the following:

terraform {
  source = "tfr:///terraform-aws-modules/vpc/aws//.?version=5.8.1"
}

include "root" {
  path = find_in_parent_folders()
}

locals {
  global_vars = read_terragrunt_config(find_in_parent_folders("global.hcl"))

  project_name = local.global_vars.locals.project_name
}

inputs = {
  name = "${local.project_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = true
  enable_vpn_gateway = false
}

To give a short explanation of what we have here:

In the terraform block we have to specify a path to a Terraform module. For example, in this case, we use the VPC module from the terraform-aws-modules open source project for which we either could use the URL of the repository, or we could use the URL provided by the Terragrunt registry. We don't necessarily need to rely on other people's code, we can use modules maintained by ourselves by providing a link to our remote Git repository, or we can even have it point to a local path on our drive.
The include block is optional. It is used for "inheritance". By inheritance, we can think of it as if the parent configuration file is copy/pasted in the current configuration file. This can be useful because we can share common inputs with other environments/regions. Including them in the current configuration, Terragrunt will automatically provide them to the Terraform module. Also, we have the ability to append/override certain inputs as we wish.
The locals block essentially acts the same as Terraform locals. These are local "variables" used in the current configuration. We can also read locals from other configuration files with Terragrunt functions (read_terragrunt_config).
The inputs are values we provide to the Terraform module. If we use inheritance, the includes provided by the parent configuration are automatically merged with current includes, making our configuration DRY, arguably less readable, more on this later.

Taking the "DRY" -ness a step further, we can notice that modules such as vpc are used in each environment/region with little configurational difference. What we can do is extract this configuration into a top-level folder, such as _env, and rely on the inheritance feature discussed before.

The extracted file will look like this:

# ./_env/vpc.hcl

terraform {
  source = "tfr:///terraform-aws-modules/vpc/aws//.?version=5.8.1"
}

locals {
  global_vars = read_terragrunt_config(find_in_parent_folders("global.hcl"))

  project_name = local.global_vars.locals.project_name
}

inputs = {
  name = "${local.project_name}-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = true
  enable_vpn_gateway = false
}

In case our region is not from us-east-1, we can override the availability zones with sensible ones:

# ./qa/eu-west-1/vpc/terragrunt.hcl

include "root" {
  path = find_in_parent_folders()
}

include "env" {
  path = "${get_terragrunt_dir()}/../../_env/vpc.hcl"
}

inputs = {
  azs  = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
}

More or less this is what we need to know to be able to write Terragrunt code. Now let's discuss deployments.

Deployments

The deployment of a Terragrunt project can be accomplished with the following command:

terragrunt run-all apply

If we run this command from the root our our project, Terragrunt will attempt to deploy all the resources in all the accounts. This assumes we have the necessary rights to deploy to each account and we also made sure that Terragrunt knows about the IAM role it can assume to do the provisioning (see: iam_role).

LATER UPDATE: It seems like it is not possible to execute run-all apply from the root of the project in this current example. Terragrunt will fail to find either globals.hcl or account.hcl. It appears strange to be unable to find globals.hcl since this file is in the same directory together with the terragrunt.hcl configuration file from where it is referenced. It might be understandable to not be able to locate account.hcl. This file is not in the current root folder, it is in a child folder relative to the root, so find_in_parent_folders will fail to locate it. This was an oversight on my part while building the example project for this article and I want to apologize for that. Deploying from an environment should work as it is presented in the upcoming lines.

In case we don't want to deploy everything everywhere at once, we can simply navigate into the folder of the environment/region where we want to provision resources, and we can execute the command there.

In case we want to deploy certain modules only in a certain region, we can navigate into the folder for that module and execute the command there. An important thing to note here, if we run terragrunt run-all apply for a module, all the dependencies of that module will also be deployed. This might be time-consuming in cases where we need to execute it frequently (in case we do development for example). As a workaround, we can run terragrunt apply instead (without the run-all command). This will omit to roll out all the dependencies. It will rely on previous outputs cached with previous deployments. If there was no previous deployment for a dependency, the command will fail.

Terragrunt commands are similar to what we've been accustomed to while using Terraform. We can see the plan by executing the plan command (with certain limitations, more about this below), we can import resources with the import command, we can force-unlock the state of a module in case it got stuck with an unsuccessful apply, and so on.

Downsides and Limitations

Like every other tool, Terragrunt has its limitations, especially if we are coming from a Terraform setup. While I consider Terragrunt to be a valuable and useful tool, I think is very important to know its limitations in case we consider adopting it.

The following is a list of challenges that I've encountered during its adoption and day-to-day usage. I imagine there are plenty of others faced by other people, this is not an exhaustive list.

1. Steep learning curve and complexity of usage: if we are new to Terragrunt, we may get easily overwhelmed by all the new concepts we are introduced to. As we get more familiar with it, we are faced with other challenges, such as configuration inheritance. Having to adhere to practices that make our configuration DRY, can also make it more challenging to understand and follow.

2. There plan command is broken (at least in certain scenarios): this is stated even in the documentation for the run-all command

[WARNING] Using run-all with plan is currently broken for certain use cases. If you have a stack of Terragrunt modules with dependencies between them—either via dependency blocks or terraform_remote_state data sources—and you’ve never deployed them, then run-all plan will fail as it will not be possible to resolve the dependency blocks or terraform_remote_state data sources!

This might seem a non-issue at first, but if we consider also to following note for the apply command...

[NOTE] Using run-all with apply or destroy silently adds the -auto-approve flag to the command line arguments passed to Terraform due to issues with shared stdin making individual approvals impossible.

...we can probably guess why it might be dangerous to simply do deployments. Although, it might not be such drastic of a situation. My recommendation is that if we have doubts, we should restrict roll-outs to individual modules. In the case of modules, we can execute plan or apply without the silent auto-approve flag.

Also, Terragrunt is meant to be used with multiple environments in mind. Having a successful rollout in a non-prod environment should make us confident and prepared for the production rollout.

3. There is no easy way to import Terraform resources (at least I'm not aware of any): in case we have a Terraform project and we decide to transform it to Terragrunt, we most likely will have to manually import all the resources into the new state. This might be a non-issue if we could destroy our Terraform stack and re-provision everything with Terragrunt, but this might not be possible in the case of a production environment where availability is important.

4. Deployment Speed: Terragrunt is running Terraform under the hood. It will invoke Terraform independently for each module, an action that takes time. A mitigation for this is to keep deployments scoped to and apply only what we need. In cases where we need a plan for the whole project (for a security assessment for example), most likely we will still have to wait a lot to get that.

Alternatives to Terragrunt

At the beginning of this article, we have seen why Terragrunt made sense for us for a business-critical solution. In the previous section, we were also faced with the limitations of this tool. Having in mind all of these, we could want to keep an eye on what other alternatives exist.

Here are a few examples that can be considered instead of Terragrunt.

Terramate: It seems like a good alternative, and I think it could have been a better choice for certain issues we had. With Terramate the transition from Terraform might have been easier since Terraform projects can be imported seamlessly. Furthermore, we don't have to think right away about how to modularize everything. The reason it was not chosen, is that my team was mainly familiar with Terragrunt. We had no experience with Terramate, so we decided to play it safely.
Terraform Stacks: at the point of writing this post, it is still not generally available. It was not even considered by us back then, since it was in private preview and nobody had access to it. It might be a good choice in the future, but for now, it is not something we can use.
Terraform Workspaces: they are a similar approach to having different tfvars files per environment/region, a solution we were extensively using. We found that it is not the best choice since it scales poorly if the infrastructure gets bigger and bigger. However, If you start a project, I still recommend sticking to workspaces at the beginning and moving to something afterward, when it is needed.
Insert any other tool here: understandably there are many other options out there. When making a decision for something that will have to be maintained by multiple people for a living, usually we go with the one tool that has to most support on the internet, it is known by most of the people from the team and generally has a good reputation.

Final Thoughts

In conclusion, Terragrunt is a powerful tool with many functionalities. It is an opinionated way of working with infrastructure. It might not be the best choice for everyone.

Should I use it for my next project?
It depends. If you did not encounter some of the issues that are aimed to be solved by it, then you probably may not want to use it. It will add considerable maintenance baggage. To quote the Terragrunt author here [source]:

If you're working on a small project (e.g., a solo project or hobby), none of this matters, and you probably don't need Terragrunt. But if you're working at a company that is using Terraform to manage infrastructure for multiple teams and multiple environments, the items above make it hard to create code that is maintainable and understandable, and that's where Terragrunt can be a great option.

References:

Fluent Bit with ECS: Configuration Tips and Tricks

Ervin Szilagyi — Tue, 26 Dec 2023 11:09:34 +0000

A while ago I wrote a blog post about Fluent Bit integration with containers running in an ECS cluster. According to my statistics, this post is one of the most viewed on my blog, so I was determined to write a follow-up for it. I've been using Fluent Bit with ECS for more than a year in a business application running in production. During this period I had the chance to make use of and even abuse several features provided by Fluent Bit.

Generally speaking my experience with Fluent Bit in the last year was positive. In many cases, I found that it had a steep learning curve, and several times I felt I was doing things that I should not supposed to be doing. In the end, I managed to reconcile myself with how it operates and I can say for sure that it is a fast and very polished product that can handle huge production workloads.

In this blog post, I will talk about certain tips and tricks for the Fluent Bit configuration file that I found useful. Some of them might be trivial if you already have experience with it. Nevertheless, I think that they might be helpful for anybody interested in introducing Fluent Bit to their cluster of services.

This post will not provide a guideline on how to set up Fluent Bit. If you are interested in that, please read my previous post on this topic: ECS Fargate Custom Logging with Fluent Bit. Moreover, there are several useful articles from AWS, such as this one: Centralized Container Logging with Fluent Bit.

Fluent Bit Configuration Basics

Fluent Bit can be configured with a fluent-bit.conf configuration file or with YAML configuration. We will focus on the so-called classic .conf configuration format since at this point the YAML configuration is not that widespread.

A basic configuration file would look like this:

[SERVICE]
    Flush           5
    Daemon          off
    Log_Level       debug

[INPUT]
    Name cpu
    Tag  my_cpu

[FILTER]
    Name  grep
    Match *
    Regex log aa

[OUTPUT]
    Name  stdout
    Match my*cpu

We can notice that a configuration file can have the following sections:

Service: defines global settings for the Fluent Bit container. Some examples of these kinds of settings are how often should the container flush its content, the logging level of the Fluent Bit agent, or if we would like to add additional plugins or parsers (more about parsers below).
Input: defines the source from where the agent will attempt to collect records. Fluent Bit can receive records from multiple sources, such as log streams created by applications and services, Linux/Unix system logs, hardware metrics, Docker events, etc. A full list of inputs can be found in the documentation. When we talk about Fluent Bit usage together with ECS containers, most of the time these records are log events (log messages with additional metadata).
Output: defines the sink, the destination where certain records will go. Fluent Bit supports multiple destinations, such as ElasticSearch, AWS S3, Kafka our event stdout. For a full list, see the official documentation for outputs.
Filter: the name of this section is somewhat misleading in my opinion. Filters can be used to manipulate records, not just for filtering and dropping entries. With filters, we can modify certain fields from the records or we can add/remove/rename certain information. A full list of filters can be found here.

Not all of these sections are mandatory in a configuration file. Generally, we need at least the input and output sections. The fluent-bit.conf file is also referred to as the main configuration file. Besides this file, we can have additional configurations, such as parsers. Parsers are used to read and transform raw input records into a structured object, such Lua tables (tables are the equivalent of a dictionary/map in other languages). This is required by the agent to be able to further process them with filters.

We can write our own parsers and load them, or we can rely on the ones provided by Fluent Bit itself. It comes with its own pre-configured parser.conf file (https://github.com/fluent/fluent-bit/blob/master/conf/parsers.conf). These parsers support most of the popular log formats, such as Docker, nginx, or syslog.

Debugging and Troubleshooting Fluent Bit Configuration File

While working with Fluent Bit I found myself losing a lot of time with deployments. If I wanted to see the effects of certain changes I made in the configuration file, I had to rebuild the Fluent Bit image, push it to an ECR repo, restart the main service which will load the newest version of the sidecar container, and then just wait for the log messages to arrive while hoping to see some meaningful change. This laborious process can be very annoying. It is way wiser to attempt to run the container locally and provide some test input for validating a modification in the configuration file.

We mentioned that we can have several types of INPUTs. One of them, having the name of dummy, was purposefully implemented for quick testig. It accepts a pre-defined JSON as input. It will repeatedly send this input for processing over and over again, simulating a stream of data. Additionally, if we set the OUTPUT to be stdout, we will create a way of doing "printf debugging".

For example:

We create a fluent-bit.conf file with the following content:

[INPUT]
    Name   dummy
    Dummy {"message": "custom dummy"}

[OUTPUT]
    Name   stdout
    Match  *

We create a Dockerfile for our Fluent Bit image:

FROM amazon/aws-for-fluent-bit:latest

WORKDIR /

ADD fluent-bit.conf fluent-bit.conf

CMD ["/fluent-bit/bin/fluent-bit", "-c", "fluent-bit.conf"]

We build the Fluent Bit docker image:

docker buildx build --platform linux/amd64 -t fluent-bit-dummy .

We run the image locally:

docker run --rm fluent-bit-dummy

This will launch the container that will run until we stop it with Ctrl+C key combination. This execution will produce an output similar to the following:

$ docker run --rm fluent-bit-dummy
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/12/24 16:06:59] [ info] [fluent bit] version=1.9.10, commit=557c8336e7, pid=1
[2023/12/24 16:06:59] [ info] [storage] version=1.4.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2023/12/24 16:06:59] [ info] [cmetrics] version=0.3.7
[2023/12/24 16:06:59] [ info] [output:stdout:stdout.0] worker #0 started
[2023/12/24 16:06:59] [ info] [sp] stream processor started
[0] dummy.0: [1703434019.553880465, {"message"=>"custom dummy"}]
[0] dummy.0: [1703434020.555768799, {"message"=>"custom dummy"}]
[0] dummy.0: [1703434021.550525174, {"message"=>"custom dummy"}]
[0] dummy.0: [1703434022.551563050, {"message"=>"custom dummy"}]
[0] dummy.0: [1703434023.551944509, {"message"=>"custom dummy"}]
[0] dummy.0: [1703434024.550027843, {"message"=>"custom dummy"}]
[0] dummy.0: [1703434025.550901801, {"message"=>"custom dummy"}]
[0] dummy.0: [1703434026.549279385, {"message"=>"custom dummy"}]
^C[2023/12/24 16:07:08] [engine] caught signal (SIGINT)
[0] dummy.0: [1703434027.549678344, {"message"=>"custom dummy"}]
[2023/12/24 16:07:08] [ warn] [engine] service will shutdown in max 5 seconds
[2023/12/24 16:07:08] [ info] [engine] service has stopped (0 pending tasks)
[2023/12/24 16:07:08] [ info] [output:stdout:stdout.0] thread worker #0 stopping...
[2023/12/24 16:07:08] [ info] [output:stdout:stdout.0] thread worker #0 stopped

The problem with dummy is that it requires the content of the message to be inline. This can be annoying if we want to give something more complex. For example, a log event generated by the log router of an ECS containers looks like this:

{
    "source": "stdout",
    "ecs_task_arn": "arn:aws:ecs:region:0123456789012:task/FluentBit-cluster/13EXAMPLE",
    "container_name": "/ecs-windows-app-task-1-sample-container-cEXAMPLE",
    "ecs_cluster": "FluentBit-cluster",
    "ecs_container_name": "sample-container",
    "ecs_task_definition_version": "1",
    "container_id": "61f5e6EXAMPLE",
    "log": "10",
    "ecs_task_definition_family": "windows-app-task"
}

We could provide this JSON document as a one-liner and call it a day. In my opinion, it would be ideal if we could put this into a file, and point the input to use the content of that file to generate records. Unfortunately, dummy input is dummy and does not support reading stuff from a file.

A workaround that I've been using to overcome this limitation, is to have exec instead of dummy. exec can take the content from the standard output of a script and generate records based on that.

We can provide a simple bash script that reads and outputs the content of a file:

#!/bin/bash

# Read the content of the log entry from a file
content=$(cat log.json)

# Echo the output, which will be the input for Fluent Bit
echo $content

We can alter the configuration file as such:

[SERVICE]
    Flush        5
    Log_Level    info
    Parsers_File /fluent-bit/parsers/parsers.conf

[INPUT]
    Name         exec
    Command      /generate.sh
    Tag          dummy.input
    Parser       json

[OUTPUT]
    Name   stdout
    Match  *

The Dockerfile should also be modified to have the bash script and the log entry JSON file:

FROM amazon/aws-for-fluent-bit:latest

WORKDIR /

ADD fluent-bit.conf fluent-bit.conf
ADD log.json log.json
ADD generate.sh generate.sh

RUN chmod +x generate.sh

CMD ["/fluent-bit/bin/fluent-bit", "-c", "fluent-bit.conf"]

Running this container locally will print the following over and over again:

[0] dummy.input: [1703444437.645291508, {"source"=>"stdout", "ecs_task_arn"=>"arn:aws:ecs:region:0123456789012:task/FluentBit-cluster/13EXAMPLE", "container_name"=>"/ecs-windows-app-task-1-sample-container-cEXAMPLE", "ecs_cluster"=>"FluentBit-cluster", "ecs_container_name"=>"sample-container", "ecs_task_definition_version"=>"1", "container_id"=>"61f5e6EXAMPLE", "log"=>"10", "ecs_task_definition_family"=>"windows-app-task"}]

Loading Parsers

Parsers should not be part of the main configuration file, they should be placed into a separate file. Adhering to this requirement, Fluent Bit provides a set of parsers located under /fluent-bit/parsers/parsers.conf path. For these parsers to be used, we have to load them in the service section:

[SERVICE]
    Parsers_File /fluent-bit/parsers/parsers.conf

Parsers can work together with INPUTs, as we have already seen in the case of exec INPUT. We can also have separate FILTER doing the parsing:

[FILTER]
    Name     parser
    Match    dummy.*
    Key_Name data
    Parser   dummy_test

As a reminder, FILTERs are used to modify data. We will discuss different types of FILTERs in the upcoming paragraphs.

Modify Records with FILTER

The most basic FILTER operation is the Modify. Modify can be used to do a bunch of changes on a record:
- add fields with static values
- overwrite fields with static values
- remove fields
- rename fields

Aside from static fields, we can refer to environment variables as well. For example, we can inject the current environment/AWS region in each record:

[FILTER]
    Name            modify
    Add environment ${ENVIRONMENT}
    Add region      ${AWS_REGION}

ENVIRONMENT and AWS_REGION are environment variables and they should be specified in the task definition.

Additionally, the Modify FILTER supports conditional actions. For example, we could apply a renaming only if a certain condition is met, such as the field stars with a particular string:

[FILTER]
    Name                                 modify
    Match                                *
    Rename    ecs_task_definition_family family
    Condition Key_value_matches ecs_task_definition_family windows.*

The above FILTER will rename the ecs_task_definition_family field to family if the value of the ecs_task_definition_family starts with windows.*. Please note windows.* is a regular expression. Aside from the Key_value_matches condition, there are several other conditions we can use. All of them can be found in the Fluent Bit documentation for Modify.

Routing and Multiple Outputs

One of the most important abilities of the Fluent Bit agent is to offer support for multiple outputs. For example, we could deliver every log message to a centralized logging aggregator which is built upon ElasticSearch, while at the same time, we could direct error messages to an alerting system. To achieve this architecture, we need to introduce the concept of routing records.

Routing requires the presence of two important other entry properties: Tag and Match. When we create an INPUT, we can add an optional Tag property. Every record originating from this INPUT will carry this tag. For example:

[INPUT]
    Name cpu
    Tag  cpu_usage

We collect the CPU usage to generate records. Each record will be tagged with cpu_usage. Now we can define FILTERs to process only these records with the help of Match:

[FILTER]
    Name          modify
    Match         cpu_usage
    Add   brand   AMD
    Add   mark    Ryzen

Each record tagged with cpu_usage will have a brand and a mark field. In case we add another input for collecting the memory usage and we tag these records with mem_usage, the records originating from the memory INPUT won't receive the brand and mark fields.

Similarly, we can create multiple outputs possessing the Match property. As an example, we can create an OUTPUT to match everything with cpu_usage only and build a CloudWatch metric based on this information, while we also could save every event in an S3 bucket:

[OUTPUT]
    Name              cloudwatch_logs
    Match             cpu_usage
    log_stream_name   fluent-bit-cloudwatch
    log_group_name    fluent-bit-cloudwatch
    region            us-west-2
    log_format        json/emf
    metric_namespace  local_cpu_metrics
    metric_dimensions amd_ryzen_7700x
    auto_create_group true

[OUTPUT]
    Name                         s3
    Match                        *
    bucket                       fluent-bit-metrics
    region                       us-west-2
    s3_key_format                /$TAG[2]/$TAG[0]/%Y/%m/%d/%H/%M/%S/$UUID.gz
    s3_key_format_tag_delimiters .-

Note, Match accepts a regular expression. We can have a wildcard (*) to match everything.

Nest and Lift

When working with Fluent Bit on ECS, generally it is a good idea to configure our services to log in JSON format. Most of the logging libraries support this out of the box. Assuming we are logging everything in JSON format, let's imagine our service generates the following log messages:

{
    "type": "error",
    "message": "Something happened!"
}

The log router from ECS will embed the content of every log message under the "log" field, the final event having a similar format:

{
    "source": "stdout",
    "ecs_task_arn": "arn:aws:ecs:region:0123456789012:task/FluentBit-cluster/13EXAMPLE",
    "container_name": "/ecs-windows-app-task-1-sample-container-cEXAMPLE",
    "ecs_cluster": "FluentBit-cluster",
    "ecs_container_name": "sample-container",
    "ecs_task_definition_version": "1",
    "container_id": "61f5e6EXAMPLE",
    "log": {
        "type": "error",
        "message": "Something happened!"
    },
    "ecs_task_definition_family": "windows-app-task"
}

We decided we don't like our log content to be embedded under "log" property, so we want everything to be on the root level of the record. To do this, we can use the Nest FILTER. This filter has two operations, the first one being Nest (again, confusing I know), and the second one is Lift. In case we want to lift out fields to the root level, we can do the following:

[FILTER]
    Name         nest
    Match        *
    Operation    lift
    Wildcard     container_id
    Nested_under log
    Add_prefix   LIFTED_

This FILTER will lift everything situated under the "log" and put it into the root. The output will be something like this:

[5] dummy.input: [1703451474.539715464, {"source"=>"stdout", "ecs_task_arn"=>"arn:aws:ecs:region:0123456789012:task/FluentBit-cluster/13EXAMPLE", "container_name"=>"/ecs-windows-app-task-1-sample-container-cEXAMPLE", "ecs_cluster"=>"FluentBit-cluster", "ecs_container_name"=>"sample-container", "ecs_task_definition_version"=>"1", "container_id"=>"61f5e6EXAMPLE", "ecs_task_definition_family"=>"windows-app-task", "LIFTED_type"=>"error", "LIFTED_message"=>"Something happened!"}]

Usually, I recommend adding a prefix to the lifted fields, but this can be omitted.

Now we are happy with how our record looks, but unfortunately, our colleague does not agree with us. He suggests we keep the "log" object as it is and move the "container_id" inside that object. We can accomplish this with Nest operation:

[FILTER]
    Name       nest
    Match      *
    Operation  nest
    Wildcard   container_id
    Nest_under log
    Add_prefix NESTED_

The output after adding this section to the configuration will look similar to this:

[3] dummy.input: [1703451786.500213512, {"source"=>"stdout", "ecs_task_arn"=>"arn:aws:ecs:region:0123456789012:task/FluentBit-cluster/13EXAMPLE", "container_name"=>"/ecs-windows-app-task-1-sample-container-cEXAMPLE", "ecs_cluster"=>"FluentBit-cluster", "ecs_container_name"=>"sample-container", "ecs_task_definition_version"=>"1", "log"=>{"type"=>"error", "message"=>"Something happened!"}, "ecs_task_definition_family"=>"windows-app-task", "log"=>{"NESTED_container_id"=>"61f5e6EXAMPLE"}}]

We can notice that this output is a little bit funky, since it appears there are two "log" objects. This is a "bug" in the Fluent Bit version used for this blog post. The demos and examples presented in this post are using the latest Fluent Bit docker image maintained by AWS, which at the moment of writing, is based on Fluent Bit 1.9.10. Technically, this issue was fixed in a later version of Fluent Bit. The fix consists of maintaining only the latest key in the table, resulting in an effective overwrite in case the key existed before.

So, we can enumerate a few caveats for Nest and Lift:

As we have seen before, if we would like to nest a field into an already existing field, that is not really possible, even if the receiving field itself is a nested object. Personally, I would have preferred a possibility to merge them, but I'm fully aware that this will come with its own baggage of challenges and edge cases.
Let's say we have a deeply nested object such as this:

{
    "source": "stdout",
    "log": {
        "type": "error",
        "message": "Something happened!",
        "details": {
            "code": 128,
            "stacktrace": "..."
        }
    },
}

In case we would like to lift only the "code" property to the root, we simply can not do this easily. We will have to lift the content of the "log" first and then the content of "details". At this point, we essentially broke the original structure of our JSON, which is probably not what we wanted in the first place.

Lua Scripting

If we think we need more flexibility for processing records, we can write our own embedded filters using Lua language. Lua is a highly efficient programming language used mainly for embedded scripting.

It is relatively easy to integrate a Lua script into a Fluent Bit configuration. First, we have to define a FILTER which will call our script:

[FILTER]
    Name    lua
    Match   *
    script  script.lua
    call    transform

Then, we have to create a script file (named script.lua in this case, but we can name it however we want) and write our function (named transform in this case, but again, we can name this as we wish) which will be invoked for each record.

function transform(tag, timestamp, record)
    record["from_lua"] = "hello from lua"
    return 1, timestamp, record
end

There are a few restrictions for this function. The function should accept the following arguments:

tag: tag attached to the record, we discussed tags in detail in the routing section above;
timestamp: an Unix timestamp attached to each record
record: the record itself. The type of this argument is a Lua table

This function has to return 3 values:

code: must be one of the following values:
- -1: tells Fluent Bit to drop the current record
- 0: the current record was not modified
- 1: the current record was modified
- 2: the timestamp was modified
timestamp: the Unix timestamp of the record, usually it is returned as it was received in the arguments
record: the record itself, in the form of a Lua table.

We can do some fairly complex transformations with Lua. My suggestion is to keep it to a minimum. We have to remember that this script will be executed for every record (as long as we did not do a filter before that). Having a heavy and time-consuming transformation will result in our processing lagging, or we will end up dropping records in the worst possible scenario. Moreover, sidecar containers usually use the same resources allocated to the main service. If we attempt to steal a significant amount of resources from the main service, we might disturb its operation.

Final Thoughts

The motivation behind this blog post was to share several ideas acquired while working with Fluent Bit sidecar container in production. Some of these might seem boring or obvious to experienced people and that is absolutely fine. Logging should be boring, without any unforeseen surprises. It should just work.

That being said, I hope some of these tips may be helpful for somebody out there.

The code for the examples presented in this blog post can be found on GitHub: https://github.com/Ernyoke/ecs-with-fluentbit

References:

Fluent Bit Inputs: Official Documentation
Fluent Bit Outputs: Official Documentation
Fluent Bit Filters: Official Documentation
Fluent Bit Modify: Official Documentation
Fluent Bit Lua Filter: Official Documentation
Lua programming language: Official Documentation
Lua - Tables: Official Documentation

Why I Am Not Able to Remove a Security Group?

Ervin Szilagyi — Sun, 10 Sep 2023 22:00:00 +0000

If you have a slightly more extended experience with IaC, more specifically with Terraform, you might have run into the following issue:

This usually happens when we are trying to remove a Lambda Function placed in a VPC. The reason for this is that the removal of the security group is temporarily blocked by one or more network interfaces.

In the upcoming lines, we will see how can we deal with cases when our security group seemingly cannot be removed. We will discuss what is causing these blockages, and how can we gracefully handle them.

Why does a Security Group become unable to be removed?

A security group is a stateful firewall, the purpose of which is to control what kind of inbound and outbound traffic can be allowed for a resource in a VPC. A security group is always assigned to an ENI (Elastic network interface). This is true, even if the AWS console makes it seem like we assign security groups to all kinds of resources such as EC2 instances, load balancers, Lambda Functions, databases, etc. What is happening in the background is that one or more ENIs will be placed inside our VPC to whom the security group will be assigned. The ENIs will be used by our resource, hence the AWS console will show it like the security group is assigned to the resource itself.

A security group will be unable to be removed in the following cases:

It is assigned to one or more ENIs: a security group can be assigned to one or more ENIs, moreover an ENI can have up to 5 security groups assigned to it (soft limit, by asking the AWS Support we can increase this limit to 16). If a security group is attached to at least one ENI, we need to either get rid of the ENI or try to de-assign the security group from it in order for the SG to be able to be removed.
It is referenced by a security group rule: a security group can allow inbound/outbound traffic based on rules. We can use another SG as the source/destination for a rule. If an SG is referenced by a rule from another SG, it can not be removed until the rule is removed/changed.
The SG is a default SG in a VPC: each VPC automatically gets a security group when it is created. We can get rid of this security group only if we remove the VPC.
We do not have the privileges to remove the SG: this can happen if the role we are using does not have the necessary permission to do DeleteSecurityGroup action.

The Security Group is Assigned to an ENI

In case we assign a security group to an AWS resource (EC2, Lambda, RDS database, VPC Endpoint, etc.) the security group will always be assigned to a Network Interface (ENI). The AWS console is somewhat misleading because it displays that the security group is assigned to the resource itself, but this is not the case. Since all of this ENI provisioning and SG assignment happens in the background, in most of the cases we are not allowed to temper with the ENI and the security group assignment. Sometimes it can be confusing what is happening under the hood when we do the resource provisioning, so let's see a few examples to understand what is AWS doing:

EC2 instances: whe we provision an EC2 instance, this will automatically receive a default ENI. This ENI cannot be detached from the instance, hence it cannot be removed. AWS expects us to assign a security group the the instance at the time of creation. If we want to remove this security group, we have to assign another security group to our EC2 instance. We can do this by either going to the ENI console and assigning a security group straight to the ENI, or by going to the EC2 instance and changing the security group in the Security Settings.

EC2 instances can have also secondary ENIs attached to them. These ENIs are provisioned independently, so we can change the security group assigned to them from the ENI console.

Lambda Functions: Lambda Functions require a security group in case we want them to have connectivity to a VPC. If we choose it so, AWS will place an ENI in each subnet we specify, the security group will be assigned to each provisioned ENI. We can change the security group freely if we modify the Lambda Function configuration, but we cannot directly temper with the ENIs. If we decide to remove our functions, the ENIs will also be removed automatically. This removal usually happens with a delay of 10-15 minutes, essentially getting stuck temporarily. We simply have to wait until the removal is finally completed. This can be annoying if we use Terraform IaC for our infrastructure since it will try to remove the security group over and over again (see the GIF from the beginning of the article). If this removal won't happen in time, we can easily end up with an inconsistent Terraform state. What we can do is simply wait and hope that Terraform won't time out.
ECS Fargate Tasks: In the case of ECS tasks, each container from the task can have an ENI, depending on the network settings. These ENIs are managed by AWS and we cannot really temper with them. We can change the security groups on the task settings. When the containers are decommissioned if we decide to remove our task, the ENIs will be automatically removed. In most of the cases, this happens instantly, but in very rare instances we can manage to end up with a stuck ENI. If this happens, we could attempt to manually remove the ENI from the AWS console. If this is unsuccessful, we have to write to AWS Support.
VPC Endpoints: VPC Endpoints have multiple benefits for our infrastructure. We can have endpoints for reaching AWS services such as S3, DynamoDB, etc. without the need to have outgoing connectivity to the public internet, or we can have one-to-one connectivity to any instance from a totally different VPC from another AWS account. The restriction is that PrivateLink, the service that powers VPC Endpoints, works at the availability zone level. This means our connecting subnet has to be in the same AZ as the other subnet that is exposing the endpoint. In terms of ENIs and security groups, the idea is the same as with other resources. We get an ENI in each subnet in which we place an endpoint. This ENI is managed by AWS. We can modify the security groups if we go to the VPC Endpoint settings. If we get rid of the VPC Endpoint, the ENI will be removed and the security group will be detached automatically.

We can notice a pattern here. If we take any AWS-managed resource that needs access to a VPC, we will end up with a similar networking setup with ENI placement and security group assignment to the ENI. What is important to know are the following:

Security groups are assigned to Network Interfaces. In most of the cases, an ENI cannot exist without a security group
In most cases, ENIs are placed inside our VPC while we provision a resource. At the time of provisioning, we have to assign a security group to the ENI
Usually, we cannot temper with the ENI, meaning we cannot directly de-associate the security group from it. We can change the security group if we modify the AWS service which uses the ENI
If we want to remove a security group we have to either:
- Remove the AWS service which is using the ENI to which our security group is assigned;
- Modify the service that is using the ENI, hence the security group, by assigning another security group to it and removing the one that we would like to remove.

What is using my randomly named Security Group?

We finally decided to remove a security group with a random funky name, that we don't remember creating. We suspect it is used by some resources, but we are not really sure which are those. In the AWS console, we navigate to the security group and press the Delete security groups button. We are greeted with this:

The console tells us, that we cannot remove the security groups because it is used by one or more network interfaces. It also conveniently gives us a link to a list with all of these network interfaces. We click on the link, we get the list of the network interfaces, and after a few moments, we realize we have no idea who is using these network interfaces.

Before moving on, you may think this scenario is unrealistic, it cannot happen to me, I know every resource I create in my account and I have good naming practices. In an ideal world, our infrastructure would be as clean as possible with well-defined naming conventions. In the real world unfortunately this is not always true. In case you had experience working in AWS accounts where multiple teams deploy their stuff, you may pretty easily end up with resources that do not adhere to any convention you predefined. Moreover, in many cases, the AWS console itself offers the opportunity to create security groups with semi-randomly generated names.

Coming back to our topic, we have a list with network interfaces, but unfortunately, the console does not help us showing who is using these network interfaces (as long as the ENI is not attached to an EC2 instance). The question is how can we proceed next?

There are a few tricks that we can use to detect who is using a network interface. In general, we can take a look at the description of the security ENI. This may contain an attachment ID or an ID to a resource. For example, in the case of a Lambda Function, we may have something like this in the description: AWS Lambda VPC ENI-vpc-lambda-f8872d9f-745a-42dd-bca9-3ac0e87ac215. Here the description tells us the ENI is used by a Lambda Function that is placed in a VPC, the name of the function is ENI-vpc-lambda and the identifier of the function is this uuid f8872d9f-745a-42dd-bca9-3ac0e87ac215. Unfortunately, this description format is not something standard and is not documented in the AWS documentation. For other resources, we may get a description using a different format (example: [DO NOT DELETE] ENI managed by SageMaker for Studio Domain(d-vegsk0mcgrdp) - 946a4c21ed31356ee889a8dd95fde7cf for a security group used by Sagemaker).

At this point, we may ask ourselves if there is a better solution to find out who is using an ENI. Scouring the internet, I did not find anything to help me out, so I decided to create a tool for myself. I want to introduce sg-ripper.

sg-ripper is a CLI application, developed in Golang, whose purpose is to make our life easier in case we want to do a little bit of cleanup in our list of security groups. It can list all the security groups from an AWS account, it can grab all the ENIs for each security group, and it tries to locate all the other resources that might be relying on those ENIs.

For example:

We can list all the security groups, their associated ENIs and the resources that are using those ENIs:

We can list all the ENIs directly. In this case, it will show which security group is using each ENI and also which other AWS resources are relying on the ENIs:

With sg-ripper we can also apply a filter to see only certain security groups or ENI in case we don't want to grab all the existing ones from our account. Aside from showing which resource is using an ENI, it can display if security groups are available for removal. If it is not, it will also show some explanation as to why the is the removal blocked.

sg-ripper is a work-in-progress project, the source code itself is open-source and it can be found on GitHub: https://github.com/cloud-crafts/sg-ripper. Contributions are welcomed.

Conclusions

As our infrastructure is evolving, we may tend to leave unused resources behind such as security groups. Security groups can be removed only if they are not used, not referenced or they are not default in a VPC. sg-ripper can make our life easier in detecting unused security groups, having an explanation of why a certain security group cannot be removed and point out which ENI/which resource is blocking us from removing it.

References:

Disallow GPT Bot from Scraping our Blog Posts

Ervin Szilagyi — Tue, 08 Aug 2023 19:49:31 +0000

Lately, we can bloc GPT bots from scraping our pages for a site that we control, by setting the following lines in the robots.txt file:

User-agent: GPTBot
Disallow: /

I, myself, found out this from a tweet from Gergely Orosz:

My stance on this is similar to what Gergely is saying. GPT offers no citation to the information it provides. While I did update the robots.txt file on my personal website, I am also cross-posting to DEV. If we look at the robots.txt from DEV.to, we can notice that it does not have the same rule for GPTBot.

There are thousands of people posting to DEV, many of whom have different views about scraping information for LLM training. I'm in no position to request changes that will affect how the site works. I'm just curious of what is the opinion of other fellow authors about scraping your article for ML training.

Does updating your `robots.txt` actually solve something?

Obviously not. Adding this statement to your site is just a hint, a request for a bot to please not scrape your data.
This won't stop huge LLM bots (including GPT) to get the information it wants.

So, I'm curious about what do other people think about this topic?

Expose our REST API on AWS with a Custom Domain

Ervin Szilagyi — Tue, 23 May 2023 21:58:22 +0000

DNS is hard.

This is absolutely true for huge enterprise networks and distributed systems. With a simple Google search, we can find many IT incidents caused by DNS issues.

But this is not the topic of this current article. Most of us are not in a position to deal with the DNS for enterprise systems. What we most likely encounter once in a while is "a simple" DNS setup for a REST API. Even in this case, DNS can be confusing for the uninitiated. The purpose of this article is to clear up certain misconceptions and to guide the reader through the steps of exposing a REST API publicly on AWS with a custom domain name.

Get our own domain

If we are thinking about making our API public to the world, we would want to purchase a custom domain. AWS provides a domain registrar where we can buy domains, but there are better third-party options out in the wild. Purchasing a domain can be an adventure in itself, with each registrar offering different prices depending on the length/wording/choice of the top-level domain. Myself, I own the domain of ervinszilagyi.dev, and I also have registered ervinszilagyi.xyz for this tutorial (and for my other projects as well). I'm using GoDaddy domain registrar for my domains, and I will be referring to them in this article. I have no affiliation with them, I just happened to use them for my domain purchases. To follow the steps of this tutorial, our domain registrar of choice should allow changing the nameservers, or if you want to delegate only a subdomain, then it should allow us to register NS records. If these concepts are not clear at this point, we should not worry too much, we will have explanations below.

Hosted Zones and Records

In AWS, everything related to DNS is handled by Route 53 (the name it's a pun, DNS resolution is using port 53).

Route 53 works with Hosted Zones, which are "databases" or containers for storing and managing our records. We can have 2 types of Hosted Zones:

Private Hosted Zones: they are used for domains inside Amazon VPC (Virtual Private Cloud). They are not resolvable from the public internet. We won't use them in this tutorial, but it is important to be aware of them.
Public Hosted Zones: they are used for storing records for publicly routable domains. We can access them from the public internet.

Hosted Zones are used to manage records. Records are used to store information about our domain, for example: if our domain is mapped to an IP address pointing to our backend, we can specify a record with this IP address. There are many different types of records, for our purposes it is enough if we know about a few of them:

A record (Address record): used to store IPv4 addresses
AAAA record : used to store IPv6 addresses
CNAME record: it points to another domain, in cases where we don't have or we simply can not use an IP address for our backend
TXT record (Text record): used to store additional text information. This information can be for other humans or other systems (metadata information)
SOA record: Specifies authoritative information about a DNS zone. Each Hosted Zone has one by default. It cannot be removed or modified.
NS record (Name server record): identifies the name servers for the hosted zone. Each Hosted Zone automatically gets assigned an NS record with 4 nameservers.

Hosted Zones can have alias records. Alias records are Amazon Route 53 specific records. The reason for their existence is to overcome a limitation of how DNS Zones work. For a DNS Zone, if we want to register a record pointing to the node of the DNS space (Apex domain), this record has to be either A or AAAA record (remember, both of them point to IP addresses). The problem with this is that in AWS we don't receive static IP addresses for a lot of services (such as S3, CloudFront, API Gateway, etc.). To work around this limitation, AWS introduced Alias records. We will use an Alias record for our REST API below.

Configure a custom domain resolution with a Hosted Zone

Let's say we decided on our domain (which in my case is ervinszilagyi.xyz) and now we would want to initiate its setup.

First, we need to create a public Hosted Zone in Route 53. We should make sure, the name of the Hosted Zone is the same as we have it for our domain.

In AWS Console we should go to Route 53 -> Hosted Zones -> and press the Create Hosted Zone button. We should be directed to the following form, where we have to fill in our domain name:

We should make sure we select that we want to create a public Hosted Zone and we should press "Create Hosted Zone". After a few moments, our hosted zone should be up:

We can notice that we have an SOA record and an NS record with 4 nameservers. What we have to do next, is to go to our domain registrar (in my case GoDaddy) and change the nameservers for the domain we purchased:

We should copy the nameservers from the Hosted Zone and add them to the GoDaddy settings:

GoDaddy will warn us that changing the nameservers can be dangerous. We should not worry about these alert messages, our nameservers are managed by AWS, we should be fine.

This change of the nameservers can take up to 48 hours to take effect. From my experience, most of the time, the changes do take effect after a few minutes, but we could never know, so we have to wait until our domain is usable. To check if the changes did take effect, we can use the Unix dig command:

$ dig +short NS ervinszilagyi.xyz
ns-1976.awsdns-55.co.uk.
ns-832.awsdns-40.net.
ns-47.awsdns-05.com.
ns-1142.awsdns-14.org.

This query should return back the 4 nameservers, the ones we just configured.

At this point, we can move on to the next step where we request a TLS certificate for our domain (jump over to the next section, if you don't care about subdomain delegation).

Delegate a custom subdomain resolution to a Hosted Zone

I own ervinszilagyi.dev. This domain resolves to my personal website and blog. Obviously, I don't want to change this behavior, I would like other people to read my blog posts. I would still want to use this domain for my tutorials.

One way to make use of this domain is to create a subdomain under it and delegate the nameservers resolution for this subdomain to an AWS Hosted Zone. Let's say I would like to register my rest API using rest.ervinszilagyi.dev domain name.

As before, in the AWS console we should go to Route53 service and create a public Hosted Zone for rest.ervinszilagyi.dev:

When the Hosted Zone is created, we get the NS record with 4 namespaces:

We have to grab the values from the NS record and navigate to GoDaddy. Selecting our purchased domain we have to create 4 NS records, one for each value:

After we save the records, we should wait a little for the changes to take effect. We can use the dig +short NS tutorial.ervinszilagyi.dev command to check if our changes are in place.

We have seen how to set up nameservers for both domains and sub-domains. Moving on with this tutorial, we will use the Hosted Zone created for ervinszilagyi.xyz. Everything we do with this Hosted Zone will apply to the Hosted Zone created for the sub-domain as well.

Request a TLS certificate for our domain

TLS certificates are used for secure connectivity between our machine and a remote server. Our plan for this tutorial is to expose a REST API, for which we would use API Gateway. API Gateway enforces the usage of a valid certificate for the base path mapping (we will see what base path mapping is in detail below).

It is very easy to request a certificate with the usage of AWS Certificate Manager. We just have to go into the AWS certificate manager portal from our AWS console and press the request button. We will be redirected to this page:

We need a public certificate, so we have to choose this option. Moving on, we will reach this page:

We have to introduce our domain and we should select DNS validation option. We have to be able to prove somehow that the domain name for which the certificate is issued is ours. With DNS validation, AWS will create a record in our Hosted Zone with this proof. For the validation record to be created we may also have to press the Create Record button after the certificate was issued:

If we go to our Hosted Zone, we should see the newly created record:

We should make sure we have this record and our certificate validation status is green:

Create a REST API

The next step we would want to accomplish is to create the REST API itself. There are several ways of creating and exposing an API in AWS. In most of the cases what we would want to do is to build a REST API using Amazon API Gateway. Amazon API Gateway is managed service built for managing APIs. It is a front-facing service standing between the user and our backend. It can handle authentication and authorization, TLS encryption, rate-limiting and quota enforcement, and many other things.

To create a REST API Gateway, from the console we should go to the API Gateway Service and select REST API:

It is important to select the option with the public REST API. A private API Gateway is accessible only internally from a VPC. Since we want our API to be reachable from the internet, we need a public API Gateway.

By pressing Build we are redirected to a page where we should select the protocol for our API Gateway (we want REST, not WebSocket) and we have to give a name to our API Gateway.

After we press Create API, we should have our API Gateway up and running. We still need to add a method to handle incoming requests.

A method is essentially an HTTP REST verb (GET, POST, PUT, etc.) that does exactly what we would expect, it handles REST API GET/POST/PUT/etc. requests. We can notice that methods can be nested inside resources creating more complex and lengthy request paths. For now, we will keep our path simple and we will place our method in the root of our API Gateway. If we press the tick symbol (✓), we are redirected to a page where we have to set the integration for our method. This is essentially a backend. We will build a Mock backend for this tutorial, which means that our API Gateway will respond with a static response each time. This is enough for our tutorial, but we can imagine that instead of a mock we could have a Lambda function or a microservice here in a production environment.

Sadly, at this point, our mock will respond with no content. To have a response body, we need to set up an integration response. To do this we have to select Integration Response:

From there we press the drop-down for the 200 response:

Then we add a Mapping Template with the content type of application/json:

If we press the tickmark, we should be able to input a response body for the template. We should paste in the following JSON:

{
    "statusCode": 200,
    "message": "Works!"
}

We should press Save.

Now we also have to Deploy our API Gateway. If we go to the Actions button, we will get a dropdown menu, from where we should select Deploy API:

We are asked to create a new Stage. We can name it however we want, so we will simply choose dev.

We press Deploy and our REST API should be live. We also get a generated URL, where we can test our GET method with a curl request.

$ curl https://ppwm7oataf.execute-api.us-east-1.amazonaws.com/dev
{
    "statusCode": 200,
    "message": "Works!"
}%

We should receive the body we configured above.

Configure basepath mapping for the API

At this point our REST API is live, but it only responds to the URL generated by AWS. Next, what we would want to configure the API to be used with our custom domain (ervinszilagyi.xyz).

In the API Gateway console, on the top-left, we should select Create domain names. This will take us to another page, where we most likely should not have any domain configured yet in the list. We should press the Create button

We are taken to another page, again. Here we have to make sure we introduce carefully the following information:

Domain name: this is the domain we own, it should be the same as whatever we introduced for our Hosted Zone above.
We have a Regional API Gateway, we should leave that option as it is
For the certificate, we should select the one for our domain. We created a certificate before (Request a TLS certificate for our domain step). We should be able to see this certificate in the list.

After pressing create, shortly we are taken to another page. What we have to do now is to set up base path mapping. We tested our API before with the generated URL (https://ppwm7oataf.execute-api.us-east-1.amazonaws.com/dev in my case). We want to configure our API to use our custom domain (ervinszilagyi.xzy in my case). To accomplish this, we should select the second tab (API mappings) and press the Configure API mappings button:

We should select our API and the stage (dev in my case). For the Path we should leave it blank.

This is it. We have the mapping set up. This is good, but at this point, we are still not finished yet. We need one more step to be able to have our REST API exposed with our custom domain. We need to create a record for our API in our Hosted Zone.

Create an Alias record for our API

We should navigate back to Route 53 and select our Hosted Zone. We want to create a record, so we should press the big orange Create record button.

We want to create an A record that is an Alias. We explained in the beginning what Alias records are, what we have to know now is that we can use an Alias record if we don't have an IP address for the A record. It is not recommended at all to rely on API addresses for AWS API Gateways, so we should enable the Alias tickbox.

For the Route traffic section, we need to select Alias to API Gateway and we will have to find our API Gateway in the region in which we are working.

We leave the routing policy at the default simple routing option.

After pressing create, we should see our A record inside our Hosted Zone:

To check if our domain works, we can use the dig command:

$ dig ervinszilagyi.xyz

; <<>> DiG 9.16.1-Ubuntu <<>> ervinszilagyi.xyz
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52495
;; flags: qr rd ad; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;ervinszilagyi.xyz.             IN      A

;; ANSWER SECTION:
ervinszilagyi.xyz.      0       IN      A       34.202.63.112
ervinszilagyi.xyz.      0       IN      A       54.209.119.225
ervinszilagyi.xyz.      0       IN      A       3.86.19.75

;; Query time: 50 msec
;; SERVER: 172.24.80.1#53(172.24.80.1)
;; WHEN: Sun May 21 18:56:53 EEST 2023
;; MSG SIZE  rcvd: 100

We should be able to see three A records with IP addresses. These IP addresses are the public IP addresses for the API Gateway service and they are managed by AWS. They might be different for you, what is important is that we should be able to get some A records back when doing DNS resolution.

We should also do a curl request to see if we get a response from our REST API:

$ curl https://ervinszilagyi.xyz
{
    "statusCode": 200,
    "message": "Works!"
}%

We can notice that we get the same answer as above. This is great! We have our REST API exposed publicly using our custom domain.

Automating all these steps

We can agree on the fact that there are many steps to be taken to have a custom domain set up for a REST API. Fortunately, we can have our infrastructure configured quickly if we write some Terraform code with all these changes.

This is exactly what I've already done, making the code available for anybody on Github: https://github.com/Ernyoke/aws-custom-domain-r53

This repository contains 3 Terraform projects (let's call them stacks):

tree
.
├── README.md
├── api-gw
│   ├── api-gw.tf
│   ├── main.tf
│   ├── output.tf
│   ├── route53.tf
│   ├── terraform.tf
│   └── variables.tf
├── certificate
│   ├── main.tf
│   ├── output.tf
│   ├── terraform.tf
│   └── variables.tf
└── route53
    ├── main.tf
    ├── output.tf
    ├── terraform.tf
    └── variables.tf

The reason for this project structure is that setting up the nameservers on GoDaddy (or any other registrar) takes time and manual steps. So, if we would like to deploy the Terraform code from my project, the first thing we should do is go inside the route53 folder and run the following commands:

terraform init
terraform apply

After the plan is rolled out successfully, we should see the nameservers in the output. We should copy these nameservers and do the change for our domain in the registrar. We have to make sure our changes are propagated before moving on to the next step.

The certificate stack creates the TLS certificate and the validation for it. For the validation to succeed, our domain should point to the correct nameservers. The commands for the terraform rollout are the same as we've seen with the route53 stack.

Last, we should roll out our REST API. After that is successful, we should be able to test it with a curl request.

Conclusions

Setting up a custom domain for a REST API in AWS is not the most complicated procedure in the world. Certainly, it is not something, we may not do on a daily basis, so I think it is a good idea to have it documented.

References

Working with hosted zones: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zones-working-with.html
DNS Records: https://en.wikipedia.org/wiki/Domain_Name_System#Resource_records
Choosing between alias and non-alias records: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resource-record-sets-choosing-alias-non-alias.html
Amazon API Gateway - https://aws.amazon.com/api-gateway/

An Introduction to AWS Batch

Ervin Szilagyi — Sun, 19 Mar 2023 11:38:37 +0000

AWS Batch is a fully managed service that helps us developers run batch computing workloads on the cloud. The goal of this service is to effectively provision infrastructure for batch jobs submitted by us while we can focus on writing the code for dealing with business constraints.

Batch jobs running on AWS are essentially Docker containers that can be executed on different environments. AWS Batch supports job queues deployed on EC2 instances, on ECS clusters with Fargate, and on Amazon EKS (Elastic Kubernetes Service). Regardless of what we choose for the basis of our infrastructure, the provisioning of the necessary services and orchestration of the jobs is managed by AWS.

Components of AWS Batch

Although one of the selling points of AWS Batch is to simplify batch computing on the cloud, it has a bunch of components each requiring its own configuration. The components required for a job executing on AWS Batch service are the following:

Jobs

Jobs are Docker containers wrapping units of work which we submit to an AWS Batch queue. Jobs can have names and they can receive parameters from their job definition.

Job Definitions

A job definition specifies how a job should run. Jobs definitions can have the followings:

an IAM role to provide access to other AWS services;
information about the memory and CPU requirements of the job;
other properties required for the job such as environment variables, container properties, and mount points for extra storage.

Job Queues

Jobs are submitted to job queues. The role of a job queue is to schedule jobs and execute them on compute environments. Jobs can have a priority based on which they can be scheduled to run on multiple different compute environments. The job queue itself can decide which job to be executed first on which compute environment.

Compute Environments

Compute environments are essentially ECS clusters. They contain the Amazon ECS container instances used for the containerized batch jobs. We can have managed or unmanaged compute environments:

Managed compute environments: AWS batch decides the capacity and the EC2 instance type required for the job (in case we decide to run our jobs on EC2). Alternatively, we can use Fargate environment, which will run our containerized batch job on instances entirely hidden from us and fully managed by AWS.
Unmanaged compute environments: we manage our own compute resources. It requires that our compute environments use an AMI that meets the AWS ECS required AMI specifications.

Multi-node jobs and GPU jobs

AWS Batch supports multi-node parallel jobs that span on multiple EC2 instances. They can be used for parallel data processing, high-performance computing applications, and for training machine learning models. Multi-node jobs can run only on managed compute environments.

In addition to multi-node jobs, we can enhance the underlying EC2 instances with graphics cards (GPUs). This can be useful for operations relying on parallel processing, such as deep learning.

AWS Batch also supports applications that use EFA. An Elastic Fabric Adapter (EFA) is a network device used to accelerate High Performance Computing (HPC) applications using Message Passing Interface (MPI). Moreover, if we would like even better performance for parallel computing, we can have direct GPU-to-GPU communication via NVIDIA Collective Communication Library (NCCL), which is also built on EFA.

AWS Batch - When to use it?

AWS Batch is recommended for any task which requires a lot of time/memory/computing power to run. This can be a vague statement, so let's see some examples of use cases for AWS Batch:

High-performance computing: tasks that require a lot of computing power such as running usage analytics tasks on a huge amount of data, automatic content rendering, transcoding, etc.
Machine Learning: as we've seen before AWS Batch supports multi-node jobs and GPU-powered jobs, which can be essential for training ML models
ETL: we can use AWS Batch for ETL (extract, transform, and load) tasks
For any other task which may take up a lot of time (hours/days)

While these use cases may sound cool, I suggest having caution before deciding if AWS Batch is the right choice for us. AWS offers a bunch of other products configured for specialized use cases. Let's walk through a few of these:

AWS Batch vs AWS Glue/Amazon EMR

A while above it was mentioned that AWS Batch can be used for ETL jobs. While this is true, we may want to step back and take a look at another service, such as AWS Glue. AWS Glue is a fully managed solution developed specifically for ETL jobs. It is a serverless option offering a bunch of choices for data preparation, data integration, and ingestion into several other services. It relies on Apache Spark.

Similarly, Amazon EMR is also an ETL solution for petabyte-scale data processing relying on open-source frameworks, such as Apache Spark, Apache Hive, and Presto.

My recommendation would be to use Glue/EMR if we are comfortable with the technologies they rely on. If we want to have something custom, built by ourselves, we can stick to AWS Batch.

AWS Batch vs SageMaker

We've also seen that AWS Batch can be used for machine learning. Again, while this is true, it is a crude way of doing machine learning. AWS offers SageMaker for Machine Learning a data science. SageMaker can run its own jobs that can be enhanced by GPU computing power.

While SageMaker is a one-stop shop for everything related to machine learning, AWS Batch is an offering for executing long-running tasks. If we have a machine learning model implemented but we just need the computing power to do the training, we can use AWS Batch, other than this probably SageMaker would make way more sense for everything ML-related.

AWS Batch vs AWS Lambda

AWS Lambda can also be an alternative for AWS Batch jobs. For certain generic tasks, a simple Lambda function can be more appropriate than having a fully-fledged batch job. We can consider using a Lambda Function when:

the task is not that compute-intensive: AWS Lambda can have up to 6 vCPU cores and up to 10GB of RAM;
we know that our task would be able to be finished in 15 minutes.

If we can adhere to these Lambda limitations, I strongly suggest using Lambda instead of AWS Batch. Lambda is considerably easier to set up and it has way fewer moving parts. We can simply focus on the implementation details rather than dealing with the infrastructure.

Building an AWS Batch Job

In the upcoming sections, we will put all things together and we will build an AWS Batch job from the scratch. For the sake of this exercise, let's assume we have a movie renting website and we would want to present movie information with ratings from critics to our customers. The purpose of a batch job will be to import a set of movie ratings at certain intervals into a DynamoDB table.

For the movies dataset, we will use one from Kaggle, which I had to download first and upload it to an S3 bucket (Kaggle limitation). Usually, if we are running a similar service in production, we will pay for a certain provider which will expose a dataset for us. Since Kaggle does not offer an easy way for automatic downloads, I had to save the dataset first into an S3 bucket.

Also, one may question the usage of a batch job considering the fact the data size might not be that big. A Lambda function may be sufficient to accomplish the same goal. While this is true, for the sake of this exercise we will stick to batch.

A simplified architectural diagram of what we would want to accomplish can be seen here:

Creating a batch job requires the provisioning of several of its components. To make this exercise redoable, we will use Terraform for the infrastructure. The upcoming steps can be accomplished from AWS console as well or with the usage of other IaC tools such as CDK. Terraform is mainly a preference of mine.

Compute Environment

The first component of a batch job we will create will be the compute environment. Our batch job will be a managed job running on AWS Fargate. We can write the IaC code for the compute environment as follows:

resource "aws_batch_compute_environment" "compute_environment" {
  compute_environment_name = var.module_name

  compute_resources {
    max_vcpus = 4

    security_group_ids = [
      aws_security_group.sg.id
    ]

    subnets = [
      aws_subnet.private_subnet.id
    ]

    type = "FARGATE"
  }

  service_role = aws_iam_role.service_role.arn
  type         = "MANAGED"
  depends_on   = [aws_iam_role_policy_attachment.service_role_attachment]
}

We can notice in the resource definition that it requires a few other resources to be present. First, the compute environment needs a service role. According to the Terraform documentation the service role "allows AWS Batch to make calls to other AWS services on your behalf". With all respect to the people who wrote the documentation, for me personally, this statement does not offer a lot of information. In all fairness, Terraform documentation offers an example of this service role, which we will use in our project:

data "aws_iam_policy_document" "assume_role" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["batch.amazonaws.com"]
    }

    actions = ["sts:AssumeRole"]
  }
}

resource "aws_iam_role" "service_role" {
  name               = "${var.module_name}-service-role"
  assume_role_policy = data.aws_iam_policy_document.assume_role.json
}

resource "aws_iam_role_policy_attachment" "service_role_attachment" {
  role       = aws_iam_role.service_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole"
}

Essentially what we are doing here is creating a role with an IAM policy offered by AWS, the name of the policy being AWSBatchServiceRole. Moreover, we create a trust policy to allow AWS Batch to assume this role.

Another important thing required by our compute environment is a list of security groups and subnets. I tie them together because they are part of the AWS networking infrastructure needed for the project. A security group is a stateful firewall, while a subnet is part of a virtual private network. Networking in AWS is a complex topic, and it falls outside the scope of this article. Since AWS Batch requires the presence of a minimal networking setup, this is what we can use for our purposes:

resource "aws_vpc" "vpc" {
  cidr_block = "10.1.0.0/16"

  tags = {
    "Name" = "${var.module_name}-vpc"
  }
}

resource "aws_subnet" "public_subnet" {
  vpc_id     = aws_vpc.vpc.id
  cidr_block = "10.1.1.0/24"

  tags = {
    "Name" = "${var.module_name}-public-subnet"
  }
}

resource "aws_subnet" "private_subnet" {
  vpc_id     = aws_vpc.vpc.id
  cidr_block = "10.1.2.0/24"

  tags = {
    "Name" = "${var.module_name}-private-subnet"
  }
}

resource "aws_internet_gateway" "igw" {
  vpc_id = aws_vpc.vpc.id

  tags = {
    Name = "${var.module_name}-igw"
  }
}

resource "aws_eip" "eip" {
  vpc = true
}

resource "aws_nat_gateway" "nat" {
  allocation_id = aws_eip.eip.id
  subnet_id     = aws_subnet.public_subnet.id

  tags = {
    Name = "${var.module_name}-nat"
  }

  depends_on = [aws_internet_gateway.igw]
}

resource "aws_route_table" "public_rt" {
  vpc_id = aws_vpc.vpc.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.igw.id
  }

  tags = {
    Name = "${var.module_name}-public-rt"
  }
}

resource "aws_route_table" "private_rt" {
  vpc_id = aws_vpc.vpc.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_nat_gateway.nat.id
  }

  tags = {
    Name = "${var.module_name}-private-rt"
  }
}

resource "aws_route_table_association" "public_rt_association" {
  subnet_id      = aws_subnet.public_subnet.id
  route_table_id = aws_route_table.public_rt.id
}

resource "aws_route_table_association" "private_rt_association" {
  subnet_id      = aws_subnet.private_subnet.id
  route_table_id = aws_route_table.private_rt.id
}

Now, this may seem like a lot of code. What is happening here is that we create an entirely new VPC with 2 subnets (a private and a public one). We put our cluster behind a NAT to be able to make calls outside to the internet. This is required for our batch job to work properly since it has to communicate with the AWS Batch API. Last but not least, for the security group, we can use this:

resource "aws_security_group" "sg" {
  name        = "${var.module_name}-sg"
  description = "Movies batch demo SG."
  vpc_id      = aws_vpc.vpc.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

This is probably the simplest security group. It allows outbound traffic, denying inbound traffic. Remember, security groups are stateful, so this should be perfect for our use case.

Job Queue

Now that we have the compute environment, we can create a job queue that will use this environment.

resource "aws_batch_job_queue" "job_queue" {
  name     = "${var.module_name}-job-queue"
  state    = "ENABLED"
  priority = 1
  compute_environments = [
    aws_batch_compute_environment.compute_environment.arn,
  ]
}

The definition of a queue is pretty simple, it needs a name, a state (enabled, disabled), a priority, and the compute environment to which it can schedule jobs. Next, we will need a job.

Job Definition

For a job definition, we need a few things to specify. Let's see the resource definition first:

resource "aws_batch_job_definition" "job_definition" {
  name = "${var.module_name}-job-definition"
  type = "container"

  platform_capabilities = [
    "FARGATE",
  ]

  container_properties = jsonencode({
    image = "${data.terraform_remote_state.ecr.outputs.ecr_registry_url}:latest"

    environment = [
      {
        name  = "TABLE_NAME"
        value = var.table_name
      },
      {
        name  = "BUCKET"
        value = var.bucket_name
      },
      {
        name  = "FILE_PATH"
        value = var.file_path
      }
    ]

    fargatePlatformConfiguration = {
      platformVersion = "LATEST"
    }

    resourceRequirements = [
      {
        type  = "VCPU"
        value = "1.0"
      },
      {
        type  = "MEMORY"
        value = "2048"
      }
    ]

    executionRoleArn = aws_iam_role.ecs_task_execution_role.arn
    jobRoleArn       = aws_iam_role.job_role.arn
  })
}

For the platform capabilities, we can have the same job being used for FARGATE and EC2 as well. In our case, we need only FARGATE.

For the container properties, we need to have a bunch of things in place. Probably the most important is the repository URL for the Docker image. We will build the Docker image in the next section.

For the resourceRequirements we configure the CPU and the memory usage. These apply to the job itself, and they should "fit" inside the compute environment.

Moving on, we can specify some environment variables for the container. We are using these environment variables to be able to pass input to the container. We could also override the CMD (command) part of the Docker container and provide some input values there, but we are not doing that in this case.

Last, but not least, we see that the job definition requires 2 IAM roles. The first one is the execution role, which "grants to the Amazon ECS container and AWS Fargate agents permission to make AWS API calls" (according to AWS Batch execution IAM role). The second one is the job role, which is an "IAM role that the container can assume for AWS permissions" (according to the ContainerProperties docs). Is this confusing for anybody else or just for me? Probably yes... so let's clarify these roles.

The service role grants permission for the ECS cluster (and the ECS Fargate agent) to do certain AWS API calls. These calls include getting the Docker image from an ECR repository or being able to create CloudWatch log streams.

resource "aws_iam_role" "ecs_task_execution_role" {
  name               = "${var.module_name}-ecs-task-execution-role"
  assume_role_policy = data.aws_iam_policy_document.assume_role_policy.json
}

data "aws_iam_policy_document" "assume_role_policy" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ecs-tasks.amazonaws.com"]
    }
  }
}

resource "aws_iam_role_policy_attachment" "ecs_task_execution_role_policy" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

Since AWS already provides a policy for the execution role (AmazonECSTaskExecutionRolePolicy), we can reuse that.

The job role will grant permissions to the running container itself. In our case, if we need to write entries to a DynamoDB table, we have to provide write permission to the job to that table. Likewise, if we read from an S3 bucket, we have to create a policy with S3 read permission as well.

resource "aws_iam_role" "job_role" {
  name               = "${var.module_name}-job-role"
  assume_role_policy = data.aws_iam_policy_document.assume_role_policy.json
}

// dynamodb table Write Policy
data "aws_iam_policy_document" "dynamodb_write_policy_document" {
  statement {
    actions = [
      "dynamodb:DeleteItem",
      "dynamodb:GetItem",
      "dynamodb:PutItem",
      "dynamodb:BatchWriteItem",
      "dynamodb:UpdateItem"
    ]

    resources = ["arn:aws:dynamodb:${var.aws_region}:${data.aws_caller_identity.current.account_id}:table/${var.table_name}"]

    effect = "Allow"
  }
}

resource "aws_iam_policy" "dynamodb_write_policy" {
  name   = "${var.module_name}-dynamodb-write-policy"
  policy = data.aws_iam_policy_document.dynamodb_write_policy_document.json
}

resource "aws_iam_role_policy_attachment" "dynamodb_write_policy_attachment" {
  role       = aws_iam_role.job_role.name
  policy_arn = aws_iam_policy.dynamodb_write_policy.arn
}

// S3 readonly bucket policy
data "aws_iam_policy_document" "s3_read_only_policy_document" {
  statement {
    actions = [
      "s3:ListObjectsInBucket",
    ]

    resources = ["arn:aws:s3:::${var.bucket_name}"]

    effect = "Allow"
  }

  statement {
    actions = [
      "s3:GetObject",
    ]

    resources = ["arn:aws:s3:::${var.bucket_name}/*"]

    effect = "Allow"
  }
}

resource "aws_iam_policy" "s3_readonly_policy" {
  name   = "${var.module_name}-s3-readonly-policy"
  policy = data.aws_iam_policy_document.s3_read_only_policy_document.json
}

resource "aws_iam_role_policy_attachment" "s3_readonly_policy_attachment" {
  role       = aws_iam_role.job_role.name
  policy_arn = aws_iam_policy.s3_readonly_policy.arn
}

Both the service role and job role require a trust policy so they can be assumed by ECS.

Building the Docker Container for the Job

For our job to be complete, we need to build a Docker container with the so-called "business logic". We can store this container in an AWS ECR repository or on DockerHub. Usually, what I tend to do is to create a separate Terraform project for the ECR. The reason for this choice is that the Docker image should exist in ECR at the moment when the deployment of the AWS Batch job happens.

The code for the ECR is very simple:

resource "aws_ecr_repository" "repository" {
  name                 = var.repo_name
  image_tag_mutability = "MUTABLE"
}

I also create an output with the Docker repository URL:

output "ecr_registry_url" {
  value = aws_ecr_repository.repository.repository_url
}

This output can be imported inside the other project and can be provided for the job definition:

data "terraform_remote_state" "ecr" {
  backend = "s3"

  config = {
    bucket = "tf-demo-states-1234"
    key    = "aws-batch-demo/ecr"
    region = var.aws_region
  }
}

For the source code which will do the ingesting of the movie ratings inside DynamoDB, we can use the following Python snippet:

import csv
import io
import os
from zipfile import ZipFile

import boto3


def download_content(bucket, key):
    print(f'Downloading data from bucket {bucket}/{key}!')
    s3 = boto3.resource('s3')
    response = s3.Object(bucket, key).get()
    print('Extracting data!')
    zip_file = ZipFile(io.BytesIO(response['Body'].read()), "r")
    files = {name: zip_file.read(name) for name in zip_file.namelist()}
    return files.get(next(iter(files.keys())))


def write_to_dynamo(csv_content, table_name):
    print('Parsing csv data!')
    reader = csv.DictReader(io.StringIO(bytes.decode(csv_content)))

    dynamo = boto3.resource('dynamodb')
    table = dynamo.Table(table_name)

    print(f'Starting to write data into table {table_name}!')
    counter = 0
    with table.batch_writer() as batch:
        for row in reader:
            counter += 1
            batch.put_item(
                Item={
                    'id': row[''],
                    'title': row['title'],
                    'overview': row['overview'],
                    'release_date': row['release_date'],
                    'vote_average': row['vote_average'],
                    'vote_count': row['vote_count'],
                    'original_language': row['original_language'],
                    'popularity': row['popularity']
                }
            )

            if counter % 100 == 0:
                print(f'Written {counter} items into table {table_name}!')

    print(f'Finished writing data into {table_name}!')


if __name__ == '__main__':
    bucket = os.environ['BUCKET']
    key = os.environ['FILE_PATH']
    table_name = os.environ['TABLE_NAME']

    is_env_missing = False

    if bucket is None:
        print(f'Environment variable BUCKET is not set!')
        is_env_missing = True

    if key is None:
        print(f'Environment variable FILE_PATH is not set!')
        is_env_missing = True

    if table_name is None:
        print(f'Environment variable TABLE_NAME is not set!')
        is_env_missing = True

    if is_env_missing:
        print('Execution finished with one ore more errors!')

    content = download_content(bucket, key)
    write_to_dynamo(content, table_name)

This code is self-explanatory. We get an archive with a CSV file from a bucket location, we extract that archive and we iterate over the lines while doing batch insert into DynamoDB. We can see that certain inputs such as bucket name, the archive path, and table name are provided as environment variables.

For the Docker file, we can use the following:

FROM public.ecr.aws/docker/library/python:3.9.16-bullseye

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY main.py .

CMD ["python3", "main.py"]

We build the image with the usual Docker build (or buildx) command:

docker build --platform linux/amd64 -t movies-loader .

Note: the platform flag is important if we are using a MacBook M1, since AWS Batch does not support ARM/Graviton yet.

We can push the image to the ECR repository following the push command from the AWS console.

Triggering a Batch Job

There are several ways to trigger batch jobs since they are available as EventBridge targets. For our example, we could have a scheduled EventBridge rule which could be invoked periodically.

To make my life easier and be able to debug my job, I opted to create a simple Step Function.

Step Functions are state machines used for serverless orchestration. They are a perfect candidate for managing running jobs, offering a way to easily see and monitor the running state of the job and report the finishing status of it. We can implement the states of a Step Function using some JSON code.

resource "aws_sfn_state_machine" "sfn_state_machine" {
  name     = "${var.module_name}-sfn"
  role_arn = aws_iam_role.sfn_role.arn

  definition = <<EOF
{
    "Comment": "Run AWS Batch job",
    "StartAt": "Submit Batch Job",
    "TimeoutSeconds": 3600,
    "States": {
        "Submit Batch Job": {
            "Type": "Task",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobName": "ImportMovies",
                "JobQueue": "${aws_batch_job_queue.job_queue.arn}",
                "JobDefinition": "${aws_batch_job_definition.job_definition.arn}"
            },
            "End": true
        }
    }
}
EOF
}

Like everything in AWS, Step Functions require an IAM role as well. The IAM role used in our example is similar to what is given in the AWS documentation.

data "aws_iam_policy_document" "sfn_policy" {
  statement {
    actions = [
      "batch:SubmitJob",
      "batch:DescribeJobs",
      "batch:TerminateJob"
    ]

    resources = ["*"]

    effect = "Allow"
  }

  statement {
    actions = [
      "events:PutTargets",
      "events:PutRule",
      "events:DescribeRule"
    ]

    resources = ["arn:aws:events:${var.aws_region}:${data.aws_caller_identity.current.account_id}:rule/StepFunctionsGetEventsForBatchJobsRule"]

    effect = "Allow"
  }
}

resource "aws_iam_policy" "sfn_policy" {
  name   = "${var.module_name}-sfn-policy"
  policy = data.aws_iam_policy_document.sfn_policy.json
}

resource "aws_iam_role_policy_attachment" "sfn_policy_attachment" {
  role       = aws_iam_role.sfn_role.name
  policy_arn = aws_iam_policy.sfn_policy.arn
}

Out Step Function is required to be able to listen to and create CloudWatch Events, this is why it is necessary to have the policy for the rule/StepFunctionsGetEventsForBatchJobsRule resource (see this StackOverflow answer).

Ultimately we will end up with this simplistic Step Function with only one intermediary state:

Conclusions

In this article, we've seen a fairly in-depth introduction to the AWS Batch service. We also talked about when to use AWS Batch and when to consider other services that might be more adequate for the task at hand. We have also built a batch job from the scratch using Terraform, Docker, and Python.

In conclusion, I think AWS Batch is a powerful service and it gets overshadowed by other offerings targeting more specific tasks. While the service itself abstracts away the provisioning of the underlying infrastructure, the whole setup process of a batch job can be still challenging and the official documentation in many cases lacks clarity. Ultimately, if we don't want to get in the weeds, we can rely on a Terraform module maintained by the community to spin up a batch job.

The source code used for this article can also be found on GitHub at this URL: https://github.com/Ernyoke/aws-batch-demo.

References

AWS Batch Documentation: https://docs.aws.amazon.com/batch/latest/userguide/what-is-batch.html
Terraform Documentation - Compute Environment: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/batch_compute_environment#service_role
Stateful Firewall: https://en.wikipedia.org/wiki/Stateful_firewall
AWS Batch - Execution IAM Role: https://docs.aws.amazon.com/batch/latest/userguide/execution-IAM-role.html
AWS Batch - Container Properties: https://docs.aws.amazon.com/batch/latest/APIReference/API_ContainerProperties.html
Step Functions: https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
EFA for AWS Batch: https://docs.aws.amazon.com/batch/latest/userguide/efa.html
Optimizing deep learning on P3 and P3dn with EFA: https://docs.aws.amazon.com/batch/latest/userguide/efa.html