Forem: Sebastian Mincewicz

Reusing CloudFront, ALB, and API Gateway in a Serverless Platform

Sebastian Mincewicz — Tue, 17 Mar 2026 22:00:00 +0000

Modern serverless platforms often need to support a mix of public-facing entry points - web applications and APIs - while keeping internal communication private and tightly controlled. Non-functional requirements around security, compliance, and operational isolation typically drive this.

At the same time, such platforms are increasingly built around multiple domain services, each representing a bounded context and owned by a small team. Independent deployments, limited blast radius, and clear ownership are usually explicit architectural goals.

These goals, however, often collide with another equally important requirement: speed. In fast-paced development environments, the ability to spin up feature environments on demand, test changes quickly, and tear everything down without friction is critical to maintaining delivery velocity.

This post explores how architectural choices at the edge - specifically around CloudFront, Application Load Balancers (ALB), and API Gateway - can either enable or severely constrain that speed. It looks at how careful reuse of these components can support isolated, per-service deployments and on-demand environments without driving up cost or operational complexity, and where the trade-offs start to appear.

Architectural choices

Frontend

In many serverless platforms, the frontend acts as the primary public entry point. A common pattern is to place CloudFront at the edge and route requests to different backend origins depending on their nature.

Static assets are typically served from S3, while dynamic requests are forwarded to compute workloads that render pages or handle frontend-driven API calls. These dynamic workloads may run as Lambda functions or containers, often inside a VPC to meet security and compliance requirements.

At this point, there is an architectural choice to make: whether dynamic frontend traffic should be handled by the API Gateway or an Application Load Balancer (ALB).

From a non-functional requirements perspective, separating public access from private execution is key. While API Gateway is well-suited for exposing managed APIs, frontend traffic often benefits from being terminated at ALB, with requests forwarded to compute that remains fully private within a VPC. CloudFront supports private origins through VPC-origin integrations with internal ALB, whereas private API Gateways cannot be used directly as CloudFront origins. This makes ALB a more natural fit when frontend compute is intentionally kept off the public internet.

Placing Lambda functions behind an ALB and inside a VPC also addresses increasingly common security requirements in regulated environments. When compute is fronted by an internal ALB, CloudFront becomes the only public ingress path by design. The ALB is not internet-facing and cannot be accessed directly, which ensures that all external traffic is consistently terminated, inspected, and controlled at the edge.

This pattern significantly reduces the risk of accidental or intentional bypass of edge-level controls. Combined with VPC-based execution, it enables tighter governance of inbound and outbound traffic through WAF, security groups, routing, and egress filtering. Together, these measures provide stronger guarantees than relying solely on managed service boundaries or publicly reachable endpoints.

From an operational perspective, ALB offers lower latency for HTTP workloads, native support for both host-based and path-based routing, and seamless integration with both Lambda and container-based runtimes such as ECS. These characteristics make it particularly well-suited for frontend and backend-for-frontend (BFF) use cases, where routing flexibility, performance, and network controls often outweigh API-centric features such as request validation, usage plans, or request transformation.

This does not diminish the role of API Gateway in the overall architecture. API Gateway remains a strong choice for internal and external APIs, especially where API lifecycle management, authentication, throttling, and contract enforcement are primary concerns. Using ALB at the frontend edge simply allows each component to be applied where it fits best, rather than forcing a single service to satisfy fundamentally different requirements.

Internal API

Internally, platforms often expose functionality through APIs consumed by other services or internal clients. A key architectural decision is whether to operate a single internal API Gateway or multiple gateways aligned with services, domains, or teams.

There is no universal answer. Smaller platforms or early-stage systems often benefit from a single internal API, as it simplifies discovery and reduces operational overhead. As platforms grow, multiple APIs aligned with domains or "two-pizza teams" can become more appropriate, particularly when ownership boundaries, release cadence, or non-functional requirements diverge.

What matters more than the number of gateways, however, is recognising that API shape and deployment shape are not the same thing. A single logical API does not require a single infrastructure deployment unit, just as multiple services do not automatically justify multiple gateways. Treating these concerns separately allows teams to optimise for independent releases, blast-radius reduction, and clearer ownership, while keeping the external API surface coherent.

This distinction becomes especially important in high-paced environments, where the ability to evolve services independently should not be constrained by how APIs are presented or grouped at the edge.

Backend services and infrastructure-as-code

From an infrastructure-as-code (IaC) perspective, backend services are often deployed as independent units to support isolation, targeted releases, and reduced blast radius. This aligns naturally with domain-driven design, where each service represents a bounded context and can evolve independently.

When multiple backend services are exposed through a single API Gateway, this model becomes more nuanced. API Gateway configuration changes - such as routes, integrations, or method settings - are only made effective through an explicit deployment step. As a result, while service-specific configuration can be defined independently, the API Gateway itself still has a shared deployment lifecycle.

This does not make per-service backend stacks incompatible with a single API Gateway, but it does require additional structure. In practice, this means decomposing API Gateway configuration into distinct concerns: a base or core API configuration, per-service or per-domain route and integration definitions, and a dedicated deployment component responsible for applying changes to the gateway. That deployment component must be notified when any service updates its portion of the API configuration, without introducing tight coupling between services or requiring knowledge of how many services exist.

Without this explicit separation, a shared API Gateway can easily become a point of implicit coupling, where independent service deployments are forced to coordinate around a central API deployment. With the right decomposition and signalling in place, however, a single logical API can still support independent service lifecycles and parallel provisioning.

In contrast, using an API Gateway per backend service naturally scopes configuration and deployment to the service itself, avoiding the need for cross-service coordination at the gateway level. The trade-off is increased operational overhead and a more fragmented API surface. The choice between these approaches is therefore not about feasibility, but about how much complexity is absorbed by infrastructure design versus operational management, and how much flexibility is required as the platform continues to evolve.

On-demand feature environments

Why they matter

Fast feedback is essential. Feature environments allow teams to validate changes in realistic conditions, unblock reviews, and catch integration issues early. While tools like LocalStack can help with local development, they have limitations - especially for integration testing, CI pipelines, and workflows involving multiple services.

On-demand environments bridge that gap, but naïvely provisioning full stacks per feature quickly becomes expensive and slow, too slow.

Reuse as an enabler

Not all infrastructure needs to be duplicated. Components like CloudFront, ALB, and API Gateway are surprisingly well-suited to reuse, provided routing is designed with that goal in mind.

This is not just a cost consideration. Creating or modifying edge components such as CloudFront distributions or load balancers can easily take ten minutes or more, which makes them a poor fit for fast, iterative workflows. When feature environments depend on provisioning or reconfiguring these components, feedback loops slow down dramatically.

The key, therefore, is deciding what varies per environment and what stays shared. By reusing long-lived edge infrastructure and shifting environment-specific concerns into routing, configuration, and backend resources, platforms can keep costs predictable and provisioning times low. The trade-off is that this requires upfront discipline in routing design, naming conventions, and configuration boundaries, but that investment pays off quickly as the environment count and delivery pace increase.

Ephemeral on-demand environments

Ephemeral environments, the golden grail, aim for maximum speed. They rely on shared edge infrastructure and differentiate environments through routing.

This approach typically requires:

Shared CloudFront configuration with a common domain name
Path-based routing at ALB
Frontend support for runtime basePath and assetPrefix changes
Multiple API Gateway stages mapped to environment identifiers

Note: Internal API design is irrelevant here

When these conditions are met, spinning up a new environment becomes little more than adding routing rules and deploying service-specific resources. Cleanup is trivial, and costs remain largely constant regardless of the number of environments. Neither CloudFront nor the ALB needs to be aware of how many environments exist or may exist in the future.

Semi-static on-demand environments

Some frontends (e.g., Next.js) require environment-specific configuration at build time and cannot easily adapt to runtime path changes. In these cases, a semi-static approach is often necessary.

Here, each environment may have:

Shared CloudFront configuration with alternate domain names and static Route 53 records
Shared ALB with host-based routing
Any API Gateway configuration implementation

Note: Internal API design is irrelevant here

In contrast to the ephemeral environments option, this model requires CloudFront and Route 53 to be prepared upfront. Each environment must be explicitly represented through distribution configuration and DNS records, making environments longer-lived and less dynamic by design.

Practical limits and quotas

In practice, hard service quotas are rarely the first blocker. More often, provisioning time and operational friction become limiting factors as environments scale.

That said, quotas do influence how far reuse patterns can go and whether a limit can be raised or must be designed around matters. API Gateway REST APIs support a limited number of stages per API (10 by default, soft limit), which constrains stage-based environment reuse. Private API Gateway custom domain names are also subject to soft limits. On the load-balancing side, an ALB supports up to 100 listener rules (soft limit) and 100 target groups (hard limit), the latter requiring an architectural change.

For example, reusing a single ALB across ten environments can consume listener rules perhaps not so quickly, while a single API Gateway with one stage per environment approaches its stage limit fast. These constraints don't prevent reuse, but they highlight the need to understand early which limits can be adjusted and which require rethinking the design.

Conclusion

The patterns described in this post are not a single fixed architecture, but a set of reuse-oriented edge and routing strategies for serverless platforms built on CloudFront, ALB, and API Gateway. Their value lies in being applied selectively.

For lower environments - feature, preview, and integration stages - reusing edge components and differentiating environments through routing enables fast provisioning, easy teardown, and predictable costs. These environments benefit most from shared CloudFront distributions, shared ALBs, and carefully structured API Gateway reuse, where speed and feedback outweigh the need for strict isolation.

Upper environments are different. Staging, pre-production, and production typically require:

Standalone, fully isolated infrastructure
Right-sized capacity and scaling characteristics
Stable DNS and edge configuration
The ability to run meaningful load, stress, and performance tests against an environment that truly behaves like production

In those environments, reuse at the edge is usually inappropriate. Dedicated CloudFront distributions, ALBs, and API Gateways are not an optimisation failure - they are a requirement.

Supporting both approaches within the same platform does introduce additional complexity, particularly in infrastructure-as-code. Designing IaC that remains DRY while allowing some environments to share edge infrastructure and others to be fully isolated requires clear layering, strong conventions, and explicit ownership boundaries. This is more demanding than a uniform setup, but it is entirely achievable with deliberate design.

These patterns optimise for:

Fast feedback and experimentation where environment churn is high
Independent service evolution with controlled blast radius
Predictable cost and provisioning time at scale
The ability to evolve toward stricter isolation when needed

They are not optimised for:

Maximum isolation across all environments
Treating edge infrastructure as immutable per environment
Eliminating all shared components

Being explicit about where reuse applies - and where it must not - allows the platform to move quickly without compromising correctness. The goal is not to reuse everything, but to reuse the right things, in the right places, for the right reasons.

AWS CodeBuild-powered GitHub Actions self-hosted runners — without webhooks

Sebastian Mincewicz — Tue, 03 Feb 2026 13:45:26 +0000

This topic may sound familiar, but this post intentionally goes beyond what you’ll find in AWS documentation or official blog posts.

The goal is to avoid webhooks and instead achieve maximum flexibility when using GitHub Actions for CI/CD with AWS CodeBuild-powered, ephemeral runners only when they are actually needed, while continuing to rely on GitHub-hosted runners for everything else.

Webhooks — one-way door

I’m not saying webhooks are wrong. If you have a clear reason to use them, they can be a good fit. However, in practice, they often become a one-way route that reduces flexibility — and I strongly prefer fit-for-purpose solutions.

Relevant AWS documentation:

The main issue with the webhook option is that you subscribe to specific GitHub event types, which effectively pushes you toward using CodeBuild for all jobs — unless you introduce increasingly complex filters and workflow logic.

At that point, your GitHub Actions configuration starts encoding infrastructure decisions, which is rarely ideal.

On-demand & ad-hoc runners — flexibility

Instead, there’s a way to make CodeBuild-powered self-hosted runners available explicitly for the workflows that need them — and only when they’re actually required.

The idea is to start a CodeBuild project on demand, scoped to a specific GitHub Actions workflow run, and configured so it can only be used by that run. This avoids clashes, ghost runs, or unintended usage across repositories within your GitHub organisation.

Consider a setup where:

some workflows only build artifacts,
others run unit tests or static analysis,
others perform deployments.

Most of these can run perfectly fine on GitHub-hosted runners. However, there are cases where that breaks down:

running tests against IP-allowlist-protected public endpoints,
running tests against private endpoints accessible only from within a custom VPC.

In the first case, some teams attempt to add GitHub runners’ public IP ranges to allowlists. This creates a false sense of security, as those endpoints remain accessible to a large, shared address space.

In the second case — private endpoints — it’s simply a hard stop.

So how do you bring both worlds together and cover all of these use cases?

Architecture overview

In this architecture, a GitHub App is created with its associated private key, enabling secure authentication with one or more repositories. AWS CodeBuild leverages this identity to generate temporary tokens and register itself dynamically as an ephemeral GitHub Actions runner.

This approach builds on the standard GitHub -> AWS OIDC integration, as documented here.

GitHub App

A key part of making this pattern robust and production-ready is how the GitHub runner registration token is obtained. While many examples rely on a Personal Access Token (PAT), using a GitHub App is the more appropriate choice for team and organisation-level setups.

A GitHub App can be installed on specific repositories only, which aligns well with the idea of tightly scoped, purpose-built runners. This ensures that a CodeBuild-provisioned runner can only ever register against repositories you explicitly allow, reducing both blast radius and the risk of accidental reuse across the organisation.

The permission model is also much cleaner than with PATs. At a minimum, the GitHub App needs:

read access to code and metadata
read and write access to administration

No organisation-wide permissions are required, and the short-lived installation access tokens generated by the App naturally fit the ephemeral runner lifecycle.

For personal projects, prototypes, or one-off experiments, a fine-grained or classic PAT can still be a perfectly acceptable and much simpler option. It avoids the additional setup overhead of a GitHub App and is often “good enough” when the scope and risk are limited.

AWS CodeBuild project & buildspec

From the CodeBuild project perspective, it really doesn’t get any simpler. The project does not need to be configured with a source repository at all — the buildspec can remain fully generic and work for any repository within your GitHub organisation.

The snippet below assumes the CodeBuild project is configured to use Amazon Linux running on Graviton-powered infrastructure.

# REQUIRED INPUT:
# - GITHUB_ORG (from IaC)
# - GITHUB_APP_ID (from IaC)
# - GITHUB_APP_INSTALLATION_ID (from IaC)
# - GITHUB_APP_PK_ASM_SECRET_ARN (from IaC)
# - GITHUB_REPO (from GHA workflow)
# - GITHUB_ACTIONS_RUNNER_NAME (from GHA workflow)

version: 0.2

env:
  variables:
    USER_HOME_DIR: "/home/runneruser"
  secrets-manager:
    GITHUB_APP_PK: "${GITHUB_APP_PK_ASM_SECRET_ARN}"

phases:
  install:
    commands:
      - |
        echo "> Creating runner user..."
        useradd -m runneruser -d $USER_HOME_DIR
        echo "runneruser ALL = NOPASSWD:/usr/bin/yum" >> /etc/sudoers # <- consider when need to install deps on the fly

        echo "> Downloading the lastest runner installation package..."
        cd $USER_HOME_DIR
        RUNNER_VERSION=$(curl -s https://api.github.com/repos/actions/runner/tags | jq -r '.[0].name' | sed 's/^v//')
        echo "Latest tag: $RUNNER_VERSION"
        curl -Ls https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-arm64-${RUNNER_VERSION}.tar.gz -o actions-runner.tar.gz
        mkdir actions-runner && tar xzf actions-runner.tar.gz -C actions-runner
        chown -R runneruser:runneruser .

  build:
    commands:
      - |
        echo "> Generating GitHub App JWT (header + payload)..."
        cd $USER_HOME_DIR/actions-runner
        now=$(date +%s)
        iat=$((${now} - 60))  # Issues 1 miute in the past
        exp=$((${now} + 600)) # Expires 10 minutes in the future

        b64enc() { openssl base64 | tr -d '=' | tr '/+' '_-' | tr -d '\n'; }

        header_json='{
            "typ":"JWT",
            "alg":"RS256"
        }'
        # Header encode
        header=$(echo -n "${header_json}" | b64enc)

        payload_json="{
            \"iat\":${iat},
            \"exp\":${exp},
            \"iss\":\"${GITHUB_APP_ID}\"
        }"
        # Payload encode
        payload=$( echo -n "${payload_json}" | b64enc )

        # Signature
        header_payload="${header}"."${payload}"
        signature=$(
            openssl dgst -sha256 -sign <(echo -n "${GITHUB_APP_PK}") \
            <(echo -n "${header_payload}") | b64enc
        )
        # Generate an installation token for the app
        JWT="${header_payload}"."${signature}"

        echo "> Requesting GitHub App installation access token..."
        INSTALLATION_TOKEN=$(curl --request POST \
            --url "https://api.github.com/app/installations/${GITHUB_APP_INSTALLATION_ID}/access_tokens" \
            --header "Accept: application/vnd.github+json" \
            --header "Authorization: Bearer ${JWT}" \
            --header "X-GitHub-Api-Version: 2022-11-28"  \
          | jq -r '.token'
        )

        echo "> Requesting ephemeral runner registration token..."
        GITHUB_RUNNER_TOKEN=$(curl --request POST \
            --url "https://api.github.com/repos/${GITHUB_ORG}/${GITHUB_REPO}/actions/runners/registration-token" \
            --header "Accept: application/vnd.github+json" \
            --header "Authorization: Bearer ${INSTALLATION_TOKEN}" | jq -r '.token'
        )

        echo "> Configuring GitHub Actions runner for ${GITHUB_ORG}/${GITHUB_REPO} ..."
        su runneruser -c "./config.sh \
          --url https://github.com/${GITHUB_ORG}/${GITHUB_REPO} \
          --token ${GITHUB_RUNNER_TOKEN} \
          --unattended --ephemeral \
          --name ${GITHUB_ACTIONS_RUNNER_NAME} \
          --labels self-hosted,${GITHUB_ACTIONS_RUNNER_NAME}"

        echo "> Starting runner..."
        su runneruser -c "./run.sh"

If you wish you can build a custom image and that way get rid of the install phase entirely.

GitHub Actions workflow

Below is a workflow snippet that completes the configuration and shows how everything fits together.

jobs:
  start-cb-runner:
    runs-on: ubuntu-latest
    outputs:
      runner_name: ${{ steps.start-cb-project.outputs.runner_name }}
    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v5
        with:
          audience: sts.amazonaws.com
          aws-region: ${{ env.AWS_REGION }}
          role-to-assume: ${{ env.AWS_ROLE_ARN }}
          role-session-name: GithubActionsSession

      - name: Start CodeBuild project
        id: start-cb-project
        run: |
          RUNNER_NAME="cb-runner-${GITHUB_RUN_ID}-${GITHUB_SHA::8}"
          CODEBUILD_PROJECT_NAME="${{ env.CODEBUILD_PROJECT_NAME }}"
          echo "runner_name=$RUNNER_NAME" >> $GITHUB_OUTPUT
          PAYLOAD=$(jq -n \
            --arg project "$CODEBUILD_PROJECT_NAME" \
            --arg gh-repo "${{ github.event.repository.name }}" \
            --arg gha-runner-name "$RUNNER_NAME" \
            '{
              projectName: $project,
              environmentVariablesOverride: [
                {name: "GITHUB_REPO", value: $gh-repo, type: "PLAINTEXT"},
                {name: "GITHUB_ACTIONS_RUNNER_NAME", value: $gha-runner-name, type: "PLAINTEXT"}
              ]
            }')
          aws codebuild start-build --cli-input-json "$PAYLOAD" > /dev/null

  run-tests-on-cb-runner:
    needs: start-cb-runner
    runs-on: [self-hosted, "${{ needs.start-cb-runner.outputs.runner_name }}"]
    steps:
      - uses: actions/checkout@v6
      - name: Introduction message
        run: |
          echo "Testing ..."

From a performance perspective, bringing a runner online should take no more than ~30 seconds, after which it is ready to pick up the queued job.

One of the less obvious but critical elements is the RUNNER_NAME. In teams where multiple developers and testers run workflows in parallel, there is always a risk of workflows competing for runners. By generating a unique runner name per workflow run and passing it into the CodeBuild project, you guarantee that the runner you spin up is used exclusively for that specific execution and cannot be accidentally picked up by another job.

Finally, the CodeBuild project output should look similar to this:

Configuring and starting runner for sebolabs/my-repo ...
--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------
# Authentication
√ Connected to GitHub
# Runner Registration
√ Runner successfully added
# Runner settings
√ Settings Saved.
√ Connected to GitHub
Current runner version: '2.331.0’
2026-01-19 20:34:07Z: Listening for Jobs
2026-01-19 20:34:09Z: Running job: run-tests-on-cb-runner
2026-01-19 20:34:17Z: Job run-tests-on-cb-runner completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stop the service, no retry needed.
Exiting runner...
[Container] 2026/01/19 20:34:17.683704 Phase complete: BUILD State: SUCCEEDED

Wrap-up

Using AWS CodeBuild–powered, ephemeral GitHub Actions self-hosted runners without webhooks gives you precise, on-demand control over when and why you leave the GitHub-hosted runner pool. You retain full workflow flexibility, keep logs and execution visibility inside the GitHub Actions UI, and only incur additional infrastructure when a workflow genuinely requires network proximity to protected or private endpoints.

This model avoids over-coupling your CI/CD design to webhook-driven automation, reduces the risk of unintended runner usage, and scales cleanly even when many workflows are triggered in parallel. Treating CodeBuild runners as a specialised, opt-in execution environment — rather than a default — keeps both architecture and blast radius under control.

Historically, self-hosted GitHub Actions runners were effectively free from GitHub’s billing perspective — you only paid for the infrastructure you ran them on. That changes on March 1, 2026, when GitHub will introduce a $0.002 per-minute GitHub Actions cloud platform charge for self-hosted runner usage, with those minutes counting toward your plan.

AWS Anywhere - a route to EKS Hybrid Nodes

Sebastian Mincewicz — Fri, 03 Jan 2025 13:16:24 +0000

In an era where cloud and on-premises environments increasingly converge, the ability to seamlessly integrate these ecosystems has never been more critical. This story explores what I like to call "AWS Anywhere" - an overarching concept encompassing a suite of AWS capabilities that enable seamless hybrid operations. From establishing a Site-to-Site VPN to bridging your on-premises network with AWS, configuring Route 53 Inbound Resolvers to enable private connections to VPC Endpoint Interfaces, leveraging IAM Roles Anywhere for secure identity management, and setting up the SSM agent for streamlined operations, this journey culminates in deploying EKS Hybrid Nodes.

This builds on my earlier story, AWS Landing Zone 3: Hybrid Networking, where I explored hybrid networking fundamentals. This time, I'm going a step further by providing source code and guides, making it easier for anyone to replicate the setup and put these concepts into action.

This story takes a practical approach, using a Raspberry Pi as the on-premises node to simulate real-world scenarios at home. By the end, you'll understand how these AWS services combine to create a unified hybrid infrastructure.

EKS Hybrid Nodes TL;DR

Amazon EKS Hybrid Nodes, introduced at AWS re:Invent 2024, allow businesses to run Kubernetes workloads seamlessly across on-premises, edge, and cloud environments. This solution simplifies Kubernetes management by offloading control plane availability and scalability to AWS while integrating with services like centralized logging and monitoring. It enables organizations to maximize existing infrastructure while modernizing deployments with AWS cloud capabilities. This unified approach reduces operational complexity and accelerates application modernization.
Before we get there, let me take you through the foundational steps that must be set up first.

Source code & guides

Terraform

The Terraform configuration files are available via the link below. These files include switches - disabled by default - represented by boolean variables that enable specific functionalities, all of which are detailed in the sections below.

https://github.com/sebolabs/aws-anywhere-tf

Guides.md

I've also utilized a combination of templatefile() and local_file resources to generate markdown guides. These guides provide step-by-step instructions and commands to configure the on-premises side of the setup, whether you're using a Raspberry Pi or another machine. Once a functionality is enabled and applied, a tailored guide file is generated, complete with references and values specific to your environment.

AWS Anywhere

Raspberry Pi

For my setup, I used my Raspberry Pi, the same one I used a few years ago to simulate hybrid networking as mentioned in the introduction. It runs Ubuntu 24.04, ensuring compatibility with all necessary installations to make this integration work seamlessly.

$ ssh pi                                                                                                                                                                                                                         255 ✘ │ 14:50:42 

Welcome to Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-1017-raspi aarch64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

 * Strictly confined Kubernetes makes edge and IoT secure. Learn how MicroK8s
   just raised the bar for easy, resilient and secure K8s cluster deployment.

   https://ubuntu.com/engage/secure-kubernetes-at-the-edge

Last login: Sat Dec 28 14:37:17 2024 from 192.168.101.10

seb@pi:~$ neofetch
            .-/+oossssoo+/-.               seb@pi
        `:+ssssssssssssssssss+:`           ------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 24.04.1 LTS aarch64
    .ossssssssssssssssssdMMMNysssso.       Host: Raspberry Pi 4 Model B Rev 1.4
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 6.8.0-1017-raspi
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 3 days, 5 hours, 24 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 810 (dpkg)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.2.21
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: (4) @ 1.800GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Memory: 399MiB / 7802MiB
+sssshhhyNMMNyssssssssssssyNMMMysssssss+
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

Once it's up and running, we can move on to the next step.

AWS VPC defaults

By default, apart from the Transit Gateway responsible for integrating different networks, the module configures a Transit VPC. This VPC hosts VPC Interface Endpoints and can also act as a centralized egress point in your landing zone setup. Here's what you get:

AWS Site-to-Site VPN

AWS Site-to-Site VPN is a flexible and cost-effective solution for securely connecting on-premises networks to AWS. While AWS Direct Connect offers a more reliable, dedicated connection with lower latency, Site-to-Site VPN is often chosen for its quicker, simpler setup or when Direct Connect isn't available. The VPN connection uses IPsec tunnels over the Internet, ensuring secure communication between local networks and AWS resources.

On the AWS side, this is a managed service. On my Pi, I configured the VPN connection using StrongSwan, an open-source IPsec-based VPN solution. StrongSwan provides flexible configuration and integrates seamlessly with diverse network setups. By pairing StrongSwan with AWS's managed service, I maintain granular control over the VPN configuration while benefiting from AWS's operational simplicity.

To get it configured start with:

on_prem_s2s_vpn_enabled = true
on_prem_props           = { ... }

and upon successful Terraform apply you'll get a guide on how to configure StrongSwan. For testing purposes feel free to limit it to a single VPN tunnel. When done here's what you get:

The private subnet is included in case you wish to spin up an EC2 instance for testing purposes, such as confirming successful connectivity to and from a real server running in the VPC.

Hybrid DNS and private access to VPC Interface Endpoints

Hybrid DNS and private access to VPC Interface Endpoints, alongside AWS PrivateLink, enable secure, private connectivity to AWS services. By integrating a local Bind DNS server with Route 53 Inbound Resolver over the configured Site-to-Site VPN, DNS queries for AWS services are routed privately within AWS. This setup allows the use of VPC Interface Endpoints to connect to AWS APIs. PrivateLink ensures that traffic to services such as S3, Systems Manager, or EKS remains within the AWS network, avoiding exposure to the public Internet and enhancing both security and performance.

To get it configured start with:

r53_inbound_resolver_enabled = true

and upon successful Terraform apply you'll get a guide on how to configure Bind. When done here's what you get:

IAM Roles Anywhere

With the Site-to-Site VPN in place, I can securely extend my on-premises network to AWS, providing seamless communication between local infrastructure and AWS resources. Integrating IAM Roles Anywhere enables secure and temporary access to AWS services from on-prem systems. This leverages the VPN connection and IAM roles for secure authentication. Additionally, with private access to VPC Interface Endpoints, on-prem systems can resolve AWS service API addresses to private IPs, ensuring traffic remains within the AWS network for enhanced security, and avoiding public Internet exposure.

To get it configured start with:

iam_roles_anywhere_enabled = true

# NOTE: Terraform must be re-run once the CA cert is uploaded to SSM PS

and upon successful Terraform apply you'll get a guide on how to configure a local CA, generate certificates, and leverage IAM to access AWS services. When done here's what you get:

SSM Agent

With the Site-to-Site VPN in place, private access to AWS services via VPC Interface Endpoints configured, and IAM Roles Anywhere enabling secure role-based access, the next step is integrating the SSM Agent. The SSM Agent facilitates secure, managed access to instances in AWS, including on-premises servers, via AWS Systems Manager. By leveraging IAM roles, the SSM Agent ensures that commands and configurations are executed securely, enabling full management of infrastructure across both on-premises and AWS environments. Additionally, communication between the agent and the service remains private, as all traffic flows through established private connections to AWS services, avoiding the public Internet.

To get it configured start with:

ssm_hybrid_activation_registred     = true
ssm_advanced_instances_tier_enabled = true

and upon successful Terraform apply you'll get a guide on how to configure the SSM Agent. When done here's what you get:

EKS Hybrid Nodes continued

The functionalities described and configured above provide a robust foundation for integrating on-premises systems with AWS. They also cover the key prerequisites for connecting an on-premises Kubernetes node to an EKS cluster in AWS.

The source code does not include specific EKS cluster configuration, as such setups vary by use case. Instead, it assumes an EKS cluster with hybrid node support enabled is already running in a separate VPC. The provided configuration focuses on integrating the cluster with the Transit Gateway and the necessary IAM-related resources.

AWS provides comprehensive guidance for configuring everything from scratch, especially the CNI-related aspects, which can be found here while most of the non-EKS-specific prerequisites have already been covered in this post.

And don't forget, we're only simulating this at home!

To get it configured start with:

eks_hybrid_nodes_enabled = true
eks_props                = { ... }

and upon successful Terraform apply you'll get a guide on how to configure your node (Pi). When done here's what you get most likely (depending on your individual EKS cluster setup):

This is it!

Wrap-up

What started as a personal Proof of Concept turned into a hands-on guide for building a hybrid infrastructure that connects on-premises systems with AWS. Using a Raspberry Pi as a simulated on-prem node, this journey explored key AWS services and functionalities required to establish a route to EKS Hybrid Nodes.

The lessons learned here - from setting up a Site-to-Site VPN to enabling private API access and configuring a hybrid Kubernetes node - showcase practical steps that can inform professional designs. These configurations pave the way for securely running Kubernetes workloads across hybrid environments, leveraging AWS for control plane management while maintaining local resources.

The benefits of this approach are clear: enhanced security with private connectivity, simplified operations through managed AWS services, and the ability to experiment and learn on a small scale while gaining insights applicable to enterprise-grade scenarios. This PoC not only highlights the potential of hybrid cloud architectures but also demonstrates how such integrations can help modernize on-prem systems and provide flexibility for diverse business needs.

Databricks on AWS

Sebastian Mincewicz — Fri, 03 Jan 2025 12:14:50 +0000

This is to share my experience around building an enterprise data platform powered by Databricks on AWS. It’s not about the data side of things but purely about the platform architecture and wider configuration aspects.

While Databricks provides their customers with many different capabilities and features to fulfill a spectrum of needs, and depending on individual requirements, there are still common elements and considerations Databricks users must make their decisions on.

Just look out for 💡 down below, as you may find the information they are highlighting relevant or even get a head start with using Databricks on AWS.

Account and workspaces

It’s not about AWS accounts and Amazon Workspaces. When working with Databricks on AWS, you should quickly learn to be precise when discussing architectures and configurations to avoid unnecessary confusion about what’s what. You’ll see why… mark my words!

This time it’s about Databricks accounts and workspaces. With the Enterprise Edition (E2) model, Databricks introduced a highly scalable multi-tenant environment, a successor to previous deployment options that have been deprecated. By the way, it’s running on Kubernetes.

Here’s what it looks like on a very high level:

Source: Databricks

💡 The account console is hosted in the US West (Oregon) AWS region, while you choose which AWS region (15 currently) you want to have your workspaces deployed into.

A Databricks account is used to manage:

Users and their access to objects and resources,
Workspaces and cloud resources,
Metastores — the top-level container for catalogs in Unity Catalog (💡 There can only be a single metastore per account per region),
Other account-level settings like SSO, SCIM, security controls, various optional features, etc.

A Databricks workspace, on the other hand, is a Databricks deployment that can be considered an environment for data engineers to access Databricks assets. To configure a workspace, the following cloud-relevant information must be provided:

Credentials ~ an IAM role
Storage ~ an S3 bucket
Storage CMK ~ a KMS key
Network ~ a VPC with subnets and security group(s)

Now, it’s up to you how you want to manage the workspaces, but we went with a workspace per environment, meaning every SDLC environment is represented by a distinct Databricks workspace and its corresponding AWS account, having all the necessary services and resources configured accordingly. All that configuration was covered with IaC.

Databricks REST APIs

Databricks exposes two APIs, the Account API and the Workspace API, and so depending on what resource you’re configuring you interact with one of these. You can see many of the methods are listed as Public Preview; however, when you go with IaC you must simply consider them Public/GA as you have no choice but to use and rely on them.

💡 Not everything makes perfect sense when it comes to what is managed through which API, but maybe that’s just me. An example can be the Artifacts Allowlist. Namely, it is configurable with the Workspace API while that configuration is considered global as it’s linked to the Unity Catalog, i.e., it applies to all workspaces in your account. Now, having a workspace per environment, you may have different artifacts per environment that need to be whitelisted. In that case, you must bring them together and contain them as a single list. Moreover, when interacting with the Workspace API, you must provide a workspace host URL. Say you don’t have any workspaces yet or, just like in our case, you have workspaces representing distinct environments, and you want to set up the allow list — which workspace URL would you use to configure that global setting?

Architecture

On a high level, the following diagram visualizes what the core components are and how they are spread across the AWS accounts, where one belongs to Databricks and another one belongs to the customer.

Source: Databricks

As you can imagine, to make both sides interact with each other securely, it all relies on trust between both AWS accounts which is fulfilled with the use of a cross-account IAM role that is granted permissions to launch clusters or manage data in S3 in the customer AWS account. It’s always the same 41435176782 AWS account and arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterROle-14S5ZJVKOTYTL IAM role that will keep popping up in Security Hub findings if not properly marked as trusted. When using the serverless compute, there’s another combination of an AWS account ID and IAM role.
Then, at the network layer, there are whitelisting mechanisms, covered in the following sections down below.

Authentication and access control

While Databricks comes with its user store, it’s a common pattern to leverage SSO and configure Databricks with an IdP that is already used in your organization. The unified login option that is now enabled by default allows you to manage one SSO configuration in your account that is used for the account and Databricks workspaces.

💡 The unified login does not yet support your workspaces having the public access completely disabled and the only way to do that is to contact Databricks to get the unified login disabled for your account and then configure SSO on a per-workspace basis.

SCIM provisioning feature allows for syncing groups and users from your IdP, like Microsoft Entra ID, and use them in Databricks to grant permissions while following the least privilege principles. Here, make sure you use groups and not individual user accounts to grant permissions.
Service principals on the other hand must be managed locally in Databricks while they still can be members of the SCIM-synced groups.

Service Principals can use personal access tokens (PAT) or OAuth with automation, while it’s recommended to go with the latter wherever you can. Make sure you verify the available authentication types based on the use case.

💡 For example, currently, Qlik does not support OAuth for Databricks hosted in AWS, while it does for Databricks hosted in Azure.

Networking

It’s important to understand that in the E2 Databricks deployment model, the control plane is in the hands of Databricks while you use their web applications to access the account console and your workspaces. That architecture requires all those endpoints to be accessible over the public Internet or rather publicly resolvable.

For the Databricks account endpoint, there’s only a single option you use for restricting access on the network level and that is the IP Access List.
For the Workspaces endpoints, apart from the IP access list, you can leverage AWS PrivateLink. Whether you decide to use it for the front-end access, it’s up to your requirements but you should definitely use it for the back-end access, i.e., secure cluster connectivity. We went for both!

Private-only connectivity

The following diagram nicely visualizes the concept.

Source: Databricks

You can see that Databricks exposes two APIs privately (SCC Relay API and Workspace API) with the use of the VPC Endpoint Services for you to connect with the VPC Interface Endpoints you configure in your VPCs and whitelist in the Databricks account console.

💡 The Workspace endpoint is not only used for the front-end access but also REST API and ODBC/JDBC connections; hence, you must realize that by closing the front-end door, you’re restricting all types of connectivity to only whitelisted networks.

💡 VPC Interface Endpoints don’t have to live in the VPC you have your workspace configured with. This means, as long as your VPC Interface Endpoint is whitelisted (by ID) it can belong to any AWS account and any VPC which enables their centralization. However, you can control which endpoints can be used to access a given workspace.

💡 No matter whether you disable public access to your workspace or not, the access verification process is always the same and starts with authentication. Yes, the IP access list and AWS PrivateLink go second, while the IP access list only applies when accessing workspaces from public IP addresses. All that is influenced by the Databricks control plane architecture.

Delta-sharing

Delta-sharing is an open-source approach to data sharing across data, analytics, and AI developed and implemented by Databricks. Among other things, it allows for sharing data between Databricks customers having their own accounts and workspaces.

💡 Delta-sharing is happening privately using the AWS backbone and Databricks AWS account, which means that going fully private, i.e., disabling the public access to your workspaces does not affect the ability to use that feature.

Analytics tools

Now, when every tool has its cloud-based option, so do various analytics tools like Power BI or Qlik Sense. I don’t know all of them, but at least these two provide a solution that allows for establishing private connectivity between them and Databricks. That is with the use of gateways — Power BI Gateway and Qlik Data Gateway respectively. Such a gateway must be deployed in a VPC in the customer AWS account in a way that it can connect to Databricks privately with the use of that front-end VPC Interface Endpoint, while the connection to a given analytics tool is established from the gateway using encryption and whitelisting mechanisms, and so no endpoint on the AWS side is exposed publicly.

AWS Graviton

Why pay more if you can pay less?

💡 AWS Graviton-powered EC2 instances for Databricks clusters are supported; however, limitations may make it useless in some use cases so make sure you know what they are before you calculate your ARR based on the assumption you can use them everywhere for anything.

Logging and monitoring with CloudWatch

To make sure all relevant logs go to CloudWatch where they can be retained and analyzed, especially when cluster nodes are usually transient, but also to make use of non-default EC2 performance metrics, you can leverage cluster-scoped init scripts to install and configure the CloudWatch agent.
Those init scripts must be whitelisted with the aforementioned Artifacts Allowlist before they can be used.

💡 Considering the compute plane can be configured with an instance profile one can leverage that fact to configure a custom logger and having properly set up IAM policies stream logs directly to custom log groups.

Databricks also produces audit logs that can be stored in S3. From there they can be pushed to CloudWatch (or elsewhere) for analysis and event management purposes.

Caveats 💡

Here are some things it’s good to know sooner rather than later to avoid headaches or other surprises when using Databricks (in general):

It’s constantly evolving, so pay attention to the release notes.
For the same reason, documentation is sometimes unclear as they try to make it cover both the new things and features that have now been deprecated while some customers are still using them.
Not every feature is available in all regions so make sure you know the coverage before picking one. For example, currently, serverless compute features are not yet available in the London region.
Make sure you know the difference between roles, entitlements permissions, and grants — it can become confusing sometimes.
Using SQL to grant a given principal permissions to Databrick objects requires running compute resources as every statement must be run somewhere, don’t expect to use direct queries to the control plane; it’s not Kubernetes.
Follow Databricks security best practices and analyze carefully any weaknesses in your architecture that allow for data exfiltration. For example, make sure any egress traffic is controlled and inspected by using AWS Network Firewall or any other next-generation firewall solution.
In case of issues or doubts, don’t hesitate to contact Databricks and get a hold of a Solution Architect who can help you understand things and even give a hint on how other customers go about some aspects.
When using Terraform, refer to the guides available in the registry, just don’t use them blindly and adapt to your design and standards.
It’s not always a good idea to manage Databricks classic compute clusters themselves with IaC unless their configuration can remain unchanged, instead, it’s worth considering keeping their definition in source control and getting them deployed with CI/CD running Databricks CLI while still managing all their dependencies in IaC.

Finally, I’m not planning to keep this post updated with features availability changes, so everything here can only be considered relevant at the time of its publication 😉

Lastly…

There’s more to comprehend for a platform solution architect than you can imagine. To make it right, I’d suggest having an AWS expert working alongside a data architect, as that’s the best way to get all the different requirements and best practices considered and implemented properly.

To help with that journey, you can have a look at the Databricks certification program and Partner Academy to help you learn Databricks and optionally even get certified or achieve an accreditation.

Re-platforming to AWS Lambda container images

Sebastian Mincewicz — Fri, 26 May 2023 20:00:00 +0000

This is an attempt at how event-driven backend processing services can potentially be migrated with minimum changes from a container orchestration service to AWS Lambda powered by container images.
The driver for this story was one of my recent migration projects where a fleet of services running on an old EKS version had to be migrated to a brand-new environment running the latest versions of everything. During that journey, one of the weaknesses that got revealed in the architecture of the system in question was insufficient resistance to data loss due to the lack of loose coupling capabilities. That made me think… firstly, how a particular backend processing service can be redesigned accordingly, and secondly, whether we need EKS for this at all.
For the sake of this story let’s say the former has already been solved and so I’m going to focus on the latter as one of the exercises I did was checking how difficult re-platforming a containerized service from Amazon ECS or Amazon EKS to AWS Lambda with container images support is.

EKS, ECS, and Lambda

It’s important to emphasize again that I’m focusing on an event-driven backend processing service which is why this comparison is focused on a specific, but still common, use case rather than a holistic capabilities overview of these services. EKS, ECS, and Lambda are not mutually exclusive and each of these services is better at one thing while less optimal at the other. It’s usually a game of trade-offs anyway but depending on a scenario, workload type, number of services in an environment, etc., to come up with the best architecture those services from AWS can be combined or implemented interchangeably. My observations show that Lambda is barely considered whenever containerized workloads are in use, if at all. These days, containers usually equal Kubernetes.

Event-driven architecture

Conceptually, I’m a big fan of this type of architecture. Do only when there’s something to do and upon getting notified about it. Once done, turn off the lights and wait for another alarm to be woken up. Repeat forever. It’s efficient, sustainable, and cost-effective.
While Lambda along with other serverless services is designed to be at the heart of event-driven solutions, both EKS and ECS can be used to deal with such requirements, however, not always efficiently enough and not without additional overheads or complexity associated.

Scaling

One of the crucial capabilities of a service that is used to support event-driven architecture is dynamic and adaptive scaling. Lambda is known for its native scaling capabilities and handling of in-flight requests in parallel that can be controlled and supported with the use of the concurrency settings that were comprehensively described here. A Lambda function can be triggered by the majority of relevant AWS services, if not all of them, and therefore perfectly adapts to unpredictable traffic events.
In the world of Kubernetes, with EKS, there are add-ons (autoscalers) like Keda that enable similar capabilities with the use of supported scalers and allow for pod scaling based on external metrics.
ECS on the other hand supports so-called step scaling that leverages CloudWatch alarms and adjusts the number of service tasks in steps based on the size of the alarm breach. Interestingly, that type of scaling is not recommended as per the ECS Developer Guide which is why I decided to give it a try as a part of a PoC for this story. An alternative approach that I’ve seen people implementing is a Lambda function triggered periodically that watches a given external metric and based on its value manipulates the number of running ECS tasks. Even then, it’s not flexible and automatic as thresholds must be predefined.

Migration, modernization, re-platforming

While migrating more and more workloads to the cloud remains high on the list of initiatives organizations want to make progress on, according to Flexera 2023 State of the Cloud Report cost savings is the top one for the seventh year in a row. Moreover, 71% of heavy cloud users want to optimize the existing use of the cloud.

At the same time, cloud spend along with security and expertise are the top three cloud challenges recognized.

Therefore, a potential path leading to cost optimization that simultaneously can elevate security and doesn’t necessarily require extensive expertise is modernizing with the use of AWS Lambda.
Depending on where you are with your cloud adoption journey and what the key business drivers are, you should choose the most appropriate migration strategy for your workloads. Modernization is something you can go for straight away or consider as the next step after re-hosting first.
Re-platforming can turn out to be a golden mean as it allows for reshaping by leveraging cloud-native capabilities without modifying the application source code or its core architecture. That way you can benefit from increased flexibility and resilience with reduced time and financial investments. Obviously, it doesn’t mean it is the way and any decision on implementing any strategy should be supported by relevant analysis.

PoC: ECS to Lambda

Let’s imagine a scenario where the goal is to re-platform from ECS to Lambda with zero or minimal changes to:

application code (Python)
container image build process (Dockerfile) to reduce risks associated with the platform change itself. Why ECS and not EKS? Because for some reason I haven’t got much experience with it hence considered this a chance to see what it’s capable of in the given scenario and let myself expand my horizons.

HLD

It’s about re-platforming from this…

to this…

Application code
What this sample code below does is it receives messages from a given SQS queue and deletes them immediately. That’s it, just enough for this PoC.

# app.py
import os
import boto3

QUEUE_URL = os.environ.get("SQS_QUEUE_URL", "")
MAX_NUM_MSGS = int(os.environ.get("SQS_MAX_NUM_MSGS", 10))
VISIBILITY_TO = int(os.environ.get("SQS_VISIBILITY_TO", 15))
WAIT_SECONDS = int(os.environ.get("SQS_WAIT_SECONDS", 10))

sqs_client = boto3.client("sqs")


def handler(event, context):
    get_response = sqs_client.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=MAX_NUM_MSGS,
        VisibilityTimeout=VISIBILITY_TO,
        WaitTimeSeconds=WAIT_SECONDS,
    )

    messages = get_response.get("Messages", [])

    messages_length = len(messages)
    print(f"Number of messages received: {messages_length}")

    for msg in messages:
        print(f"Message body: {msg['Body']}")

        del_response = sqs_client.delete_message(
            QueueUrl=QUEUE_URL, ReceiptHandle=msg["ReceiptHandle"]
        )
        print(
            f"Deletion status code: {del_response['ResponseMetadata']['HTTPStatusCode']}"
        )


if __name__ == "__main__":
    handler("", "")

Dockerfile & RIC

The only thing that had to be added to the existing Dockerfile, for Lambda to work with a container image, was installing the AWS runtime interface client (RIC) for Python — one additional line still keeping the image generic.

FROM python:3.10

WORKDIR /app

COPY requirements.txt  .
RUN pip install -r requirements.txt

RUN pip install awslambdaric

COPY app.py .

CMD ["python", "app.py"]

RIC is an implementation of the Lambda runtime API that allows for extending a base image to become Lambda-compatible and is represented by an interface that enables receiving and sending requests from and to the AWS Lambda service. While it can be installed for Python, Node.js, Go, Java, .NET, and Ruby, AWS also delivers Amazon Linux-powered container images for those runtimes here. Even though they might be bigger than standard (slim) base images they are pro-actively cached by the Lambda service meaning they don’t have to be pulled entirely every time.

Lambda function configuration

From the Lambda function configuration perspective, to instruct Lambda how to run the application code both ENTRYPOINT and CWD settings must be overridden as follows:

And that’s it. It’s ready to be invoked.

Parallel testing

For that purpose, I have used the SQS queue seeder Python code below and ran it from Lambda. Based on the settings provided, it generates random strings and sends them as messages to as many queues as you want, in this case, there were two, one for ECS and another one for Lambda.

# sqs_seeder.py
import os
import random
import time
import string
import boto3

# SQS queues
SQS_QUEUE_URLS=[
    # queue for ECS,
    # queue for Lambda,
]

# Settings
NUMBER_OF_MSG_BATCHES = 50
MAX_MSGS_PER_BATCH = 5
MIN_SECONDS_BETWEEN_MSG_BATCHES = 5
MAX_SECONDS_BETWEEN_MSG_BATCHES = 15

sqs_client = boto3.client("sqs")


def sqs_send_msg(msgs_count):
    MSGS_COUNT = msgs_count

    print(f"[INFO] Number of messages in batch: {MSGS_COUNT}")

    while MSGS_COUNT > 0:
        MESSAGE_CHAR_SIZE = random.randint(10, 50)
        MESSAGE = "".join(random.choices(string.ascii_uppercase, k=MESSAGE_CHAR_SIZE))

        for queue_url in SQS_QUEUE_URLS:
            try:
                print(f"[INFO] Sending message {MESSAGE} to {queue_url}")
                response = sqs_client.send_message(
                    QueueUrl=queue_url,
                    MessageBody=MESSAGE,
                )
                print(f"[INFO] Message sent, ID: {response['MessageId']}")
            except Exception as error:
                print(f"[ERROR] Queue URL: {error}")

        MSGS_COUNT -= 1


def handler(event, context):
    global NUMBER_OF_MSG_BATCHES

    print(f"[INFO] Number of message batches: {NUMBER_OF_MSG_BATCHES}")

    BATCH_NUMBER = 1

    while NUMBER_OF_MSG_BATCHES > 0:
        print(f"[INFO] Batch number #{BATCH_NUMBER}")

        NUMBER_OF_MSGS = random.randint(1, MAX_MSGS_PER_BATCH)
        sqs_send_msg(NUMBER_OF_MSGS)

        BATCH_NUMBER += 1
        NUMBER_OF_MSG_BATCHES -= 1

        if NUMBER_OF_MSG_BATCHES > 0:
            SLEEP_TIME = random.randint(MIN_SECONDS_BETWEEN_MSG_BATCHES, MAX_SECONDS_BETWEEN_MSG_BATCHES)
            print(f"[INFO] Sleeping {SLEEP_TIME} seconds...")
            time.sleep(SLEEP_TIME)


if __name__ == "__main__":
    handler("", "")

Here are the ECS service step scaling policies configured that didn’t do what I initially thought they would do but more on this later.

The Lambda function was configured with reserved concurrency set to 1 to try to give ECS a head start.

The screenshot below shows relevant metrics illustrating how messages were being loaded, received, and deleted from the two individual queues.

Clearly, something wasn’t right about ECS step scaling. It’s just not what I was hoping for. I thought it would be continuously adding 5 tasks as the number of messages in the queue grows but that wasn’t the case. According to the scale-out action, for the second run, I expected to get 15 tasks while the DesiredTaskCount remained at the level of 5, so it looks like a one-off scaling operation that is not adaptive over time. I admit I got misled by the Set/Add actions available there. I should have done better at reading and managing my expectations. Now I see why it wasn’t recommended.
Either way, it is also crystal clear when looking at the metrics that Lambda scaling is seamless and fast while ECS needs time before the CloudWatch alarm kicks in. That is because for AWS metrics the minimum period is 1 minute and so the reaction is not immediate.
Finally, not sure why but the scale-in action didn’t work, namely, it did not set the number of tasks to 0 even though the alarm history was showing that the action was successfully executed. I had to do it by hand.

When it comes to SQS itself, I learned one thing that seems important from the performance and efficiency point of view. Namely, having the MaxNumberOfMessages set to 10 (max) for the ReceiveMessage API call when the number of messages is not big enough, it will still return a single message most of the time. More on that here.

Wrap-up

While I had no intention of going deep into ECS step scaling and demystifying things, even if it worked the way I expected it would still perform worse than Lambda. However, this is not about judging but about knowing your options and their constraints. It is also about realizing caveats relevant to the integration points and their limitations too. While one service may seem to be fit for purpose, there might be something else that will eventually have a decisive impact on the final design.
The takeaway is that thanks to containers' portability such a comparison doesn’t necessarily have to be difficult to execute or requires much effort to draw some conclusions that will help in decision-making. A Lambda function powered by a container image is probably one of the easiest things to configure and so as long as there are no obvious obstacles, quite the opposite, when there are indications of potential improvement, why wouldn’t you give it a chance? The worst thing that can happen is that you’ll end up with additional arguments for your decision.
Again, know your options! Don’t hesitate to try things out by running PoCs to find the best solution and remember that modernization is a continuous improvement process that never ends. Technology, business requirements, functionalities, and KPIs keep on changing over time, therefore, try to continuously assess your solutions against the AWS Well-Architected Framework and strive for optimization.

Amazon VPC Lattice — feasibility study

Sebastian Mincewicz — Thu, 04 May 2023 19:36:24 +0000

Amazon VPC Lattice has now become generally available (March 2023) and finally, I managed to give it a try and see whether it would meet the expectations it had aroused back at AWS re:Invent 2022. There were quite a few of them, e.g. having the ability to avoid using VPC peerings or VPC service endpoints to facilitate cross-account, cross-VPC applications communication while separating the core networking management from individual services configuration across the estate, as well as easily defining and attaching services to a wider application networking mesh.

Just in case…

“Amazon VPC Lattice is a fully managed application networking service that you use to connect, secure, and monitor all of your services across multiple accounts and virtual private clouds (VPCs).” ~ AWS

Need to know more? — check it out here since the rest of this story assumes you know the basics and is all about presenting the results of my early experimentation with VPC Lattice as well as sharing my findings and opinions.

Proof of concept

The idea was simple — I wanted to test out VPC Lattice functional capabilities as much as possible with just a light touch on the non-functional ones. And so I ended up building this…

What you can see above is a set-up containing:

two AWS accounts being a part of the same AWS organization
two VPC Lattice service networks where Test #1 is shared through RAM
two VPC Lattice services where test-svc-1 is shared through RAM for association with the Test #2 service network
three VPCs associated with service networks where VPC .103 is client-only
test-svc-1 service with several different target types

Note: those Lambda-powered curlers in all three VPCs are there to facilitate sending custom HTTP requests to any VPC Lattice service for testing purposes. Need that Lambda function source code to try it yourself? It’s not sophisticated but does the job and here it is:

import json
import http.client
import ssl
from urllib.parse import urljoin

###################################################################################
REQ_PROTOCOL = '<???>'                            # options: 'http', 'https'
REQ_HOST     = '<???>'                            # e.g. myservice.mydomain.aws
REQ_PATH     = '/'                                # e.g. '/lambda80', '/lambda443'
REQ_HEADERS  = {                                  # a map of custom headers
    # "My-Header": "lambda-hh",                       
    "Content-Type": "application/json"
}
###################################################################################

def get(protocol, host, path, headers, event):
    """GET request"""

    if protocol == "https":
        conn = http.client.HTTPSConnection(host, context = ssl._create_unverified_context())
    elif protocol == "http":
        conn = http.client.HTTPConnection(host)
    else:
        return "[ERR] Unknown protocol provided!"

    try:
        conn.request('GET', path, json.dumps(event), headers)
        res = conn.getresponse()
        location_header = res.getheader("location")

        if location_header is not None:
            location = urljoin(path, location_header)
            # print(location)
            return get(protocol, host, location, headers, event)

        data = res.read()
    except Exception as error:
        return f"[ERR] {str(error)}"

    return data

def lambda_handler(event, context):
    """Lambda handler"""

    response = get(
        REQ_PROTOCOL,
        REQ_HOST,
        REQ_PATH,
        REQ_HEADERS,
        event
    )

    return {
        "statusCode": 200,
        "body": response,
        "headers": {
            "Content-Type": "application/json"
        }
    }

VPC Lattice service

That test-svc-1 is the main element though and most of the focus was put on the VPC Lattice service, service networks configuration aspects as well as cross-account and cross-VPC communication to services. It exposes privately several microservices under different combinations of protocols, ports, and paths represented by various AWS compute services and configured with the following target groups (TGs):

EC2 in ASG
ALB with EC2 in ASG
ALB with ECS powered by Fargate
Lambda functions

Apart from one Lambda function, everything else runs in private subnets across multiple availability zones.

Guess what!? It all worked very nicely!
But hey! Doesn’t that configuration and its elements look any familiar?
To me, it did hence I’m going to risk the following statement…

Amazon VPC Lattice service is an implementation of a private Application Load Balancer with cross-account- and auth-related features in mind.

VPC Lattice service vs. ALB

First of all, VPC Lattice is meant to satisfy application layer load-balancing with weighted targets and blue/green (B/G) deployment support.
Moreover, let’s have a look at the target group’s target types available in both cases:

As you can see both TGs support pretty much the same target types with only small differences which doesn’t necessarily mean a given option is missing. E.g. even though the Amazon EC2 Auto Scaling is not explicitly mentioned in the Instances section I managed to successfully attach an ASG as per the diagram above because that option simply exists:

Yes, that same ASG can be attached at the same time to an ALB and a VPC Lattice service. Nothing surprising here as a given service may need to be accessible not only privately to another service but also to clients over a public network.

Moreover, the VPC Lattice service routing configuration consists of listeners and rules just like in the case of ALB, however, all traffic can only be forwarded with no manipulation whatsoever.

Finally, the costs of just running a VPC Lattice service vs. an ALB are almost identical. For example, in the Ireland (eu-west-1) region, it’s $0.0275 vs. $0.0252 per hour.

EKS integration

I believe the Gateway API and Amazon EKS form a more specific use case that deserves a separate touch, therefore it’s not in the scope of this study.
For those interested though, I can tell that there is an AWS Gateway API controller that is meant to let you connect services across different EKS clusters by leveraging Amazon VPC Lattice, and more info on how to do that can be found here.

Admin vs. Developer

A key characteristic when announcing VPC Lattice was that it finally enables the separation of duties between network administrators and developers who can now freely define and manage services themselves.
This reminds me of times shortly after the AWS Lambda introduction and foreshadowing of the no-Ops era. Time will tell also in this case whether no admins involvement in configuring VPC Lattice services is doable or not.
Either way, the idea is that developers simply get a VPC to develop and define their services so then they can share them with admins who control VPC Lattice service networks by:

associating services and VPCs to networks based on requirements,
sharing service networks with AWS accounts or organizations,
enforcing authentication on service network access.

Sharing

Both VPC Lattice services and service networks can be shared with the use of AWS Resource Access Manager (RAM). While services are shared to be associated with different service networks, service networks are shared so that other principals can associate their VPCs and communicate with services associated with those networks.
To better understand what can and cannot be done as a shared resource owner and/or consumer scan the “Responsibilities and permissions for shared resources” section in docs as the information contained there can become very helpful when designing more complex, enterprise-grade architectures and strategies.

Security

The rules defining allowed network communication between individual services living in VPCs are applied with the use of security groups (SGs) configured for every VPC to service network association. VPC Lattice IPv4 and IPv6 managed prefix lists on the other hand simplify the other part of that set-up that involves clients and targets security rules.

Another element of the wider security are auth policies that can be applied on both the service network and service level. They are represented by IAM policy documents and are meant to control in a more granular way what principal has access to which service or a group of services.
Both SGs and auth policies are optional but recommended.

Logging and monitoring

While all out-of-the-box features are nicely described in the docs, one thing worth emphasizing is that both VPC Lattice service and service network logs can be streamed concurrently to:

CloudWatch Log Group
S3 bucket
Kinesis Data Firehose delivery stream

Awesome! Would love to see that for any AWS service, including the ALB.

VPC Lattice wrap-up

Caveats

As Amazon VPC Lattice service is still new, there are some caveats and limitations one should know about. One of them claimed as temporary is the ability to create only an exact match path condition (case insensitive) listener rules in the console. To configure the HTTP match condition for my PoC, I had to use the following AWS CLI command:

$ aws vpc-lattice create-rule \
--name lambda-hh \
--service-identifier svc-0ce77bf32833f5b5b \
--listener-identifier listener-0f287b2d41f2cb905 \
--action '{ "forward": { "targetGroups": [ { "targetGroupIdentifier": "tg-0096d77adfb1bcd29", "weight": 1 } ] } }' \
--match '{ "httpMatch": { "headerMatches": [ { "caseSensitive": false, "match": { "contains": "lambda-hh" }, "name": "My-Header" } ] } }' \
--priority 20

BTW, don’t think you can match the Host header with that rule — it’s not like an ALB with host-based routing.
However, that rule creation through the API resulted in the associated listener becoming uneditable from the console anymore (see screenshot above, the “Edit listener” button is greyed out).

Just because of the abilities delivered with VPC Lattice your appetite for more complex and sophisticated scenarios may grow so don’t forget every VPC can be associated with a single service network at the same time! Therefore, it’s important to design your architecture accordingly.

When working with Lambda functions as targets, as long as your function doesn’t require access to your custom VPC resources there’s no need to configure it with a VPC. Even when you send a request via a VPC Lattice service to a Lambda function set up with a VPC it won’t use the associated ENI in that VPC for communication. Instead, it will always communicate with your function via the Lambda API in the region where it is located, and what is clearly visible above on that CW Logs Insights query output screenshot — see entries without the destinationVpcId.
Need more details on that? Have a look at my “Lambda security paradox” story from 2019.

There is also one thing that I found very easy to miss. Namely, even though the docs say you should have rules allowing traffic from clients to VPC Lattice one may forget to whitelist local clients living in the VPC associated with the service network while only allowing those on the other side of the network. In other words, when configuring a client (like that test-svc-curler Lambda function in the VPC .103) you must:

[Lambda SG] allow outbound traffic to the VPC Lattice prefix list
[VPC SG] allow inbound traffic from the Lambda SG

Last but not least, make sure you realize what the default service quotas are and which ones can be adjusted upon request.

Expectations

After watching the re:Invent introductory video I expected a bit more when it comes to path-based routing. Namely, I thought it would allow for defining the entire application routing at the VPC Lattice services level regardless of the target type. Having forwarding as the only available action, when using EC2 or ECS as targets, that path is always forwarded to the backend that must consider that information and know how to deal with such requests and not throw a 404.
While that is not a problem when using Lambda and Gateway API for EKS, I thought it would be great to avoid having to keep routes mapping in sync between infrastructure and application code, especially when they may be kept in different repositories and have independent deployment pipelines.

And hey, where’s private Amazon API Gateway target support!?

The good, the great, and the awesome!

Either way, I have no doubt that Amazon VPC Lattice is a superb improvement over what had to be put in place to satisfy similar private, cross-account, cross-VPC service-to-service comms. While there’s always room for enhancement and more features, that I’m sure AWS will be introducing over time and based on customers' feedback, it has already made things easier by:

simplifying service-to-service cross-VPC comms,
mitigating the IP overlap issue,
enhancing service-to-service comms security,
facilitating B/G deployments or A/B testing,
supporting migration and modernization activities,
and more.

Finally, a great thing about VPC Lattice is that services can be associated with many service networks at the same time providing maximum flexibility and extensibility. Can’t wait to start using it on future projects!

Amazon EKS with Terraform and GitOps in minutes

Sebastian Mincewicz — Mon, 28 Nov 2022 11:17:27 +0000

This one is simply a result of a need that I had and that was about getting a fully functional, flexible, and secure Amazon EKS cluster set up in under half an hour to be able to test anything asap. For that, I did not want to spend too much time developing IaC myself as there are so many great sources out there that are worth supporting rather than reinventing the wheel. The force is there in the community and as an AWS Community Builder I came across something that met my expectations hence I’m sharing my experience hoping you may find it helpful too.

It is meant to get you your EKS cluster while you can go buy yourself a coffee ☕️

This time I will start the other way around and go straight away to the solution while context and other details can be found down below.
The only thing to reveal at this stage is that I’m leveraging Amazon EKS Blueprints for Terraform 🚀

MVP

While one can use the flexibility of the EKS Blueprints solution to set things up in many different ways and depending on individual requirements, I’ve got the minimal/initial configuration I start with, and that consists of the following:

the control plane with whitelisted public access,
the data plane (spot EC2 instances) communicating with the control plane privately,
all EKS-managed add-ons enabled and using the most recent versions,
ArgoCD publicly accessible (whitelisted) through an ALB configured with a Route53 domain and an ACM certificate,
a set of additional add-ons deployed with the use of ArgoCD and following the GitOps approach.

The following extra add-ons are enabled by default:

Cluster autoscaler
AWS load balancer controller
External DNS
FluentBit

Here’s the code that sets everything up.

EKS-TF-GITOPS

It’s opinionated, however, I believe it’s a perfect starting point where you have a fully functional Kubernetes cluster with GitOps support and can immediately start deploying and testing anything you want.

The Terraform code (/terraform) in this repo consists of three components:

account (optional) — covers S3 bucket for storing logs, etc.
core — covers networking
k8s — covers EKS cluster configuration

In addition, there’s a K8s configuration (/k8s) covering add-ons that is periodically read by ArgoCD to keep things set up as declared in Git — the GitOps way. If you decide to use that code simply look for TODOs and provide values that will be relevant to your set-up.

Finally, after running Terraform and then going to get your well-deserved coffee…

… it’s there, up and running!

It needs a couple of minutes to deploy the add-ons automatically, including the AWS Load Balancer controller and External DNS responsible for exposing the ArgoCD UI publicly.
Then, you just have to retrieve ArgoCD's initial admin password…

$ kubectl -n argocd get secret argocd-initial-admin-secret \
    -o jsonpath=”{.data.password}” | base64 -d 

n20x3mwZoapDv9JC

…and you can log in 😎

App of Apps

Now there’s that first ArgoCD application called add-ons where the enabled core K8s controllers belong to.

To learn more about the App of Apps pattern in ArgoCD check this link.

Logging

Apart from cluster logs that were enabled also the logs from all the pods get nicely delivered to CloudWatch and can be easily queried with Logs Insights. See some examples below.

EKS Blueprints

Now, let’s get to the roots of the solution…

EKS Blueprints helps you compose complete EKS clusters that are fully bootstrapped with the operational software that is needed to deploy and operate workloads. With EKS Blueprints, you describe the configuration for the desired state of your EKS environment, such as the control plane, worker nodes, and Kubernetes add-ons, as an IaC blueprint.

Looks like a sponsored advertisement? Maybe, but it’s not!
What I was after personally is something that:

follows best practices
is flexible and extensible
is being actively supported

First of all, EKS Blueprints turned out not only to be an open-source project supported by some K8s and AWS enthusiasts only. It’s a result of cooperation between AWS representatives, their partners, and others as the answer to customers’ needs. That I believe has defined the direction and shaped the foundation of what that project represents. One of its pillars is that it follows AWS Well-Architected Framework best practices and therefore lets you focus more on the functional side of your set-up.
Secondly, it supports a wide and constantly growing range of Kubernetes add-ons. The EKS Blueprints allow it to be deployed either with Terraform or with AWS CDK, the two probably most popular tools out there.
Then, it implements the so-called GitOps bridge that takes care of configuring resources (e.g. IAM roles and service accounts) to satisfy add-on functionalities requirements.
Lastly, the already mentioned growing community — people using it for real, professionals testing it in battle, on real projects, in many different use cases, and for other purposes — made me realize it is not an ephemeral thing and enough quality is there.

Caveats

Things one should be aware of when using EKS Blueprints…

There are multiple modules calls happening behind the scenes due to the fact the EKS Blueprints support quite a wide range of various controllers/add-ons which ultimately makes the Terraform initialization last a bit longer (~3 minutes).

When configuring private connectivity between the data plane and the API server endpoint sometimes things don’t work at the beginning. AWS recommends that if your endpoint does not resolve to a private IP address within the VPC you should enable public access and then disable it again.

Amazon EKS cluster endpoint access control

Deleting K8s namespaces created with Terraform is not that straightforward so check the link below to get yourself unblocked just in case.

Troubleshoot terminated Amazon EKS namespaces

Beware! Terraform doesn’t know about AWS resources provisioned by K8s controllers running on the cluster. Make sure you tidy up after running terraform destroy or know what you should do to make the controllers delete relevant resources before destroying your infrastructure. The last thing you want is to have a couple of ALBs hanging there and costing you $16–$20 per month each.

AWS Landing Zone: Hybrid networking

Sebastian Mincewicz — Thu, 07 Jul 2022 08:32:27 +0000

Previously, I fleshed out the core aspects of AWS Control Tower managed landing zone and brought closer how to approach accounts baselining to maintain consistency and elevate the security level across the estate.

AWS Landing Zone #1: Expanding Control Tower managed estate

AWS Landing Zone #2: Control Tower Account Factory and baselining

This one is more of a try-out of how potentially hybrid networking could be set up, including hybrid DNS. Why so? It’s because it really depends on the individual requirements of an organisation. Such requirements can oscillate around various aspects like security, scalability, performance etc. and so the final architecture should be carefully considered to make sure the correct model is applied. Otherwise, one may end up with a configuration that in the long term won’t fit while modifications to such a fundamental matter can turn out to be very costly in many ways.
I decided to define objectives that could fit into a wide range of use cases there might be and set things up myself to gain even more experience with the recent AWS network services as well as share my thoughts, as usual.

Hybrid networking

Hybrid networking is nothing else than just connecting on-premises networks with the ones in the cloud in a secure and performant way.

To establish hybrid network connectivity three elements are required:

AWS hybrid connectivity service (Virtual Private Gateway, Transit Gateway, Direct Connect Gateway)
Hybrid network connection (AWS managed or software VPN, Direct Connect)
On-prem customer gateway

Objectives

As for my on-prem network, I simply chose my home network there was only one set of building blocks I could use and these were respectively:

AWS Transit Gateway (TGW)
AWS Managed Site-to-Site VPN (S2S VPN)
StrongSwan @ RaspberryPi

The functional objectives were to get:

centralised TGW in the Networking account shared across the Organisation with the use of AWS Resource Access Manager (RAM)
centralised egress routing via NAT Gateway (NGW) and Internet Gateway (IGW) living in the Networking account
hybrid DNS

Hybrid DNS

DNS is a critical component of every network. I wanted to make sure that I can resolve my local/home DNS domain (sebolabs.home) hosted on my Synology NAS from AWS accounts across my Organization and at the same time be able to resolve Route53 (R53) Private Hosted Zones’ records configured in those accounts.
The challenge here was the decision itself that must have been made in order to get it set up in the least complicated way keeping in mind the associated costs coming from the fact the R53 resolvers are not the cheapest services out there. They cannot be avoided though as the VPC R53 native resolver (.2) is not reachable from outside of AWS. The concept of using the Inbound/Outbound resolvers is pretty straightforward per se, however, things get more complicated when considering a multi-account set-up.

One other objective that I set was to be able to provide flexibility and autonomy to manage the R53 Private Hosted Zones (PHZ) within individual accounts but under one condition. That condition was that those hosted zones must overlap with the root hosted zone living in the Networking account along with the resolvers, namely they must represent subdomains:

Networking AWS account root R53 PHZ: sebolabs.aws
Sandbox AWS account R53 PHZ: sandbox.sebolabs.aws
another AWS account R53 PHZ: any.sebolabs.aws

Apart from the overlapping domain namespaces, one other requirement here is that all the R53 PHZ across the Organization accounts that want to benefit from the hybrid DNS must be associated with the VPC the root PHZ is associated with and where the R53 resolvers are located. At the same time, the Outbound R53 resolver must be shared through RAM to be associated with all other VPCs. The alternative of centralising multiple hosted zones in a shared AWS account didn’t feel appealing to me and so that was the idea I went with.

The solution

Just to make it crystal clear, the solution presented below is a sort of an MVP that in a real-world scenario would have to be expanded at least to introduce enough resiliency and performance. I will touch slightly upon that matter later in this section. Bear with me…

High-level design

For my PoC, I set everything up just like explained above leveraging my AWS Organization Networking account and a Sandbox one.
My on-prem network on the other hand is represented by a single Raspberry Pi 4 running Ubuntu 20.04 and StrongSwan 1.9.4, as well as Synology DS218+.

To keep the diagram below as clean as possible to highlight the concept the traffic lines were made to flow through the TGW while associated NICs are there to just indicate they physically exist as all that traffic is handled by them in fact. For the same reason, there are no availability zones visualised while this entire set-up is Multi AZ’d, as well as local routes in the routing tables were omitted.

Transit Gateway

As you can see the Transit Gateway routing has been simplified as the use case above is not complex. Normally, there would be multiple TGW routing tables assigned to attachments depending on the individual connectivity requirements of a particular VPC.

The centralised egress out to the Internet is apart from a way of reducing the costs of running NAT and Internet gateways in each VPC requiring them the ability to introduce security appliances (“bump-in-the-wire”) combined with AWS Gateway Load Balancer for traffic inspection or make use of the AWS Network Firewall.

DNS resolution

The support for overlapping domain names that is the core concept of the proposed set-up was introduced in late 2019 and made it easy to distribute permissions for managing private hosted zones across the organisation.
At the same time, it allows the R53 resolver route traffic based on the most specific match. If there is no hosted zone that exactly matches the domain name in the request, the R53 resolver checks for a hosted zone that has a name that is the parent of the domain name in the request.

As PHZs are global constructs and not regional they are also a perfect means to support DR scenarios leveraging a multi-region solution. Similar thing with R53 Inbound/Outbound resolvers, another pair of resolvers can be configured in another region to failover to in case of the primary region failure.

Caveats

The above is obviously just a fundament. Things complicate when you start considering how your workloads will run across your Organisation managed accounts and how services they host will be exposed.
Now, when there’s a centralised egress what about exposing your services to the Internet? Wait, wasn’t one of the main ideas behind centralising the egress to disallow the creation of Internet Gateways in managed accounts through SCPs? In such a case you probably either centralise your ingress or maybe disallow the creation of NAT Gateways and association of public IP addresses.
Hereby, I just wanted to emphasise how one decision can drive another one and eventually influence the shape of the target solution. In the end, every Organisation wants to end up with patterns and procedures for doing things, don’t they?

Final thoughts

Going back to my initial statement, there are multiple ways such a hybrid network can be designed and implemented depending on requirements. Each individual functionality must be carefully thought through.
Related aspects that I came across when working for companies in their cloud enablement phase were, among other things, considerations around:

centralised VPC with subnets sharing with RAM along with centrally managed R53 PHZs per share
single, centralised ingress ALB/NLB with multiple rules passing traffic to internal ALBs/NLBs with a firewall in between Not all those ideas turned out to be good choices therefore thinking global and making your target solution as flexible as possible is the way to go. For that of course, there must be a lot of experience within the team.

Now when AWS offers an enormous range of services and options to deliver very complex solutions while organisations decide to migrate to the cloud we’re back to centralising things but in a different place. The reason for that is that with hundreds of AWS accounts running a big number of workloads organisations want to ensure some control and elevate the level of security which is probably the most important key factor for them when deciding on migrating to the cloud. The times when individual projects were treated independently seem to be over. We’re back doing the same work around networking but in someone else’s data centre :)

As all these things are not always easy to comprehend, especially when AWS services evolve I strongly suggest following the Networking and Content Delivery Blog from AWS where you can find many useful clues, and solutions or at least get your head around what’s going on in that world.

Auth Portal powered by AWS/AzureAD and built with CDKs

Sebastian Mincewicz — Fri, 10 Jun 2022 08:39:28 +0000

This one aims to bring together all the pieces required to build and deploy an authentication portal in AWS leveraging Azure AD as IdP. Something that has recently been more and more often used across AWS projects and this time I thought I would go about it a bit differently to try out new things and therefore gain more insights into some tech I haven’t had a chance to master yet. As usual, at the same time sharing some thoughts and experiences with whoever’s interested.

Tech stack

From the stack outlined below, I’ve already got a huge experience with AWS and Terraform which should not be a surprise if you read my previous publications. However, this time I wanted to shift my focus to using some other popular, open-source tooling I had limited knowledge of and put them into the mix.

AWS — to host the Portal
AWS CDK (v. 2.26.0)— for developing the Portal (type-script)
Azure AD — as an identity service provider (federated authentication)
CDKtf (v. 0.11.0) — for configuring Azure AD (type-script)
React/Amplify — for a bit of frontend

As for my job, I must take decisions from time to time on what tooling should be used to deliver a solution, I badly wanted to give AWS CDK a full end-to-end go to witness whether that framework can be easily used for something more than just a PoC. And as there was CDKtf too I decided to give that one a chance as well.
Regarding the frontend bit, don’t expect too much ;) It’s not something I do and only set that up to cover up the entire infrastructure stack this story is mainly about and to show a tangible result.

Design

The high-level diagram below represents the spectrum of services composing the Auth Portal in AWS and its integration with Azure AD.

In case you want to deepen your understanding of the SAML user pool IdP authentication flow please navigate to this page: https://docs.aws.amazon.com/cognito/latest/developerguide/cognito-user-pools-saml-idp-authentication.html

Goals

Apart from the fact that I wanted to get some hands-on experience with CDKs I also set several goals for myself in terms of what I wanted to try out. These were:

Cognito hosted UI (with customisations)
Cognito custom domain
Pre token generation Lambda trigger
Cognito required claims
Azure AD additional/custom claims
Azure AD → Cognito claims mappings
Azure App access restrictions based on security group membership

While I abandoned the last one as I wanted to keep the FREE Azure subscription plan that only allows access restrictions on the user account basis, the rest turned out not to be difficult to configure.

Source code

The source code for the entire stack can be referenced below.

https://github.com/sebolabs/auth-portal

To use it, you’ll have to export the required environment variables.
Additionally, it expects you to have your own public domain and a valid ACM certificate in the N. Virginia region (CloudFront and Cognito requirement).

Because the frontend part is optional I decided to keep it separate from the AWS portal stack code. If you want to test the authentication flow in the simplest possible way just set the S3_DUMMY_PAGE_DEPLOY environment variable to true, surf to the address below, sign in and then check things with Developers Tools in the browser and jwt.io.

https://<cognito custom domain>/login?response_type=code&client_id=<cognito client id>&redirect_uri=<portal site url>

Caveats

Cognito custom domain feature assumes that if you expose the Cognito authentication endpoint at auth.test.example.com then your landing page is test.example.com. If that's not the case, like in my example, then Cognito expects you to have a resolvable A record configured for test.example.com to perform some verification. Moreover, on the Azure side, you must also let your root domain example.com be verified by configuring a TXT record with a provided value.

Cognito required claims can only be set up at the user pool provisioning stage and so cannot be modified at a later time. This means that if you change your mind you’ll have to delete your Cognito user pool and create it again. This also means your user pool ID changes along with the client application ID and so both the Azure AD and the frontend configuration must be updated with new values. What can turn out to be even more disturbing is the fact that you lose all user profiles previously created in Cognito. Luckily, users can now be imported from CSV.

The Pre token generation Lambda function hasn’t got any logic customising the claims and was set up only to see what useful information it can produce out of the box.

AWS & CDK

With CDK the stack gets synthesized and translated into a CloudFormation template. Depending on features you decide to leverage across your code there could be additional, custom resources spun up for you to satisfy requirements.
A cool feature here is that any changes to the IAM resources require your attention and approval before getting deployed. This is useful from the security/audit perspective.
One thing to keep in mind when bootstrapping a CDK project is the qualifier option that helps you avoid resource name clashes when provisioning multiple bootstrap stacks in the same AWS account.

Must read: https://docs.aws.amazon.com/cdk/v2/guide/best-practices.html

$ cdk deploy
✨  Synthesis time: 3.59s

PortalStack: deploying...
[0%] start: Publishing 483ae06ed27ef8ca76e011264d772420593a6cfe8544759c306ef3b98c9e25be:XXXXXXXXXXXX-eu-central-1
[...]
[100%] success: Published ca2e471276d39c586eae61d73c2e253eb08b4a648a8676f2000f81271b73a405:XXXXXXXXXXXX-eu-central-1

PortalStack: creating CloudFormation changeset...
✅  PortalStack
✨  Deployment time: 393.82s

Outputs:
PortalStack.cognitoDomainName = https://auth.test.sebolabs.net
PortalStack.cognitoUserPoolClientId = 72661fo2r4bgob1ateqskfhicd
PortalStack.cognitoUserPoolId = eu-central-1_XXXXXXXXX
PortalStack.frontendS3BucketName = sebolabs-test-portal-XXXXXXXXXXXX-eu-central-1
PortalStack.portalSiteUrl = https://portal.test.sebolabs.net

Stack ARN:
arn:aws:cloudformation:eu-central-1:XXXXXXXXXXXX:stack/PortalStack/e9974180-e322-11ec-ab92-02e74f818fb0

✨  Total time: 397.4s

Testing

One of the fundamentals and a huge advantage of using CDK is the fact that infrastructure code can be tested just like application code.

$ npm test
> portal@0.1.0 test
> jest
 PASS  test/cf.test.ts
 PASS  test/cognito.test.ts
 PASS  test/s3.test.ts
 PASS  test/lambda.test.ts
Test Suites: 4 passed, 4 total
Tests:       9 passed, 9 total
Snapshots:   0 total
Time:        3.852 s, estimated 4 s
Ran all test suites.

Azure AD & CDKtf

With CDKtf the stack gets synthesized and translated into a Terraform plan (zipped) that is then executed either locally or remotely.
When using Terraform Cloud for your backend you can have it to store your state and optionally use it to run Terraform keeping a track of all your runs in one place. Another nice feature of using that remote backend is that it also versions states and highlights changes (diff) between consecutive Terraform runs.
Furthermore, CDKtf supports most of the Terraform well-known commands e.g. output that can become very useful for example to pass certain values between stages in a CI/CD pipeline, but also locals,remote states and other fundamental Terraform features.

Must read: https://www.terraform.io/cdktf

$ cdktf deploy
sebolabs-aad-auth-portal  Initializing the backend...
sebolabs-aad-auth-portal  Initializing provider plugins...
                          - Reusing previous version of hashicorp/azuread from the dependency lock file
sebolabs-aad-auth-portal  - Using previously-installed hashicorp/azuread v2.22.0
sebolabs-aad-auth-portal  Terraform has been successfully initialized!

azuread_claims_mapping_policy.portal_cmp (portal_cmp): Refreshing state... [id=a733375e-b67c-49de-9142-12af23de9afa]
azuread_group.portal_users (portal_users): Refreshing state... [id=e426314e-14fb-4415-9fd8-e170981b378e]
azuread_application.portal_app (portal_app): Refreshing state... [id=764cc3ae-2cf0-4cd5-9521-091b7bd3bece]
azuread_service_principal.portal_sp (portal_sp): Refreshing state... [id=a4d4926d-c5cc-402c-96c4-016dc396325f]
azuread_service_principal_claims_mapping_policy_assignment.portal_cmpa (portal_cmpa): Refreshing state... [id=a4d4926d-c5cc-402c-96c4-016dc396325f/claimsMappingPolicy/a733375e-b67c-49de-9142-12af23de9afa]

No changes. Your infrastructure matches the configuration.
Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

sebolabs-aad-auth-portal
  app_id = 6e8b9c9d-715b-4371-810f-4avc07a27c2x

Testing

Likewise, CDKtf also enables you to run unit tests against your code.

$ npm test
> portal-azure@1.0.0 test
> jest
PASS  __tests__/main-test.ts
  Terraform
    ✓ check if the produced terraform configuration is valid
    ✎ todo check if this can be planned
  AzureAD configuration
    ✎ todo should contain an application
    ✎ todo should contain a service principal
    ✎ todo should contain a claims mapping policy
    ✎ todo should contain a service principal claims mapping policy assignment
Test Suites: 1 passed, 1 total
Tests:       5 todo, 1 passed, 6 total
Snapshots:   0 total
Time:        4.438 s, estimated 5 s
Ran all test suites.

Outcome

And finally, here’s the result of putting all the things together…

💡 In case some user information was not retrieved or mapped as expected, it’s worth comparing the SAML response (idpresponse payload) with the ID JWT token (token preview) using Developer Tools from within a browser.

CloudWatch Logs Insights

The aforementioned Lambda function that is meant to be used for customising ID token claims can also be used to just log certain information carried by tokens. Especially when Cognito is a black box not revealing anything, such data can become very useful e.g. if the authentication flow must be debugged or to generate some users activity statistics. Moreover, very often project teams working on the AWS side of things have no access to the Azure AD application sign-in logs and so that way they can gain some insights.

Wrap-up

AWS & Azure AD

First of all, setting up an end-to-end authentication flow using Amazon Cognito and Azure AD is fairly simple. Obviously, there are some caveats here and there + things one must be aware of as both services are managed cloud services working based on some assumptions etc. Therefore, it is worth spending some time reading relevant documentation in advance.

The authentication mechanism configured as a part of this story is just a beginning though. Going further, there’s logic one may want to implement with Lambda triggers and also the entire authorisation flow when integrating such a frontend solution with backend services. Regarding the latter, there are decisions to be made e.g. what type of an authoriser should be used when integrating with API Gateways and even more on access scopes. Either way, having a token carrying correct claims is a good start.

AWS CDK & CDKtf

On the CDKs side of things, I must admit both have turned out to be very appealing and promising! During my try-out, I certainly enjoyed the fact things simply happen automatically without me having to worry about things I used to worry about when working with cloud infrastructure orchestrators natively. There were several issues I came across but they were rather related either to the underlying orchestrators or cloud providers themselves, nothing that would want me to ditch any of the CDKs.

In terms of AWS CDK, I’m not a big fan of some of the concepts CloudFormation is based on, e.g. the fact you can’t delete a resource manually in the console and then rerun cdk deploy to calculate what’s missing and reprovision that resource. Instead, you get a resource not found exception.

In terms of CDKtf, having configured only five resources is probably not representative enough to draw big conclusions, however, it just worked and literally took me minutes to have a working deployment mechanism.

Just like with any other frameworks, e.g. the serverless framework, CDKs can massively simplify developers' lives. With AWS Security Speciality hat on, however, I want to emphasize the security aspect of the infrastructure being provisioned as a part of application development. I’ve seen many times already how insecure such infrastructure can become when it’s developed by individuals having not enough security-in-the-cloud awareness, especially when there’s networking involved. Therefore, getting compliant constructs or modules is definitely something that should be considered by organisations and project teams who care more about just application features.

Finally, both CDKs require you to figure out the best way you want to go about environments and how you’re going to provide environment-specific values when deploying stacks. It’s not as straightforward as with Terraform environment files but once you get it right you can’t go wrong with CDKs. Both frameworks can still be considered quite new and I strongly believe that with time they’ll become even more robust.