Forem: Rajesh Gunasekaran

Deploying Amazon MSK at Scale: A Platform Engineer's Journey at Wehkamp

Rajesh Gunasekaran — Mon, 27 Oct 2025 14:14:43 +0000

Introduction

At Wehkamp, we embarked on a journey to modernize our messaging infrastructure by migrating from self-managed Kafka on EC2 to Amazon MSK (Managed Streaming for Apache Kafka). This post shares real-world experiences from the platform team perspective - the challenges we faced, lessons learned, and how we built a scalable, multi-account MSK platform using Infrastructure as Code.

Background: Why MSK?

Wehkamp's legacy Kafka infrastructure (established around 2014-2015) ran on EC2 instances. While it served us well initially, we faced several
challenges:

Scaling complexity: Manual broker management across multiple business units
Operational overhead: Patching, ZooKeeper management, monitoring setup
Stability concerns: Resource contention and performance issues
Multi-account sprawl: Difficult to maintain consistency across environments

Our Migration Goals

We set out to achieve:

Reduce operational burden on the platform team
Improve reliability with AWS-managed infrastructure
Standardize Kafka deployments across all AWS accounts
Enable faster scaling and easier maintenance
Free up engineering time for higher-value work

Infrastructure as Code: Multi-Account MSK with Terraform

At Wehkamp, we managed MSK clusters across multiple AWS accounts representing different business units and environments (dev, staging, production).

Here's how we structured our Terraform setup:

Repository Structure

terraform/
├── modules/
│ └── msk-cluster/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── broker-config.tf
├── accounts/
│ ├── bu1-prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── msk.tf
│ ├── bu1-dev/
│ │ └── ...
│ ├── bu2-prod/
│ │ └── ...
│ └── bu2-dev/
│ └── ...
└── aws-config/
└── credentials

The MSK Module Pattern

We created a reusable MSK module that handled all the complexity:

  # modules/msk-cluster/main.tf
  resource "aws_msk_cluster" "main" {
    cluster_name           = var.cluster_name
    kafka_version          = var.kafka_version
    number_of_broker_nodes = var.broker_count

    broker_node_group_info {
      instance_type   = var.instance_type
      client_subnets  = var.subnet_ids
      security_groups = [aws_security_group.msk.id]

      storage_info {
        ebs_storage_info {
          volume_size = var.storage_size_gb
        }
      }
    }

    configuration_info {
      arn      = aws_msk_configuration.main.arn
      revision = aws_msk_configuration.main.latest_revision
    }

    encryption_info {
      encryption_in_transit {
        client_broker = "TLS"
        in_cluster    = true
      }
    }

    tags = merge(var.common_tags, {
      Environment = var.environment
      ManagedBy   = "Terraform"
    })
  }

Calling the Module Per Account

Each AWS account directory called the module with environment-specific values:

  # accounts/bu1-prod/msk.tf
  module "msk_cluster" {
    source = "../../modules/msk-cluster"

    cluster_name    = "bu1-prod-msk"
    kafka_version   = "2.8.1"
    broker_count    = 3
    instance_type   = "kafka.m5.large"
    storage_size_gb = 1000
    subnet_ids      = data.aws_subnets.private.ids

    environment = "production"

    common_tags = {
      BusinessUnit = "BU1"
      CostCenter   = "engineering"
    }
  }

Multi-Account Authentication

Each AWS account directory contained its own variable files and configuration:

terraform/
├── accounts/
│ ├── bu1-prod/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Variables for this account
│ │ └── msk.tf
│ ├── bu1-dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars # Variables for this account
│ │ └── msk.tf

We used AWS credential profiles with Terraform automatically picking up the tfvars file in each directory:

  # Deploying to specific account
  cd terraform/accounts/bu1-prod
  terraform init
  terraform plan
  terraform apply

Terraform automatically used:

The terraform.tfvars file in the current directory
The appropriate AWS credential profile configured in aws-config
Module references pointing to ../../modules/msk-cluster

The Initial Setup

When provisioning our MSK clusters, we followed https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html. We calculated our
requirements based on:

Expected throughput (MB/s)
Number of partitions
Retention policies
Client connections

Based on these calculations, we initially deployed with kafka.m5.large instances.

Problem: Capacity Exceeded

Shortly after production deployment, we hit a critical issue: our topic count exceeded what the instance type could accommodate. MSK has specific limits per broker instance type:

| Instance Type | Max Partitions per Broker |
|------------------|---------------------------|
| kafka.m5.large | ~1000 partitions |
| kafka.m5.xlarge | ~2000 partitions |
| kafka.m5.2xlarge | ~4000 partitions |

With multiple business units creating topics organically, we quickly exceeded capacity, causing:

Broker CPU spikes
Increased replication lag
Connection timeouts

Solution: Upgrade to kafka.m5.2xlarge

We upgraded the instance type via Terraform:

  module "msk_cluster" {
    source = "../../modules/msk-cluster"

    # Changed from kafka.m5.large
    instance_type = "kafka.m5.2xlarge"

    # ... other config
  }

terraform apply
# MSK performs rolling upgrade, no downtime

The upgrade worked smoothly - MSK performed a rolling update with zero downtime.

New Problem: Over-Provisioning & Cost

A few weeks later, our AWS cost reports flagged a significant increase. We had over-provisioned:

Actual usage: ~1,500 partitions
Provisioned capacity: kafka.m5.2xlarge (4,000 partitions)
Cost impact: 2x more expensive than needed

We wanted to downgrade to kafka.m5.xlarge (the Goldilocks size), but discovered a critical MSK limitation:

⚠️ MSK does NOT support downgrading instance types through the console or API.

You can only upgrade, never downgrade.

The Workaround: AWS Support Case

We raised an AWS Support case requesting manual downgrade assistance. Here's what we learned:

AWS Support Options:

Recommended approach: Create new cluster with correct size, migrate topics
- Pros: Clean slate, proper sizing
- Cons: Complex migration, client reconfiguration
Manual intervention (what we did): AWS engineers performed backend downgrade
- Pros: Faster, no client changes
- Cons: Requires support case, not guaranteed for all scenarios

The AWS support team successfully downgraded our cluster, but the process took 48-72 hours and required careful planning during a maintenance window.

Lessons Learned

Start conservatively, but not too conservatively
- Use monitoring data to right-size within first 30 days
- Build in 30-40% headroom for growth
Monitor partition count actively
- Set CloudWatch alarms for partition count thresholds
- Implement topic creation governance (approval process)
MSK instance type changes are one-way
- You can upgrade easily, but downgrades require AWS support
- Plan sizing carefully to avoid cost optimization headaches
Consider kafka.m5.xlarge as default starting point
- Good balance of capacity and cost
- Enough headroom for most workloads
Implement topic quotas
- Prevent unbounded topic/partition growth
- Use Kafka quotas or approval workflows

Cost Optimization

After this experience, we implemented partition count monitoring:

# CloudWatch alarm for partition count
resource "aws_cloudwatch_metric_alarm" "partition_count" {
alarm_name = "msk-partition-count-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "PartitionCount"
namespace = "AWS/Kafka"
period = 300
statistic = "Average"
threshold = 1500 # 75% of kafka.m5.xlarge capacity
alarm_description = "MSK partition count approaching instance type limit"

dimensions = {
  "Cluster Name" = aws_msk_cluster.main.cluster_name
}

}

Estimated cost savings: ~$500/month per cluster by right-sizing.

Operational Challenge #2: Production Incident - MSK Disk Space Crisis

Our Monitoring Setup

At Wehkamp, we implemented a tiered alerting strategy:

Informational alerts → Slack channel #aws-platform-alerts
Critical alerts → Immediate action required

The Incident Timeline

Upgrading EKS from v1.31 to v1.32

Rajesh Gunasekaran — Tue, 30 Sep 2025 03:46:34 +0000

Production Ready Serverless

Rajesh Gunasekaran — Sun, 31 Aug 2025 17:26:48 +0000

Deploying a Sample Retail Store App on Amazon EKS – Step-by-Step Guide

Rajesh Gunasekaran — Wed, 30 Jul 2025 20:56:17 +0000

Overview

In this workshop, we will walk through deploying a sample retail store application on Amazon EKS (Elastic Kubernetes Service) using a combination of CloudFormation and Terraform. This guide is divided into logical stages, from setting up foundational infrastructure to running your app on Spot and Graviton instances.

Step 1: Set Up AWS Infrastructure with CloudFormation

We’ll start by using AWS CloudFormation to provision the foundational network infrastructure (VPC, subnets, Internet Gateway, etc.) that EKS requires.

What to do:

Create a CloudFormation stack using a predefined template (YAML/JSON)
Include resources:
VPC
Public and Private Subnets
NAT Gateways
Route Tables

Launching the CloudFormation stack using a predefined template to set up a code-server IDE for the EKS workshop.

Reviewing permissions and acknowledging IAM capabilities before creating the eks-workshop-ide CloudFormation stack.

CloudFormation stack eks-workshop-ide created successfully with status CREATE_COMPLETE.

The CloudFormation stack will take roughly 5 minutes to deploy, and once completed you can retrieve information required to continue from the Outputs tab:

The IdeUrl output contains the URL to enter in your browser to access the IDE. The IdePasswordSecret contains a link to an AWS Secrets Manger secret that contains a generated password for the IDE.

To retrieve the password open that URL and click the Retrieve button, the password will then be available for you to copy::

Open the IDE URL provided and you will be prompted for the password:

After submitting your password you will be presented with the initial VSCode screen

Step 2: Prepare Terraform Configuration

With the CloudFormation networking layer in place, it’s time to define the rest of our infrastructure using Terraform. This is where we set up the EKS cluster, node groups, and other supporting components.

For the given configuration, Terraform will create the workshop environment with the following:

Resources created by Terraform:

A VPC across three Availability Zones
An Amazon EKS Cluster
An IAM OIDC Provider for enabling IAM roles for service accounts
A Managed Node Group named default for general workloads
Configuration of the Amazon VPC CNI plugin to use prefix delegation (for improved IP management and scaling)

Terraform initial setup, including downloading necessary Terraform configuration files for the EKS workshop.

Terraform init in action

Terraform successfully deploys the EKS cluster

EKS cluster Up & running in AWS console

Node Group config within EKS cluster

Step 3: Set Up/Prepare the EKS Cluster environment

Initializing the EKS workshop environment

Verify the EKS cluster nodes, namespaces using kubectl command

Step 4: Deploy the Sample Retail Store App

In this workshop, we'll deploy the sample retail application efficiently using the power of Kustomize. The following kustomization file shows how you can reference other kustomizations and deploy multiple components together:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - rabbitmq
  - catalog
  - carts
  - checkout
  - assets
  - orders
  - ui
  - other

What is Infrastructure as Code (IaC) and What are the benefits of using it ?

Rajesh Gunasekaran — Sat, 19 Jul 2025 09:48:27 +0000

What is Infrastructure as Code (IaC) ?

Infrastructure as Code (IaC) is a declarative approach to provisioning and managing infrastructure using tools such as Terraform, CloudFormation, Ansible, and others.

With IaC, there’s no need to manually log in to cloud provider consoles like AWS, Azure, or GCP to create infrastructure resources.

The core idea of IaC is that you define your infrastructure in code—allowing you to create, update, and destroy resources programmatically.

By using IaC across both on-premises and cloud environments, organizations can deliver dynamic, scalable infrastructure to internal teams and ensure a seamless experience for customers.

IaC has become a foundational practice in DevOps and cloud-native engineering, empowering teams to build scalable, consistent, and reproducible environments with confidence.

Benefits of IaC:

Consistency: Prevent configuration drift by ensuring the same code provisions the same infrastructure every time.
Version Control: Use Git to track changes, collaborate across teams, and roll back when needed.
Automation: Integrate with CI/CD pipelines to automatically deploy infrastructure without manual intervention.
Speed: Launch and update environments faster and more reliably than through manual processes.
Reuse: Encapsulate infrastructure into reusable modules, enabling teams to build on top of proven, well-documented components.
Documentation: With Terraform and other IaC tools, infrastructure definitions live in code repositories—making them easy to read, manage, and understand.

Iac Tools:

Below are some of the most widely adopted Infrastructure as Code tools used globally. These tools help automate infrastructure deployment on both private and public cloud platforms:

Hashicorp Terraform - terraform.io
AWS Cloud Formation - aws.amazon.com/cloudformation
Azure Resource Manager (ARM) - azure.microsoft.com
Google Cloud Deployment Manager - cloud.google.com/deployment-manager/docs
Pulumi - pulumi.com

Conclusion:

I hope this gave you a clear understanding of what Infrastructure as Code (IaC) is and the key benefits it brings. Whether you’re managing on-premises systems or cloud environments, IaC provides a reliable, scalable, and efficient way to provision and maintain infrastructure. Embracing IaC is a strong step toward automation, collaboration, and modern DevOps practices.

Securing Terraform Automation: Atlantis IAM Design and Implementation in AWS

Rajesh Gunasekaran — Mon, 30 Jun 2025 10:03:57 +0000

Introduction

Infrastructure as Code (IaC) tools like Terraform enable organizations to manage cloud resources efficiently. However, managing permissions for Terraform operations is crucial to maintaining security and compliance. This article explores how to design and implement IAM roles and policies for Atlantis in an Amazon EKS environment.

Background

Atlantis is an open-source tool that automates Terraform workflows. It allows teams to collaborate on infrastructure changes via pull requests (PRs). Since Atlantis executes Terraform commands on behalf of users, it requires appropriate AWS IAM permissions to perform its tasks securely.

Challenges in IAM Design for Atlantis

Principle of Least Privilege (PoLP): Atlantis should have only the minimum permissions necessary to apply infrastructure changes.
Managing Multi-Account Access: Organizations often deploy infrastructure across multiple AWS accounts, requiring cross-account access management.
Securely Storing AWS Credentials: Storing and managing AWS credentials securely is essential to prevent unauthorized access.
Audit and Compliance: Tracking who initiated infrastructure changes and ensuring compliance with security policies is a key challenge.

Solution Approach

Using IAM Roles Instead of Static Credentials

Instead of using AWS access keys, Atlantis should assume an IAM role with specific permissions.

Setting Up IAM Roles for Atlantis

Create an IAM Role for Atlantis: Define an IAM role with trust policies allowing Atlantis (running in Amazon EKS) to assume it.
Attach Least Privilege Policies: Assign policies granting only the necessary permissions for Terraform actions.

Enabling Cross-Account Access

Create IAM roles in each AWS account that Atlantis needs to manage.
Update trust policies to allow Atlantis to assume these roles.

Storing and Using IAM Credentials Securely

Use Kubernetes Service Account IAM Roles (IRSA) to provide Atlantis with temporary credentials.
Avoid storing long-term AWS credentials in configuration files.

Implementing Logging and Auditing

Enable AWS CloudTrail to track Atlantis’s API calls.
Use AWS IAM Access Analyzer to review granted permissions.

Step-by-Step Implementation

Step 1: Create an IAM Role for Atlantis

aws iam create-role --role-name AtlantisRole --assume-role-policy-document file://trust-policy.json

The trust policy (trust-policy.json) should allow EKS to assume the role.

Step 2: Attach Required IAM Policies

Attach policies that grant only necessary permissions.

aws iam attach-role-policy --role-name AtlantisRole --policy-arn arn:aws:iam::aws:policy/TerraformApplyPolicy

Step 3: Enable IRSA for Atlantis in EKS

Create a Kubernetes service account with IAM role annotations.
Update the Atlantis deployment to use the service account.

Step 4: Configure Cross-Account Role Assumption

Create a role in each target AWS account.
Update the trust policy to allow AtlantisRole to assume it.

Step 5: Verify and Monitor Access

Test Atlantis’s ability to assume roles and apply Terraform changes.
Monitor API calls using AWS CloudTrail and IAM Access Analyzer.

Conclusion

By designing a secure IAM strategy for Atlantis, organizations can ensure Terraform automation runs safely while adhering to security best practices. This structured approach balances security with operational efficiency, enabling teams to manage infrastructure confidently.

Step-by-Step Process of Atlantis Executing Terraform

Rajesh Gunasekaran — Sat, 11 Jan 2025 07:11:39 +0000

Introduction

Atlantis is an essential tool for automating Terraform workflows. It provides a GitOps-style approach where Terraform plans and applies are triggered by pull requests (PRs). This step-by-step guide details how Atlantis processes Terraform changes when a developer submits a PR.

Prerequisites

Before implementing Atlantis, ensure you have the following:

An Amazon EKS cluster with Atlantis deployed
A properly configured GitHub repository
AWS IAM roles and permissions set up for Terraform execution
Backend configuration for storing Terraform state (e.g., AWS S3 and DynamoDB)

Step 1: Developer Submits a PR with Terraform Changes

A developer modifies Terraform configuration files and pushes the changes to a feature branch.
A pull request (PR) is created against the main branch in the GitHub repository.
Atlantis automatically detects the PR and adds a comment indicating that a Terraform plan is in progress.

Step 2: Atlantis Runs terraform plan

Atlantis checks out the PR branch inside the EKS pod.
It executes terraform init to initialize the working directory.
It runs terraform plan to generate an execution plan.
Atlantis posts the output of terraform plan as a comment in the PR.
Developers review the plan and validate the proposed infrastructure changes.

Step 3: Reviewer Approves the Plan

If the plan looks good, an authorized reviewer (or the developer themselves) comments atlantis apply on the PR.
Atlantis detects the command and proceeds with applying the changes.

Step 4: Atlantis Runs terraform apply

Atlantis reinitializes the working directory and ensures the state is up-to-date.
It executes terraform apply to make the infrastructure changes.
Once completed, Atlantis updates the PR with the results of the apply command.
If successful, the infrastructure is updated, and Terraform state is stored in the configured backend.

Step 5: Merging the PR

After a successful apply, the PR is ready to be merged into the main branch.
The developer or reviewer merges the PR.
Atlantis automatically removes the workspace and cleans up temporary files related to the PR.

Step 6: Continuous Monitoring and Improvements

Regularly update Atlantis configurations and Terraform modules.
Implement policies and checks to ensure compliance.
Use Atlantis logging and monitoring to troubleshoot any issues.

Conclusion

By following this structured approach, teams can streamline their Terraform workflows, enhance collaboration, and maintain infrastructure as code best practices using Atlantis on Amazon EKS.