Forem: Shittu Sulaimon (Barry)

You Pay 6x More When Your EKS Cluster Goes Out of Support

Shittu Sulaimon (Barry) — Mon, 29 Dec 2025 18:25:04 +0000

Many teams don’t realize this until it shows up on the AWS bill.

Amazon EKS has two support modes for Kubernetes versions: standard support and extended support. The difference between them is not just lifecycle; it is cost.

Standard vs Extended Support

Standard support: $0.10 per hour per cluster
Extended support: $0.60 per hour per cluster (6× more)

Extended support applies when a Kubernetes version goes out of standard support. At first, this increase may seem small, but if you manage multiple clusters across development, staging, and production, the cost adds up very quickly.

Kubernetes Version Lifecycle in EKS

Kubernetes releases a new minor version roughly every four months. AWS EKS follows this model and supports:

The latest three minor versions under standard support
About 14 months of standard support per version
An additional ~12 months of extended support at a higher cost

Standard support versions receive newer security patches, API updates, and configuration improvements. Staying within these versions is strongly recommended.

What Happens If You Don’t Upgrade

AWS notifies customers through the AWS Personal Health Dashboard at least 60 days before a cluster enters extended support.

If a cluster goes beyond extended support:

AWS will automatically upgrade the control plane
You must manually upgrade worker nodes and add-ons such as CoreDNS, kube-proxy, VPC CNI, and CSI drivers
Kubernetes version skew policies apply: upgrades must happen one minor version at a time, and worker nodes must not lag too far behind the control plane

Ignoring this can lead to broken workloads or cluster instability.

Conclusion

Keeping your EKS clusters within standard support is not just a security or stability best practice, it is a cost avoidance strategy. Planning upgrades early helps you avoid surprise bills and keeps your platform healthy.

EKS Disaster Recovery, Simplified: Native Backups with AWS Backup

Shittu Sulaimon (Barry) — Sun, 21 Dec 2025 21:53:04 +0000

For years, platform engineers have shared the same quiet nightmare: backing up EKS at scale. As clusters grow and teams stay lean, disaster recovery stops being optional and becomes mandatory. Until recently, this usually meant Velero pain, custom scripts, manually managed S3 buckets, and constant anxiety about whether your persistent volumes matched cluster state. It worked, but it was fragile, time-consuming, and easy to get wrong.

The Turning Point: November 10, 2025

AWS closed a long-standing gap by introducing native Amazon EKS support in AWS Backup. This isn’t a minor feature drop—it’s a shift from DIY backup engineering to managed reliability.

Here’s why this matters.

Why Native EKS Backup is a Game-Changer

1. Composite Recovery Points (the missing piece)
Previously, EKS backups were fragmented:

Cluster configs in one place
EBS snapshots somewhere else
Hope holding everything together

AWS Backup now captures cluster state + persistent storage (EBS, EFS, S3) as a single, consistent recovery point. No more guessing if your data and manifests are in sync.

2. One Pane of Glass
If you already use AWS Backup for EC2, RDS, or DynamoDB, EKS backups will feel familiar.

Same workflows, policies, and visibility
No extra controllers
No per-cluster Velero babysitting

3. Policy-Driven, Not Script-Driven
Instead of CronJobs inside your clusters, you define Backup Plans:

“Back up every 6 hours. Retain for 30 days.”

AWS handles scheduling, encryption, immutability, and lifecycle management automatically. This is what “set and forget” is supposed to look like.

4. Restores Without the Stress
Restores no longer feel like a gamble. You can:

Restore an entire cluster
Recover a single namespace
Roll back individual persistent volumes
Restore into a brand-new EKS cluster as part of the process

That’s real operational confidence.

Why This Matters Now

Native EKS backup is more than protection against accidental deletion. It provides a safety net for:

Cluster upgrades (e.g., 1.30 → 1.31)
AMI rollouts that fail
Security patches
Kubernetes API changes

For production EKS, this feature quietly changes how teams sleep at night. AWS didn’t just add a backup option; they removed a category of operational stress.

Practical Guide: Enabling Native EKS Backups

If you already have an EKS cluster, follow these steps:

Navigate to your AWS Backup resource, go to Settings, then Configure Resource. Include your EKS cluster as a protected resource.

Go to Protected Resources, click Create On-Demand Backup.

Create a custom IAM role for backup, attaching:

AWSBackupServiceRolePolicyForBackup
AWSBackupServiceRolePolicyForRestores

Example role: EKS-BACKUP-ROLE-EXAMPLE

Start the backup. You can verify progress in the Backup or EKS page.

Restoring Your EKS Cluster

In AWS Backup, navigate to Protected Resources and select the Resource ID of the cluster. Choose the composite recovery point and click Restore.

Configure restore options:

Scope: entire cluster or a namespace
Destination: original cluster, existing cluster, or new cluster

For this walkthrough, we restore into a new cluster to demonstrate full capabilities.

Select storage resources to include. AWS Backup supports EBS, EFS, and S3 storage for persistent data.
AWS Backup provisions the cluster and restores workloads based on your configuration.

This workflow doesn’t replace GitOps or careful upgrade strategies, but it provides a reliable safety net for runtime recovery.

Considerations & Best Practices

Even with native EKS backup, there are important points:

Not all Kubernetes resources are restored exactly as is, especially external integrations
Restore time depends on PV size and data footprint
AWS Backup costs apply for snapshots, storage, and retention
This complements GitOps, but doesn’t replace it

Final Thoughts

Native Amazon EKS support in AWS Backup removes much of the complexity that platform teams previously managed manually. It delivers:

Consistent, policy-driven backups
Predictable restores
No additional controllers or operational overhead

For production EKS environments, it significantly reduces the risk and stress associated with cluster level failures while keeping operations simple and predictable. Platform teams finally have a set-and-forget safety net for backups and restores.

References

Chaos Engineering in Kubernetes: 5 Real World Experiments to Try Today

Shittu Sulaimon (Barry) — Thu, 08 May 2025 07:47:03 +0000

In today’s fast paced digital world, distributed systems have become the backbone of modern applications. However, their complexity also makes them vulnerable to unpredictable failures. Chaos engineering provides a proactive approach to building resilience by intentionally injecting faults and observing how systems respond. This practice enables teams to uncover hidden weaknesses and prepare for real world disruptions before they escalate into critical incidents. It’s important to note that chaos engineering is different from traditional software testing. While testing verifies that a system works as expected under normal conditions, chaos engineering deliberately introduces failure to evaluate how resilient the system is under stress.

In this article, I’ll be using Chaos Mesh a Kubernetes native chaos engineering tool to demonstrate how faults can be injected into a Kubernetes ecosystem to improve its resilience. I’ll walk through its architecture and highlight common chaos experiments you can perform using Chaos Mesh.

Chaos Mesh: Bringing Chaos Engineering to Kubernetes

Chaos Mesh is a CNCF open source project that implements chaos engineering concepts specifically for Kubernetes environments. It achieves this by injecting faults and abnormalities into a Kubernetes cluster or a physical node to analyze how workloads and the environment perform under different failure scenarios.

Chaos Mesh Architecture

Chaos Mesh leverages Kubernetes Custom Resource Definitions (CRDs) to perform chaos engineering in a Kubernetes environment. Different CRD types are used based on the specific fault being injected, and various controllers manage these CRD objects. Chaos Mesh consists of three primary components:

Chaos Dashboard: A major component that provides a user friendly interface for visualizing and experimenting with different types of chaos. The dashboard leverages CRDs and enables users to select and induce chaos experiments into the system. Additionally, it supports Role Based Access Control (RBAC) to grant users specific permissions.
Chaos Controller Manager: This component is responsible for scheduling and monitoring chaos experiments. It injects faults into the system through the Kubernetes API and monitors system responses. It includes different controllers such as the workflow controller, scheduler controller, and controllers for various fault types.
Chaos Daemon: The main execution component of Chaos Mesh. Running in DaemonSet mode with privileged permissions (which can be disabled), Chaos Daemon interacts with network devices, file systems, and kernels by modifying the target Pod Namespace.

Chaos Mesh Architecture

With Chaos Mesh, we can perform different types of chaos simulations on both nodes and Kubernetes environments.

Chaos Experiments in Kubernetes

Pod Fault Simulation: This involves injecting pod crashes, deletions, or restarts using PodChaos in Chaos Mesh.
Network Fault Simulation: This experiment simulates network outages within a cluster, packet drops, and bandwidth limitations between nodes. This is done using NetworkChaos.
Resource Stress Simulation: This experiment stresses CPU, memory, or disk resources in the cluster. It is implemented using StressChaos.
HTTP Fault Simulation: This experiment introduces HTTP faults such as aborting or delaying HTTP connections, modifying HTTP request parameters, or altering response content. It is implemented using HTTPChaos.

Additionally, several other chaos experiments can be performed in Kubernetes.

Chaos Experiments on Physical Nodes

For physical nodes, Chaos Mesh provides Chaosd, a tool that enables experimentation with different failure scenarios, including:

Process Fault Simulation: Killing or stopping a process to observe its impact on the environment.
Resource Pressure Simulation: Applying stress to CPU, memory, and disk resources on each node.
Host Level Injection: Shutting down or restarting a node within a cluster to simulate failures.

Instrumentation with Chaos Mesh Using Kubernetes

Chaos Mesh is a powerful chaos engineering tool designed for Kubernetes environments. It can be used across different Kubernetes setups, whether you're running on cloud platforms like EKS, GKE or AKS, or using local solutions like Minikube, kubeadm or kind (local-setup-guide).

Step 1: Install Chaos Mesh

Install via curl:

   curl -sSL https://mirrors.chaos-mesh.org/v2.7.1/install.sh | bash

Or use helm (Recommended for production)

Step 2: Verify the Installation

   kubectl get po -n chaos-mesh

This confirms that the core components dashboard, controller manager, and chaos daemon are running.

Step 3: Access the Chaos Mesh Dashboard

   kubectl port-forward -n chaos-mesh svc/chaos-dashboard 2333:2333

Step 4: Deploy a simple application on the cluster for chaos experiment

To conduct chaos experiments, we will deploy a simple React application using Kubernetes manifest.

Now we can access the application

Chaos Experiments

Experiment 1: PodChaos

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: custom-pod-failure-experiment
  namespace: chaos-mesh 
spec:
  action: pod-kill
  mode: all 
  selector:
    namespaces:
      - chaos-experiment 
    labelSelectors:
      'app': 'react-app'
  duration: '30s'
  gracePeriod: 10

What it does:
Simulates a scenario where all matching pods are forcefully killed. This helps test the resilience and auto recovery behavior of deployments.

Effect:

All pods with the label app: react app in the chaos experiment namespace were terminated and restarted.
The application was briefly unavailable for 30 seconds.

Experiement 2: HTTPChaos

apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: custom-http-failure-experiment
spec:
  mode: all
  selector:
    namespaces:
      - chaos-experiment 
    labelSelectors:
      app: react-app
  target: Request
  port: 80
  method: GET
  path: /
  abort: true
  duration: 10m

What it does:
Intercepts and aborts incoming HTTP GET requests to simulate upstream service failures or gateway crashes.

Effect:

All HTTP GET requests to / on port 80 were blocked for 10 minutes, making the app appear down.

Experiement 3: NetworkChaos

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: custom-network-bandwidth-failure-experiment
spec:
  action: bandwidth
  mode: all
  selector:
    namespaces:
      - chaos-experiment 
    labelSelectors:
      'app': 'react-app'
  bandwidth:
    rate: '2mbps'
    limit: 20971520
    buffer: 10000

What it does:
Limits the network bandwidth to simulate slow network conditions.

Effect:

The application’s outgoing bandwidth was restricted to 2 Mbps, with a buffer and rate limit applied.
Useful for testing frontend responsiveness or service timeouts under constrained network speeds.

Experiement 4: StressChaos

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: custom-cpu-stress-test-experiment
spec:
  mode: all  # Apply chaos to all matching pods
  selector:
    namespaces:
      - chaos-experiment
    labelSelectors:
      app: react-app
  stressors:
    cpu:
      workers: 4
      load: 80  # 80% CPU load on each worker
  duration: 10m  # Stress for 10 minutes

What it does:
Applies CPU pressure by generating artificial load on the container.

Effect:

Each matching pod experienced 80% CPU usage across 4 workers for 10 minutes.
This can uncover performance bottlenecks or autoscaling issues.

Experiement 5: TimeChaos

apiVersion: chaos-mesh.org/v1alpha1
kind: TimeChaos
metadata:
  name: custom-time-shift-example-experiment
  namespace: chaos-mesh
spec:
  mode: all
  selector:
    namespaces:
      - chaos-experiment
    labelSelectors:
      app: react-app
  timeOffset: '-10m100ns'

What it does:
Shifts the system clock on the container to simulate clock skew or time drift.

Effect:

System time on each pod was shifted 10 minutes backward.
Helps test the behavior of time sensitive features like cron jobs, auth tokens, and expiry mechanisms.

⚠ Note: TimeChaos only affects the main container process (PID 1) and its child processes. It does not impact processes launched externally using kubectl exec. Therefore, to test TimeChaos effectively, you need to observe the application’s internal behavior (logs, API responses, or time based operations) rather than relying on external exec based checks.

Conclusion
Chaos engineering with Chaos Mesh transforms failure into resilience by proactively testing systems under stress. By simulating real world disruptions like pod crashes and network delays you uncover weaknesses before they cause outages. This practice ensures your Kubernetes applications don’t just survive chaos but emerge stronger. In distributed systems, resilience isn’t luck, it’s engineered. 🚀

AWS EKS AutoMode: Simplifying Kubernetes Management

Shittu Sulaimon (Barry) — Sat, 21 Dec 2024 08:21:16 +0000

From the word “Auto”, it’s clear that this feature emphasizes automation. EKS AutoMode is a revolutionary feature that AWS recently unveiled at re:Invent December 2024, with the goal of making Kubernetes cluster administration on Amazon Elastic Kubernetes Service (EKS) easier. Customers can concentrate on innovation and adding value to their organizations thanks to this feature, which eliminates the operational load typically involved with setting up and maintaining Kubernetes clusters.

In the past, AWS simplified things for users by controlling the EKS clusters' control plane, which included maintaining intricate parts like the API server and other things. But with EKS AutoMode, AWS has gone one step further by automating networking, storage, and worker node management, relieving users of several infrastructure hassles. For enterprises wishing to implement Kubernetes without requiring in-depth knowledge of its underlying architecture, this breakthrough represents a major advancement.

Key Advantages of AWS EKS AutoMode
Here are some of the standout benefits of this new feature:
1. Simplified Cluster Operations:
The Operational Excellence pillar of AWS's Well-Architected Framework is specifically in line with EKS AutoMode's architecture. Essential processes like patching, version upgrades, cluster management, and putting security best practices into effect are all automated by it. As a result, clients no longer have to invest time in overseeing the clusters' operational lifecycle—AWS takes care of everything.

2. Improved Performance, Availability, and Security:
EKS AutoMode improves the security posture, availability, and performance of applications operating in the cluster by automating operational operations. Even while scaling or patching, workloads are guaranteed to fulfill strict security and performance requirements thanks to built-in AWS optimizations.

3. Cost-Optimized Compute and Right-Sizing:
EKS AutoMode's capacity to optimize computational resources is among its most remarkable attributes. By preventing over- or under-provisioning, it guarantees that the compute, memory, and storage resources needed for workloads are sized appropriately. This ensures that apps operate effectively while conserving money.
EKS AutoMode further lowers expenses for clients by optimizing EC2 instances by automatically choosing the appropriate instance sizes and types.

4. Built-In Health Monitoring and Auto Repair:
EKS AutoMode continuously monitors the health of applications and underlying resources, automatically repairing any issues that arise. This ensures that workloads remain highly available and resilient without manual intervention.

5. Streamlined Kubernetes Adoption:
Even without a thorough understanding of Kubernetes architecture, enterprises may easily adopt Kubernetes with EKS AutoMode. For businesses wishing to adopt containerization and update their applications, this reduces the entry barrier.

Hands-On Guide: Creating an EKS AutoMode Cluster Using the AWS Console

To establish a cluster in the console, we have two choices:

Quick configuration (with EKS Auto Mode)
Custom configuration

This post will teach you how to use the Quick configuration option to construct an EKS Auto Mode cluster.

Step 1: Sign in to the AWS Management Console

Log in to your AWS account at AWS Management Console.
Navigate to the EKS (Elastic Kubernetes Service) dashboard by searching for EKS in the search bar.

Step 2: Create a new EKS cluster

On the Amazon EKS dashboard, click on Create cluster.
Under the Cluster configuration section:
- Name: Enter a name for your cluster (e.g., eks-automode-cluster).
- Version: Select the Kubernetes version you’d like to use (the latest version is recommended for best support and features).

Step 3: The Cluster IAM Role should be chosen. Use the Create suggested role option if this is your first time setting up an EKS Auto Mode cluster.

The Cluster IAM Role includes required permissions for EKS Auto Mode to manage resources including EC2 instances, EBS volumes, and EC2 load balancers, by default it will create eksClusterRole, the name can actually change but the policy attached is the important part.
These are the list policies that is created in the attached policies:

Step 4: Choose the role of Node IAM. Use the Create suggested role option if this is your first time setting up an EKS Auto Mode cluster.

The necessary permissions for Auto Mode nodes to connect to the cluster are included in the Node IAM Role. Permissions to retrieve ECR images for your containers must be included in the Node IAM Role. By default we have the role named AmazonEKSAutoNodeRole which have the following policy attached to it. The name can vary but the necessary policies should be attached.

If you recently created a new role, use the Refresh icon to reload the role selection dropdown.

Step 5: For your EKS Auto Mode cluster, choose the VPC. You can either select a VPC you've already made for EKS or click the build VPC button to build a new one.

Step 6: (optional) Select View quick configuration defaults to review all configuration values for the new cluster. The table indicates some values are not editable after the cluster is created.

Step 7: By selecting "Create cluster." Note that it may take up to fifteen minutes to construct the cluster.

After it has successfully created a cluster, let us view the nodes being provisioned by navigating to the compute tab:

Here we see no node being provisioned and nodefinitely no resources will be available, so let us ttry and inflate workload into the cluster so we can have workloads and nodes running. let's kubectl apply the below manifest file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      nodeSelector:
        eks.amazonaws.com/compute-type: auto
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        fsGroup: 2000
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 1
          securityContext:
            allowPrivilegeEscalation: false

Now, let's recheck the nodes and the workload running from the console. From the below diagram, we can see it provisioned where the deployment is running.

We can then destroy and clean up the workload.
Out of the box solution that is applied to EKS is the observability of the cluster, added to it is a dashboard for Control plane monitoring, Cluster health issues, Cluster insights and Node health issues.