Forem: Sakthivel C

Why Your AWS EKS Cluster Isn't Scaling Down — The PDB Trap With Stateless Services

Sakthivel C — Thu, 16 Apr 2026 14:50:12 +0000

Introduction
What is a Pod Disruption Budget?
The Problem - PDB Blocking Node Scale Down
Why This Is Easy To Miss ?
The Fix — PDB Only For Stateful Services
Key Takeaway

Introduction

Kubernetes cost optimization on AWS EKS often focuses on scaling up efficiently — but scaling down is where hidden costs live. Cluster autoscalar focuses on identifying nodes which are consuming less resource and scaling down those nodes ( if node count is greater than min available node count set in cluster )to reduce overall EKS usage cost..In a production environment we worked on, we noticed nodes weren't been scaled down by autoscalar even when resource usage was very low. After investigating, the culprit was something small and easy to overlook — a Pod Disruption Budget configured on a stateless service.

What is a Pod Disruption Budget?

A Pod Disruption Budget (PDB) is a Kubernetes resource that limits how many pods of a deployment can be down at the same time during voluntary disruptions — something like node drains, cluster upgrades, or autoscaler scale down events. This is used to ensure critical services are always available during above disruption cases to ensure service availability. This doesn't prevent cases node failure, pod OOM events.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: my-service

This tells Kubernetes — "at least 1 pod of this service must always be running." ( minAvailable: 1)

For stateful services like Redis or Kafka this makes complete sense. You don't want all the pods of these stateful services going down unexpectedly. And some minimum number of pods to be available at all costs.The problem starts when you apply this same logic to stateless services.

The Problem -PDB Blocking Node Scale Down

Here's the exact scenario we ran into:

A stateless service was running with 1 replica .It had a PDB with minAvailable: 1. The pod was consuming very low CPU and memory
AWS EKS Cluster Autoscaler identified the node as underutilized and tried to scale it down
To scale down the node it needed to evict the pod first, But PDB said minimum 1 pod must be available at all times. Since there was only 1 replica, evicting it would violate the PDB

Result — autoscaler couldn't evict the pod, node stayed up indefinitely.The node was essentially stuck — too empty to be useful, too protected to be removed.
Cluster Autoscaler → tries to drain node
→ attempts to evict pod
→ PDB blocks eviction (minAvailable: 1, replicas: 1)
→ node scale down blocked
→ you keep paying for an underutilized node

Why this is Easy to Miss ?

The pod itself showed no issues. CPU and memory were fine. HPA wasn't triggering. Everything looked healthy from an application perspective. The only sign was nodes not scaling down during low traffic periods — which is easy to dismiss as "autoscaler being slow" rather than investigating deeper.

The Fix — PDB Only For Stateful Services

The solution was straightforward once we identified the cause. Removed PDB entirely from stateless services
Kept PDB only for stateful services like - Redis, Kafka, and similar infra components.Moved these stateful services to dedicated node group to ensure any high resource usage by stateless pods doesn't affect these pods if they are allocated in same node as PDB won't protect such cases. This fix ensured stateful services running in dedicated node group isolated from stateless pods with PDB ensuring during drain events these critical infra pods are available and doesn't cause entire production outage events.

Stateless services by definition can handle being evicted and rescheduled — that's the whole point of being stateless. They don't need disruption protection by default. If even minute disruptions are not acceptable having PDBs with Max unavailable option can be considered or isolating such services to seperate node group in EKS with high tier based on whether they are cpu/ memory intensive would be a better choice.

#  PDB makes sense here — stateful service
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: redis-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: redis

#  Avoid this — stateless service with single replica
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-stateless-service-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: my-stateless-service

Key Takeaway

If your AWS EKS cluster autoscaler isn't scaling down nodes during low traffic periods, check your PDBs before anything else. A minAvailable: 1 on a single replica stateless service is effectively telling your cluster — "this node can never be removed."
Reserve PDBs for services that genuinely need them. Your AWS bill will thank you.

Have you run into unexpected autoscaler behaviour in EKS? Drop a comment — would love to hear other gotchas people have faced.

Why Your Kubernetes Pods Scale Slowly (And How to Fix It)

Sakthivel C — Fri, 10 Apr 2026 15:40:24 +0000

The Problem
Why Autoscaling Feels Slow
The Fix: Placeholder Pods
How to Set It Up
What Happens During a Real Spike
Things to Keep in Mind
Wrapping Up

The Problem

You've set up the Horizontal Pod Autoscaler (HPA) in your cluster. Your app gets a sudden spike in traffic, and your existing pods start to throttle under the heavy load.

The HPA kicks in: "Hey, I need 3 more pods to service this traffic!"

But instead of scaling instantly, those pods sit in a Pending state for 4–5 minutes. In that window:

Requests are dropped.
Latency spikes.
You lose a huge number of customers.

Why are the pods stuck?

The Kubernetes scheduler can't place your pods because there is no room left on your existing nodes. This triggers the Cluster Autoscaler (CA) to provision a brand new node.

That process is slow:

VM Provisioning: The cloud provider has to spin up a new instance.
Node Bootstrapping: Joining the node to the cluster and installing dependencies.
Image Pulling: Downloading your container images to the new node.

By the time the node is ready, the damage is already done.

Why Autoscaling Feels Slow

Kubernetes autoscaling operates in two distinct layers:

HPA (Horizontal Pod Autoscaler): Scales pods based on metrics. This is fast (seconds).
CA (Cluster Autoscaler): Adds new nodes when pods can't be scheduled. This is slow (3–5 minutes).

HPA reacts in seconds, but CA reacts in minutes. That gap is where your availability suffers.

The Fix: Placeholder Pods

The Concept: Keep "dummy" pods running on your nodes to reserve space. They do nothing but hold capacity. When a real pod needs that space, Kubernetes evicts the dummy immediately, and your real pod schedules without waiting.

The evicted dummy then has nowhere to go, which signals the Cluster Autoscaler to provision a new node. The dummy lands there—restoring the buffer for the next spike.

This ensures you always have warm capacity ready. The slow provisioning happens in the background, not in your user's critical path.

How to Set It Up

Step 1: Create a Low-Priority Class

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: placeholder-pod-priority
value: -1
globalDefault: false
description: "Used for placeholder pods that can be evicted anytime"

A negative priority ensures any real pod—which defaults to priority 0—will always win. The scheduler will immediately evict the placeholder to make room for your application pod.

Step 2: Deploy the Placeholder Pods

apiVersion: apps/v1
kind: Deployment
metadata:
  name: placeholder
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: placeholder
  template:
    metadata:
      labels:
        app: placeholder
    spec:
      priorityClassName: placeholder-pod-priority
      terminationGracePeriodSeconds: 0
      containers:
        - name: placeholder
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"

Key details in this manifest:

pause image: This is the smallest possible container; it does nothing and consumes virtually no resources.

resources.requests: This tells Kubernetes to reserve this specific amount of space. Match this roughly to your app's requirements.

terminationGracePeriodSeconds: 0: Ensures the eviction is instant, handing the spot to your real pod without any shutdown delay.

Step 3: Verify Your App's Priority

If you haven't explicitly set a priorityClassName on your application deployment, it defaults to 0. Since 0 is higher than -1, your real pods will always preempt the placeholders automatically.

What Happens During a Real Spike

Traffic increases → HPA requests 3 new pods.
Scheduler looks for space → finds it (placeholder pods are holding it).
Placeholder pods get evicted instantly → real pods schedule in seconds.
Evicted placeholders are now in Pending state.
Cluster Autoscaler sees Pending pods → provisions a new node.
Placeholders land on the new node → buffer is restored for next time.

Things to Keep in Mind

Cost Trade-off: Placeholder pods reserve real node capacity, meaning you are essentially paying for "warm" standby nodes.
Namespace Scope: Deploy placeholders in the same namespace as your workloads, or tune them per-namespace based on criticality.
Works Best with CA: This pattern targets the node provisioning delay specifically. If your nodes already have massive amounts of spare capacity, you don't need this.

Wrapping Up

Cluster Autoscaler is not broken—it's just slow by design because provisioning VMs takes time. Placeholder pods let you work with that constraint. Your HPA scales instantly into pre-warmed capacity, and the slow provisioning happens in the background where it belongs.