Forem: Aonnis

Thundering Herds: The Scalability Killer

Aonnis — Thu, 01 Jan 2026 08:00:00 +0000

Imagine it’s 3:00 AM. Your pager goes off. The dashboard shows a 100% CPU spike on your primary database, followed by a total service outage. You look at the logs and see a weird pattern: the traffic didn't actually increase, but suddenly every single request started failing at the exact same millisecond.

You’ve just been trampled by the Thundering Herd.

In this post, we’re going to dive into one of the most common yet misunderstood performance bottlenecks in distributed systems: what the Thundering Herd problem is, and how to use a combination of Request Collapsing and Jitter to build systems that don’t collapse under their own weight.

What is the Thundering Herd?

At its core, the Thundering Herd occurs when a large number of processes are waiting for an event, but when the event happens, they all wake up at once. However, only one can actually "handle" the event.

While the term originated in OS kernel scheduling, modern web engineers most frequently encounter it in the form of a Cache Stampede.

The Anatomy of a Crash:

The Golden State: You have a high-traffic endpoint cached in Redis. Everything is fast.
The Expiry: The cache TTL (Time-to-Live) hits zero.
The Stampede: 5,000 concurrent users refresh the page. They all see a cache miss.
The Collapse: All 5,000 requests hit your database simultaneously to re-generate the same data. Your database will experience a surge in load, latency skyrockets, and the service goes down.

Note: 5000 requests is an arbitrary number. The actual number depends on your system's capacity.

Beyond the Cache: Other Thundering Herd Scenarios

While cache stampedes are the most common, the Thundering Herd can manifest across your entire stack:

1. The "Welcome Back" Surge (Downstream Recovery)

Imagine your primary Auth service goes down for 5 minutes. During this time, every other service in your cluster is failing and retrying. When the Auth service finally comes back up, it is immediately hit by large number of requests per second from all the other services trying to "catch up." This often knocks the service right back down again—a phenomenon known as a Retry Storm.

2. The Auth Token Expiry

In microservices, many internal services might share a common access token (like a machine-to-machine JWT). If that token has a hard expiry and 50 different microservices all see it expire at the exact same second, they will all "thunder" toward the Identity Provider to get a new one.

3. "Top of the Hour" Scheduled Tasks

A classic ops mistake is scheduling a heavy cleanup cron job to run at 00 * * * * (midnight) across 100 different server nodes. At precisely 12:00:00 AM, your database or shared storage is hit by 100 heavy processes simultaneously.

4. CDN "Warm-up" and Deployment Surge

When you deploy a new version of a 500MB mobile app binary, it isn't in any CDN edge caches yet. If you immediately notify 1 million users to download it, the first thousands of requests will all miss the edge and hit your origin server at once, potentially melting your storage layer.

How to Detect the Herd (Monitoring & Metrics)

You don't want your first notification of a thundering herd to be a total outage. Look for these "herd signatures" in your dashboard:

Correlation of Cache Misses and Latency: A sudden spike in cache miss rates that perfectly aligns with a surge in p99 database latency.
Connection Pool Exhaustion: If you see your database connection pool hitting its max limit within milliseconds, you likely have a stampede.
CPU Context Switching: On your application servers, a massive spike in "System CPU" or context switches indicates that thousands of threads are waking up and fighting for the same locks.
Error Logs: Thousands of "lock wait timeout" or "connection refused" errors occurring in a tight cluster.

Strategy 1: Request Collapsing (The "Wait in Line" Approach)

Request collapsing (also known as Promise Memoization) is the practice of ensuring that for any given resource, only one upstream request is active at a time.

If Request A is already fetching user_data_123 from the database, Requests B, C, and D shouldn't start their own fetches. Instead, they should "subscribe" to the result of Request A.

The Problem with Naive Collapsing

If you implement a simple lock, you often run into a secondary issue: Busy-Waiting. If 4,999 requests are waiting for that one database call to finish, how do they know when it's done? If they all check "Is it ready yet?" every 10ms, you’ve just created a new herd in your application memory.

The Solution:

Event-Based NotificationTo fix this, we need to move from a Push model (or Polling) to a Pull/Notification model. Instead of asking "Is it done?", the waiting requests should simply go to sleep and ask to be woken up when the data is ready.

In Python or Node.js, this is often handled natively by Promises or Futures. In other languages, you might use Condition Variables or Channels.

Here is a Python example using asyncio. Notice how we use a shared Event object. The "followers" simply await the event, consuming zero CPU while they wait for the "leader" to finish the work.

import asyncio

class RequestCollapser:
    def __init__(self):
        # Stores the events for keys currently being fetched
        self.inflight_events = {}
        self.cache = {}

    async def get_data(self, key):
        # 1. Check if data is already in cache
        if key in self.cache:
            return self.cache[key]

        # 2. Check if someone else is already fetching it
        if key in self.inflight_events:
            print(f"Request for {key} joining the herd (waiting)...")
            event = self.inflight_events[key]
            await event.wait()  # <--- Crucial: Zero CPU usage while waiting
            return self.cache.get(key)

        # 3. Be the "Leader"
        print(f"Request for {key} is the LEADER. Fetching from DB...")
        event = asyncio.Event()
        self.inflight_events[key] = event

        try:
            # Simulate DB fetch
            await asyncio.sleep(1) 
            data = "Fresh Data"
            self.cache[key] = data
            return data
        finally:
            # 4. Notify the herd
            event.set() # Wakes up all waiters instantly
            del self.inflight_events[key]

The Giant Herd: Distributed Collapsing

The Python example above works perfectly for a single server. But what if you have 100 app servers? You still have 100 "leaders" hitting your database at once. Which may or may not be a problem, depending on your database. If you want to protect your system from this edge case, you can use distributed locks to ensure only one node in the entire cluster becomes the leader for a specific key.

To solve this at scale, you can use:

Distributed Locks (Redis/Etcd): Use a library like Redlock to ensure only one node in the entire cluster becomes the leader for a specific key.
The "Singleflight" Pattern: In Go, the golang.org/x/sync/singleflight package is the gold standard for this. It handles the local collapsing logic efficiently, and when combined with a distributed lock, it protects both your app memory and your database.

Strategy 2: Jitter (The "Social Distancing" for Data)

This is where Jitter comes in. Jitter is the introduction of intentional, controlled randomness to stagger execution.

Staggered Retries

When a request finds that a resource is being "collapsed" (someone else is already fetching it), don't let it retry on a fixed interval.

Bad: Retry every 50ms.
Good: Retry every 50ms + random(0, 20ms).

Staggered Expirations

Never set a hard TTL on a batch of keys. If you update 10,000 products and set them all to expire in exactly 1 hour, you are scheduling a disaster for exactly 60 minutes from now.Instead, use: TTL = 3600 + (rand() * 120). This spreads the "thundering" over a 2-minute window, which your database can likely handle.

The Pro Move: Probabilistic Early Refresh

The most resilient systems I've built use a technique called X-Fetch. Instead of waiting for the cache to expire, we use jitter to trigger a refresh slightly before expiration.

As the TTL approaches zero, each request performs a "dice roll." If the roll is low, that specific request takes the lead, re-fetches the data, and resets the cache. Because the "roll" is random for every user, the probability ensures that only one user triggers the update, while everyone else keeps getting the "stale but safe" data.

import time
import random

async def get_resilient_data(key):
    cached = await cache.get(key)

    should_refresh = False

    # 1. Handle Cache Miss
    if cached is None:
        should_refresh = True
    else:
        # 2. Calculate time remaining
        time_remaining = cached.expiry - time.time()

        # 3. Handle Negative Time (Expired) or Probabilistic Check
        if time_remaining <= 0:
            should_refresh = True
        else:
            # Probability increases as time_remaining approaches 0
            # Note: We check <= 0 above to avoid DivisionByZero or negative probability
            should_refresh = random.random() < (1.0 / time_remaining)

    if should_refresh:
        try:
            # Collapse requests using a distributed lock or local future map
            return await collapse_request(key, fetch_from_db)
        except Exception:
            if cached: 
                return cached.data # Fallback to stale data on DB failure
            raise

    return cached.data

Final Defense: Safety Nets

Sometimes, despite your best efforts with Jitter or Collapsing, a herd still breaks through. In those moments, you need a final line of defense to keep your system alive:

Load Shedding: When your database connection pool is full, don't keep queuing requests (which just increases latency). Start dropping them with a 503 Service Unavailable. It’s better to fail 10% of users quickly than to make 100% of users wait 30 seconds for a timeout.
Circuit Breakers: If your database is struggling, the circuit breaker "trips" and stops all traffic for a cool-down period. This gives your DB the breathing room it needs to recover without being continuously bombarded by retries.
Rate Limiting: By capping the number of requests per second (globally or per-user), you ensure that even a massive "herd" can't exceed your system's hard limits. Excess requests are throttled with a 429 Too Many Requests, protecting your infrastructure from being overwhelmed.

Choosing Your Weapon: Strategy Comparison

Strategy	Implementation Complexity	Best Used For...	Main Drawback
Jitter	Low	Retries, TTL Expirations	Doesn't stop the initial spike, just spreads it.
Request Collapsing	Medium	High-traffic single keys (e.g., Homepage)	Can become a complex "leader" bottleneck.
X-Fetch (Probabilistic)	High	Mission-critical low-latency data	Adds pre-emptive load to your database.

Closing Thoughts

Scaling isn't just about adding more servers; it's about managing the coordination between them. By implementing Request Collapsing, you protect your downstream resources. By adding Jitter, you protect your coordination layer from itself.

The next time you set a cache TTL, ask yourself: "What happens if 10,000 people ask for this at the same time?" If the answer is "they all wait for the DB," it's time to add some jitter.

If you enjoyed this deep dive into systems engineering, feel free to follow for more insights on building resilient distributed systems.

Build More Resilient Systems with Aonnis

If you're managing complex caching layers and want to avoid the pitfalls of manual scaling and configuration, check out the Aonnis Valkey Operator. It helps you deploy and manage high-performance Valkey compatible clusters on Kubernetes with built-in best practices for reliability and scale.

Surprise: It is free for limited time.

Visit www.aonnis.com to learn more.

Solving the 1MiB ConfigMap Limit in Kubernetes

Aonnis — Wed, 31 Dec 2025 08:19:17 +0000

If you have built a Kubernetes Operator in 2025, you eventually hit the "State Problem."

You start simple: storing configuration in ConfigMaps. It works perfectly until it doesn't. Perhaps you are managing a database cluster, and the cluster topology data grows. Suddenly, you hit the 1 MiB limit of Kubernetes ConfigMaps. Splitting data across multiple ConfigMaps becomes a nightmare of race conditions and unmanageable YAML.

You need a durable, writable store that is accessible by all replicas of your operator.

In this article, we explore how to move beyond ConfigMaps by embedding a distributed, Raft-based SQLite database directly into your Go operator. We will cover the architecture, resource overhead, and provide a complete code example.

The Challenge: 3 Nodes, 1 State

Imagine you are running a high-availability Operator deployment with 3 replicas to ensure leadership election and fault tolerance.

If you just write to a local file system or sqlite.db file on disk:

No Sync: Node A writes data, but Node B and Node C never see it.
Data Loss: If Node A crashes and gets rescheduled, the local file is lost (unless you use PVs, but even then, the new pod might not get the old volume).
Corruption: You cannot simply mount a shared file system (like NFS) and have three SQLite instances write to it simultaneously. SQLite locks will fight, and the database will likely corrupt as SQLite do not support concurrent writes.

We need a solution that is durable, synchronized, and lightweight.

Core Requirements for Operator State

Before looking at tools, we must define what a robust operator state store requires in a Kubernetes environment:

Strong Consistency: When managing infrastructure (like a database cluster), two replicas cannot have different views of the truth. We need a system that ensures all nodes agree on the state before proceeding.
High Availability: The store must survive the loss of a pod. In a 3-node setup, the system should remain fully operational even if one node is down.
Minimal Footprint: Kubernetes operators often run in resource-constrained environments. The database should not require massive CPU or RAM overhead that eclipses the operator's actual logic.
Zero-Dependency Architecture: Ideally, the solution should not require an external dependency service (like a managed database) or a complex sidecar. Adding external components increases the complexity and edge cases that needs to be solved. A self-contained binary simplifies the lifecycle management and reduces networking overhead.
Relational Capabilities: While Key-Value stores are common, having the ability to perform SQL joins and complex queries on cluster metadata significantly simplifies operator logic.

The Landscape of Solutions

Before writing custom code, we evaluated the standard architectural patterns for this problem.

1. The Sidecar Approach (rqlite / LiteFS)

You can run a database process alongside your operator container.

rqlite: A distributed database that uses SQLite as its engine. It uses HTTP for queries and handles Raft consensus for you.
LiteFS: A FUSE-based file system that replicates SQLite files across nodes by intercepting writes.

Verdict: While robust, sidecars introduce "lifecycle entanglement." You must ensure the sidecar is healthy before the operator starts, handle local network latency between containers, and manage double the resource requests/limits per pod. It also complicates kubectl logs and debugging as you're monitoring two distinct processes per replica.

2. The "Kubernetes Native" Approach (etcd)

K8s uses etcd, so why shouldn't you?

Verdict: Using the cluster's internal etcd (via the K8s API) brings you back to the 1MiB limit per object and strict rate limiting. Running your own etcd cluster inside the operator’s namespace is an option, but etcd is notoriously sensitive to disk latency and requires significant "babysitting" (backups, defragmentation, and member management). Furthermore, you lose the ability to perform relational queries, forcing you to implement complex indexing in your Go code.

3. External Database Service (Managed RDS / Self-hosted Postgres)

You could connect the operator to an external database like PostgreSQL or MySQL.

Verdict: This moves the state outside the cluster's blast radius, but introduces significant networking hurdles. You must manage VPC peering, Subnet routing, and IAM roles or Kubernetes Secrets for credentials. If the operator is running in a restricted environment (like an air-gapped cluster), an external DB might be physically unreachable. Additionally, the latency of a cross-network SQL query can slow down the reconciliation loop compared to a locally-embedded store.

4. The Embedded Approach (Go + Raft + SQLite)

Since Kubernetes Operators are typically written in Go, we can embed the distribution logic directly into the binary using libraries that integrate Raft consensus with the SQLite driver.

Verdict: This solution fits perfectly given the requirements. It creates a single, self-healing binary that manages its own replication. There are no extra containers to patch, no external credentials to rotate, and it leverages the same Persistent Volumes already assigned to the operator pods.

The Solution: Embedded Raft Consensus

We chose an approach using an embeddable library (like Hiqlite or Dqlite) that bundles:

SQLite: For SQL storage.
Raft: For consensus (ensuring all 3 nodes agree on the data).
HTTP/TCP Transport: To replicate logs between nodes.

How it handles "Simultaneous" Writes

A common concern is concurrency. If operator Node A manages "Cluster X" and operator Node B manages "Cluster Y", and they write simultaneously, what happens?

Distributed SQLite utilizes Serialized Writes. Even if requests come in parallel, the Raft Leader ingests them, orders them in a log, and applies them sequentially.

Throughput: While this sounds slow, Raft can handle hundreds of operations per second—far more than what a typical Operator needs.
Consistency: Writes are atomic, meaning Node C never sees a 'partial' transaction. Reads can be configured as Strong (guaranteed latest data from Leader) or Stale (fast local reads), giving you flexibility between correctness and performance.

Resource Overhead

Operators must be lightweight. Here is the estimated overhead of embedding a Raft/SQLite node:

CPU: Negligible when idle. During consensus and log replication (writes), expect spikes to 100-200m (millicores) as nodes handle serialization, cryptographic signing of entries, and active network exchange.
Memory:
- Baseline: ~64MiB (Estimated based on standard Go runtime + Raft log cache + SQLite page cache).
- Under Load: 256MiB - 512MiB (depending on caching strategy and query complexity).
Storage: Minimal. The Raft log is compacted into SQLite snapshots periodically.

Implementation: A Go-Based Stateful Operator

Below is a complete example using a hypothetical integration of hiqlite (a representative library for this pattern) to create a self-healing 3-node cluster.

Prerequisites

StatefulSet: You must deploy this as a StatefulSet so pods get stable names (operator-0, operator-1).
Headless Service: To allow pods to resolve each other's IPs by DNS.

The Code

package main

import (
    "context"
    "fmt"
    "log"
    "os"
    "time"

    // Replace with your chosen Raft/SQLite library
    "github.com/sebadob/hiqlite" 
)

// ClusterData represents the schema for our Valkey clusters
type ClusterData struct {
    ID        string
    Status    string
    NodeCount int
}

func main() {
    // 1. Identity & Discovery
    // In K8s StatefulSets, POD_NAME is stable (e.g., "my-operator-0")
    nodeID := os.Getenv("POD_NAME")
    if nodeID == "" {
        log.Fatal("POD_NAME env var is required")
    }

    // Define the peers. In a real operator, you might generate this string 
    // based on the Replicas count in your Helm chart.
    peers := []string{
        // Format: {pod_name}.{service_name}.{namespace}.svc.cluster.local:{port}
        "my-operator-0.operator-svc.default.svc.cluster.local:8080",
        "my-operator-1.operator-svc.default.svc.cluster.local:8080",
        "my-operator-2.operator-svc.default.svc.cluster.local:8080",
    }

    // 2. Initialize the Distributed DB
    // This starts the Raft listener and SQLite engine
    db, err := hiqlite.New(hiqlite.Config{
        NodeId:   nodeID,
        Address:  fmt.Sprintf("%s:8080", nodeID), // Listen on this pod's network
        DataDir:  "/var/lib/operator/data",      // Must be a PersistentVolume
        Members:  peers,
        Secret:   "cluster-shared-secret",       // basic security
    })
    if err != nil {
        log.Fatalf("Failed to initialize distributed store: %v", err)
    }

    // 3. Schema Migration (Idempotent)
    // Usually only the Raft leader executes this, but the library handles forwarding.
    initSchema(db)

    // 4. Start the Operator Loop
    go runOperatorLoop(db, nodeID)

    // Keep main process alive
    select {}
}

func initSchema(db *hiqlite.Client) {
    ctx := context.Background()
    query := `
    CREATE TABLE IF NOT EXISTS valkey_clusters (
        id TEXT PRIMARY KEY,
        status TEXT,
        node_count INTEGER,
        updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
    );`

    if err := db.Execute(ctx, query); err != nil {
        log.Printf("Schema init warning: %v", err)
    }
}

func runOperatorLoop(db *hiqlite.Client, nodeID string) {
    // Simulate reconciliation loop
    for {
        time.Sleep(10 * time.Second)

        // WRITE OPERATION
        // We insert/update state. If this node is a Follower, 
        // the library forwards the write to the Leader transparently.
        err := db.Execute(context.Background(), 
            "INSERT OR REPLACE INTO valkey_clusters (id, status, node_count) VALUES (?, ?, ?)",
            "cluster-primary", "Healthy", 5)

        if err != nil {
            log.Printf("[%s] Failed to sync state: %v", nodeID, err)
        } else {
            log.Printf("[%s] State synced successfully via Raft", nodeID)
        }

        // READ OPERATION
        // Reads can be strongly consistent (via Leader) or stale (local)
        // depending on configuration.
        var status string
        var count int
        row := db.QueryRow(context.Background(), 
            "SELECT status, node_count FROM valkey_clusters WHERE id = ?", 
            "cluster-primary")

        if err := row.Scan(&status, &count); err == nil {
            fmt.Printf("[%s] Current World State: Status=%s, Nodes=%d\n", nodeID, status, count)
        }
    }
}

Key Takeaways for Production

Persistence is Mandatory: Even though Raft replicates data, you must use PersistentVolumes (PVCs) for the underlying storage directory (/var/lib/operator/data). If the entire cluster restarts, in-memory data is lost. The PVC ensures the Raft log survives.
Handling Failures: If one node goes down, the other two continue to operate (Quorum = 2). When the failed node comes back, it will automatically "catch up" by downloading the missing logs or a full snapshot from the leader.
Readiness Probes: Don't mark your operator pod as "Ready" until the DB has joined the Raft cluster. This prevents K8s from routing traffic to a node that isn't synced yet. When a new pod joins (e.g., during a scale-up or replacement), it will start in a "Catch-up" state, replaying the Raft log from the leader until its local SQLite state matches the cluster consensus.

During this catch-up phase:

Writes: Any write request initiated by the new node will immediately work because the library transparently forwards the command to the current cluster Leader.
Reads: Stale local reads are available immediately but will return outdated data. Strongly consistent reads will only work once the node has joined the Raft group and synchronized its state, as they require a round-trip to the Leader to verify the latest index.

Only once this synchronization is complete should the readiness probe pass or build operator logic to wait for synchronization depending on your operator business logic, ensuring the operator never reconciles against potentially stale data in its local view.

Why Not Use a Standard Deployment?

While it is technically possible to run this architecture using a standard Kubernetes Deployment, it introduces significant operational complexity. If you choose to avoid StatefulSets, you must manually manage the following:

Quorum Management & Membership Changes: Raft requires a majority (Quorum) to perform any action, including removing a dead node. In a Deployment, if a pod dies and a new one starts with a random name, the cluster size effectively increases. If you don't explicitly remove the "old" node identity, you risk losing quorum during subsequent failures because the leader will keep trying to contact a node that no longer exists.
Identity-to-Storage Mapping: Standard Deployments do not guarantee which pod gets which Persistent Volume. You would need to write custom logic to ensure a new pod can find and mount the specific volume containing its previous Raft log and SQLite state.
Dynamic Peer Discovery: Without the stable DNS names provided by a Headless Service and StatefulSet (e.g., operator-0.svc), your nodes must constantly query the Kubernetes API to discover the current IPs of their peers and update the Raft membership list dynamically, which is prone to race conditions during split-brain scenarios.

StatefulSets simplify this by providing stable hostnames and predictable volume bindings, allowing your operator to focus on business logic rather than cluster coordination plumbing.

Conclusion

Moving your Operator's state from ConfigMaps to a distributed SQLite instance allows you to scale beyond the 1MiB limit while maintaining the simplicity of a single Go binary. By leveraging libraries like Hiqlite or Dqlite, you gain SQL capabilities, strong consistency, and high availability, making your Operator robust enough for critical production workloads.

Build More Resilient Systems with Aonnis

Visit www.aonnis.com to learn more. If a feature is not available which you need then let us know on support@aonnis.com we will try to ship it within two weeks depending on the complexity of the feature.