<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aonnis</title>
    <description>The latest articles on Forem by Aonnis (@aonnis).</description>
    <link>https://forem.com/aonnis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3683869%2F6152fd27-206a-4d6f-9bbc-d9dc88c2fcf0.png</url>
      <title>Forem: Aonnis</title>
      <link>https://forem.com/aonnis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aonnis"/>
    <language>en</language>
    <item>
      <title>Thundering Herds: The Scalability Killer</title>
      <dc:creator>Aonnis</dc:creator>
      <pubDate>Thu, 01 Jan 2026 08:00:00 +0000</pubDate>
      <link>https://forem.com/aonnis/thundering-herds-the-scalability-killer-41mh</link>
      <guid>https://forem.com/aonnis/thundering-herds-the-scalability-killer-41mh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ts15g1wirqtnohgycro.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ts15g1wirqtnohgycro.png" alt="Thundering Herds: The Scalability Killer" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine it’s 3:00 AM. Your pager goes off. The dashboard shows a 100% CPU spike on your primary database, followed by a total service outage. You look at the logs and see a weird pattern: the traffic didn't actually increase, but suddenly every single request started failing at the exact same millisecond.&lt;/p&gt;

&lt;p&gt;You’ve just been trampled by the Thundering Herd.&lt;/p&gt;

&lt;p&gt;In this post, we’re going to dive into one of the most common yet misunderstood performance bottlenecks in distributed systems: what the Thundering Herd problem is, and how to use a combination of Request Collapsing and Jitter to build systems that don’t collapse under their own weight.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the Thundering Herd?
&lt;/h2&gt;

&lt;p&gt;At its core, the Thundering Herd occurs when a large number of processes are waiting for an event, but when the event happens, they all wake up at once. However, only one can actually "handle" the event.&lt;/p&gt;

&lt;p&gt;While the term originated in OS kernel scheduling, modern web engineers most frequently encounter it in the form of a Cache Stampede.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Anatomy of a Crash:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Golden State:&lt;/strong&gt; You have a high-traffic endpoint cached in Redis. Everything is fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Expiry:&lt;/strong&gt; The cache TTL (Time-to-Live) hits zero.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Stampede:&lt;/strong&gt; 5,000 concurrent users refresh the page. They all see a cache miss.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Collapse:&lt;/strong&gt; All 5,000 requests hit your database simultaneously to re-generate the same data. Your database will experience a surge in load, latency skyrockets, and the service goes down.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; 5000 requests is an arbitrary number. The actual number depends on your system's capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Beyond the Cache: Other Thundering Herd Scenarios
&lt;/h2&gt;

&lt;p&gt;While cache stampedes are the most common, the Thundering Herd can manifest across your entire stack:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The "Welcome Back" Surge (Downstream Recovery)
&lt;/h3&gt;

&lt;p&gt;Imagine your primary Auth service goes down for 5 minutes. During this time, every other service in your cluster is failing and retrying. When the Auth service finally comes back up, it is immediately hit by large number of requests per second from all the other services trying to "catch up." This often knocks the service right back down again—a phenomenon known as a &lt;strong&gt;Retry Storm&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Auth Token Expiry
&lt;/h3&gt;

&lt;p&gt;In microservices, many internal services might share a common access token (like a machine-to-machine JWT). If that token has a hard expiry and 50 different microservices all see it expire at the exact same second, they will all "thunder" toward the Identity Provider to get a new one.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. "Top of the Hour" Scheduled Tasks
&lt;/h3&gt;

&lt;p&gt;A classic ops mistake is scheduling a heavy cleanup cron job to run at &lt;code&gt;00 * * * *&lt;/code&gt; (midnight) across 100 different server nodes. At precisely 12:00:00 AM, your database or shared storage is hit by 100 heavy processes simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. CDN "Warm-up" and Deployment Surge
&lt;/h3&gt;

&lt;p&gt;When you deploy a new version of a 500MB mobile app binary, it isn't in any CDN edge caches yet. If you immediately notify 1 million users to download it, the first thousands of requests will all miss the edge and hit your origin server at once, potentially melting your storage layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Detect the Herd (Monitoring &amp;amp; Metrics)
&lt;/h2&gt;

&lt;p&gt;You don't want your first notification of a thundering herd to be a total outage. Look for these "herd signatures" in your dashboard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Correlation of Cache Misses and Latency&lt;/strong&gt;: A sudden spike in cache miss rates that perfectly aligns with a surge in p99 database latency.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Connection Pool Exhaustion&lt;/strong&gt;: If you see your database connection pool hitting its max limit within milliseconds, you likely have a stampede.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;CPU Context Switching&lt;/strong&gt;: On your application servers, a massive spike in "System CPU" or context switches indicates that thousands of threads are waking up and fighting for the same locks.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Error Logs&lt;/strong&gt;: Thousands of "lock wait timeout" or "connection refused" errors occurring in a tight cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Strategy 1: Request Collapsing (The "Wait in Line" Approach)
&lt;/h3&gt;

&lt;p&gt;Request collapsing (also known as Promise Memoization) is the practice of ensuring that for any given resource, only one upstream request is active at a time.&lt;/p&gt;

&lt;p&gt;If Request A is already fetching user_data_123 from the database, Requests B, C, and D shouldn't start their own fetches. Instead, they should "subscribe" to the result of Request A.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Problem with Naive Collapsing
&lt;/h4&gt;

&lt;p&gt;If you implement a simple lock, you often run into a secondary issue: Busy-Waiting. If 4,999 requests are waiting for that one database call to finish, how do they know when it's done? If they all check "Is it ready yet?" every 10ms, you’ve just created a new herd in your application memory.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Solution:
&lt;/h4&gt;

&lt;p&gt;Event-Based NotificationTo fix this, we need to move from a Push model (or Polling) to a Pull/Notification model. Instead of asking "Is it done?", the waiting requests should simply go to sleep and ask to be woken up when the data is ready.&lt;/p&gt;

&lt;p&gt;In Python or Node.js, this is often handled natively by Promises or Futures. In other languages, you might use Condition Variables or Channels.&lt;/p&gt;

&lt;p&gt;Here is a Python example using asyncio. Notice how we use a shared Event object. The "followers" simply await the event, consuming zero CPU while they wait for the "leader" to finish the work.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RequestCollapser&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Stores the events for keys currently being fetched
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflight_events&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. Check if data is already in cache
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Check if someone else is already fetching it
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflight_events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; joining the herd (waiting)...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflight_events&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;--- Crucial: Zero CPU usage while waiting
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Be the "Leader"
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is the LEADER. Fetching from DB...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflight_events&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Simulate DB fetch
&lt;/span&gt;            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fresh Data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
        &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# 4. Notify the herd
&lt;/span&gt;            &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;# Wakes up all waiters instantly
&lt;/span&gt;            &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inflight_events&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  The Giant Herd: Distributed Collapsing
&lt;/h4&gt;

&lt;p&gt;The Python example above works perfectly for a single server. But what if you have 100 app servers? You still have 100 "leaders" hitting your database at once. Which may or may not be a problem, depending on your database. If you want to protect your system from this edge case, you can use distributed locks to ensure only one node in the entire cluster becomes the leader for a specific key.&lt;/p&gt;

&lt;p&gt;To solve this at scale, you can use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Distributed Locks (Redis/Etcd)&lt;/strong&gt;: Use a library like &lt;code&gt;Redlock&lt;/code&gt; to ensure only one node in the entire cluster becomes the leader for a specific key.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;The "Singleflight" Pattern&lt;/strong&gt;: In Go, the &lt;code&gt;golang.org/x/sync/singleflight&lt;/code&gt; package is the gold standard for this. It handles the local collapsing logic efficiently, and when combined with a distributed lock, it protects both your app memory and your database.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Strategy 2: Jitter (The "Social Distancing" for Data)
&lt;/h3&gt;

&lt;p&gt;This is where Jitter comes in. Jitter is the introduction of intentional, controlled randomness to stagger execution.&lt;/p&gt;

&lt;h4&gt;
  
  
  Staggered Retries
&lt;/h4&gt;

&lt;p&gt;When a request finds that a resource is being "collapsed" (someone else is already fetching it), don't let it retry on a fixed interval.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bad:&lt;/strong&gt; Retry every 50ms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Good:&lt;/strong&gt; Retry every 50ms + random(0, 20ms).&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Staggered Expirations
&lt;/h4&gt;

&lt;p&gt;Never set a hard TTL on a batch of keys. If you update 10,000 products and set them all to expire in exactly 1 hour, you are scheduling a disaster for exactly 60 minutes from now.Instead, use: &lt;code&gt;TTL = 3600 + (rand() * 120)&lt;/code&gt;. This spreads the "thundering" over a 2-minute window, which your database can likely handle.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Pro Move: Probabilistic Early Refresh
&lt;/h4&gt;

&lt;p&gt;The most resilient systems I've built use a technique called X-Fetch. Instead of waiting for the cache to expire, we use jitter to trigger a refresh slightly before expiration.&lt;/p&gt;

&lt;p&gt;As the TTL approaches zero, each request performs a "dice roll." If the roll is low, that specific request takes the lead, re-fetches the data, and resets the cache. Because the "roll" is random for every user, the probability ensures that only one user triggers the update, while everyone else keeps getting the "stale but safe" data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_resilient_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;should_refresh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Handle Cache Miss
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;should_refresh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 2. Calculate time remaining
&lt;/span&gt;        &lt;span class="n"&gt;time_remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expiry&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Handle Negative Time (Expired) or Probabilistic Check
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;time_remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_refresh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Probability increases as time_remaining approaches 0
&lt;/span&gt;            &lt;span class="c1"&gt;# Note: We check &amp;lt;= 0 above to avoid DivisionByZero or negative probability
&lt;/span&gt;            &lt;span class="n"&gt;should_refresh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;time_remaining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;should_refresh&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Collapse requests using a distributed lock or local future map
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;collapse_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fetch_from_db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="c1"&gt;# Fallback to stale data on DB failure
&lt;/span&gt;            &lt;span class="k"&gt;raise&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Final Defense: Safety Nets
&lt;/h2&gt;

&lt;p&gt;Sometimes, despite your best efforts with Jitter or Collapsing, a herd still breaks through. In those moments, you need a final line of defense to keep your system alive:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Load Shedding&lt;/strong&gt;: When your database connection pool is full, don't keep queuing requests (which just increases latency). Start dropping them with a &lt;code&gt;503 Service Unavailable&lt;/code&gt;. It’s better to fail 10% of users quickly than to make 100% of users wait 30 seconds for a timeout.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Circuit Breakers&lt;/strong&gt;: If your database is struggling, the circuit breaker "trips" and stops all traffic for a cool-down period. This gives your DB the breathing room it needs to recover without being continuously bombarded by retries.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Rate Limiting&lt;/strong&gt;: By capping the number of requests per second (globally or per-user), you ensure that even a massive "herd" can't exceed your system's hard limits. Excess requests are throttled with a &lt;code&gt;429 Too Many Requests&lt;/code&gt;, protecting your infrastructure from being overwhelmed.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Choosing Your Weapon: Strategy Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Implementation Complexity&lt;/th&gt;
&lt;th&gt;Best Used For...&lt;/th&gt;
&lt;th&gt;Main Drawback&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Jitter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Retries, TTL Expirations&lt;/td&gt;
&lt;td&gt;Doesn't stop the initial spike, just spreads it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Request Collapsing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High-traffic single keys (e.g., Homepage)&lt;/td&gt;
&lt;td&gt;Can become a complex "leader" bottleneck.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X-Fetch (Probabilistic)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Mission-critical low-latency data&lt;/td&gt;
&lt;td&gt;Adds pre-emptive load to your database.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Scaling isn't just about adding more servers; it's about managing the coordination between them. By implementing Request Collapsing, you protect your downstream resources. By adding Jitter, you protect your coordination layer from itself.&lt;/p&gt;

&lt;p&gt;The next time you set a cache TTL, ask yourself: "What happens if 10,000 people ask for this at the same time?" If the answer is "they all wait for the DB," it's time to add some jitter.&lt;/p&gt;

&lt;p&gt;If you enjoyed this deep dive into systems engineering, feel free to follow for more insights on building resilient distributed systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build More Resilient Systems with Aonnis
&lt;/h2&gt;

&lt;p&gt;If you're managing complex caching layers and want to avoid the pitfalls of manual scaling and configuration, check out the &lt;strong&gt;&lt;a href="https://aonnis.com" rel="noopener noreferrer"&gt;Aonnis Valkey Operator&lt;/a&gt;&lt;/strong&gt;. It helps you deploy and manage high-performance Valkey compatible clusters on Kubernetes with built-in best practices for reliability and scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surprise&lt;/strong&gt;: It is free for limited time.&lt;/p&gt;

&lt;p&gt;Visit &lt;strong&gt;&lt;a href="https://aonnis.com" rel="noopener noreferrer"&gt;www.aonnis.com&lt;/a&gt;&lt;/strong&gt; to learn more.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>performance</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Solving the 1MiB ConfigMap Limit in Kubernetes</title>
      <dc:creator>Aonnis</dc:creator>
      <pubDate>Wed, 31 Dec 2025 08:19:17 +0000</pubDate>
      <link>https://forem.com/aonnis/solving-the-1mib-configmap-limit-in-kubernetes-14m9</link>
      <guid>https://forem.com/aonnis/solving-the-1mib-configmap-limit-in-kubernetes-14m9</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdy9felv3e2mturjjhmo0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdy9felv3e2mturjjhmo0.png" alt="1MiB ConfigMap limit in Kubernetes" width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have built a Kubernetes Operator in 2025, you eventually hit the "State Problem."&lt;/p&gt;

&lt;p&gt;You start simple: storing configuration in ConfigMaps. It works perfectly until it doesn't. Perhaps you are managing a database cluster, and the cluster topology data grows. Suddenly, you hit the 1 MiB limit of Kubernetes ConfigMaps. Splitting data across multiple ConfigMaps becomes a nightmare of race conditions and unmanageable YAML.&lt;/p&gt;

&lt;p&gt;You need a durable, writable store that is accessible by all replicas of your operator.&lt;/p&gt;

&lt;p&gt;In this article, we explore how to move beyond ConfigMaps by embedding a distributed, Raft-based SQLite database directly into your Go operator. We will cover the architecture, resource overhead, and provide a complete code example.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: 3 Nodes, 1 State
&lt;/h2&gt;

&lt;p&gt;Imagine you are running a high-availability Operator deployment with 3 replicas to ensure leadership election and fault tolerance.&lt;/p&gt;

&lt;p&gt;If you just write to a local file system or &lt;code&gt;sqlite.db&lt;/code&gt; file on disk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No Sync&lt;/strong&gt;: Node A writes data, but Node B and Node C never see it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Loss&lt;/strong&gt;: If Node A crashes and gets rescheduled, the local file is lost (unless you use PVs, but even then, the new pod might not get the old volume).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Corruption&lt;/strong&gt;: You cannot simply mount a shared file system (like NFS) and have three SQLite instances write to it simultaneously. SQLite locks will fight, and the database will likely corrupt as SQLite do not support concurrent writes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We need a solution that is durable, synchronized, and lightweight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Requirements for Operator State
&lt;/h2&gt;

&lt;p&gt;Before looking at tools, we must define what a robust operator state store requires in a Kubernetes environment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strong Consistency&lt;/strong&gt;: When managing infrastructure (like a database cluster), two replicas cannot have different views of the truth. We need a system that ensures all nodes agree on the state before proceeding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Availability&lt;/strong&gt;: The store must survive the loss of a pod. In a 3-node setup, the system should remain fully operational even if one node is down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal Footprint&lt;/strong&gt;: Kubernetes operators often run in resource-constrained environments. The database should not require massive CPU or RAM overhead that eclipses the operator's actual logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-Dependency Architecture&lt;/strong&gt;: Ideally, the solution should not require an external dependency service (like a managed database) or a complex sidecar. Adding external components increases the complexity and edge cases that needs to be solved. A self-contained binary simplifies the lifecycle management and reduces networking overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relational Capabilities&lt;/strong&gt;: While Key-Value stores are common, having the ability to perform SQL joins and complex queries on cluster metadata significantly simplifies operator logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Landscape of Solutions
&lt;/h2&gt;

&lt;p&gt;Before writing custom code, we evaluated the standard architectural patterns for this problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Sidecar Approach (rqlite / LiteFS)
&lt;/h3&gt;

&lt;p&gt;You can run a database process alongside your operator container.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;rqlite&lt;/strong&gt;: A distributed database that uses SQLite as its engine. It uses HTTP for queries and handles Raft consensus for you.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiteFS&lt;/strong&gt;: A FUSE-based file system that replicates SQLite files across nodes by intercepting writes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: While robust, sidecars introduce "lifecycle entanglement." You must ensure the sidecar is healthy before the operator starts, handle local network latency between containers, and manage double the resource requests/limits per pod. It also complicates &lt;code&gt;kubectl logs&lt;/code&gt; and debugging as you're monitoring two distinct processes per replica.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The "Kubernetes Native" Approach (etcd)
&lt;/h3&gt;

&lt;p&gt;K8s uses etcd, so why shouldn't you?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: Using the cluster's internal etcd (via the K8s API) brings you back to the 1MiB limit per object and strict rate limiting. Running your own etcd cluster inside the operator’s namespace is an option, but etcd is notoriously sensitive to disk latency and requires significant "babysitting" (backups, defragmentation, and member management). Furthermore, you lose the ability to perform relational queries, forcing you to implement complex indexing in your Go code.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. External Database Service (Managed RDS / Self-hosted Postgres)
&lt;/h3&gt;

&lt;p&gt;You could connect the operator to an external database like PostgreSQL or MySQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: This moves the state outside the cluster's blast radius, but introduces significant networking hurdles. You must manage VPC peering, Subnet routing, and IAM roles or Kubernetes Secrets for credentials. If the operator is running in a restricted environment (like an air-gapped cluster), an external DB might be physically unreachable. Additionally, the latency of a cross-network SQL query can slow down the reconciliation loop compared to a locally-embedded store.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Embedded Approach (Go + Raft + SQLite)
&lt;/h3&gt;

&lt;p&gt;Since Kubernetes Operators are typically written in Go, we can embed the distribution logic directly into the binary using libraries that integrate Raft consensus with the SQLite driver.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict&lt;/strong&gt;: This solution fits perfectly given the requirements. It creates a single, self-healing binary that manages its own replication. There are no extra containers to patch, no external credentials to rotate, and it leverages the same Persistent Volumes already assigned to the operator pods.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Embedded Raft Consensus
&lt;/h2&gt;

&lt;p&gt;We chose an approach using an embeddable library (like Hiqlite or Dqlite) that bundles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQLite&lt;/strong&gt;: For SQL storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raft&lt;/strong&gt;: For consensus (ensuring all 3 nodes agree on the data).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP/TCP Transport&lt;/strong&gt;: To replicate logs between nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How it handles "Simultaneous" Writes
&lt;/h3&gt;

&lt;p&gt;A common concern is concurrency. If operator Node A manages "Cluster X" and operator Node B manages "Cluster Y", and they write simultaneously, what happens?&lt;/p&gt;

&lt;p&gt;Distributed SQLite utilizes &lt;strong&gt;Serialized Writes&lt;/strong&gt;. Even if requests come in parallel, the Raft Leader ingests them, orders them in a log, and applies them sequentially.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput&lt;/strong&gt;: While this sounds slow, Raft can handle hundreds of operations per second—far more than what a typical Operator needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt;: Writes are atomic, meaning Node C never sees a 'partial' transaction. Reads can be configured as Strong (guaranteed latest data from Leader) or Stale (fast local reads), giving you flexibility between correctness and performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resource Overhead
&lt;/h2&gt;

&lt;p&gt;Operators must be lightweight. Here is the estimated overhead of embedding a Raft/SQLite node:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU&lt;/strong&gt;: Negligible when idle. During consensus and log replication (writes), expect spikes to 100-200m (millicores) as nodes handle serialization, cryptographic signing of entries, and active network exchange.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Baseline&lt;/strong&gt;: ~64MiB (Estimated based on standard Go runtime + Raft log cache + SQLite page cache).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Under Load&lt;/strong&gt;: 256MiB - 512MiB (depending on caching strategy and query complexity).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Storage&lt;/strong&gt;: Minimal. The Raft log is compacted into SQLite snapshots periodically.&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation: A Go-Based Stateful Operator
&lt;/h2&gt;

&lt;p&gt;Below is a complete example using a hypothetical integration of &lt;a href="https://github.com/sebadob/hiqlite" rel="noopener noreferrer"&gt;hiqlite&lt;/a&gt; (a representative library for this pattern) to create a self-healing 3-node cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;StatefulSet&lt;/strong&gt;: You must deploy this as a StatefulSet so pods get stable names (&lt;code&gt;operator-0&lt;/code&gt;, &lt;code&gt;operator-1&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headless Service&lt;/strong&gt;: To allow pods to resolve each other's IPs by DNS.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Code
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"fmt"&lt;/span&gt;
    &lt;span class="s"&gt;"log"&lt;/span&gt;
    &lt;span class="s"&gt;"os"&lt;/span&gt;
    &lt;span class="s"&gt;"time"&lt;/span&gt;

    &lt;span class="c"&gt;// Replace with your chosen Raft/SQLite library&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/sebadob/hiqlite"&lt;/span&gt; 
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// ClusterData represents the schema for our Valkey clusters&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ClusterData&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ID&lt;/span&gt;        &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;Status&lt;/span&gt;    &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;NodeCount&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// 1. Identity &amp;amp; Discovery&lt;/span&gt;
    &lt;span class="c"&gt;// In K8s StatefulSets, POD_NAME is stable (e.g., "my-operator-0")&lt;/span&gt;
    &lt;span class="n"&gt;nodeID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"POD_NAME"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;nodeID&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"POD_NAME env var is required"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Define the peers. In a real operator, you might generate this string &lt;/span&gt;
    &lt;span class="c"&gt;// based on the Replicas count in your Helm chart.&lt;/span&gt;
    &lt;span class="n"&gt;peers&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// Format: {pod_name}.{service_name}.{namespace}.svc.cluster.local:{port}&lt;/span&gt;
        &lt;span class="s"&gt;"my-operator-0.operator-svc.default.svc.cluster.local:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"my-operator-1.operator-svc.default.svc.cluster.local:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"my-operator-2.operator-svc.default.svc.cluster.local:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// 2. Initialize the Distributed DB&lt;/span&gt;
    &lt;span class="c"&gt;// This starts the Raft listener and SQLite engine&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;hiqlite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hiqlite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;NodeId&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;nodeID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Address&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s:8080"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeID&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c"&gt;// Listen on this pod's network&lt;/span&gt;
        &lt;span class="n"&gt;DataDir&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;"/var/lib/operator/data"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c"&gt;// Must be a PersistentVolume&lt;/span&gt;
        &lt;span class="n"&gt;Members&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;peers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Secret&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;"cluster-shared-secret"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c"&gt;// basic security&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatalf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Failed to initialize distributed store: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// 3. Schema Migration (Idempotent)&lt;/span&gt;
    &lt;span class="c"&gt;// Usually only the Raft leader executes this, but the library handles forwarding.&lt;/span&gt;
    &lt;span class="n"&gt;initSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// 4. Start the Operator Loop&lt;/span&gt;
    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="n"&gt;runOperatorLoop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// Keep main process alive&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;initSchema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;hiqlite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;`
    CREATE TABLE IF NOT EXISTS valkey_clusters (
        id TEXT PRIMARY KEY,
        status TEXT,
        node_count INTEGER,
        updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
    );`&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Schema init warning: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;runOperatorLoop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;hiqlite&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Simulate reconciliation loop&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c"&gt;// WRITE OPERATION&lt;/span&gt;
        &lt;span class="c"&gt;// We insert/update state. If this node is a Follower, &lt;/span&gt;
        &lt;span class="c"&gt;// the library forwards the write to the Leader transparently.&lt;/span&gt;
        &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; 
            &lt;span class="s"&gt;"INSERT OR REPLACE INTO valkey_clusters (id, status, node_count) VALUES (?, ?, ?)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"cluster-primary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Healthy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[%s] Failed to sync state: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[%s] State synced successfully via Raft"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c"&gt;// READ OPERATION&lt;/span&gt;
        &lt;span class="c"&gt;// Reads can be strongly consistent (via Leader) or stale (local)&lt;/span&gt;
        &lt;span class="c"&gt;// depending on configuration.&lt;/span&gt;
        &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
        &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
        &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueryRow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; 
            &lt;span class="s"&gt;"SELECT status, node_count FROM valkey_clusters WHERE id = ?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
            &lt;span class="s"&gt;"cluster-primary"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"[%s] Current World State: Status=%s, Nodes=%d&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Takeaways for Production
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistence is Mandatory&lt;/strong&gt;: Even though Raft replicates data, you must use PersistentVolumes (PVCs) for the underlying storage directory (&lt;code&gt;/var/lib/operator/data&lt;/code&gt;). If the entire cluster restarts, in-memory data is lost. The PVC ensures the Raft log survives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handling Failures&lt;/strong&gt;: If one node goes down, the other two continue to operate (Quorum = 2). When the failed node comes back, it will automatically "catch up" by downloading the missing logs or a full snapshot from the leader.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Readiness Probes&lt;/strong&gt;: Don't mark your operator pod as "Ready" until the DB has joined the Raft cluster. This prevents K8s from routing traffic to a node that isn't synced yet. When a new pod joins (e.g., during a scale-up or replacement), it will start in a "Catch-up" state, replaying the Raft log from the leader until its local SQLite state matches the cluster consensus.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During this catch-up phase:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Writes&lt;/strong&gt;: Any write request initiated by the new node will immediately work because the library transparently forwards the command to the current cluster Leader.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reads&lt;/strong&gt;: Stale local reads are available immediately but will return outdated data. Strongly consistent reads will only work once the node has joined the Raft group and synchronized its state, as they require a round-trip to the Leader to verify the latest index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only once this synchronization is complete should the readiness probe pass or build operator logic to wait for synchronization depending on your operator business logic, ensuring the operator never reconciles against potentially stale data in its local view.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Use a Standard Deployment?
&lt;/h2&gt;

&lt;p&gt;While it is technically possible to run this architecture using a standard Kubernetes &lt;code&gt;Deployment&lt;/code&gt;, it introduces significant operational complexity. If you choose to avoid &lt;code&gt;StatefulSets&lt;/code&gt;, you must manually manage the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quorum Management &amp;amp; Membership Changes&lt;/strong&gt;: Raft requires a majority (Quorum) to perform any action, including removing a dead node. In a &lt;code&gt;Deployment&lt;/code&gt;, if a pod dies and a new one starts with a random name, the cluster size effectively increases. If you don't explicitly remove the "old" node identity, you risk losing quorum during subsequent failures because the leader will keep trying to contact a node that no longer exists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identity-to-Storage Mapping&lt;/strong&gt;: Standard &lt;code&gt;Deployments&lt;/code&gt; do not guarantee which pod gets which Persistent Volume. You would need to write custom logic to ensure a new pod can find and mount the specific volume containing its previous Raft log and SQLite state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Peer Discovery&lt;/strong&gt;: Without the stable DNS names provided by a Headless Service and &lt;code&gt;StatefulSet&lt;/code&gt; (e.g., &lt;code&gt;operator-0.svc&lt;/code&gt;), your nodes must constantly query the Kubernetes API to discover the current IPs of their peers and update the Raft membership list dynamically, which is prone to race conditions during split-brain scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;StatefulSets&lt;/code&gt; simplify this by providing stable hostnames and predictable volume bindings, allowing your operator to focus on business logic rather than cluster coordination plumbing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Moving your Operator's state from ConfigMaps to a distributed SQLite instance allows you to scale beyond the 1MiB limit while maintaining the simplicity of a single Go binary. By leveraging libraries like Hiqlite or Dqlite, you gain SQL capabilities, strong consistency, and high availability, making your Operator robust enough for critical production workloads.&lt;/p&gt;




&lt;h3&gt;
  
  
  Build More Resilient Systems with Aonnis
&lt;/h3&gt;

&lt;p&gt;If you're managing complex caching layers and want to avoid the pitfalls of manual scaling and configuration, check out the &lt;strong&gt;&lt;a href="https://aonnis.com" rel="noopener noreferrer"&gt;Aonnis Valkey Operator&lt;/a&gt;&lt;/strong&gt;. It helps you deploy and manage high-performance Valkey compatible clusters on Kubernetes with built-in best practices for reliability and scale. It is free for limited time.&lt;/p&gt;

&lt;p&gt;Visit &lt;strong&gt;&lt;a href="https://aonnis.com" rel="noopener noreferrer"&gt;www.aonnis.com&lt;/a&gt;&lt;/strong&gt; to learn more. If a feature is not available which you need then let us know on &lt;a href="mailto:support@aonnis.com"&gt;support@aonnis.com&lt;/a&gt; we will try to ship it within two weeks depending on the complexity of the feature.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
