<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Gabriel Anhaia</title>
    <description>The latest articles on Forem by Gabriel Anhaia (@gabrielanhaia).</description>
    <link>https://forem.com/gabrielanhaia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F425693%2F531a245a-bc24-453e-bb2b-eb7077a3da8b.png</url>
      <title>Forem: Gabriel Anhaia</title>
      <link>https://forem.com/gabrielanhaia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gabrielanhaia"/>
    <language>en</language>
    <item>
      <title>Request Hedging: The Tail-at-Scale Technique Most Teams Skip</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 20:30:28 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/request-hedging-the-tail-at-scale-technique-most-teams-skip-2j37</link>
      <guid>https://forem.com/gabrielanhaia/request-hedging-the-tail-at-scale-technique-most-teams-skip-2j37</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Your p99 is 2 seconds. Your p50 is 80 milliseconds. That gap is what wakes you up. It's also what your SLO actually measures, and the tricks you've been reaching for (bigger pods, more replicas, a fatter cache) barely move it.&lt;/p&gt;

&lt;p&gt;Jeff Dean and Luiz Barroso wrote about this in 2013 in a paper called &lt;a href="https://research.google/pubs/the-tail-at-scale/" rel="noopener noreferrer"&gt;&lt;em&gt;The Tail at Scale&lt;/em&gt;&lt;/a&gt; (CACM, Vol. 56, No. 2). The headline finding: in a fanout system where every request touches 100 backends, a per-backend p99 of 10ms produces an end-to-end p99 closer to 140ms. Tail latency multiplies. You can't outrun that by tuning the average.&lt;/p&gt;

&lt;p&gt;One of the techniques they shipped at Google to fix it (request hedging) is dead simple, well understood, and somehow still missing from most production codebases in 2026. So let's put it back on the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tail-at-scale problem still applies
&lt;/h2&gt;

&lt;p&gt;The math hasn't changed since 2013. If a single backend has a 1% chance of being slow on any given request, a request that fans out to 100 backends has a &lt;code&gt;1 - (0.99)^100 ≈ 63%&lt;/code&gt; chance of hitting at least one slow backend. Your end-to-end latency is the &lt;em&gt;max&lt;/em&gt; across all backends, not the average.&lt;/p&gt;

&lt;p&gt;The same effect shows up in non-fanout systems too. A user-facing request that depends on three serial calls (auth, profile, recommendations) sees its p99 stack. Each hop has its own bad day, and the bad days line up more often than your intuition expects.&lt;/p&gt;

&lt;p&gt;Causes haven't changed either: GC pauses, hot keys, kernel scheduling glitches, contended locks, noisy neighbors on shared hardware, NIC queue backpressure. You can fix individual sources, but you'll never zero them out. The tail is structural.&lt;/p&gt;

&lt;p&gt;That's why Dean and Barroso framed the solution as latency-tolerant techniques rather than latency-reduction tricks. Hedging is the one you should ship first.&lt;/p&gt;

&lt;h2&gt;
  
  
  What hedging actually is
&lt;/h2&gt;

&lt;p&gt;The pattern: send the request to backend A. If you haven't gotten a response by the time you'd expect a slow-but-not-broken reply (say, your p95), send the same request to backend B. Take whichever response comes back first. Cancel the loser.&lt;/p&gt;

&lt;p&gt;That's it. No retries on failure. No queue. You're betting that the slow tail of one backend isn't correlated with the slow tail of another, which is usually true when the slowness comes from GC, kernel, or local hot-spot causes.&lt;/p&gt;

&lt;p&gt;The Dean/Barroso paper reports that hedging at the 95th percentile reduced their BigTable lookup p99 from 1800ms to 74ms, while only inflating total backend work by about 2%. Those numbers depend on your workload, but the shape holds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost of hedging
&lt;/h2&gt;

&lt;p&gt;You're sending extra requests, so you're doing extra work. The question is how much.&lt;/p&gt;

&lt;p&gt;If you hedge at the p95, then by definition only 5% of requests trigger a second call. Of those, the second call usually completes faster than the first would have, but you still pay for one extra request 5% of the time. That's where the 5–10% overhead figure comes from, and it assumes you cancel the loser fast enough that the loser stops working as soon as the winner returns.&lt;/p&gt;

&lt;p&gt;If you hedge at the p50, you've doubled your backend traffic. Don't do that.&lt;/p&gt;

&lt;p&gt;If you hedge at the p99 you're too late. The user has already noticed.&lt;/p&gt;

&lt;p&gt;The threshold matters. Measure it from production data, not from a load test.&lt;/p&gt;

&lt;h2&gt;
  
  
  A working Go implementation
&lt;/h2&gt;

&lt;p&gt;The minimum viable Go version uses two goroutines, a buffered channel, and &lt;code&gt;context.WithCancel&lt;/code&gt;. Real, no foo/bar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;hedged&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"errors"&lt;/span&gt;
    &lt;span class="s"&gt;"time"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Body&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;
    &lt;span class="n"&gt;Err&lt;/span&gt;  &lt;span class="kt"&gt;error&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Fetch issues the primary request, then fires a hedge after p95 latency.&lt;/span&gt;
&lt;span class="c"&gt;// The first non-error response wins. The loser is cancelled.&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;Fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithCancel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// always cancel the loser, plus tear down on parent cancel&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// buffered so the loser's send never blocks&lt;/span&gt;

    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="n"&gt;hedgeTimer&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTimer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;hedgeTimer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// primary returned before the hedge fired&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;hedgeTimer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="c"&gt;// primary is slow; fire the hedge&lt;/span&gt;
    &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}()&lt;/span&gt;

    &lt;span class="c"&gt;// take the first response that isn't a context-cancel artefact&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="c"&gt;// loser may have errored because we cancelled it; ignore and wait&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Is&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Canceled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hedged: both attempts failed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;defer cancel()&lt;/code&gt; line does the heavy lifting. The moment the function returns (winner found, error raised, or parent cancelled), the still-in-flight loser sees its context die and should abort whatever it was doing. If &lt;code&gt;call&lt;/code&gt; is a &lt;code&gt;net/http&lt;/code&gt; request built with &lt;code&gt;http.NewRequestWithContext&lt;/code&gt;, the TCP socket closes within microseconds. That's how you keep the overhead at single-digit percent instead of double.&lt;/p&gt;

&lt;p&gt;One subtle bit: the &lt;code&gt;results&lt;/code&gt; channel is buffered to 2 so the loser's send doesn't deadlock when nobody's reading. Skip that and you leak goroutines.&lt;/p&gt;

&lt;h2&gt;
  
  
  A working Python implementation
&lt;/h2&gt;

&lt;p&gt;Same pattern in asyncio. The trick is &lt;code&gt;asyncio.wait&lt;/code&gt; with &lt;code&gt;FIRST_COMPLETED&lt;/code&gt;, then explicit cancellation of the pending task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Awaitable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TypeVar&lt;/span&gt;

&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TypeVar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hedged_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;p95&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[],&lt;/span&gt; &lt;span class="n"&gt;Awaitable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="c1"&gt;# wait up to p95 for the primary
&lt;/span&gt;    &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;p95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# primary is slow; fire the hedge
&lt;/span&gt;    &lt;span class="n"&gt;hedge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hedge&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;return_when&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FIRST_COMPLETED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# cancel the loser; await it so we don't leak the task
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CancelledError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# loser exceptions are uninteresting
&lt;/span&gt;
    &lt;span class="n"&gt;winner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;done&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;winner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;call()&lt;/code&gt; wraps an &lt;code&gt;httpx.AsyncClient.get&lt;/code&gt;, &lt;code&gt;task.cancel()&lt;/code&gt; propagates down and closes the connection. If it wraps a &lt;code&gt;requests.get&lt;/code&gt; running in a thread executor, cancellation is advisory and the thread keeps working, which is exactly the case where hedging makes your load &lt;em&gt;worse&lt;/em&gt;, not better. Use async clients all the way down or skip hedging at this layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hedging at the load balancer
&lt;/h2&gt;

&lt;p&gt;You don't have to write any of the above if your traffic flows through Envoy. The &lt;a href="https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-msg-config-route-v3-hedgepolicy" rel="noopener noreferrer"&gt;&lt;code&gt;hedge_policy&lt;/code&gt;&lt;/a&gt; on a route does it for you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;route_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;profile_service&lt;/span&gt;
  &lt;span class="na"&gt;virtual_hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;profile&lt;/span&gt;
      &lt;span class="na"&gt;domains&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile.internal"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;profile_cluster&lt;/span&gt;
            &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
            &lt;span class="na"&gt;retry_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;retry_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5xx,reset"&lt;/span&gt;
              &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
              &lt;span class="na"&gt;per_try_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.4s&lt;/span&gt;
            &lt;span class="na"&gt;hedge_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;initial_requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
              &lt;span class="na"&gt;additional_request_chance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;numerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
                &lt;span class="na"&gt;denominator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HUNDRED&lt;/span&gt;
              &lt;span class="na"&gt;hedge_on_per_try_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;hedge_on_per_try_timeout: true&lt;/code&gt; flag is the one that matters. When the per-try timeout (400ms here) fires, Envoy issues a second request to a different upstream and races them. Pair this with a &lt;code&gt;per_try_timeout&lt;/code&gt; set to your measured p95 and you've got hedging without writing a line of Go or Python.&lt;/p&gt;

&lt;p&gt;Istio inherits this through its Envoy dataplane. If you're on a service mesh you may already have the primitive sitting there, unused.&lt;/p&gt;

&lt;p&gt;The application-layer version is more flexible. You can hedge selectively based on request type, vary the threshold per endpoint, or hedge against a different backend entirely. The mesh version is easier to ship and harder to get wrong. Pick based on whether you need that flexibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to hedge
&lt;/h2&gt;

&lt;p&gt;A short list of places hedging will hurt you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-idempotent operations.&lt;/strong&gt; &lt;code&gt;POST /payments&lt;/code&gt;, &lt;code&gt;PUT /counter/increment&lt;/code&gt;, anything that mutates state. Two requests means two payments. There is no clever idempotency-key trick that makes this safe by default. You have to engineer the dedup explicitly and you usually shouldn't bother.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expensive backends.&lt;/strong&gt; If each call costs you GPU seconds or a $0.02 LLM token bill, hedging means paying twice for 5% of requests. Do the math on your unit economics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backends with their own cascading retries.&lt;/strong&gt; If service B retries service C three times before returning, your hedge fires a &lt;em&gt;second&lt;/em&gt; tree of retries. The amplification gets ugly fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful sessions.&lt;/strong&gt; WebSockets, gRPC streaming, anything sticky. Hedging assumes the request is a pure function of its input.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rule of thumb: hedge read paths that hit shared backends. Don't hedge anything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The outage gotcha: pair it with a circuit breaker
&lt;/h2&gt;

&lt;p&gt;This is the part everyone misses, and it's the part that turns hedging from a tail-latency win into an outage amplifier.&lt;/p&gt;

&lt;p&gt;When a backend gets unhealthy, every request crosses your p95 threshold. Every request triggers a hedge. You've just doubled your traffic to a backend that was already failing. The hedges hit the same dying instances. They time out too. You retry. The cluster dies faster.&lt;/p&gt;

&lt;p&gt;Dean and Barroso flagged this in the paper. The mitigation is non-optional: hedging must be coupled with a circuit breaker that opens when the failure rate or hedge rate crosses a ceiling.&lt;/p&gt;

&lt;p&gt;Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track your hedge rate. If it exceeds, say, 15% over a 10-second window, stop hedging entirely until it drops below 10%. The percentages are workload-specific; pick numbers that mean "this isn't tail latency, this is an outage."&lt;/li&gt;
&lt;li&gt;Track downstream failure rate. Standard circuit-breaker behaviour: open on errors, half-open to probe, close on recovery. The hedge logic should only fire when the breaker is closed.&lt;/li&gt;
&lt;li&gt;Cap concurrent hedged requests. A bounded semaphore that drops the hedge attempt (not the primary) when saturated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In code, the simplest form wraps the hedge call in a breaker check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;FetchSafe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;breaker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Allow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c"&gt;// circuit open: skip hedging entirely, just run the primary&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;Fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;breaker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Envoy's hedge policy doesn't pair this for you automatically. You configure the outlier detection separately, and you need both: outlier detection ejects bad upstreams, hedging covers the tail of the healthy ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to ship this week
&lt;/h2&gt;

&lt;p&gt;Measure your real p95 per endpoint. Pick the three highest-fanout read paths in your system. Add hedging at the p95 threshold with a circuit breaker. Watch your p99 drop and your overhead stay under 10%.&lt;/p&gt;

&lt;p&gt;If you're on Envoy or Istio, ship the &lt;code&gt;hedge_policy&lt;/code&gt; first — it's a config change, no code. If you're in a service that calls a database or a downstream API directly, ship the application-layer version with proper context cancellation. Either way, instrument the hedge rate as a first-class metric. The day it spikes is the day you'll be glad you wired up the breaker.&lt;/p&gt;

&lt;p&gt;The 2013 paper called this "good enough" engineering against unavoidable variability. Twelve years on, that's still the right framing. You can't make the tail go away. You can race it.&lt;/p&gt;

&lt;p&gt;What's the p99/p50 gap on your hottest read path right now, and what's stopping you from shipping a hedge against it this week?&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;Hedging sits in a family of patterns (timeouts, retries, circuit breakers, load shedding, backpressure) that decide whether your system survives its own scale. The &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals&lt;/a&gt; walks through the trade-offs in the latency and reliability chapters, with the same level of "show me the actual config" that this post tried to hit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dz9ehd0n8k7iax7x19i.jpg" alt="System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>performance</category>
      <category>distributedsystems</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Pull-Based vs Push-Based Architecture: The Choice That Decides Your Reliability Story</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 20:30:04 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/pull-based-vs-push-based-architecture-the-choice-that-decides-your-reliability-story-63n</link>
      <guid>https://forem.com/gabrielanhaia/pull-based-vs-push-based-architecture-the-choice-that-decides-your-reliability-story-63n</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Pull-based and push-based aren't just delivery styles. They're reliability stories. One absorbs producer spikes; the other absorbs consumer outages. Picking the wrong one means your most predictable failure shape becomes your worst.&lt;/p&gt;

&lt;p&gt;Most teams pick by accident. Someone says "we need webhooks," everyone nods, and three months later a marketing campaign fires a million events at a receiver that was sized for a quiet Wednesday. Or someone picks Kafka because it's on the resume bingo card, and now a tiny webhook integration runs through a 7-broker cluster with a 4-week retention policy.&lt;/p&gt;

&lt;p&gt;The right way to pick is to ask: which failure shape can I afford?&lt;/p&gt;

&lt;h2&gt;
  
  
  Push and pull, in failure-shape terms
&lt;/h2&gt;

&lt;p&gt;A push system has the producer drive the data. The producer decides when to send, how much to send, and where to send it. The consumer is the passenger. When the producer is calm, life is easy. When the producer panics (a viral campaign, a backfill job, a fan-out from a single upstream event) the consumer absorbs the punch.&lt;/p&gt;

&lt;p&gt;A pull system inverts the relationship. The consumer drives. It decides when to fetch, how much to fetch, and how fast to drain. The producer writes to a buffer (a log, a queue, a table) and walks away. When the consumer is slow, work piles up in the buffer. When the consumer dies, work waits.&lt;/p&gt;

&lt;p&gt;That's the whole tradeoff. Push gives you low latency and amplifies producer spikes. Pull gives you natural backpressure and survives consumer outages.&lt;/p&gt;

&lt;p&gt;Everything else in this post is a footnote on that one sentence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Push: webhooks, server-sent events, server-initiated jobs
&lt;/h2&gt;

&lt;p&gt;The push family is broader than people think. Webhooks are push. Server-sent events are push. WebSocket fan-out is push. A cron job firing HTTP POSTs at downstream services is push. AWS SNS topics that fan out to HTTPS endpoints are push.&lt;/p&gt;

&lt;p&gt;What unites them: the producer decides the timing, and the network round-trip happens at production time.&lt;/p&gt;

&lt;p&gt;The good news is latency. A push event lands at the consumer milliseconds after it's produced. That's why payment processors use webhooks for &lt;code&gt;payment.succeeded&lt;/code&gt;. The merchant wants to update the order page right now, not in 30 seconds when a poller wakes up.&lt;/p&gt;

&lt;p&gt;The bad news is that the consumer has no say. If the producer fires 50,000 webhooks per second and the consumer can absorb 5,000, the producer doesn't know or care. The 45,000 overflow becomes failed deliveries, retries, or both. Stripe, for example, retries failed webhooks with exponential backoff for up to 3 days, but most teams' receivers fall over long before the retries help.&lt;/p&gt;

&lt;p&gt;There's no shared buffer between producer and consumer in pure push. The network is the buffer, and the network is bad at buffering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pull: polling consumers, Kafka-style log readers, batch jobs
&lt;/h2&gt;

&lt;p&gt;Pull means the consumer initiates. A SQS consumer that calls &lt;code&gt;ReceiveMessage&lt;/code&gt; every second. A Kafka consumer group that calls &lt;code&gt;poll()&lt;/code&gt; and gets a batch back. A nightly ETL job that scans a &lt;code&gt;transactions&lt;/code&gt; table where &lt;code&gt;created_at &amp;gt; last_run&lt;/code&gt;. All pull.&lt;/p&gt;

&lt;p&gt;The key property: there's a buffer between producer and consumer, and the consumer chooses its rate.&lt;/p&gt;

&lt;p&gt;Here's the minimum viable Kafka consumer in Python. Production code, not pseudocode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;confluent_kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KafkaError&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="n"&gt;log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;make_consumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Consumer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka-1:9092,kafka-2:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;group.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable.auto.commit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# we commit after work succeeds
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto.offset.reset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;earliest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max.poll.interval.ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# 5 min — kicks slow consumers out
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.timeout.ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;45_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;make_consumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;poll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="c1"&gt;# partition EOF is fine, anything else we log and continue
&lt;/span&gt;                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;code&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;KafkaError&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_PARTITION_EOF&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;poll error: %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;value&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
                &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;commit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;asynchronous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# don't commit — message will be redelivered after rebalance
&lt;/span&gt;                &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handler failed for offset %d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what this consumer controls: poll timing, batch size (via the broker config), commit timing, and what "processed" means. The producer doesn't see any of this. If the consumer dies for 6 hours, Kafka holds the messages (default retention is a week, but you can keep them for years). When the consumer comes back, it picks up at the last committed offset and drains the backlog.&lt;/p&gt;

&lt;p&gt;That's the magic of pull. The buffer absorbs the consumer's downtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  How they handle producer spikes (pull absorbs, push amplifies)
&lt;/h2&gt;

&lt;p&gt;This is the failure mode that surprises teams most.&lt;/p&gt;

&lt;p&gt;Imagine a checkout system that emits an &lt;code&gt;order.placed&lt;/code&gt; event for every order. On a normal day, 100 orders per second. Black Friday morning, 8,000 per second for 90 seconds, then back to normal.&lt;/p&gt;

&lt;p&gt;In a pull system with Kafka or SQS in front of the consumers, the buffer takes the spike. The consumer group sees the backlog rising and drains it at whatever rate it can sustain. End-to-end latency goes from 50ms to maybe 30 seconds during the spike. Nothing breaks. No alerts.&lt;/p&gt;

&lt;p&gt;In a push system, the producer tries to POST 8,000 webhooks per second at the consumer's &lt;code&gt;/webhook&lt;/code&gt; endpoint. The consumer's connection pool runs out. New connections queue. The load balancer's queue fills. Requests start timing out at 30 seconds. The producer's retry logic kicks in and now we have 8,000 original requests plus retries, hammering an already-overloaded receiver. This is the classic retry storm, and it's how short producer spikes turn into multi-hour outages.&lt;/p&gt;

&lt;p&gt;The fix in push systems is rate limiting at the producer (Stripe caps webhook concurrency per endpoint), backoff schedules, and dead letter queues for permanently-failed deliveries. All real, all expensive to build, all things teams discover after their first incident.&lt;/p&gt;

&lt;p&gt;Pull doesn't need any of that. The buffer is the rate limiter.&lt;/p&gt;

&lt;h2&gt;
  
  
  How they handle consumer outages (push loses, pull catches up)
&lt;/h2&gt;

&lt;p&gt;The mirror image of the spike scenario is the outage scenario.&lt;/p&gt;

&lt;p&gt;Your consumer is down for 2 hours. Database migration, deploy bug, whatever.&lt;/p&gt;

&lt;p&gt;In a pull system, nothing happens to the producer. It keeps writing to Kafka. Messages accumulate in the partitions. When the consumer comes back, it sees the unprocessed offsets and works through them. If you provisioned enough partitions to allow parallelism, you can catch up in minutes by scaling the consumer group temporarily.&lt;/p&gt;

&lt;p&gt;In a push system, every webhook delivery during those 2 hours fails. The producer's retry policy decides what happens next. Stripe retries for 3 days; AWS SNS retries with limits configurable per subscription; GitHub gives up after a few attempts and writes the failure to an admin page nobody looks at. If your retry window is shorter than your outage, those events are gone unless the producer is willing to backfill on request — and most aren't.&lt;/p&gt;

&lt;p&gt;This is why "we use webhooks for everything" usually becomes "we use webhooks plus a daily reconciliation job that pulls the truth from the source-of-record API." Teams end up building pull on top of push because push alone loses events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backpressure mechanisms per style
&lt;/h2&gt;

&lt;p&gt;Pull has backpressure for free. The consumer doesn't poll faster than it can process. The buffer grows during slow periods and shrinks during fast ones. The producer never knows.&lt;/p&gt;

&lt;p&gt;Push needs explicit backpressure, and it's awkward.&lt;/p&gt;

&lt;p&gt;The HTTP-level mechanism is &lt;code&gt;429 Too Many Requests&lt;/code&gt; plus a &lt;code&gt;Retry-After&lt;/code&gt; header. The producer is supposed to read this and back off. Some producers respect it (Stripe, GitHub). Many don't (your in-house service that fires HTTP from a cron job). And &lt;code&gt;429&lt;/code&gt; only works if your service can still respond to the producer with a &lt;code&gt;429&lt;/code&gt; — which means it's not actually overloaded, just selectively rejecting.&lt;/p&gt;

&lt;p&gt;The transport-level mechanism is connection limits. Set your reverse proxy to accept N concurrent connections, no more. Past that, requests get rejected at the LB. The producer sees connection failures and (if well-behaved) retries with backoff.&lt;/p&gt;

&lt;p&gt;Neither of these is as clean as "the queue grew by 10,000 messages." That's why senior engineers reach for pull whenever the work doesn't strictly need millisecond latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real systems are usually hybrid: push for notifications, pull for processing
&lt;/h2&gt;

&lt;p&gt;Once you've stared at the tradeoff long enough, you stop picking one. You pick both.&lt;/p&gt;

&lt;p&gt;The pattern: push a tiny notification, pull the heavy data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# producer side — the push is a 200-byte heads-up
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_order_placed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;placed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
    &lt;span class="nf"&gt;notify_webhook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subscriber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;webhook_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order.placed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# consumer side — push handler just records intent, pulls the rest
&lt;/span&gt;&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/webhooks/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# idempotent enqueue keyed on order_id
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fetch_order&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="mi"&gt;202&lt;/span&gt;

&lt;span class="c1"&gt;# worker — pulls from the queue, fetches the full record at its own pace
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fetch_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/orders/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# producer's read API
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Webhook delivery is now a 200-byte ping. The receiver's job is to write the event ID into a queue and return &lt;code&gt;202&lt;/code&gt; in under 50ms. The actual processing happens on a worker pool that drains the queue at whatever rate is healthy. If the worker pool goes down, the queue grows; no webhook deliveries fail.&lt;/p&gt;

&lt;p&gt;Stripe's docs explicitly recommend this pattern. The webhook is a notification, not a payload. The full source of truth is the producer's read API, which you can call at your own cadence.&lt;/p&gt;

&lt;p&gt;This is also how Kafka Connect, Debezium CDC pipelines, and most production event systems work. The CDC connector pushes a small notification to Kafka; consumers pull. Nobody pushes a 10MB payload through 14 hops of HTTP.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: webhooks need retry + idempotency on the receiver; most teams under-engineer this
&lt;/h2&gt;

&lt;p&gt;Here's where most teams ship the bug that bites them six months in.&lt;/p&gt;

&lt;p&gt;Webhooks retry. Stripe retries. GitHub retries. SNS retries. Your in-house webhook producer retries because you copied the pattern from Stripe's docs. Retries mean the same logical event arrives at your receiver multiple times. If your handler isn't idempotent, you charge the customer twice, you send two emails, you double the inventory decrement.&lt;/p&gt;

&lt;p&gt;The minimum viable receiver looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTTPException&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hmac&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;asyncpg&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/webhooks/payments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;body&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;sig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stripe-Signature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. verify signature — reject forgeries before any DB touch
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WEBHOOK_SECRET&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bad signature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# Stripe gives you a unique id per event
&lt;/span&gt;
    &lt;span class="c1"&gt;# 2. idempotency — write the event id first, in its own txn
&lt;/span&gt;    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;acquire&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO processed_events (id, received_at) VALUES ($1, NOW())&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;asyncpg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UniqueViolationError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# already processed — return 200 so the producer stops retrying
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duplicate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. enqueue actual work — don't process inline
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enqueue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handle_payment_event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. ack within the producer's timeout (usually 30s)
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things every webhook receiver needs and most don't have:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signature verification before anything else.&lt;/strong&gt; The receiver is on the public internet. Anyone can POST to it. If you write to the DB before checking the signature, you've built a public RCE on your event handler.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency on the event ID, persisted in your own store.&lt;/strong&gt; Not Redis (it evicts), not in-memory (it resets on deploy). A real table with a unique constraint. The &lt;code&gt;INSERT ... ON CONFLICT DO NOTHING&lt;/code&gt; pattern or a try-catch on the unique violation. The point is: the second delivery of the same event should be a no-op.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decouple receipt from processing.&lt;/strong&gt; Return &lt;code&gt;2xx&lt;/code&gt; fast, do the work async. If you process inline, a slow downstream call makes the producer think you failed and retry, turning one event into ten attempts. The receiver becomes a 50ms-budget endpoint whose only job is to validate, dedupe, and enqueue.&lt;/p&gt;

&lt;p&gt;The thing that keeps biting people: idempotency keys on Stripe's side don't help here. Those protect your API calls to Stripe from being processed twice. They do nothing for webhook deliveries flowing the other way. Different problem, different key.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking the right side
&lt;/h2&gt;

&lt;p&gt;A small checklist for your next system.&lt;/p&gt;

&lt;p&gt;Pick push when: latency under one second matters, payloads are small, the consumer is reachable and well-sized, you control or trust the producer's retry behavior, and you're willing to invest in idempotency + signature + DLQ on the receiver.&lt;/p&gt;

&lt;p&gt;Pick pull when: throughput is bursty, the consumer might be slow or down, you want replay, you want fan-out to multiple consumer groups at different speeds, or latency above a few seconds is acceptable.&lt;/p&gt;

&lt;p&gt;Pick hybrid when: you want push's latency but pull's reliability, or when the payload is large and only some consumers care about the full record. This is the production default for a reason.&lt;/p&gt;

&lt;p&gt;The mistake to avoid is treating delivery style as a taste preference. It's a reliability decision. The day your traffic doubles or your consumer crashes, the architecture you picked will either absorb the event or become the headline of the incident report.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;Pull vs push is one of those choices that looks like plumbing and turns out to be load-bearing. The &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals&lt;/a&gt; walks through the same tradeoff lens for the rest of the core building blocks (queues, caches, replication, consistency) so the next time you sketch a system on a whiteboard, you're picking failure shapes on purpose instead of by accident.&lt;/p&gt;

&lt;p&gt;What's the worst push-vs-pull mismatch you've inherited, and what did the cleanup actually look like?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dz9ehd0n8k7iax7x19i.jpg" alt="System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>architecture</category>
      <category>distributedsystems</category>
      <category>reliability</category>
    </item>
    <item>
      <title>Service Mesh in 2026: When Istio Is Overkill, When It's the Right Answer</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 15:21:06 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/service-mesh-in-2026-when-istio-is-overkill-when-its-the-right-answer-36cc</link>
      <guid>https://forem.com/gabrielanhaia/service-mesh-in-2026-when-istio-is-overkill-when-its-the-right-answer-36cc</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;A platform team I talked to last quarter spent four months rolling out Istio across a 30-service estate. They wanted three things: encrypted service-to-service traffic, retries handled outside application code, and a single dashboard showing who calls whom. Six months after go-live, they ripped it out. Not because it didn't work. Because the win was a fraction of the operational cost they paid for it.&lt;/p&gt;

&lt;p&gt;That story isn't rare. It's the modal outcome for teams under a hundred services who adopt a heavyweight mesh because a conference talk made it sound mandatory. Service mesh is a real architectural pattern with a real ROI shape. In 2026 the answer to "do we need one" got both clearer and more interesting, because the alternatives finally caught up.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a service mesh actually does
&lt;/h2&gt;

&lt;p&gt;Strip away the marketing and a mesh is four features bundled into one data plane:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Mutual TLS between services.&lt;/strong&gt; Every pod gets an identity, every connection is encrypted, certificate rotation happens automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic policy.&lt;/strong&gt; Timeouts, retries, circuit breakers, weighted routing for canaries, all declared once and enforced at the proxy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L7 observability.&lt;/strong&gt; Per-RPC latency, status codes, request rate, broken down by source and destination service, without any app-side instrumentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-cluster gateway logic.&lt;/strong&gt; Cross-cluster service discovery and routing without bespoke gateway code in each app.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You get those features by injecting a proxy (Envoy, in most cases) next to every workload. That sidecar is doing the work. The control plane (Istio's &lt;code&gt;istiod&lt;/code&gt;, Linkerd's controller) configures the sidecars and rotates certs.&lt;/p&gt;

&lt;p&gt;The cost is the sidecar itself. Memory per pod doubles or worse. Network hops add latency. The control plane has its own SRE story. And every CRD-laden Helm chart is a thing your team now operates.&lt;/p&gt;

&lt;p&gt;That cost is what makes the question "do we need this" harder than the conference talks admit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Istio in 2026: what changed
&lt;/h2&gt;

&lt;p&gt;Istio's biggest 2024–2025 story was &lt;strong&gt;ambient mode&lt;/strong&gt; going GA. Ambient swaps the per-pod Envoy sidecar for a per-node L4 proxy called ztunnel, plus an optional per-namespace L7 proxy (waypoint). The net effect: mTLS and basic identity are cheap, only namespaces that need L7 policy pay for an Envoy.&lt;/p&gt;

&lt;p&gt;The pre-ambient Istio model was the operational equivalent of a tank: capable, heavy, and the reason most teams shipped Linkerd instead. Ambient changes that calculus. If you're starting fresh in 2026 and you do want Istio, ambient is the default mode you should be evaluating, not the legacy sidecar mode.&lt;/p&gt;

&lt;p&gt;What ambient doesn't fix: the CRD surface area, the rate of breaking changes between minor versions, and the operational expertise required to debug an Envoy config that misbehaves under load. Istio is more approachable than it was. It isn't yet what anyone calls easy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternatives, briefly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Linkerd&lt;/strong&gt; is the answer for most teams that decide they need a mesh but don't want to operate Istio. The proxy is written in Rust, the control plane is small, the default configuration is sane. The Buoyant team's stance is that simplicity is a feature, and the project's history shows it. You give up some of Istio's L7 routing expressiveness; you get an operational story most platform teams can actually staff.&lt;/p&gt;

&lt;p&gt;A minimal Linkerd route policy looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy.linkerd.io/v1beta3&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPRoute&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-canary&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parentRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-svc&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
      &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;core&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;x-release-channel&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;canary&lt;/span&gt;
      &lt;span class="na"&gt;backendRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-svc-v2&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;backendRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-svc-v1&lt;/span&gt;
          &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;

&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy.linkerd.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTPLocalRateLimitPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-limits&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;targetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;core&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-svc&lt;/span&gt;
  &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requestsPerSecond&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt;
  &lt;span class="na"&gt;overrides&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;requestsPerSecond&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
      &lt;span class="na"&gt;clientRefs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;marketing-batch&lt;/span&gt;
          &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;growth&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read that and the mental model is the same as Kubernetes' own Gateway API: parents, matches, backends. No &lt;code&gt;EnvoyFilter&lt;/code&gt; escape hatches, no second config language to learn. That is the whole Linkerd pitch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cilium service mesh&lt;/strong&gt; is the option that changes the architecture, not just the implementation. Cilium uses eBPF in the kernel to do what an Envoy sidecar does in user space. mTLS, L7 policy, observability, all without a sidecar process per pod. For teams already running Cilium as their CNI, turning on mesh features is a config flag, not a new deployment topology.&lt;/p&gt;

&lt;p&gt;The Cilium trade is real: your mesh logic now lives in the kernel and depends on a recent kernel version, your debugging story shifts from "exec into the sidecar" to "read &lt;code&gt;cilium monitor&lt;/code&gt;", and the L7 features still aren't as deep as Istio's. But if your platform team already knows eBPF, Cilium is the lowest-overhead path to mesh features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consul Connect&lt;/strong&gt; is the right answer if you already run Consul for service discovery. Otherwise it's rarely the first choice in a Kubernetes-native shop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No mesh, with Envoy at the edge.&lt;/strong&gt; This is what most teams should actually consider first. An ingress gateway running Envoy or Contour gives you north-south policy. Your apps emit OpenTelemetry traces. mTLS, if you need it, comes from cert-manager + an SDK like SPIFFE/SPIRE. You pay zero per-pod sidecar tax. The ceiling is lower, but most teams never approach the ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The three questions that decide it
&lt;/h2&gt;

&lt;p&gt;When a team asks me whether they should adopt a service mesh, the conversation collapses to three questions. Two yeses is the bar. One yes means you're reaching for a mesh because of a different unsolved problem.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;If yes, why a mesh helps&lt;/th&gt;
&lt;th&gt;If no, what to do instead&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Is mTLS between services a hard requirement (regulator, security review, zero-trust mandate)?&lt;/td&gt;
&lt;td&gt;The mesh issues per-workload identities and rotates certs automatically. Doing this yourself with cert-manager + SPIRE works but the operational cost is non-trivial.&lt;/td&gt;
&lt;td&gt;Stick with TLS at the edge and network policies. Per-pod mTLS without an external requirement is a tax most teams don't recover.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Do you need cross-team traffic policies (rate limits, retries, timeouts) enforced uniformly without trusting every team to ship them in code?&lt;/td&gt;
&lt;td&gt;The mesh moves the contract out of app code into a platform layer the SRE team owns. This is the killer feature for orgs with 50+ services and a platform team.&lt;/td&gt;
&lt;td&gt;Document timeouts in the SDK, fail PR review when they're missing. Cheaper than running a mesh until you scale past trust.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Are you running multi-cluster with cross-cluster service-to-service traffic that needs identity-aware routing?&lt;/td&gt;
&lt;td&gt;The mesh's gateway and identity model handles this without you writing cluster-aware DNS hacks.&lt;/td&gt;
&lt;td&gt;A regional gateway + a service registry is enough for most setups. Multi-cluster mesh is the heaviest deployment of any of these tools.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two yeses justifies a mesh. Three makes the cost-benefit obvious. One yes means you have a specific gap you can probably close with a smaller tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  The false positive that traps everyone
&lt;/h2&gt;

&lt;p&gt;The single most common reason teams reach for a mesh is observability. "We can't see what our services are doing to each other." Then they install Istio, get a Kiali dashboard, and feel productive.&lt;/p&gt;

&lt;p&gt;Here's the problem with that path: a mesh gives you L7 telemetry at the proxy. That telemetry is per-service, not per-request-flow. You see that service A called service B 1,200 times in the last minute with p99 latency of 380ms. You can't, from mesh telemetry alone, tell which user request that traffic served, which downstream calls fanned out from it, or which code path was responsible.&lt;/p&gt;

&lt;p&gt;The thing you actually want is distributed tracing. That is OpenTelemetry, not a mesh. Instrument your apps with the OTel SDKs, propagate the W3C &lt;code&gt;traceparent&lt;/code&gt; header through your HTTP and gRPC calls, ship spans to your tracing backend. A mesh can decorate those spans with mTLS identity, but the trace pipeline is what answers the questions you actually have.&lt;/p&gt;

&lt;p&gt;If your reason for wanting a mesh is "I want to see what's happening," the cheaper, better answer is OpenTelemetry first. Adopt a mesh later, if and when the three questions above shift from no to yes.&lt;/p&gt;

&lt;h2&gt;
  
  
  When ambient mode changes the math
&lt;/h2&gt;

&lt;p&gt;Ambient Istio is interesting because it decouples the cost of mTLS from the cost of L7 features. With sidecar Istio, you paid the full Envoy memory and CPU bill for every workload, even ones that only needed encrypted transport and basic telemetry. With ambient, the per-node ztunnel handles mTLS and L4 telemetry at a fraction of the cost. Waypoint proxies (per namespace, opt-in) handle L7 policy only where you actually need it.&lt;/p&gt;

&lt;p&gt;If the answer to question one (mTLS required) is yes but the answer to question two (cross-team policy) is no, ambient Istio becomes a viable shape where full sidecar Istio wasn't. You get the encryption-and-identity story without paying the full sidecar tax across every pod in the cluster.&lt;/p&gt;

&lt;p&gt;Linkerd's lighter footprint already addressed this trade for many teams. Ambient closes the gap from Istio's side. If you're reaching for Istio specifically and have an ambient-capable kernel, evaluate ambient first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The operational reality nobody puts on the slides
&lt;/h2&gt;

&lt;p&gt;Two failure modes you'll hit, regardless of which mesh you pick:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sidecars and init order.&lt;/strong&gt; Workloads that need to make outbound calls during startup race with the sidecar. The classic symptom is your migration job that runs at pod start failing with a connection refused, because the Envoy sidecar isn't ready yet. Kubernetes' native sidecar containers (stable since 1.29) fix this if you're on a recent enough cluster. If not, you need &lt;code&gt;holdApplicationUntilProxyStarts&lt;/code&gt; in Istio or equivalent in Linkerd. This bites every team eventually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Upgrade choreography.&lt;/strong&gt; A mesh control plane upgrade can cascade into every sidecar in the cluster restarting. You'll discover, the first time you do this in production, that some of your pods don't tolerate a sidecar restart as gracefully as you assumed. Plan upgrades for low-traffic windows, drain the cluster's most sensitive workloads first, and have a tested rollback. The mesh vendor docs underplay this; the on-call rotation doesn't.&lt;/p&gt;

&lt;p&gt;These aren't deal-breakers. They're the kind of cost the conference talks gloss over and the procurement deck doesn't surface.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical decision tree
&lt;/h2&gt;

&lt;p&gt;You have under 30 services and a platform team of two people. Don't run a mesh. OpenTelemetry, ingress Envoy, cert-manager, network policies. Revisit in a year.&lt;/p&gt;

&lt;p&gt;You have 30 to 100 services, a real platform team, and one of the three justification questions is a hard yes. Run &lt;strong&gt;Linkerd&lt;/strong&gt;. Smallest surface area, fastest time to value, easiest to debug.&lt;/p&gt;

&lt;p&gt;You have 100+ services across multiple clusters and at least two of the three questions are yes. &lt;strong&gt;Ambient Istio&lt;/strong&gt; is the modern default. Pay the operational tax knowingly. Staff for it.&lt;/p&gt;

&lt;p&gt;You already run Cilium as your CNI and your platform team is comfortable with eBPF. &lt;strong&gt;Cilium service mesh&lt;/strong&gt; is your shortest path; the kernel-level data plane wins on overhead.&lt;/p&gt;

&lt;p&gt;The operational complexity of any mesh outweighs the technical win for teams under fifty services. That isn't snark; it's the conclusion the platform team I opened with reached after running Istio in production for half a year. Their advice, retroactively: solve the actual problem (mTLS for one regulator-facing service path, observability across the whole estate) with the smallest tool that solves it. Add the mesh later, if the answers to the three questions change.&lt;/p&gt;

&lt;p&gt;The thing that flipped in 2026 is the bottom of the toolbox got better. Ambient Istio lowers the floor. Linkerd's policy model matured. Cilium gives you a sidecar-free path if your infrastructure cooperates. The case for adopting a mesh is now stronger when it applies and weaker when it doesn't. That asymmetry is the win.&lt;/p&gt;

&lt;p&gt;What pushed your team toward a mesh, or kept you off one? Curious which of the three questions actually moved the decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;Service mesh is one of the higher-stakes architectural choices a platform team makes. Get it wrong and you carry the operational cost for years. The chapter on networking and service-to-service communication in the &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals&lt;/a&gt; walks through the same trade lens applied to load balancers, ingress, and inter-service protocols, with the goal of making "do we need this layer" answerable on the back of an envelope.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dz9ehd0n8k7iax7x19i.jpg" alt="System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>kubernetes</category>
      <category>microservices</category>
      <category>devops</category>
    </item>
    <item>
      <title>Caching Layers in 2026: CDN, App, DB, Query: What Goes Where</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 15:20:39 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/caching-layers-in-2026-cdn-app-db-query-what-goes-where-4p4m</link>
      <guid>https://forem.com/gabrielanhaia/caching-layers-in-2026-cdn-app-db-query-what-goes-where-4p4m</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Four cache layers sit between a user's request and the row they're asking about. Most teams use two of them (usually Redis in front of the DB plus a CDN for static assets) and treat the other two as someone else's problem.&lt;/p&gt;

&lt;p&gt;That's how you end up with a Redis tier doing the work the CDN should be doing, a database that's silently using its plan cache as a quality cushion, and a stampede every time a hot key expires. Each layer answers a different question. Pick the wrong one and you pay for the layer you didn't pick.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why caching is layered, not picked
&lt;/h2&gt;

&lt;p&gt;The pattern matters because the questions stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The CDN answers "can we avoid the origin entirely?"&lt;/li&gt;
&lt;li&gt;The application cache answers "can we avoid the DB round trip?"&lt;/li&gt;
&lt;li&gt;The database cache answers "can we avoid recomputing the result?"&lt;/li&gt;
&lt;li&gt;The query cache answers "can we avoid parsing and planning the statement?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each "yes" you can get short-circuits the layers below. Each "no" passes the request down. A request that hits all four layers shouldn't feel different from one that hits none, except in cost and latency.&lt;/p&gt;

&lt;p&gt;The mistake is collapsing this into a single layer. Teams shove everything into Redis because it's the layer they already have. Redis is a fine application cache. It's a terrible CDN, a worse materialised view, and it can't help your prepared-statement planner at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: CDN edge cache
&lt;/h2&gt;

&lt;p&gt;The CDN sits closest to the user. It's the cheapest place to serve a request because the request never reaches your servers. The trick in 2026 is that CDNs cache more than &lt;code&gt;.jpg&lt;/code&gt; files. They cache API responses too, if you tell them how.&lt;/p&gt;

&lt;p&gt;Here's a Cloudflare cache rule for a public product-listing endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Cloudflare Worker - cache /api/products/* responses for 60s at edge&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;caches&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// only cache GETs for the public catalog&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;method&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GET&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt;
        &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pathname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startsWith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/api/products/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// strip auth-affecting query params from the cache key&lt;/span&gt;
    &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;searchParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;trace_id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cacheKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cacheKey&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// edge hit, never touches origin&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Cache-Control&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;public, max-age=60, s-maxage=60&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;CDN-Cache-Control&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;max-age=60&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitUntil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cacheKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clone&lt;/span&gt;&lt;span class="p"&gt;()));&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;s-maxage&lt;/code&gt; is the shared-cache directive. It tells the CDN how long to hold the response, separate from what browsers do. The two &lt;code&gt;Cache-Control&lt;/code&gt; directives let you give the CDN 60 seconds while telling the browser something different (often &lt;code&gt;max-age=0, must-revalidate&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;What goes here: public GETs, anonymous responses, things that look the same for thousands of users. Product listings, marketing pages, public profile pages, search-results-without-personalisation, OG image endpoints, sitemap.xml. Anything that personalises per-user is wrong here unless you cache-key on a user ID.&lt;/p&gt;

&lt;p&gt;What doesn't: anything with &lt;code&gt;Set-Cookie&lt;/code&gt; in the response, anything authenticated, anything with a write side effect. If your CDN hit rate on &lt;code&gt;/api/*&lt;/code&gt; is above 5%, you're already doing better than most teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Application cache (Redis)
&lt;/h2&gt;

&lt;p&gt;The application cache is what people mean when they say "cache." It sits in your service process or in Redis next to it. It catches the requests the CDN couldn't (because they're authenticated or personalised) and serves them without a DB round trip.&lt;/p&gt;

&lt;p&gt;The pattern that ships well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Redis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cache.internal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;decode_responses&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user_profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:profile:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# try cache first
&lt;/span&gt;    &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# cache miss - hit the DB
&lt;/span&gt;    &lt;span class="n"&gt;profile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT id, name, plan, created_at FROM users WHERE id = $1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# set with a TTL so stale data times out even if invalidation fails
&lt;/span&gt;    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invalidate_user_profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# called from any writer that mutates the user row
&lt;/span&gt;    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user:profile:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things people skip: the TTL safety net, and the write-side invalidation. TTL alone leaks stale reads for up to 5 minutes after a write. Invalidation alone leaks forever when a network blip drops the &lt;code&gt;DEL&lt;/code&gt;. You need both. Belt, suspenders, and a third hand on the belt.&lt;/p&gt;

&lt;p&gt;What goes here: per-user data, session state, computed aggregates that are expensive to rebuild, anything you'd hate to recompute on every request but that changes infrequently. The 80/20 rule: pick the 20% of queries that account for 80% of your DB load and cache those first.&lt;/p&gt;

&lt;p&gt;What doesn't: anything that needs to be strongly consistent (a bank balance someone is about to spend). Anything written more often than read. Anything where a stale read is worse than a 50ms delay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Database cache (materialised views)
&lt;/h2&gt;

&lt;p&gt;Materialised views are the layer most teams forget exists. They sit inside the database and pre-compute the result of an expensive query. The DB stores the result like a table. Reads are O(1) lookups instead of seven joins and a window function.&lt;/p&gt;

&lt;p&gt;Postgres example. A per-day, per-account revenue rollup that would otherwise scan a fact table on every dashboard load:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;account_revenue_daily&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue_cents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;txn_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'settled'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;UNIQUE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;account_revenue_daily&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- refresh policy: every 10 minutes via pg_cron&lt;/span&gt;
&lt;span class="c1"&gt;-- CONCURRENTLY needs the unique index above to work&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'refresh-account-revenue'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'*/10 * * * *'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="err"&gt;$$&lt;/span&gt;&lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;account_revenue_daily&lt;/span&gt;&lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;REFRESH ... CONCURRENTLY&lt;/code&gt; is the one nobody reads about. Without it, the refresh takes an &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock that blocks reads. With it, the refresh writes to a shadow copy and swaps atomically. You pay slightly more disk during the swap; you stop blocking your dashboard.&lt;/p&gt;

&lt;p&gt;What goes here: aggregations, joins across 4+ tables, anything where the underlying data changes slower than you query it. Per-day rollups, leaderboards, search facets, anything an analyst would write a CTE for.&lt;/p&gt;

&lt;p&gt;What doesn't: results that need to be real-time. Materialised views are stale by definition between refreshes. If a user expects to see their action reflected immediately, this layer isn't the answer.&lt;/p&gt;

&lt;p&gt;The gotcha here is silent: materialised views age. A team ships one, hits the dashboard latency target, moves on. Six months later the underlying table has tripled in size, the refresh runs for 12 minutes, and the view is more often stale than fresh. Audit refresh durations like you audit query plans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 4: Query cache (prepared-statement plan cache)
&lt;/h2&gt;

&lt;p&gt;The deepest layer is also the most invisible. Every time your driver sends a SQL statement, the database has to parse it, plan it, and execute it. The first two steps can be cached if you use prepared statements.&lt;/p&gt;

&lt;p&gt;Most ORMs do this poorly because they ship a fresh statement string for every query (&lt;code&gt;WHERE id = 1&lt;/code&gt; vs &lt;code&gt;WHERE id = 2&lt;/code&gt;), defeating the cache. The fix is parameter binding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# bad - new statement every call, plan cache miss every time
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_order_bad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM orders WHERE id = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# good - same statement text, only parameters change, plan reused
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_order_good&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM orders WHERE id = $1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Postgres, you can see what's getting cached:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- requires pg_stat_statements extension&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mean_exec_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;rows&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'SELECT % FROM orders WHERE id = $1'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see the same logical query showing up dozens of times with different literal values instead of &lt;code&gt;$1&lt;/code&gt;, your ORM is bypassing the plan cache. Fix the ORM config (Eloquent's &lt;code&gt;DB::statement&lt;/code&gt; vs &lt;code&gt;DB::select&lt;/code&gt; with bindings, GORM's &lt;code&gt;Raw&lt;/code&gt; vs &lt;code&gt;Where&lt;/code&gt;, etc.) before you tune anything else.&lt;/p&gt;

&lt;p&gt;What goes here: hot OLTP queries. The lookup-by-id, the insert-into-orders, the update-customer-last-seen. The win per query is small (a millisecond, maybe two). The win at 50,000 queries per second is the difference between two database instances and ten.&lt;/p&gt;

&lt;p&gt;What doesn't: queries with variable shapes. If your WHERE clause changes structure based on user input, you can't reuse the plan, and that's fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The decision matrix
&lt;/h2&gt;

&lt;p&gt;The shortcut for "which layer does this data belong in":&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data shape&lt;/th&gt;
&lt;th&gt;CDN&lt;/th&gt;
&lt;th&gt;App cache&lt;/th&gt;
&lt;th&gt;Materialised view&lt;/th&gt;
&lt;th&gt;Plan cache&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Static asset, anonymous&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Public GET, no auth&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;maybe&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-user profile, hot&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hourly analytics rollup&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;maybe (TTL)&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OLTP id lookup&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strongly consistent read&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time write feedback&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "maybe" rows are where teams argue in design reviews. The honest answer is "measure first." If your hourly rollup gets hit 200 times per minute, a materialised view earns its keep. If it gets hit 3 times per day, a Redis cache with a 30-minute TTL is fine and the materialised view is over-engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Invalidation strategy per layer
&lt;/h2&gt;

&lt;p&gt;Each layer wants a different invalidation story.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CDN.&lt;/strong&gt; TTL is your only realistic option for most teams. You can purge specific URLs via API, but at high request rates the purge takes seconds to propagate and you can't rely on it for correctness. Use short TTLs (60s for hot endpoints, 5 min for catalog data) and accept the staleness. For long-TTL assets, version the URL (&lt;code&gt;/static/app.a3f7b2.js&lt;/code&gt;) so a new version is a new key, not an invalidation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;App cache.&lt;/strong&gt; Event-driven invalidation. Every write path that mutates a row calls &lt;code&gt;r.delete(key)&lt;/code&gt; for every cached derivative of that row. This works until you have 12 places that write to &lt;code&gt;users&lt;/code&gt; and someone adds the 13th without remembering to invalidate. Centralise it: every write goes through a repository that fires invalidations as part of its post-commit hook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Materialised view.&lt;/strong&gt; Scheduled refresh. Decide your acceptable staleness (10 minutes? 1 hour?) and set a cron. For lower latency on the staleness window, layer event-driven &lt;code&gt;REFRESH ... CONCURRENTLY&lt;/code&gt; triggers, but be careful: refresh is itself an expensive operation that you don't want firing 100 times a minute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plan cache.&lt;/strong&gt; You don't invalidate it. The database manages it. Your job is to write queries that look the same to the planner.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: stampedes across layers
&lt;/h2&gt;

&lt;p&gt;The bug that kills the layered design: when a hot cache entry expires across multiple layers at once, every concurrent request misses every layer simultaneously, and they all stampede the origin together.&lt;/p&gt;

&lt;p&gt;A real version: a CDN cache for &lt;code&gt;/api/homepage&lt;/code&gt; expires at 12:00:00. A thousand concurrent requests miss the CDN, hit your app, miss the app cache (which also expired at 12:00:00 because both TTLs were 60s and both were set at 11:59:00), hit the database, all trigger the same materialised-view rebuild, and the database falls over.&lt;/p&gt;

&lt;p&gt;Two patterns prevent this. The first is &lt;code&gt;singleflight&lt;/code&gt;: collapse N identical concurrent requests into one and broadcast the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"context"&lt;/span&gt;
    &lt;span class="s"&gt;"encoding/json"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/redis/go-redis/v9"&lt;/span&gt;
    &lt;span class="s"&gt;"golang.org/x/sync/singleflight"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sf&lt;/span&gt; &lt;span class="n"&gt;singleflight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Group&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;GetHomepage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rdb&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"homepage:v3"&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;rdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bytes&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// singleflight: only the first concurrent miss recomputes;&lt;/span&gt;
    &lt;span class="c"&gt;// every other caller waits on the same future&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Do&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;fresh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;buildHomepageFromDB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Marshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fresh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c"&gt;// SETEX with a small jitter so 1000 keys don't expire on the same second&lt;/span&gt;
        &lt;span class="n"&gt;rdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetEx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttlWithJitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;60&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second is &lt;strong&gt;lock + serve-stale&lt;/strong&gt;. Hold the expired value in cache past its TTL, return it while one request goes to rebuild. Pseudo-pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_with_stale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fetch_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fresh_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stale_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:meta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fresh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# try to grab the rebuild lock; if we get it, rebuild
&lt;/span&gt;    &lt;span class="n"&gt;lock_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:lock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;got_lock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lock_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;got_lock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;fresh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stale_ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fresh&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:meta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fresh_ttl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fresh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fresh&lt;/span&gt;
        &lt;span class="k"&gt;finally&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lock_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# we didn't get the lock - serve stale if we have it
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# no stale, no lock - wait briefly and retry the read
&lt;/span&gt;    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;fetch_fn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now a thousand concurrent requests trigger one rebuild. The other 999 either get the slightly-stale value (acceptable) or wait 50ms and pick up the freshly written one. The database sees one extra query instead of a thousand.&lt;/p&gt;

&lt;p&gt;Apply this pattern at every layer. CDNs do it natively with &lt;code&gt;stale-while-revalidate&lt;/code&gt; directives. Application caches need you to wire it. Materialised views are inherently stale-while-revalidate by design. Plan caches don't have the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do Monday
&lt;/h2&gt;

&lt;p&gt;Walk a request through your system and label which layer each piece of data should live in. Most teams discover they have one layer doing the job of three, or three layers all caching the same thing with different TTLs that don't agree.&lt;/p&gt;

&lt;p&gt;Pick the layer with the highest miss-to-cost ratio. Add it. Measure. Repeat. The goal is not to use every layer. It's to put each piece of data in the layer that answers its specific question, then make sure none of them stampede.&lt;/p&gt;

&lt;p&gt;Which layer is doing the most work in your current system, and which one is silently absent?&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;The four-layer cache stack is the kind of thing a system-design interview will ask about, and the kind of thing your ops team will thank you for shipping. The &lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;System Design Pocket Guide: Fundamentals&lt;/a&gt; covers caching in the chapter on "performance primitives," alongside the load-balancing, queueing, and replication patterns that decide whether your design holds up at scale. If this post helped, the book is the same voice across 250+ pages of the building blocks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GYMFPTWV" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dz9ehd0n8k7iax7x19i.jpg" alt="System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>caching</category>
      <category>performance</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Design a Multi-Device Authentication Service (Sessions vs JWT vs Passkeys)</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 15:20:18 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/design-a-multi-device-authentication-service-sessions-vs-jwt-vs-passkeys-5fc0</link>
      <guid>https://forem.com/gabrielanhaia/design-a-multi-device-authentication-service-sessions-vs-jwt-vs-passkeys-5fc0</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Sessions, JWTs, refresh tokens, OAuth, magic links, passkeys. Your authentication service has to support all of them, on any device, with a "log out everywhere" button that actually works. Most candidates pick one mechanism and freeze. The interviewer wants to see you weigh the tradeoff and design the seams.&lt;/p&gt;

&lt;p&gt;Here are the five decisions that frame the whole thing, the three implementations of the log-out button (cheap to bulletproof), and the gotcha that quietly breaks production at month four.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five decisions
&lt;/h2&gt;

&lt;p&gt;Before you draw a single box, write these on the whiteboard:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token shape&lt;/strong&gt;: opaque session ID or self-contained JWT?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful or stateless&lt;/strong&gt;: does verification hit a store on every request?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Revocation model&lt;/strong&gt;: how do you kill a credential before it expires?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Device tracking&lt;/strong&gt;: one session per user, or one per device?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MFA gating&lt;/strong&gt;: always-on, risk-based, or step-up on sensitive actions?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each decision constrains the next. Pick stateless JWTs and revocation becomes a research project. Pick per-user sessions and "log out my laptop only" becomes impossible. The interviewer is watching you handle the constraint propagation, not memorize a stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Session model: server-side store, easy revocation
&lt;/h2&gt;

&lt;p&gt;The classic shape. Login succeeds, the server creates a session row and returns an opaque ID via cookie. Every request hits a store (Redis, Postgres, whatever) to validate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;            &lt;span class="n"&gt;BYTEA&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;-- 256-bit random&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt;       &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;device_id&lt;/span&gt;     &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;    &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_seen_at&lt;/span&gt;  &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;expires_at&lt;/span&gt;    &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ip&lt;/span&gt;            &lt;span class="n"&gt;INET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;user_agent&lt;/span&gt;    &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;mfa_verified&lt;/span&gt;  &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;sessions_user_id_idx&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;sessions_expires_idx&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expires_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Revocation is a &lt;code&gt;DELETE&lt;/code&gt;. Log out everywhere is &lt;code&gt;DELETE FROM sessions WHERE user_id = $1&lt;/code&gt;. Device-tracking is free because every row already names a device. Auditing is free too.&lt;/p&gt;

&lt;p&gt;The cost is a round trip per request. At 50k RPS, that's 50k Redis hits per second. Doable, but you'll cache the session in the gateway's local memory for 5-10 seconds to absorb the load, and you'll size Redis for the working set of active sessions, not the total count.&lt;/p&gt;

&lt;p&gt;The hidden cost: every service that wants to know "who is this" needs the store. Pick this model and your auth service becomes a hard dependency for the whole platform. Plan for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  JWT model: stateless, fast, revocation is the hard problem
&lt;/h2&gt;

&lt;p&gt;A JWT carries its own claims, signed by the auth service. Any service with the public key can validate it without calling back. No store lookup, no round trip. Beautiful, until you need to kick someone out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Verifying a JWT on a downstream service&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;jwtVerify&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;jose&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;jwtVerify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;publicKey&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;issuer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;auth.example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;audience&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;api.example.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// payload.sub is the user id. No DB call.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The token's claims are frozen at issue time. If you change a user's role, demote them, or notice their token leaked, the existing JWT keeps working until it expires. That's the whole catch. Every revocation strategy is a workaround for this one property.&lt;/p&gt;

&lt;p&gt;Three workarounds, in order of pain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short expiry + refresh tokens.&lt;/strong&gt; Issue JWTs that live 5-15 minutes. Compromise window is small. Refresh tokens live in a server store and can be revoked.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Revocation list (denylist).&lt;/strong&gt; Maintain a set of &lt;code&gt;jti&lt;/code&gt; claims that should be rejected. Every verifier consults the list. You just put the store back; the win is partial (only revoked tokens need a lookup).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Versioned claims.&lt;/strong&gt; Embed a &lt;code&gt;token_version&lt;/code&gt; claim. Bump the version in the user record on revocation. Verifiers compare to the user's current version. Cheap to write, expensive to read (still a store hit).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice the pattern: every workaround re-introduces the state the JWT was meant to avoid. Stateless verification is the wrong goal. The right goal is &lt;em&gt;cheap&lt;/em&gt; verification, and short JWTs + refresh tokens gets you there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Refresh tokens: the bridge
&lt;/h2&gt;

&lt;p&gt;Refresh tokens are the seam where session-like state meets stateless verification. Issue a short JWT (15 min) plus a long refresh token (30 days). The refresh token sits in a server store and looks like a session row.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;refresh_tokens&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;              &lt;span class="n"&gt;BYTEA&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt;         &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;device_id&lt;/span&gt;       &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;family_id&lt;/span&gt;       &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;-- rotation lineage&lt;/span&gt;
  &lt;span class="n"&gt;issued_at&lt;/span&gt;       &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;expires_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;revoked_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;replaced_by&lt;/span&gt;     &lt;span class="n"&gt;BYTEA&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;refresh_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the JWT expires, the client posts the refresh token to &lt;code&gt;/auth/refresh&lt;/code&gt;. The server validates it, rotates it (issues a new one and revokes the old), and returns a fresh JWT + fresh refresh token. The &lt;code&gt;family_id&lt;/code&gt; tracks the rotation lineage.&lt;/p&gt;

&lt;p&gt;The rotation matters for one reason: replay detection. If a refresh token gets used twice (once by the legit client, once by an attacker who stole it), the server sees a token in &lt;code&gt;family_id = X&lt;/code&gt; being used after it was rotated. That's a hard signal. Revoke the entire family and force re-login.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NewTokenPair&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM refresh_tokens WHERE id = $1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_id&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;InvalidToken&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;revoked_at&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Replay detected. Kill the whole family.
&lt;/span&gt;        &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UPDATE refresh_tokens SET revoked_at = now() &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WHERE family_id = $1 AND revoked_at IS NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;family_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;InvalidToken&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expires_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ExpiredToken&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;new_refresh&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;issue_refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;family_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UPDATE refresh_tokens SET revoked_at = now(), replaced_by = $1 &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WHERE id = $2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;new_refresh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;NewTokenPair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jwt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;issue_jwt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;refresh&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;new_refresh&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the architecture worth defending in the interview. Short JWTs for cheap verification, refresh tokens for revocation, families for replay detection. Almost every modern auth service ships this shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  Passkeys (WebAuthn): the 2026 default
&lt;/h2&gt;

&lt;p&gt;Passkeys replaced "password + TOTP" as the recommended default for new sign-in flows. They're public-key credentials stored in the user's device keychain (or roaming via iCloud Keychain, Google Password Manager, 1Password). The server never sees a secret.&lt;/p&gt;

&lt;p&gt;Registration:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Server sends a challenge plus user info.&lt;/li&gt;
&lt;li&gt;Browser asks the authenticator (Touch ID, Windows Hello, security key) to generate a keypair.&lt;/li&gt;
&lt;li&gt;Authenticator returns the public key plus an attestation. Server stores the public key against the user.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Authentication:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Server sends a challenge.&lt;/li&gt;
&lt;li&gt;Authenticator signs it with the matching private key.&lt;/li&gt;
&lt;li&gt;Server verifies the signature against the stored public key.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The server changes you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;webauthn_credentials&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;credential_id&lt;/span&gt;   &lt;span class="n"&gt;BYTEA&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt;         &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;public_key&lt;/span&gt;      &lt;span class="n"&gt;BYTEA&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;sign_count&lt;/span&gt;      &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;transports&lt;/span&gt;      &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;                   &lt;span class="c1"&gt;-- ['internal','hybrid','usb']&lt;/span&gt;
  &lt;span class="n"&gt;aaguid&lt;/span&gt;          &lt;span class="n"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                     &lt;span class="c1"&gt;-- authenticator model&lt;/span&gt;
  &lt;span class="n"&gt;backup_eligible&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;backup_state&lt;/span&gt;    &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_used_at&lt;/span&gt;    &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;sign_count&lt;/code&gt; is a monotonically increasing counter the authenticator returns. If it ever goes backwards, the credential was cloned. &lt;code&gt;backup_eligible&lt;/code&gt; plus &lt;code&gt;backup_state&lt;/code&gt; tell you whether the passkey is synced (iCloud) or device-bound (a YubiKey).&lt;/p&gt;

&lt;p&gt;Once a user signs in with a passkey, your service issues the same JWT + refresh token pair as before. WebAuthn replaces the credential check, not the session layer. That's a useful framing for the interview: passkeys are an &lt;em&gt;authentication factor&lt;/em&gt;, the rest of the architecture stays.&lt;/p&gt;

&lt;p&gt;The one new failure mode: a user loses every device with the passkey. Your account-recovery flow has to exist and has to not be a password reset to email (defeats the point). Use a recovery code printed at enrollment, a verified backup device, or a delegated recovery contact.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Log out everywhere": three implementations
&lt;/h2&gt;

&lt;p&gt;This is the question that separates candidates. Three answers, cheap to bulletproof.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Delete sessions / refresh tokens.&lt;/strong&gt; Works if you're session-based or short-JWT + refresh. One SQL statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;refresh_tokens&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;       &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The catch: existing JWTs keep working until they expire. If your JWT lives 15 minutes, the user's old devices have up to 15 minutes of grace. For consumer apps that's usually fine. For a banking app it isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Bump a token version.&lt;/strong&gt; Add &lt;code&gt;token_version&lt;/code&gt; to the user row and to every JWT claim. On verify, compare. Mismatch = reject.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Claims&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;claims&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jwt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;public_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;user_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_user_token_version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claims&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sub&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;claims&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ver&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;user_version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Revoked&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;claims&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Log out everywhere becomes &lt;code&gt;UPDATE users SET token_version = token_version + 1 WHERE id = $1&lt;/code&gt;. Every existing JWT is now invalid. Verifiers pay one cache hit per request (cache the version aggressively, invalidate on bump).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Per-credential revocation broadcast.&lt;/strong&gt; The bulletproof option. Maintain a denylist of &lt;code&gt;jti&lt;/code&gt; claims with TTLs matching the JWT's &lt;code&gt;exp&lt;/code&gt;. Push revocations to every verifier via Redis pub/sub, NATS, or a streaming bus. Verifiers keep a local Bloom filter, fall back to the central store on a hit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Claims&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;claims&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jwt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;public_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;revocation_bloom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;might_contain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;claims&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;jti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sismember&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;revoked&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;claims&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;jti&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Revoked&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;claims&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This combines low-latency verification (Bloom filter is in-process) with hard revocation guarantees. The cost is operational: a streaming bus to maintain, a Bloom filter to size, and a denylist that has to expire entries.&lt;/p&gt;

&lt;p&gt;For most products, option 1 plus short JWTs is the right answer. Option 2 if your JWTs live longer than a minute and you can't tolerate that grace. Option 3 if you're under regulatory pressure (PCI, HIPAA, SOC 2 with specific revocation SLAs).&lt;/p&gt;

&lt;h2&gt;
  
  
  Device tracking: sessions per device
&lt;/h2&gt;

&lt;p&gt;A row per (user, device) is what makes "log out my laptop only" possible. The &lt;code&gt;device_id&lt;/code&gt; is generated by the client on first auth, persisted in secure storage (Keychain on iOS, Credential Manager on Android, IndexedDB + a fingerprint on web). It survives reinstalls if you ship a recovery hint, and it survives nothing if you don't.&lt;/p&gt;

&lt;p&gt;Track three things per device row:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;last_seen_at&lt;/code&gt;: update on every refresh, drives "active devices" UI.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ip&lt;/code&gt; and a geo lookup: drives "new sign-in from Brazil" emails.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;user_agent&lt;/code&gt; plus a friendly device label ("Pixel 9 Pro", "Chrome on macOS"): what the user actually recognizes in the settings page.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Avoid hard fingerprinting (canvas, WebGL, font enumeration). It's brittle, it breaks privacy modes, and the GDPR conversation is uncomfortable. A self-generated device ID covers 95% of the use cases without the legal surface area.&lt;/p&gt;

&lt;h2&gt;
  
  
  MFA gating: risk-based vs always-on
&lt;/h2&gt;

&lt;p&gt;Always-on MFA is the safe answer. Every login asks for a second factor. Users hate it. They also enable it and forget about it, which is the point.&lt;/p&gt;

&lt;p&gt;Risk-based MFA is the better UX answer. Score each login attempt: new device, new geo, unusual hour, password reset within 24h, impossible travel since last session. Above a threshold, demand a second factor. Below, skip.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;risk_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LoginAttempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;known_devices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;geo&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nf"&gt;last_geo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;impossible_travel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;password_changed_within&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hour&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;usual_hours&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A score above 50 triggers a step-up: passkey, TOTP, push approval. The implementation lives in the auth service, the rules live in config so you can tune them without a deploy.&lt;/p&gt;

&lt;p&gt;Step-up MFA is the related pattern: sensitive actions (change password, add a new payment method, delete account) trigger MFA even mid-session. Implement it as a &lt;code&gt;mfa_verified_at&lt;/code&gt; timestamp on the session. Within 5 minutes counts, older requires a re-prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90-second answer
&lt;/h2&gt;

&lt;p&gt;If the interviewer asks for the elevator pitch:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Short-lived JWTs (15 min) for fast, stateless verification at every service. Long-lived refresh tokens (30 days, rotated on use, families tracked for replay detection) backed by a Postgres or Redis store. That's the revocation surface. Per-device session rows so the user can see and kill individual devices. Passkeys (WebAuthn) as the primary factor for new accounts, password-plus-TOTP as the legacy fallback. Risk-based step-up MFA for sensitive actions. "Log out everywhere" bumps a &lt;code&gt;token_version&lt;/code&gt; on the user record so existing JWTs fail their next verify, and deletes all refresh tokens so they can't bootstrap a new JWT. Device-tracking via a self-generated client ID, not browser fingerprinting.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That covers the five decisions, names the rotation pattern, includes a real revocation strategy, and namechecks the modern default. It's about 90 seconds spoken.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: JWT revocation lists grow without bound
&lt;/h2&gt;

&lt;p&gt;The trap I see in code reviews more than any other: someone implements a JWT denylist as a Redis set keyed on user ID, with no TTL on the entries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Don't do this
&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sadd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;revoked:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token_jti&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A year later, that set has 50k JTIs in it, half of them long expired, every verify is shipping a 200KB set membership check. Eventually the verify path dominates your CPU and someone deletes the set, which silently un-revokes everything.&lt;/p&gt;

&lt;p&gt;Two fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Store JTIs with TTLs matching the JWT &lt;code&gt;exp&lt;/code&gt;.&lt;/strong&gt; Use individual keys, not sets. &lt;code&gt;SET revoked:&amp;lt;jti&amp;gt; 1 EX &amp;lt;seconds-until-exp&amp;gt;&lt;/code&gt;. Redis evicts on its own.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use short JWTs.&lt;/strong&gt; If your JWT lives 5 minutes, you barely need a denylist. Revoke the refresh token, wait 5 minutes, done. The denylist becomes a defense-in-depth layer for the worst case, not the primary mechanism.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Or, put bluntly: if your JWTs live an hour and you need real revocation, your JWTs are too long.&lt;/p&gt;

&lt;p&gt;The interview takeaway: revocation drives the architecture. Pick a token lifetime that matches your revocation SLA. Five minutes if you can absorb the refresh chatter, fifteen if you can't, never an hour for anything that matters. The rest of the design (refresh families, passkeys, device tracking, MFA gating) falls out of that one decision.&lt;/p&gt;

&lt;p&gt;What's your team's revocation SLA, and does your current JWT lifetime actually match it?&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;Auth services are one of the 15 walkthroughs in the &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews&lt;/a&gt;. The chapter walks the same five decisions in interview-pace, plus the variants the interviewer is likely to push you on (OAuth provider integration, mobile vs web token storage, the multi-tenant case). It's the book I'd hand a friend the week before an L5 system-design loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6dw87uaq2vin2k1bwb0.jpg" alt="System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>interview</category>
      <category>security</category>
      <category>auth</category>
    </item>
    <item>
      <title>Design a Payment Ledger: Idempotent, Audit-Compliant, Reconciles to the Cent</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 13:47:58 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/design-a-payment-ledger-idempotent-audit-compliant-reconciles-to-the-cent-59p7</link>
      <guid>https://forem.com/gabrielanhaia/design-a-payment-ledger-idempotent-audit-compliant-reconciles-to-the-cent-59p7</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;"Design a payment ledger" is the prompt that ends interview rounds. Candidates draw a &lt;code&gt;payments&lt;/code&gt; table, an &lt;code&gt;update balance&lt;/code&gt; arrow, and a Redis cache. Twenty minutes later the interviewer has asked about duplicate charges, partial refunds, and what happens when the processor times out. The whiteboard is a mess.&lt;/p&gt;

&lt;p&gt;A payment ledger isn't a CRUD app. It's an append-only event store with two-sided arithmetic, a reconciliation job that runs at 03:00, and a regulator who wants to see every byte you ever wrote. Get four properties right and the rest of the design draws itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a ledger isn't a CRUD app
&lt;/h2&gt;

&lt;p&gt;The instinct is wrong from the first table. CRUD systems mutate rows. Ledgers don't mutate anything. Every event (charge, refund, chargeback, fee, FX adjustment) is a new entry. The current balance is a &lt;em&gt;projection&lt;/em&gt; over the entries, not a value you store and update.&lt;/p&gt;

&lt;p&gt;The reason isn't ideology. It's audit. A regulator (or an angry customer, or your CFO) eventually asks: "what was account 7218's balance at 14:32 on March 4th?" If you've been updating a &lt;code&gt;balance&lt;/code&gt; column, you can't answer. If you've been appending entries, you replay events up to that timestamp and the answer falls out.&lt;/p&gt;

&lt;p&gt;The four properties that separate ledgers from CRUD:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Double-entry&lt;/strong&gt;: every transaction touches at least two accounts, and the sum is zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotency per attempt&lt;/strong&gt;: the same payment attempt, retried, doesn't double-charge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Append-only&lt;/strong&gt;: entries are never updated or deleted, only reversed by new entries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reconcilable&lt;/strong&gt;: your numbers match the payment processor's numbers, to the cent, every day.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Miss any one and you're shipping a bug factory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Double-entry bookkeeping in code
&lt;/h2&gt;

&lt;p&gt;Double-entry has been around since the 15th century. Pacioli wrote it down in 1494. Every accountant on Earth uses it. Software engineers keep reinventing single-entry ledgers and discovering, three years later, why that's bad.&lt;/p&gt;

&lt;p&gt;The rule: every transaction has two sides, and they sum to zero. A customer pays you $50. The customer's &lt;em&gt;liability&lt;/em&gt; account goes down by $50; your &lt;em&gt;cash&lt;/em&gt; account goes up by $50. Net change: zero. You haven't created or destroyed money. You've moved it.&lt;/p&gt;

&lt;p&gt;In SQL, that looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;              &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;owner_type&lt;/span&gt;      &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;-- 'customer', 'platform', 'processor'&lt;/span&gt;
    &lt;span class="n"&gt;owner_id&lt;/span&gt;        &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;currency&lt;/span&gt;        &lt;span class="nb"&gt;CHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- ISO 4217&lt;/span&gt;
    &lt;span class="n"&gt;account_type&lt;/span&gt;    &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;-- 'asset', 'liability', 'revenue', 'expense'&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="k"&gt;UNIQUE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;owner_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;account_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;              &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;UNIQUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;     &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;occurred_at&lt;/span&gt;     &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;              &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;transaction_id&lt;/span&gt;  &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;account_id&lt;/span&gt;      &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;direction&lt;/span&gt;       &lt;span class="nb"&gt;CHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;direction&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'D'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'C'&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="c1"&gt;-- Debit / Credit&lt;/span&gt;
    &lt;span class="n"&gt;amount_minor&lt;/span&gt;    &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;         &lt;span class="c1"&gt;-- cents, always positive&lt;/span&gt;
    &lt;span class="n"&gt;currency&lt;/span&gt;        &lt;span class="nb"&gt;CHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- The invariant that every code review should check for:&lt;/span&gt;
&lt;span class="c1"&gt;-- SUM(D) - SUM(C) = 0 for every transaction_id, per currency.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to notice. &lt;code&gt;amount_minor&lt;/code&gt; is always positive; direction lives in its own column. And &lt;code&gt;entries&lt;/code&gt; carries no &lt;code&gt;updated_at&lt;/code&gt;. The table is append-only; there's no &lt;code&gt;UPDATE entries SET ...&lt;/code&gt; anywhere in the codebase. If you find one in code review, that's a blocker.&lt;/p&gt;

&lt;p&gt;The balance projection is a query, not a stored value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;direction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'D'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt;
                       &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;direction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'C'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;balance_minor&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For hot accounts this gets too slow to run on every request. You materialize the projection into a &lt;code&gt;balances&lt;/code&gt; table with &lt;code&gt;last_entry_id&lt;/code&gt;, and a worker incrementally folds new entries in. The projection is a cache. If it goes corrupt, you drop it and rebuild from &lt;code&gt;entries&lt;/code&gt;. The append-only log is the source of truth.&lt;/p&gt;

&lt;p&gt;A single payment writes one transaction and two entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;BEGIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;occurred_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'a3f...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'pay_2026_05_24_attempt_1_user_7218'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Order 9921 capture'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transaction_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount_minor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'a3f...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4501&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'D'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'USD'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;-- cash (asset) up&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'a3f...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7218&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'C'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'USD'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;-- customer liability up (we owe them goods/refund-right)&lt;/span&gt;

&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;BEGIN&lt;/code&gt;/&lt;code&gt;COMMIT&lt;/code&gt; matters. The two entries must land atomically, or you have a half-applied transaction and the invariant is broken. Postgres gives you this for free. If you're sharding across databases, you need a two-phase commit, or better, keep both legs of any transaction in the same shard. Sharding a ledger is its own essay; the short version is: shard by &lt;code&gt;owner_id&lt;/code&gt;, route both legs to the same shard, and accept that some cross-account transfers will need a saga.&lt;/p&gt;

&lt;h2&gt;
  
  
  Idempotency keys: per attempt, not per customer
&lt;/h2&gt;

&lt;p&gt;The most common bug in payment systems: idempotency keyed on the wrong thing.&lt;/p&gt;

&lt;p&gt;A customer clicks "pay" and the request times out. The mobile app retries. Without idempotency, you charge twice. With idempotency keyed on &lt;code&gt;customer_id + order_id&lt;/code&gt;, the second attempt sees the existing transaction and returns the old result. Perfect.&lt;/p&gt;

&lt;p&gt;Now the customer cancels and pays again for the &lt;em&gt;same order&lt;/em&gt;. Different attempt, same key. You return the cached failure. The new payment never happens.&lt;/p&gt;

&lt;p&gt;The fix is in the title: idempotency keys are per &lt;em&gt;attempt&lt;/em&gt;, not per customer or per order. The client generates a fresh UUID for each payment attempt, sends it on every retry of that attempt, and discards it once the attempt completes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;charge_customer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount_minor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ONE attempt = ONE key. Retries of THIS attempt reuse it.
&lt;/span&gt;    &lt;span class="c1"&gt;# If the user clicks "pay" again later, the client generates a new one.
&lt;/span&gt;    &lt;span class="n"&gt;idempotency_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pay_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;retry&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/v1/payments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount_minor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;amount_minor&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Idempotency-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ConnectionError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Same key, server returns the previously committed result if it landed
&lt;/span&gt;            &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PaymentFailed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3 retries exhausted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Server side, the key is enforced by the &lt;code&gt;UNIQUE&lt;/code&gt; constraint on &lt;code&gt;transactions.idempotency_key&lt;/code&gt;. Two requests with the same key race; one wins the insert, the other gets a unique-violation and reads back the existing transaction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_payment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;txn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO transactions (id, idempotency_key, description, occurred_at) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VALUES (%s, %s, %s, %s) RETURNING id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;occurred_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="nf"&gt;insert_entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;txn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;txn&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;UniqueViolation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Another request with the same key already committed
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM transactions WHERE idempotency_key = %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;,)&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;UniqueViolation&lt;/code&gt; branch is the idempotency contract. It says: "I already did this exact thing, here's the result." That's why the key must scope to one &lt;em&gt;attempt&lt;/em&gt;. If it scoped to the order, the retry of a second legitimate attempt would silently return the first attempt's result and the user would never get charged.&lt;/p&gt;

&lt;p&gt;One more wrinkle: store the request body's hash alongside the key. If a client reuses an idempotency key with a different body (different amount, different currency), reject it with &lt;code&gt;409 Conflict&lt;/code&gt;. Stripe does this. It catches integration bugs early.&lt;/p&gt;

&lt;h2&gt;
  
  
  The append-only event log
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;entries&lt;/code&gt; table is your event log. Treat it like Kafka. New events are appended, existing events are immutable, downstream projections are rebuilt from the log.&lt;/p&gt;

&lt;p&gt;This buys you three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Time-travel&lt;/strong&gt;: you can answer "what was the balance at any past moment" by replaying entries up to that timestamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disaster recovery&lt;/strong&gt;: if your &lt;code&gt;balances&lt;/code&gt; materialized table is corrupted (a bad migration, a bug in the projector), you drop it and rebuild from &lt;code&gt;entries&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit&lt;/strong&gt;: the log is what regulators read. They don't want your &lt;code&gt;balances&lt;/code&gt; table; they want the chronological record of every event.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The projector worker is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;project_balances&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;new_entries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM entries WHERE id &amp;gt; (SELECT COALESCE(MAX(last_entry_id), 0) FROM balances) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ORDER BY id LIMIT %s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;new_entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_entries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;direction&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;D&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt;
                &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO balances (account_id, balance_minor, last_entry_id) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VALUES (%s, %s, %s) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ON CONFLICT (account_id) DO UPDATE SET &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  balance_minor = balances.balance_minor + EXCLUDED.balance_minor, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  last_entry_id = EXCLUDED.last_entry_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this worker falls behind, balances are stale but never wrong. If the worker dies and you restart it, it picks up from &lt;code&gt;last_entry_id&lt;/code&gt; and catches up. If you find a bug in projection logic, you &lt;code&gt;TRUNCATE balances&lt;/code&gt; and restart; the log rebuilds it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reconciliation: daily checksum against the processor
&lt;/h2&gt;

&lt;p&gt;Every payment processor (Stripe, Adyen, PayPal, your acquiring bank) sends a settlement report. It lists what they think they collected, what they took in fees, and what they're going to deposit. Your job is to make their report match your ledger, to the cent.&lt;/p&gt;

&lt;p&gt;The reconciliation job runs nightly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reconcile_day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Pull processor's view
&lt;/span&gt;    &lt;span class="n"&gt;processor_charges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_charges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# external truth
&lt;/span&gt;
    &lt;span class="c1"&gt;# Pull our ledger's view of charges from that processor account
&lt;/span&gt;    &lt;span class="n"&gt;our_charges&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT t.id, t.idempotency_key, e.amount_minor, e.currency
        FROM transactions t
        JOIN entries e ON e.transaction_id = t.id
        WHERE e.account_id = (SELECT id FROM accounts WHERE owner_id = %s)
          AND t.occurred_at::date = %s
          AND e.direction = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;D&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;fetchall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Match on processor's transaction reference, which we store in transactions.description or a side table
&lt;/span&gt;    &lt;span class="n"&gt;discrepancies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processor_charges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;our_charges&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;discrepancies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Don't auto-fix. File a discrepancy record, page on-call.
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;discrepancies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO reconciliation_discrepancies &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(date, kind, processor_ref, processor_amount, our_amount, details) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VALUES (%s, %s, %s, %s, %s, %s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor_ref&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processor_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;our_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Reconciliation failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;discrepancies&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The discrepancies are usually one of four shapes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Processor has, we don't&lt;/strong&gt;: a charge succeeded at the processor but the success webhook never reached us. Reach out, replay it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;We have, processor doesn't&lt;/strong&gt;: we recorded a charge that never actually settled. Bug in our state machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amount mismatch&lt;/strong&gt;: usually rounding (more on that), or a fee we recorded against the wrong account.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Currency mismatch&lt;/strong&gt;: an FX conversion happened differently on the processor side than we modeled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Auto-fixing reconciliation discrepancies is a trap. You think you're cleaning up; you're actually papering over the bug that produced the discrepancy. File the discrepancy, page a human, fix the root cause. That's the only way the gap doesn't widen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dispute handling: chargebacks as reverse entries
&lt;/h2&gt;

&lt;p&gt;A customer disputes a charge two months after it happened. The processor reverses the $50 and takes a $15 fee for their trouble. You do &lt;em&gt;not&lt;/em&gt; go back and delete the original transaction. You write three new entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- The chargeback reversal (mirrors the original charge, opposite direction)&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;occurred_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'b88...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'chargeback_a3f'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Chargeback of txn a3f'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2026-07-20 10:00:00+00'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transaction_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount_minor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'b88...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4501&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'C'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'USD'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;-- cash down&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'b88...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7218&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'D'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'USD'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;-- customer liability down&lt;/span&gt;

&lt;span class="c1"&gt;-- The processor's chargeback fee&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;occurred_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'b89...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'chargeback_fee_a3f'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Chargeback fee for txn a3f'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2026-07-20 10:00:01+00'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;entries&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transaction_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;direction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount_minor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'b89...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4501&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'C'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'USD'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;-- cash down&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'b89...'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'D'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'USD'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;-- 'dispute fees' expense account up&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The original transaction &lt;code&gt;a3f&lt;/code&gt; is still there. Its idempotency key (&lt;code&gt;chargeback_a3f&lt;/code&gt;) refers back to it. The history reads: charged on March 4th, reversed on July 20th, fee posted on July 20th. A regulator or auditor can see the whole story. A &lt;code&gt;DELETE FROM transactions WHERE id = 'a3f'&lt;/code&gt; would have destroyed it.&lt;/p&gt;

&lt;p&gt;The rule is brutal and simple: never delete, always reverse. If you find code that runs &lt;code&gt;DELETE&lt;/code&gt; against &lt;code&gt;transactions&lt;/code&gt; or &lt;code&gt;entries&lt;/code&gt;, it's a bug. Add a CI check.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-currency, FX, and the precision problem
&lt;/h2&gt;

&lt;p&gt;Money is not a &lt;code&gt;float&lt;/code&gt;. This is the single most expensive bug in payment systems, and you find it in year 2 when you reconcile $0.03 against the processor and have no idea where it came from.&lt;/p&gt;

&lt;p&gt;Floats lose precision. &lt;code&gt;0.1 + 0.2 != 0.3&lt;/code&gt; in IEEE 754. Run it in any language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;
&lt;span class="mf"&gt;0.30000000000000004&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Multiply that across a million transactions and the rounding error becomes real money. Worse, the rounding is non-deterministic across operations: sum-then-multiply gives a different answer from multiply-then-sum. You can't reconcile against a processor that does the math in integers when your code does it in floats.&lt;/p&gt;

&lt;p&gt;The fix is the boring one. Store money as integer minor units (cents for USD, satoshis for BTC, the smallest denomination your currency supports). The schema above already does this: &lt;code&gt;amount_minor BIGINT&lt;/code&gt;. A $50.00 charge is &lt;code&gt;5000&lt;/code&gt;. A €19.99 charge is &lt;code&gt;1999&lt;/code&gt;. A ¥500 charge is &lt;code&gt;500&lt;/code&gt; (JPY has no minor unit, so the conversion factor is 1 instead of 100).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wrong
&lt;/span&gt;&lt;span class="n"&gt;amount_usd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;49.99&lt;/span&gt;
&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;amount_usd&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.07&lt;/span&gt;  &lt;span class="c1"&gt;# tax. Hello, float drift.
&lt;/span&gt;
&lt;span class="c1"&gt;# Right
&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4999&lt;/span&gt;          &lt;span class="c1"&gt;# $49.99
&lt;/span&gt;&lt;span class="n"&gt;tax_minor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# 350 cents = $3.50
&lt;/span&gt;&lt;span class="n"&gt;total_minor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;amount_minor&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;tax_minor&lt;/span&gt;
&lt;span class="c1"&gt;# Bankers' rounding for half-cent cases:
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ROUND_HALF_EVEN&lt;/span&gt;
&lt;span class="n"&gt;tax_minor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount_minor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.07&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;quantize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;rounding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ROUND_HALF_EVEN&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FX is its own pit. When you accept a payment in EUR and settle in USD, you do the conversion once, store both legs in their native currencies, and record the FX rate as metadata. Don't average rates, don't recompute historical FX with today's rate, don't lose the original currency. A regulator will ask, and "we converted everything to USD at ingest" is the wrong answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;fx_rate&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;fx_source_currency&lt;/span&gt; &lt;span class="nb"&gt;CHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;fx_source_amount_minor&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cardinal sin is rounding at the wrong place. Round when the customer sees a number. Don't round inside the ledger. Every intermediate calculation runs at full precision; the rounding happens once, at the boundary, with &lt;code&gt;ROUND_HALF_EVEN&lt;/code&gt; (bankers' rounding) to keep long-run drift symmetric.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90-second answer
&lt;/h2&gt;

&lt;p&gt;When the interviewer asks "design a payment ledger," the answer that wins the round:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A payment ledger is an append-only event store with double-entry bookkeeping. Every transaction has two or more entries that sum to zero, stored in an immutable &lt;code&gt;entries&lt;/code&gt; table. Current balances are a projection over the log, materialized into a &lt;code&gt;balances&lt;/code&gt; table by a worker that folds new entries in incrementally.&lt;/p&gt;

&lt;p&gt;Idempotency is per payment attempt, not per customer or order. The client generates a UUID for each attempt and reuses it on retries. The server enforces uniqueness with a &lt;code&gt;UNIQUE&lt;/code&gt; constraint on &lt;code&gt;transactions.idempotency_key&lt;/code&gt;, and stores a hash of the request body to reject reuse with a different payload.&lt;/p&gt;

&lt;p&gt;Refunds and chargebacks never delete entries; they write reversing entries that point back to the original transaction by idempotency key. This preserves the audit trail that regulators require.&lt;/p&gt;

&lt;p&gt;A nightly reconciliation job compares our ledger against the payment processor's settlement report, files discrepancies, and pages on-call. Reconciliation never auto-fixes; humans investigate.&lt;/p&gt;

&lt;p&gt;Money is stored as integer minor units (cents). Never floating point. FX conversions store both source and target currency plus the rate, computed once at ingest and never recomputed.&lt;/p&gt;

&lt;p&gt;The properties that make this different from CRUD: double-entry invariant per transaction, idempotency per attempt, append-only log as source of truth, daily reconciliation to the cent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the 90 seconds. The interviewer asks follow-ups (sharding, partial refunds, multi-currency settlement, what happens during a processor outage), and the design above gives you a vocabulary to answer them without contradiction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: floating-point money is the bug you find at year 2
&lt;/h2&gt;

&lt;p&gt;Every ledger that uses &lt;code&gt;float&lt;/code&gt; or &lt;code&gt;double&lt;/code&gt; for money eventually fails a reconciliation. Usually around year 2, when you have enough volume that a tenth of a cent of drift per transaction shows up as a dollar a day, and a few months of those add up to a number your CFO notices.&lt;/p&gt;

&lt;p&gt;The fix is to use integers (cents) from day one. The refactor at year 2, when you have millions of rows in &lt;code&gt;entries&lt;/code&gt;, is doable but expensive. You convert column types, rewrite every query, recompute every historical balance, and run for months in parallel comparing old and new sums. It works. It's also six months of work you didn't need.&lt;/p&gt;

&lt;p&gt;If you take one thing from this post: &lt;code&gt;amount_minor BIGINT&lt;/code&gt;, on every monetary column, from day one. The interview answer mentions it. The codebase enforces it. The CI check rejects any new &lt;code&gt;NUMERIC&lt;/code&gt; or &lt;code&gt;DOUBLE PRECISION&lt;/code&gt; column on a financial table.&lt;/p&gt;

&lt;p&gt;The other lessons are reversible. This one isn't.&lt;/p&gt;

&lt;p&gt;What's the worst ledger bug you've shipped to prod? Float drift, a missing idempotency key, a &lt;code&gt;DELETE&lt;/code&gt; that should've been a reverse entry? Drop the story in the comments; I want to read it.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;The full version of this design (sharding strategy, saga compensation for cross-shard transfers, the projection table schema in detail, and the regulator export format) is the payment-ledger chapter in the &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews&lt;/a&gt;. The book walks 15 system designs at this depth, each ending with a 90-second answer template. If you've got an interview coming up where "design Stripe" or "design Robinhood" might land, the ledger chapter is the one to read first.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6dw87uaq2vin2k1bwb0.jpg" alt="System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>interview</category>
      <category>fintech</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Design a Real-Time Collaboration Backend (OT vs CRDT, Step by Step)</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 13:47:32 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/design-a-real-time-collaboration-backend-ot-vs-crdt-step-by-step-2clo</link>
      <guid>https://forem.com/gabrielanhaia/design-a-real-time-collaboration-backend-ot-vs-crdt-step-by-step-2clo</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;"Design Google Docs" is the prompt that exposes whether a candidate has actually thought about real-time collaboration, or whether they've memorized a list of buzzwords. The trap is that OT and CRDT are not interchangeable. Pick the wrong one and your sync engine fights itself for the rest of its life.&lt;/p&gt;

&lt;p&gt;Most candidates pick CRDT because it sounds fancier. Most production systems that ship still use OT. Both are right answers depending on the question, and the interviewer is watching to see if you know which question is being asked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The interview-grade definition of "real-time collaboration"
&lt;/h2&gt;

&lt;p&gt;When the interviewer says "real-time collaboration," they mean four properties at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Two or more clients edit the same document. Concurrent edits don't lose each other.&lt;/li&gt;
&lt;li&gt;Edits propagate in under ~100ms perceived latency.&lt;/li&gt;
&lt;li&gt;Every client converges to the same final state, eventually.&lt;/li&gt;
&lt;li&gt;The system survives a client going offline mid-edit and reconnecting later.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Notice what's missing: nothing about locking, nothing about "the server decides who wins." A collaboration system that locks rows is a CRUD app with extra steps. The hard part is that two users typed at the same position at the same time and both edits have to land.&lt;/p&gt;

&lt;p&gt;The two families of algorithms that solve this are Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDT). They solve the same problem with opposite philosophies.&lt;/p&gt;

&lt;h2&gt;
  
  
  OT: the Google Docs lineage
&lt;/h2&gt;

&lt;p&gt;Operational Transformation came out of academic CSCW research in the late 1980s. Jupiter and the GROVE editor refined it through the 1990s. Google Docs picked it up around 2010 and made it the household name.&lt;/p&gt;

&lt;p&gt;The core idea: every edit is an &lt;code&gt;op&lt;/code&gt; (operation) with a position. When two ops happen concurrently, a central server transforms one against the other so they apply in a consistent order on every client.&lt;/p&gt;

&lt;p&gt;A minimal OT op looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;Op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;insert&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;char&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;delete&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;len&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Client A sees: "hello"&lt;/span&gt;
&lt;span class="c1"&gt;// A inserts "X" at pos 2 → "heXllo"&lt;/span&gt;
&lt;span class="c1"&gt;// Concurrently, B deletes 1 char at pos 0 → "ello"&lt;/span&gt;
&lt;span class="c1"&gt;// Server receives both. It transforms A's op against B's:&lt;/span&gt;
&lt;span class="c1"&gt;//   A.pos -= B.len  →  insert "X" at pos 1&lt;/span&gt;
&lt;span class="c1"&gt;// Final state on every client: "eXllo"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The transform function is the heart of the algorithm. For a real editor you have transforms for every pair: &lt;code&gt;insert × insert&lt;/code&gt;, &lt;code&gt;insert × delete&lt;/code&gt;, &lt;code&gt;delete × delete&lt;/code&gt;, plus formatting ops, attribute changes, and embedded objects. Get one transform wrong and two users with the same op history end up with different text. Debugging that bug in production is the kind of thing that turns engineers into woodworkers.&lt;/p&gt;

&lt;p&gt;What OT requires: a central, authoritative server. Every client sends ops to the server, the server orders them, transforms them, and broadcasts the transformed versions back. You can't run OT peer-to-peer without electing one peer as the server, which defeats the point.&lt;/p&gt;

&lt;p&gt;What you get in return: small ops on the wire (a single &lt;code&gt;insert&lt;/code&gt; is a handful of bytes), simple client-side state (just the current document plus a small pending-op queue), and a decades-deep body of academic work that explains every edge case.&lt;/p&gt;

&lt;p&gt;OT in the wild: Google Docs, Google Slides, ShareDB, Apache Wave (RIP), CodeMirror's &lt;code&gt;@codemirror/collab&lt;/code&gt; package.&lt;/p&gt;

&lt;h2&gt;
  
  
  CRDT: the Figma / Linear / Notion direction
&lt;/h2&gt;

&lt;p&gt;A CRDT replaces "transform ops on a central server" with "ops carry enough metadata to be commutative and idempotent." If every replica applies every op exactly once, in any order, the replicas converge. No transform needed. No central server needed.&lt;/p&gt;

&lt;p&gt;For text, the most common CRDT family is RGA-style (Replicated Growable Array) or YATA, used by Yjs. Every character gets a unique ID (usually &lt;code&gt;(clientID, clock)&lt;/code&gt;) and a parent pointer to the character it was inserted after. Insertion order is reconstructed from the IDs, not from positions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Conceptual shape of a Yjs-style insert.&lt;/span&gt;
&lt;span class="c1"&gt;// Every character carries its own identity.&lt;/span&gt;
&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;YItem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;clock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;  &lt;span class="c1"&gt;// unique per character&lt;/span&gt;
  &lt;span class="nl"&gt;origin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ItemID&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                   &lt;span class="c1"&gt;// the char it was inserted after&lt;/span&gt;
  &lt;span class="nl"&gt;rightOrigin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ItemID&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;              &lt;span class="c1"&gt;// tie-breaker for concurrent inserts&lt;/span&gt;
  &lt;span class="nl"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;deleted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                        &lt;span class="c1"&gt;// tombstone, never removed eagerly&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wire format and tombstones are bigger than OT ops. But the math works out: any two replicas that have seen the same set of ops are byte-identical. No transform table. No central order.&lt;/p&gt;

&lt;p&gt;Yjs is the de-facto library for the JavaScript world. Automerge is the Rust/JS sibling with a JSON-document-shaped API. Both are production-grade.&lt;/p&gt;

&lt;p&gt;A side-by-side of the same edit pattern looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Yjs: insert "X" at position 2 in a shared text doc.&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;Y&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;yjs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Doc&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ytext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;body&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;ytext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hello&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;ytext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;X&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                     &lt;span class="c1"&gt;// "heXllo"&lt;/span&gt;

&lt;span class="c1"&gt;// The encoded update is what you send on the wire.&lt;/span&gt;
&lt;span class="c1"&gt;// Doesn't matter if 7 peers receive these in different orders.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;update&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encodeStateAsUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Automerge: same insert, JSON-document API.&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@automerge/automerge&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hello&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;change&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;d&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;X&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="c1"&gt;// "heXllo"&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;changes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getChanges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;A&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Ship `changes` to any peer; merge order doesn't matter.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trade: bigger payloads, heavier client memory, but you can ship a peer-to-peer editor over WebRTC, support offline-first natively, and never write a transform function in your life.&lt;/p&gt;

&lt;p&gt;CRDT in the wild: Figma (custom CRDT, not Yjs), Linear (custom, ops + snapshots), Notion (CRDT-ish for blocks), Apple Notes sync, Roam Research clones, AppFlowy, the Y-collab ecosystem (Tldraw, Hocuspocus, Liveblocks, ProseMirror with &lt;code&gt;y-prosemirror&lt;/code&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing: three questions
&lt;/h2&gt;

&lt;p&gt;Forget "which is better." The choice falls out of three questions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Server-authoritative or peer-to-peer?&lt;/strong&gt; If the server has to enforce permissions, redact content, or be the legal source of truth (medical records, financial documents, regulated content), OT fits the mental model. The server already orders everything. With CRDT you can still enforce permissions, but the server is one peer among many and the model leaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Offline-first?&lt;/strong&gt; If users edit on a plane for six hours and expect their work to merge back in cleanly, CRDT wins. OT's offline story exists, but the transform queues get hairy after a long disconnect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. What's your operational complexity budget?&lt;/strong&gt; OT means you own the transform function. Every new op type is a quadratic explosion of transform pairs you have to test. CRDT means you adopt Yjs or Automerge and accept their wire format, their memory shape, their idiosyncrasies. Neither is free.&lt;/p&gt;

&lt;p&gt;A rough heuristic: if the answer to "where does the document live?" is "in our database, and the client is just a view," OT. If the answer is "on every device, and the server is just one replica," CRDT.&lt;/p&gt;

&lt;h2&gt;
  
  
  Server architecture: what you actually deploy
&lt;/h2&gt;

&lt;p&gt;Either algorithm, the deployment shape is similar. Three components.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                ┌───────────────────────┐
   clients ──── │  WebSocket gateway    │  (sticky to doc shard)
                └──────────┬────────────┘
                           │
                ┌──────────▼────────────┐
                │  Doc shard (per doc)  │  ── in-memory authoritative state
                │  - applies ops        │     - OT: transform &amp;amp; broadcast
                │  - keeps clients map  │     - CRDT: merge &amp;amp; broadcast
                └──────────┬────────────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
         ops log    snapshots     presence service
         (append)  (periodic)     (Redis pub/sub)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;WebSocket gateway.&lt;/strong&gt; Terminates client connections. Routes by &lt;code&gt;doc_id&lt;/code&gt; to the right shard. Sticky session: once a client lands on a shard, all its messages go there. Use consistent hashing on &lt;code&gt;doc_id&lt;/code&gt; so adding shards doesn't reshuffle everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Doc shard.&lt;/strong&gt; One process (or actor, or goroutine, or Erlang process) per active document. Holds the current state in memory, applies incoming ops, broadcasts to every connected client for that doc. Dies when nobody's connected and the shard manager evicts it. Cold-start the next time someone opens the doc by replaying the ops log over the last snapshot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Presence service.&lt;/strong&gt; Who's online, where their cursor is, what color they get. This is ephemeral and high-frequency (cursor moves at 60Hz). Don't put it through the same path as document ops, or it'll drown the ops log. Redis pub/sub or a separate WebSocket channel.&lt;/p&gt;

&lt;p&gt;For OT, the doc shard also holds the &lt;strong&gt;transform queue&lt;/strong&gt;: ops the server has accepted but not yet acknowledged to the originating client. For CRDT, the shard is mostly a relay. It merges incoming ops into its replica and broadcasts to other peers, and the merge math handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistence: ops log + snapshots
&lt;/h2&gt;

&lt;p&gt;This is the part that's the same for both families, and the part that interviewers almost always want you to draw on the whiteboard.&lt;/p&gt;

&lt;p&gt;You never want to write the full document to disk on every keystroke. Even a moderate-size doc is 50KB and you'd be doing 50KB writes 30 times a second per active editor. So:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Append-only ops log.&lt;/strong&gt; Every accepted op gets written to a log keyed by &lt;code&gt;(doc_id, op_seq)&lt;/code&gt;. Postgres, DynamoDB, Cassandra, Kafka: anything that does cheap appends and ordered scans by doc. The log is the source of truth. If everything else burns down, you can rebuild any document from its log.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc_ops&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;doc_id&lt;/span&gt;      &lt;span class="n"&gt;uuid&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;op_seq&lt;/span&gt;      &lt;span class="nb"&gt;bigint&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;client_id&lt;/span&gt;   &lt;span class="nb"&gt;text&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;op&lt;/span&gt;          &lt;span class="n"&gt;jsonb&lt;/span&gt;   &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;  &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;op_seq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Snapshots: a materialized state at a known op_seq.&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc_snapshots&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;doc_id&lt;/span&gt;      &lt;span class="n"&gt;uuid&lt;/span&gt;    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;op_seq&lt;/span&gt;      &lt;span class="nb"&gt;bigint&lt;/span&gt;  &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;state&lt;/span&gt;       &lt;span class="n"&gt;bytea&lt;/span&gt;   &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;-- serialized doc (OT) or Y.encodeStateAsUpdate&lt;/span&gt;
  &lt;span class="n"&gt;created_at&lt;/span&gt;  &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;op_seq&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Periodic snapshots.&lt;/strong&gt; Every N ops (N = 1000 is a fine starting point), serialize the current document state and write it as a snapshot. Cold-start = load the latest snapshot + replay the ops since then. Without snapshots, a hot doc with a year of history takes minutes to open.&lt;/p&gt;

&lt;p&gt;For Yjs specifically, snapshots are just &lt;code&gt;Y.encodeStateAsUpdate(doc)&lt;/code&gt;, the encoded merge of every update so far. Restore with &lt;code&gt;Y.applyUpdate(newDoc, snapshot)&lt;/code&gt;. Automerge has &lt;code&gt;A.save(doc)&lt;/code&gt; / &lt;code&gt;A.load(bytes)&lt;/code&gt; for the same purpose.&lt;/p&gt;

&lt;p&gt;A practical pattern: ship snapshots out-of-band on a background job, not in the hot path. The doc shard accepts an op, writes it to the log, broadcasts to clients, and returns. A separate process tails the log and rolls up snapshots every N ops or every M minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Undo/redo
&lt;/h2&gt;

&lt;p&gt;Both OT and CRDT handle undo, but the semantics differ.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;OT&lt;/strong&gt;, undo is generating an inverse op (delete the inserted character, re-insert the deleted run) and applying it as a fresh op. The transform machinery does the rest. If someone else edited around your op in the meantime, the inverse gets transformed and lands cleanly.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;CRDT&lt;/strong&gt; (Yjs in particular), undo is done by an &lt;code&gt;UndoManager&lt;/code&gt; that tracks which ops belong to the local user and inverts them on demand. The trick is that the inverse has to respect concurrent edits. If someone replied to your sentence, undoing your sentence doesn't take their reply with it.&lt;/p&gt;

&lt;p&gt;Both libraries ship reasonable defaults. The interview answer: "Per-user undo stacks. The inverse is generated against the current state, not the original state, so concurrent edits survive."&lt;/p&gt;

&lt;h2&gt;
  
  
  Conflict UI: what the user actually sees
&lt;/h2&gt;

&lt;p&gt;True conflicts in text are rarer than you'd think. Two people typing at adjacent positions isn't a conflict, since both insertions land and the order is determined by the algorithm. A real conflict in a rich-text doc shows up at the semantic layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two users edit the same cell of a table.&lt;/li&gt;
&lt;li&gt;One user deletes a paragraph another user is editing.&lt;/li&gt;
&lt;li&gt;Concurrent format changes (one user bolds a run, another italicizes the overlapping run).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The honest UI answer is: most of the time, you don't show anything. The doc converges, the user reads the result. For the cases that matter, like a deleted block someone was typing into, surface a non-blocking notice ("Someone deleted this section while you were editing. Restore?") and offer one-click revert. Don't put a modal in the user's face.&lt;/p&gt;

&lt;p&gt;Figma's approach is the gold standard: their conflict-on-property-change leaves both edits applied in op order, and you can scrub the version history to see what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90-second answer
&lt;/h2&gt;

&lt;p&gt;When the interviewer asks "design Google Docs," this is the shape:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I'd start with the algorithm. OT if the server is authoritative: Google Docs, anything with strict permissions or compliance. CRDT if I need offline-first or peer-to-peer: Figma, Linear, anything where the client is a real replica.&lt;/p&gt;

&lt;p&gt;Either way the architecture is the same: WebSocket gateway, sticky-routed by doc ID to a per-document shard process that holds in-memory state. The shard accepts ops, applies them (transforms for OT, merges for CRDT), broadcasts to other connected clients, and appends to a persistent ops log. Periodic snapshots roll the log up so cold-start is fast.&lt;/p&gt;

&lt;p&gt;Presence (cursors, selections, who's online) goes through a separate channel. Redis pub/sub. Don't pollute the ops log with cursor moves.&lt;/p&gt;

&lt;p&gt;Undo is per-user inverse ops generated against the current state. Conflict UI is mostly invisible, since the algorithm converges. For semantic conflicts (deleted-while-editing), surface a non-blocking notice with a restore option.&lt;/p&gt;

&lt;p&gt;The gotcha I'd call out: CRDT state grows. Every tombstone is forever unless you have a garbage-collection story. Yjs and Automerge both ship one. I'd budget for it from day one."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's 90 seconds. It hits algorithm choice, server shape, persistence, presence, undo, conflict UI, and one operational risk. Interviewer follow-ups will dig into whichever of those you said with the least confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: CRDT garbage collection
&lt;/h2&gt;

&lt;p&gt;Here's the bug that hits CRDT systems in production around month 18: the document size on disk keeps growing even when the user perception is that the doc shrank.&lt;/p&gt;

&lt;p&gt;CRDTs work by keeping tombstones. When you delete a character, the character object stays in the structure with &lt;code&gt;deleted: true&lt;/code&gt;. The tombstone has to exist so that concurrent edits referencing that character (an insert "after this character") have something to anchor to. If you eagerly remove the character, a delayed peer with an old reference can't merge.&lt;/p&gt;

&lt;p&gt;So your "empty" doc with one paragraph of text is actually carrying every character that's ever been typed, deleted, and re-typed. A doc that's been edited heavily for a year is comically larger than the visible text.&lt;/p&gt;

&lt;p&gt;Yjs ships a garbage collector that runs when you call &lt;code&gt;Y.encodeStateAsUpdate(doc)&lt;/code&gt;: it collapses tombstones whose causal context is no longer reachable. Automerge has &lt;code&gt;A.save&lt;/code&gt; which produces a compacted form. Both work, both have caveats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GC requires that all peers are caught up past the GC point. A peer that's been offline for two weeks and tries to merge against a GC'd state will fail or produce a corrupted merge. The library raises this, but you have to handle it, usually by forcing the lagging peer to re-fetch a fresh snapshot instead of merging old updates.&lt;/li&gt;
&lt;li&gt;GC is expensive on big docs. Don't run it in the request path. Run it in a background job, ideally during low-traffic windows.&lt;/li&gt;
&lt;li&gt;The compacted form is not byte-identical across runs of the GC. Don't use document bytes as a cache key after compaction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Yjs the practical setup looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;Y&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;yjs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// On the server, every N ops or every M minutes:&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;compactDoc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;docId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;loadDoc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;docId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Encode + decode with GC enabled. This is the compaction step.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;compacted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encodeStateAsUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Write the compacted bytes as the new snapshot.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveSnapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;docId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;compacted&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// Truncate the ops log up to the snapshot point.&lt;/span&gt;
  &lt;span class="c1"&gt;// Any peer ahead of this point must be told to refetch.&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;truncateOpsLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;docId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clientStates&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interview-flavored version of this gotcha: "CRDT state grows because tombstones are forever. You need a GC pass that runs in the background, and a story for peers that have fallen behind the GC horizon, typically forcing a fresh snapshot pull instead of an incremental merge."&lt;/p&gt;

&lt;p&gt;OT doesn't have this problem because the server can prune ops freely once everyone has acknowledged them. The trade is that OT puts the burden on the transform function instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;If you're prepping for system design interviews and want this same step-by-step treatment for the other 14 designs interviewers actually ask (sharded counters, payment ledgers, rate limiters, news feeds), that's the spine of my &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step&lt;/a&gt;. The collaboration chapter goes deeper on the OT transform table and walks through a Yjs server you can actually run.&lt;/p&gt;

&lt;p&gt;Which side do you fall on for your own product: OT because the server is the source of truth, or CRDT because the client has to keep working when the network doesn't?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6dw87uaq2vin2k1bwb0.jpg" alt="System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>interview</category>
      <category>realtime</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Design a Job Scheduler at 10M Jobs/Day: 4 Components, 3 Failure Modes</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 13:41:48 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/design-a-job-scheduler-at-10m-jobsday-4-components-3-failure-modes-28hn</link>
      <guid>https://forem.com/gabrielanhaia/design-a-job-scheduler-at-10m-jobsday-4-components-3-failure-modes-28hn</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Cron is fine until it isn't. One box, a flat file, a tab character that nobody can type from memory. Works great for the nightly backup. Then somebody asks you to fire ten million scheduled jobs a day, across timezones, with sub-minute accuracy, and the interviewer wants to know what k8s &lt;code&gt;CronJob&lt;/code&gt; controllers and Airflow schedulers do under the hood.&lt;/p&gt;

&lt;p&gt;This is one of the cleanest system design prompts to practice on because the failure modes are real, the components are small, and the wrong answer is loud. Let's build it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "scheduler" actually means
&lt;/h2&gt;

&lt;p&gt;A scheduler is four things glued together. People conflate them and the design suffers.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A schedule store&lt;/strong&gt;: the durable truth about what should run and when.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A tick coordinator&lt;/strong&gt;: the thing that wakes up every second (or 100ms, or 10ms), reads the store, and decides "these N jobs are due now."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A dispatcher&lt;/strong&gt;: the thing that hands due jobs to a queue. Owns idempotency keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An executor&lt;/strong&gt;: the worker pool that actually runs the job, with timeouts and retries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Confuse "dispatcher" with "executor" and you get a design where the same box reads the schedule and runs the user's code. That's cron. It doesn't scale past one machine because the work the executor does (run arbitrary user code for arbitrary lengths of time) has wildly different failure modes from the work the dispatcher does (publish a tiny message to a queue, deterministically).&lt;/p&gt;

&lt;p&gt;Keep these separated. The interviewer will nod.&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 1: the schedule store
&lt;/h2&gt;

&lt;p&gt;This is the part most candidates get wrong because they reach for Redis. Redis is fine for the &lt;em&gt;due-now queue&lt;/em&gt;. It is not the source of truth for "every Tuesday at 14:30 Europe/Berlin, run job 47 forever."&lt;/p&gt;

&lt;p&gt;The store needs three things: durability, an index on the next firing time, and the ability to update both atomically when a job runs. Postgres works at this scale. So does CockroachDB or any decent OLTP database with a proper index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;scheduled_jobs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;job_id&lt;/span&gt;           &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tenant_id&lt;/span&gt;        &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;             &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;          &lt;span class="n"&gt;JSONB&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;-- cron expression OR a one-shot timestamp; never both&lt;/span&gt;
    &lt;span class="n"&gt;cron_expr&lt;/span&gt;        &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timezone&lt;/span&gt;         &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'UTC'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;-- the calculated next firing time, always in UTC&lt;/span&gt;
    &lt;span class="n"&gt;next_run_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_run_at&lt;/span&gt;      &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;-- versioned so dispatcher can do optimistic locking&lt;/span&gt;
    &lt;span class="k"&gt;version&lt;/span&gt;          &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;          &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;       &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;one_of_schedule&lt;/span&gt;
      &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;cron_expr&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_run_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- the only index the tick coordinator hits&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;scheduled_jobs_due_idx&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;scheduled_jobs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next_run_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;enabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- audit table, append-only, never updated&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;scheduled_runs&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;run_id&lt;/span&gt;           &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;job_id&lt;/span&gt;           &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;scheduled_jobs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;scheduled_for&lt;/span&gt;    &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dispatched_at&lt;/span&gt;    &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="c1"&gt;-- the idempotency key the executor will use&lt;/span&gt;
    &lt;span class="n"&gt;idempotency_key&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;UNIQUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;           &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;  &lt;span class="c1"&gt;-- 'dispatched','succeeded','failed','skipped'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to call out before the interviewer asks.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;next_run_at&lt;/code&gt; is always UTC. The &lt;code&gt;timezone&lt;/code&gt; column exists so you can re-compute the next firing time correctly when DST hits. Storing local time in the index is how you end up firing twice in November and zero times in March.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;version&lt;/code&gt; column is the optimistic lock the dispatcher uses to claim a job. No &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt; row locks held across a tick; that doesn't scale. The dispatcher reads candidates, then issues an &lt;code&gt;UPDATE ... WHERE version = $expected&lt;/code&gt; and only proceeds if the row count is 1.&lt;/p&gt;

&lt;p&gt;The partial index on &lt;code&gt;enabled = true&lt;/code&gt; is a real win at 10M rows. You almost never query disabled jobs, and the index stays small.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;scheduled_runs&lt;/code&gt; audit table is append-only. This is the table you'll be glad you had at 03:00 when someone asks "did job 47 actually run on Tuesday or did we just say it did."&lt;/p&gt;

&lt;p&gt;A back-of-envelope sanity check: 10M jobs/day is ~116 inserts/second sustained on &lt;code&gt;scheduled_runs&lt;/code&gt;. A single Postgres primary handles that without sweating. The schedule store itself is mostly read traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 2: the tick coordinator
&lt;/h2&gt;

&lt;p&gt;The tick coordinator is the dangerous part. One job here: every second, find the jobs whose &lt;code&gt;next_run_at &amp;lt;= now()&lt;/code&gt; and hand them to the dispatcher.&lt;/p&gt;

&lt;p&gt;Single instance is a single point of failure. Multiple instances without coordination is duplicate dispatch. The textbook answer is &lt;strong&gt;leader election&lt;/strong&gt;: run N instances, exactly one is leader, the rest are warm standbys.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// the leader-election loop, etcd-style&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;RunCoordinator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;etcd&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;clientv3&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;etcd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTTL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;election&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewElection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"/scheduler/leader"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// blocks until we win the election&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;election&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Campaign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"became leader, starting tick loop"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tickLoop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;tickLoop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lost&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="k"&gt;chan&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt;&lt;span class="p"&gt;{})&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// 1-second ticks. Don't try to be cleverer than this.&lt;/span&gt;
    &lt;span class="n"&gt;ticker&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTicker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;select&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;lost&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="c"&gt;// we lost the lease; step down immediately&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Errorf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"leader lease lost"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt;&lt;span class="n"&gt;ticker&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;C&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
            &lt;span class="c"&gt;// hand the tick to the dispatcher; don't block here&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dispatchDue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dispatch failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"err"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two non-obvious things here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;&amp;lt;-lost&lt;/code&gt; channel matters more than the &lt;code&gt;Campaign&lt;/code&gt; call.&lt;/strong&gt; etcd's lease can be lost without you noticing if you only listen for context cancellation. If you keep dispatching after losing the lease, the new leader is also dispatching, and you've recreated the duplicate-dispatch problem you were trying to avoid. Step down the moment the session ends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't tick faster than your dispatch can finish.&lt;/strong&gt; If &lt;code&gt;dispatchDue&lt;/code&gt; takes longer than 1 second sometimes (it will, under load), tickers in Go won't queue; the next tick is dropped. That's the behavior you want. What you don't want is goroutine-per-tick where you fan out and overlap dispatches. Serialize the tick. Parallelism happens inside &lt;code&gt;dispatchDue&lt;/code&gt;, where it's bounded.&lt;/p&gt;

&lt;p&gt;For 10M jobs/day, peak rate matters more than average. If 60% of jobs are on &lt;code&gt;0 0 * * *&lt;/code&gt; (midnight UTC), you're dispatching 6M jobs in a 1-minute window. The tick coordinator doesn't run those; it hands them off. So the question becomes how fast the dispatcher can drain that backlog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 3: the dispatcher
&lt;/h2&gt;

&lt;p&gt;The dispatcher reads due jobs, publishes to a queue (SQS, Kafka, RabbitMQ; pick one), updates the schedule store with the next firing time, and writes an entry to &lt;code&gt;scheduled_runs&lt;/code&gt;. All under an idempotency key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;dispatchDue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tickAt&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// batch read, process in chunks so a single tick can drain a spike&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;batchSize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueryContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;`
            SELECT job_id, tenant_id, name, payload,
                   cron_expr, timezone, next_run_at, version
            FROM scheduled_jobs
            WHERE enabled = true
              AND next_run_at &amp;lt;= $1
            ORDER BY next_run_at
            LIMIT $2
        `&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tickAt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batchSize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;scanJobs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c"&gt;// idempotency key = job + scheduled instant.&lt;/span&gt;
            &lt;span class="c"&gt;// Same key for retries; never collides across schedules.&lt;/span&gt;
            &lt;span class="n"&gt;idem&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s:%d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JobID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NextRunAt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unix&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

            &lt;span class="n"&gt;nextRun&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;computeNext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CronExpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Timezone&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tickAt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c"&gt;// claim-and-advance in one statement&lt;/span&gt;
            &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;`
                UPDATE scheduled_jobs
                SET next_run_at = $1,
                    last_run_at = $2,
                    version     = version + 1
                WHERE job_id = $3 AND version = $4
            `&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nextRun&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NextRunAt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JobID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

            &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RowsAffected&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c"&gt;// another dispatcher (shouldn't happen with leader&lt;/span&gt;
                &lt;span class="c"&gt;// election, but belt-and-braces) already claimed it&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idem&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="c"&gt;// queue down; the job will be picked up next tick&lt;/span&gt;
                &lt;span class="c"&gt;// because we'll revert the version on a retry tick&lt;/span&gt;
                &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"publish failed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"job"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JobID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"err"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

            &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ExecContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;`
                INSERT INTO scheduled_runs
                  (run_id, job_id, scheduled_for, idempotency_key, status)
                VALUES ($1, $2, $3, $4, 'dispatched')
            `&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JobID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NextRunAt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;batchSize&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The idempotency key is &lt;code&gt;job_id + scheduled_instant_unix&lt;/code&gt;. Two retries of the same scheduled fire share the key. Two different schedulings of the same job (today's midnight and tomorrow's midnight) get different keys because the timestamp differs. The executor uses this key to enforce at-most-once execution at the application layer if the user opted in.&lt;/p&gt;

&lt;p&gt;Note the &lt;code&gt;at-most-once vs at-least-once&lt;/code&gt; choice lives here, not in the queue. The queue is at-least-once (because SQS, Kafka, RabbitMQ all are). The dispatcher's &lt;code&gt;INSERT&lt;/code&gt; into &lt;code&gt;scheduled_runs&lt;/code&gt; happens &lt;em&gt;after&lt;/em&gt; &lt;code&gt;publish&lt;/code&gt;, so if the publish succeeds and the insert fails, you get a duplicate dispatch on the next tick. Flip the order and you can lose dispatches. Pick one, document the failure mode loudly. Most real schedulers pick at-least-once because losing jobs is worse than running them twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Component 4: the executor
&lt;/h2&gt;

&lt;p&gt;The executor is a worker pool consuming from the queue. Standard stuff: pull a message, run the user's job with a timeout, ack on success, nack on failure with a retry policy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;signal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;contextlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asynccontextmanager&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                 &lt;span class="n"&gt;default_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;          &lt;span class="c1"&gt;# name -&amp;gt; handler
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;concurrency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default_timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;default_timeout&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_running&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_running&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# unknown job; quarantine, don't drop silently
&lt;/span&gt;                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dead_letter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown_handler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt;

            &lt;span class="n"&gt;timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timeout_s&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;default_timeout&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idempotency_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;TimeoutError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# log scheduled_for, not now(): the audit trail wants the
&lt;/span&gt;                &lt;span class="c1"&gt;# original tick, not the time the worker happened to drain it
&lt;/span&gt;                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                      &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retryable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;is_retryable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                      &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things to highlight for the interviewer.&lt;/p&gt;

&lt;p&gt;The default timeout is per-job, not global. A schedule that runs &lt;code&gt;send_daily_email&lt;/code&gt; and one that runs &lt;code&gt;rebuild_search_index&lt;/code&gt; have different tolerances. Storing &lt;code&gt;timeout_s&lt;/code&gt; on the job and shipping it with the message means workers don't need to know about job semantics.&lt;/p&gt;

&lt;p&gt;The bounded semaphore is what keeps a single worker from blowing out memory when a spike arrives. Without it, every received message creates a task immediately and you fan out to whatever the queue gave you. 10k pending tasks waiting on shared resources is how an executor box dies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 1: missed ticks during failover
&lt;/h2&gt;

&lt;p&gt;The leader dies. The standby takes over. The transition isn't instant; etcd's TTL is 10 seconds in the code above, so up to 10 seconds of ticks didn't fire.&lt;/p&gt;

&lt;p&gt;What did you miss? Anything with &lt;code&gt;next_run_at&lt;/code&gt; between the leader's death and the new leader's first tick.&lt;/p&gt;

&lt;p&gt;The fix is the &lt;strong&gt;catch-up window&lt;/strong&gt;. The dispatcher query is &lt;code&gt;next_run_at &amp;lt;= $1&lt;/code&gt; where &lt;code&gt;$1&lt;/code&gt; is the current tick instant, &lt;em&gt;not&lt;/em&gt; &lt;code&gt;next_run_at = $1&lt;/code&gt;. So when the new leader fires its first tick, it picks up everything that was due during the gap. Late, but fired.&lt;/p&gt;

&lt;p&gt;For schedules where lateness matters (cron, but not "the next-run-at" semantics of a one-shot job), this is the right tradeoff. For schedules where firing late is actively wrong (a stock-market open trigger at 09:30:00 doesn't want a 09:30:09 firing), you need a different model. The job carries a "skip if older than" tolerance, and the dispatcher checks &lt;code&gt;now() - next_run_at &amp;lt; tolerance&lt;/code&gt; before publishing.&lt;/p&gt;

&lt;p&gt;The catch-up window also covers the case where the dispatcher itself is slow. If a tick takes 8 seconds to drain a spike, the next tick fires immediately when the previous one returns, and it sees everything that became due during the slow batch.&lt;/p&gt;

&lt;p&gt;What you do &lt;em&gt;not&lt;/em&gt; want is the new leader rewinding its clock and re-firing already-dispatched ticks. The &lt;code&gt;version&lt;/code&gt; check in the &lt;code&gt;UPDATE&lt;/code&gt; plus the audit table's unique constraint on &lt;code&gt;idempotency_key&lt;/code&gt; are the belt and braces against that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 2: clock skew across executors
&lt;/h2&gt;

&lt;p&gt;Your tick coordinator's clock is 400ms ahead of one executor box. The executor box's clock is 600ms behind a third box. A job named &lt;code&gt;mark_subscription_expired&lt;/code&gt; runs at 00:00:00 UTC against a row that has &lt;code&gt;expires_at = 00:00:00.000&lt;/code&gt;. The executor's clock says it's 23:59:59.4. The row isn't expired yet.&lt;/p&gt;

&lt;p&gt;This isn't a thought experiment. It's a class of bug that gets shipped quarterly somewhere in the industry.&lt;/p&gt;

&lt;p&gt;Two rules. &lt;strong&gt;UTC everything.&lt;/strong&gt; No localtime in the schedule store, no localtime in the queue message, no localtime in the executor's comparisons. The user's timezone is metadata used to compute &lt;code&gt;next_run_at&lt;/code&gt; once, not a thing you carry around at dispatch time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NTP is non-negotiable.&lt;/strong&gt; Every box runs &lt;code&gt;chronyd&lt;/code&gt; or &lt;code&gt;systemd-timesyncd&lt;/code&gt; against a known-good pool. AWS EC2 has its time-sync endpoint at &lt;code&gt;169.254.169.123&lt;/code&gt;; GCP has Google's NTP servers; on-prem teams run their own stratum-2. The interviewer wants to hear that you'd alert on &lt;code&gt;chrony tracking&lt;/code&gt; reporting more than 100ms of offset.&lt;/p&gt;

&lt;p&gt;A third belt-and-braces move: include &lt;code&gt;scheduled_for&lt;/code&gt; in every queue message and have the executor compare its own clock to that timestamp on receipt. If the executor sees &lt;code&gt;scheduled_for = 00:00:00&lt;/code&gt; and its own clock says &lt;code&gt;23:59:59&lt;/code&gt;, it knows to either wait or refuse, and either is better than running the job against a state that isn't yet the state the schedule was designed for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure mode 3: long jobs blocking the next tick
&lt;/h2&gt;

&lt;p&gt;The dispatcher publishes to a queue. The executor pulls from the queue. A job that takes 4 hours doesn't block the &lt;em&gt;dispatcher&lt;/em&gt;, it blocks an &lt;em&gt;executor worker&lt;/em&gt;. This is the right separation, and it's exactly the reason dispatcher and executor are different components.&lt;/p&gt;

&lt;p&gt;The failure shape: a job takes 4 hours. The schedule fires every 1 hour. By hour 5, you have five concurrent runs of the same job, all working on overlapping data, racing each other to corrupt state.&lt;/p&gt;

&lt;p&gt;The schedule store needs one more field for this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;scheduled_jobs&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;concurrency_policy&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'allow'&lt;/span&gt;
    &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;concurrency_policy&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'allow'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'forbid'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'replace'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three policies, taken straight from k8s CronJob because Kubernetes already solved this and there's no point re-inventing the vocabulary.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;allow&lt;/code&gt;: fire it, let them overlap. Default for stateless jobs.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;forbid&lt;/code&gt;: skip the new run if the previous run hasn't completed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;replace&lt;/code&gt;: cancel the previous run and start the new one.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dispatcher enforces this by checking the most recent &lt;code&gt;scheduled_runs&lt;/code&gt; row for the job before publishing. &lt;code&gt;forbid&lt;/code&gt; means: if there's a &lt;code&gt;dispatched&lt;/code&gt; row without a &lt;code&gt;succeeded&lt;/code&gt;/&lt;code&gt;failed&lt;/code&gt; row, write a &lt;code&gt;skipped&lt;/code&gt; row and don't publish. &lt;code&gt;replace&lt;/code&gt; means: send a cancellation message for the in-flight run, then publish the new one.&lt;/p&gt;

&lt;p&gt;Queue depth isn't execution depth. A backed-up queue and a backed-up executor pool look the same to a naive dashboard. You want metrics for both, and the alert that matters is &lt;code&gt;executions_in_flight &amp;gt; expected_max&lt;/code&gt; for any job tagged &lt;code&gt;forbid&lt;/code&gt;. That's the one that catches the silent overlap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90-second answer
&lt;/h2&gt;

&lt;p&gt;You'll be asked to summarize at the end. Here's the version that fits in the time you'll be given:&lt;/p&gt;

&lt;p&gt;"Four components, kept separate. A &lt;strong&gt;schedule store&lt;/strong&gt; in Postgres with the cron expression, the next-run-at as UTC, and a partial index on the next-run-at where enabled is true. A &lt;strong&gt;tick coordinator&lt;/strong&gt; running with etcd leader election so exactly one instance is firing per second; standbys are warm. A &lt;strong&gt;dispatcher&lt;/strong&gt; that reads due rows in batches, advances the next-run-at with an optimistic version check, then publishes to SQS with an idempotency key that's &lt;code&gt;job_id + scheduled_unix&lt;/code&gt;. An &lt;strong&gt;executor&lt;/strong&gt; worker pool consuming from the queue, with per-job timeouts and a bounded concurrency semaphore. The audit table &lt;code&gt;scheduled_runs&lt;/code&gt; is append-only and tells operators exactly what was dispatched and when.&lt;/p&gt;

&lt;p&gt;Three failure modes I'd call out. &lt;strong&gt;Missed ticks during failover&lt;/strong&gt;: the dispatcher query is &lt;code&gt;next_run_at &amp;lt;= now()&lt;/code&gt;, not equality, so the catch-up window absorbs the leader-transition gap. &lt;strong&gt;Clock skew&lt;/strong&gt;: UTC stored, NTP enforced via chronyd, alert on &amp;gt;100ms offset, and &lt;code&gt;scheduled_for&lt;/code&gt; ships in the message so the executor can sanity-check its own clock. &lt;strong&gt;Long jobs overlapping their own next run&lt;/strong&gt;: &lt;code&gt;concurrency_policy&lt;/code&gt; column with allow/forbid/replace, dispatcher checks the latest &lt;code&gt;scheduled_runs&lt;/code&gt; row before publishing. Queue depth and execution depth are different metrics; alerting on one without the other hides the overlap."&lt;/p&gt;

&lt;p&gt;That's the answer. 90 seconds, hits every component, names the failure modes by their actual shapes, and shows you know that &lt;code&gt;cron&lt;/code&gt; and &lt;code&gt;Airflow&lt;/code&gt; and k8s &lt;code&gt;CronJob&lt;/code&gt; all converge on roughly this architecture for the same reasons.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;This walk-through is the exact shape the &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews&lt;/a&gt; uses across all 15 designs in the book: components first, failure modes second, a 90-second summary you can actually deliver in the room. The job-scheduler chapter goes deeper into the catch-up-window math and the cron-expression edge cases (DST, leap seconds, the &lt;code&gt;L&lt;/code&gt; and &lt;code&gt;W&lt;/code&gt; operators most parsers get wrong). If you liked this one, that's the chapter to start with.&lt;/p&gt;

&lt;p&gt;What part of the scheduler do you spend the most time arguing about in interviews: the leader-election story, the at-least-once vs at-most-once tradeoff, or the concurrency-policy column? Drop your take in the comments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6dw87uaq2vin2k1bwb0.jpg" alt="System Design Pocket Guide: Interviews" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>interview</category>
      <category>distributedsystems</category>
      <category>scalability</category>
    </item>
    <item>
      <title>Design a Feature Flag Service: 100k SDK Clients and the SSE Protocol Reframe</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 12:05:35 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/design-a-feature-flag-service-100k-sdk-clients-and-the-sse-protocol-reframe-1oj1</link>
      <guid>https://forem.com/gabrielanhaia/design-a-feature-flag-service-100k-sdk-clients-and-the-sse-protocol-reframe-1oj1</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;"Design a feature flag service" sounds soft. It's the kind of prompt candidates think they can wing because they've used LaunchDarkly and Unleash. Then the interviewer asks the follow-up: "Your SDK ships in 100,000 production processes. How do they know when a flag changes?"&lt;/p&gt;

&lt;p&gt;The naive answer hits the load balancer first. The right answer reframes the protocol before drawing a single box.&lt;/p&gt;

&lt;h2&gt;
  
  
  The interviewer's hidden question: how do you scale READS?
&lt;/h2&gt;

&lt;p&gt;The shape of the system is asymmetric. Writes are rare: a product manager toggles a flag a few times a day. Reads are everywhere. Every request your fleet handles asks "is this flag on for this user?" at least once, sometimes dozens of times.&lt;/p&gt;

&lt;p&gt;100,000 SDK clients is not your user count. It's your &lt;em&gt;process&lt;/em&gt; count. If your app runs on 5,000 pods and each pod is a separate SDK instance, you're at 5,000. If your mobile app has a million daily actives and each device runs the SDK, you're at a million. The interviewer says "100k" to anchor you somewhere realistic for a mid-size SaaS. Push them on it. Ask whether the SDKs live in your backend fleet or on end-user devices. The protocol changes.&lt;/p&gt;

&lt;p&gt;The real question under the prompt: &lt;em&gt;given that flag config is small (kilobytes) and changes rarely, how do you make every SDK in the world see the new value in under a second without melting your origin?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you start sketching MySQL and a &lt;code&gt;GET /flags/:key&lt;/code&gt; endpoint, you've answered the wrong question. The interviewer wants you to notice that this is a read-distribution problem with a publish-subscribe shape, not a CRUD app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive design: REST polling, and why it dies at 100k clients
&lt;/h2&gt;

&lt;p&gt;The first instinct is &lt;code&gt;GET /api/flags&lt;/code&gt;. The SDK polls every 30 seconds. Toggle latency is bounded by the poll interval. Done.&lt;/p&gt;

&lt;p&gt;Walk the math on the whiteboard. 100,000 clients, 30-second poll, that's 3,333 requests per second on average. Sustainable. Except every SDK starts at process boot and processes boot in waves: a deploy of 5,000 pods finishes in 90 seconds and you've stacked thousands of polls on top of each other. Now your p99 latency on the flag endpoint spikes and your SDKs time out at startup, which means your app boots without flags, which means defaults everywhere, which means a silent incident.&lt;/p&gt;

&lt;p&gt;Lower the poll interval to 5 seconds and you're at 20,000 req/s sustained. You can cache aggressively at the edge, but you've also turned a flag toggle into a 5-second worst-case propagation. Not interview-grade.&lt;/p&gt;

&lt;p&gt;Raise the interval to 5 minutes and your incident-response story collapses. The SRE flips the kill switch and waits five minutes for the bad code path to stop firing. The PM who shipped the broken experiment is already on a call.&lt;/p&gt;

&lt;p&gt;The pattern: polling forces a tradeoff between propagation latency and origin load that gets worse linearly with client count. There is no value of &lt;code&gt;pollInterval&lt;/code&gt; that wins. The protocol itself is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reframe: SSE push from server, clients hold the connection
&lt;/h2&gt;

&lt;p&gt;Server-Sent Events. One long-lived HTTP connection per SDK, opened at startup, held open by the server, written to only when a flag changes. The flag toggle becomes O(N) writes across N open sockets instead of O(N) polls per interval.&lt;/p&gt;

&lt;p&gt;Why SSE over WebSocket: unidirectional fits the problem (server tells client, client never tells server), it's plain HTTP so corporate proxies don't choke, browsers and most HTTP libraries support it natively, and reconnect-with-Last-Event-ID is part of the spec. WebSocket is fine if you also need client-to-server messages, but for flag distribution you don't.&lt;/p&gt;

&lt;p&gt;Here's the wire-level protocol an SDK should implement. Real SSE, not pseudo-code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FlagStream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sdk_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sdk_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk_key&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# exponential backoff on disconnect — flags service is best-effort
&lt;/span&gt;        &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sdk_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/event-stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cache-Control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no-cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Last-Event-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt;

                &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/sdk/stream?env=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="c1"&gt;# stream=True is the whole point — don't buffer the response
&lt;/span&gt;                &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;  &lt;span class="c1"&gt;# reset on successful connect
&lt;/span&gt;                    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ConnectionError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="c1"&gt;# keep last-known flags; never block app startup on this
&lt;/span&gt;                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iter_lines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decode_unicode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# blank line = dispatch the event
&lt;/span&gt;                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data_buffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_buffer&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;
                    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
                &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;  &lt;span class="c1"&gt;# SSE comment / keepalive
&lt;/span&gt;            &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;data_buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server side is symmetrically simple. On connect, send the full flag snapshot as a &lt;code&gt;put&lt;/code&gt; event. On every subsequent change, send a &lt;code&gt;patch&lt;/code&gt; event with just the diff.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# server-side SSE handler — FastAPI / Starlette
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sse_starlette.sse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EventSourceResponse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/sdk/stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;event_generator&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# 1) snapshot — bring the client to current state
&lt;/span&gt;        &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;flag_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;put&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# 2) live patches — pubsub fan-out from the write path
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;flag_pubsub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_disconnected&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;EventSourceResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;event_generator&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;ping&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# send :keepalive every 15s for proxy timeouts
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Capacity changes character. A modern Linux box holds 200k+ open SSE connections with a sane file-descriptor limit and a non-blocking server. Five such gateway nodes cover the 100k fleet with 5x headroom. The hard part stops being throughput and becomes connection lifecycle: graceful drain on deploy, half-open detection, idle-killer proxies in front of you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge caching: flag evaluations at the CDN edge for read-anywhere clients
&lt;/h2&gt;

&lt;p&gt;For browser SDKs and mobile SDKs, even SSE is too chatty. Every cold start opens a connection, downloads the snapshot, then holds an idle socket. On a flaky mobile network you'd rather not.&lt;/p&gt;

&lt;p&gt;The reframe again: push the &lt;em&gt;evaluation&lt;/em&gt; to the edge. Flag config is small enough that it fits in a CloudFlare Worker, a Fastly Compute@Edge function, or a Lambda@Edge handler. The SDK calls one HTTP endpoint, the edge worker has the flag rules cached locally, and the answer comes back from the nearest PoP in 30ms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// CloudFlare Worker — evaluates a single flag at the edge&lt;/span&gt;
&lt;span class="c1"&gt;// flag config is hydrated from KV (CF's edge KV store)&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;flagKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pathname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// { userId, attrs }&lt;/span&gt;

    &lt;span class="c1"&gt;// KV hit is sub-ms at the edge; miss falls through to origin&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cfgRaw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FLAGS_KV&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`flag:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;flagKey&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;cfgRaw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;UNKNOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cfgRaw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 60s edge cache, but vary on the bucket — not the userId itself&lt;/span&gt;
    &lt;span class="c1"&gt;// (don't blow up the cache key space)&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stickyBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;flagKey&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Headers&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Cache-Control&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;public, s-maxage=60&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;X-Flag-Bucket&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// 1) kill switch — fast path&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;offVariation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OFF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// 2) targeting rules — explicit user/segment overrides&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clauses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TARGET_MATCH&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// 3) percentage rollout — bucket the user, compare to threshold&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stickyBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;salt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;variant&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rollout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;variant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cumulativeWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;variant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ROLLOUT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fallthrough&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;FALLTHROUGH&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Edge cache invalidation runs off the same pubsub channel the SSE gateway uses. When a flag changes, you push a KV update to every edge PoP and the next request reads the new config. Propagation is dominated by KV replication time, which on CloudFlare KV is sub-second globally.&lt;/p&gt;

&lt;p&gt;The gotcha: edge caching only works for flag values that don't depend on per-request secrets. If the flag rules reference &lt;code&gt;ctx.attrs.email&lt;/code&gt; to do regex matching, you can't cache the response; the cache key would explode. Restrict edge evaluation to flags with bucket-based rollouts and named-segment matches; route attribute-heavy evaluations back to origin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flag-config distribution: durable store, snapshot to object storage, push diffs over pubsub
&lt;/h2&gt;

&lt;p&gt;Behind the gateway and the edge sits the source of truth. The write path is low-traffic: a dashboard call writes a new flag version to a relational store (Postgres, simple), bumps a monotonic version counter, and publishes a change event to Redis pubsub or NATS.&lt;/p&gt;

&lt;p&gt;The read path is everything. Gateway pods don't read Postgres on every SDK request; they hold the flag set in process memory and subscribe to the same pubsub channel. On boot, a gateway reads a snapshot from S3 (refreshed by a background job every 30s) so a cold restart of the entire fleet doesn't herd against Postgres.&lt;/p&gt;

&lt;p&gt;S3 plus CloudFront is also the SDK-side fallback channel. If the SSE connection won't establish (corporate proxy strips the connection, mobile network is hostile, the gateway is down), the SDK falls back to a 60-second polled GET of &lt;code&gt;flags-{env}-{version}.json&lt;/code&gt; from a public CloudFront URL. Slower, lossier, but the application boots with real flag values instead of compiled-in defaults.&lt;/p&gt;

&lt;p&gt;The pattern to name in the interview: &lt;em&gt;write to a durable store, snapshot to object storage for cold-start, push diffs over pubsub for hot fan-out, fall back to polled snapshots for hostile networks&lt;/em&gt;. Three independent paths, ranked by cost and latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  SDK-side caching with TTL + fallback (offline-safe)
&lt;/h2&gt;

&lt;p&gt;Every flag evaluation must answer in microseconds. The SDK keeps the full flag set in memory and serves evaluations locally. The SSE stream keeps the in-memory copy fresh.&lt;/p&gt;

&lt;p&gt;When the connection drops, the SDK keeps serving the last-known values. That's the offline-safe property. No timeout, no fallback default unless the SDK has &lt;em&gt;never&lt;/em&gt; received a snapshot. Add a &lt;code&gt;lastSyncedAt&lt;/code&gt; field on the SDK that your monitoring scrapes; a process that's been disconnected from the flag service for 10 minutes is a real signal, but it shouldn't crash the request path.&lt;/p&gt;

&lt;p&gt;Compiled-in defaults belong in the application code, not the SDK. The contract is: &lt;em&gt;if the SDK has no value for this key, return the default that the calling code supplied&lt;/em&gt;. The application owner decides what "off" looks like for that specific flag, not the flag platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Targeting and rules engine: boolean predicates, rollout percentages, sticky bucketing
&lt;/h2&gt;

&lt;p&gt;Three primitives, in order: kill switch, targeting rules, percentage rollout. Already shown in the edge evaluator. Worth saying out loud in the interview because it answers "what's a flag actually evaluating?"&lt;/p&gt;

&lt;p&gt;Sticky bucketing is the load-bearing piece. When you say "5% rollout", a given user must always land in the same bucket. Otherwise the user flips between treatment and control across requests, ruins your experiment, and corrupts your analytics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# the bucketing hash — identical implementation on every SDK,
# every edge worker, every backend evaluator. one source of truth.
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sticky_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flag_salt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_buckets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# SHA-1, not for security — for stable, uniform distribution across languages.
&lt;/span&gt;    &lt;span class="c1"&gt;# Every SDK ships the same impl. If you swap algorithms you re-bucket
&lt;/span&gt;    &lt;span class="c1"&gt;# every user mid-experiment, which is a silent data corruption bug.
&lt;/span&gt;    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag_salt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# take 4 bytes, big-endian, mod the bucket count
&lt;/span&gt;    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;big&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;total_buckets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the salt. Without it, the same user lands in bucket 4,217 for &lt;em&gt;every&lt;/em&gt; flag, so a user in the 5% rollout of flag A is also in the 5% rollout of flag B, C, and D. Correlation across experiments destroys your stats. The salt is usually the flag key itself plus a per-flag random string set at flag creation.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;total_buckets=10_000&lt;/code&gt;: lets you express rollout in basis points (0.01% granularity), enough precision for ramp schedules and small canary groups.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90-second answer that wins the round
&lt;/h2&gt;

&lt;p&gt;When the interviewer drops the prompt, talk for 90 seconds before drawing anything:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Feature flags are a read-heavy, write-rare distribution problem. The naive &lt;code&gt;GET /flags&lt;/code&gt; design fails at 100k SDK clients because polling forces a bad tradeoff between propagation latency and origin load. I'd reframe to a push protocol: SSE from a stateless gateway tier, with each SDK holding one long-lived connection that receives a full snapshot on connect and patches on every flag change. The gateways subscribe to a pubsub channel (Redis or NATS) that the write path publishes to. The source of truth is Postgres for writes and S3-plus-CDN for cold-start snapshots, so a fleet restart doesn't herd. For browser and mobile SDKs I'd push evaluation to the edge via CloudFlare Workers or Lambda@Edge, with flag config replicated to edge KV; that gives 30ms response from the nearest PoP. SDKs cache locally and evaluate in microseconds. Sticky bucketing uses a salted SHA-1 hash with the same implementation in every SDK, edge worker, and backend evaluator. The whole system is offline-safe because the SDK keeps serving last-known values when the stream drops."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the answer. Now they ask follow-ups and you draw boxes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: sticky bucketing requires a deterministic hash, and every SDK must agree
&lt;/h2&gt;

&lt;p&gt;The single failure mode that sinks real flag platforms: hash drift across SDKs.&lt;/p&gt;

&lt;p&gt;Your Node SDK uses MurmurHash3 because someone copy-pasted from LaunchDarkly's old open-source SDK. Your Go SDK uses FNV-1a because Go's stdlib has it. Your Python SDK uses MD5 because that's what the first engineer reached for. The same user gets bucketed three different ways. You roll out a flag to 10% and you actually hit 27% of users because each SDK has independent randomness.&lt;/p&gt;

&lt;p&gt;Worse: the bug is invisible. Aggregate counts look right (10% of evaluations across the fleet return the new variant), but per-user consistency is gone. An experiment that should detect a 2% conversion lift sees noise. A canary that should affect 5% of traffic affects different 5%-slices in different SDKs.&lt;/p&gt;

&lt;p&gt;The fix is governance, not code: one bucketing spec, written down, with test vectors. Every SDK ships a &lt;code&gt;test_sticky_bucket.py&lt;/code&gt; (or &lt;code&gt;_test.go&lt;/code&gt;, etc.) with at least 20 &lt;code&gt;(user_id, flag_salt) -&amp;gt; expected_bucket&lt;/code&gt; pairs. CI fails if any SDK disagrees with the canonical vectors. When you change the algorithm, you bump a &lt;code&gt;bucketingVersion&lt;/code&gt; field on every flag and run the old and new algorithms in parallel during the cutover.&lt;/p&gt;

&lt;p&gt;If the interviewer is sharp they'll ask about this. Bring it up unprompted and you've shown you've actually shipped one of these systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;This pattern (protocol reframe, push beats poll, edge evaluation, deterministic bucketing) is one of fifteen full system designs walked end-to-end in &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews&lt;/a&gt;. The feature-flag design lives next to the rate-limiter, the URL shortener, and the notification system; each one structured around the 90-second answer and the follow-up questions that actually decide the round.&lt;/p&gt;

&lt;p&gt;What's the gnarliest follow-up you've been asked on this kind of design? Drop it in the comments and I'll work through it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6dw87uaq2vin2k1bwb0.jpg" alt="System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>interview</category>
      <category>distributedsystems</category>
      <category>scalability</category>
    </item>
    <item>
      <title>Design a Feature Flag Service: 100k SDK Clients and the SSE Protocol Reframe</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 12:05:34 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/design-a-feature-flag-service-100k-sdk-clients-and-the-sse-protocol-reframe-kj6</link>
      <guid>https://forem.com/gabrielanhaia/design-a-feature-flag-service-100k-sdk-clients-and-the-sse-protocol-reframe-kj6</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;"Design a feature flag service" sounds soft. It's the kind of prompt candidates think they can wing because they've used LaunchDarkly and Unleash. Then the interviewer asks the follow-up: "Your SDK ships in 100,000 production processes. How do they know when a flag changes?"&lt;/p&gt;

&lt;p&gt;The naive answer hits the load balancer first. The right answer reframes the protocol before drawing a single box.&lt;/p&gt;

&lt;h2&gt;
  
  
  The interviewer's hidden question: how do you scale READS?
&lt;/h2&gt;

&lt;p&gt;The shape of the system is asymmetric. Writes are rare: a product manager toggles a flag a few times a day. Reads are everywhere. Every request your fleet handles asks "is this flag on for this user?" at least once, sometimes dozens of times.&lt;/p&gt;

&lt;p&gt;100,000 SDK clients is not your user count. It's your &lt;em&gt;process&lt;/em&gt; count. If your app runs on 5,000 pods and each pod is a separate SDK instance, you're at 5,000. If your mobile app has a million daily actives and each device runs the SDK, you're at a million. The interviewer says "100k" to anchor you somewhere realistic for a mid-size SaaS. Push them on it. Ask whether the SDKs live in your backend fleet or on end-user devices. The protocol changes.&lt;/p&gt;

&lt;p&gt;The real question under the prompt: &lt;em&gt;given that flag config is small (kilobytes) and changes rarely, how do you make every SDK in the world see the new value in under a second without melting your origin?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you start sketching MySQL and a &lt;code&gt;GET /flags/:key&lt;/code&gt; endpoint, you've answered the wrong question. The interviewer wants you to notice that this is a read-distribution problem with a publish-subscribe shape, not a CRUD app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Naive design: REST polling, and why it dies at 100k clients
&lt;/h2&gt;

&lt;p&gt;The first instinct is &lt;code&gt;GET /api/flags&lt;/code&gt;. The SDK polls every 30 seconds. Toggle latency is bounded by the poll interval. Done.&lt;/p&gt;

&lt;p&gt;Walk the math on the whiteboard. 100,000 clients, 30-second poll, that's 3,333 requests per second on average. Sustainable. Except every SDK starts at process boot and processes boot in waves: a deploy of 5,000 pods finishes in 90 seconds and you've stacked thousands of polls on top of each other. Now your p99 latency on the flag endpoint spikes and your SDKs time out at startup, which means your app boots without flags, which means defaults everywhere, which means a silent incident.&lt;/p&gt;

&lt;p&gt;Lower the poll interval to 5 seconds and you're at 20,000 req/s sustained. You can cache aggressively at the edge, but you've also turned a flag toggle into a 5-second worst-case propagation. Not interview-grade.&lt;/p&gt;

&lt;p&gt;Raise the interval to 5 minutes and your incident-response story collapses. The SRE flips the kill switch and waits five minutes for the bad code path to stop firing. The PM who shipped the broken experiment is already on a call.&lt;/p&gt;

&lt;p&gt;The pattern: polling forces a tradeoff between propagation latency and origin load that gets worse linearly with client count. There is no value of &lt;code&gt;pollInterval&lt;/code&gt; that wins. The protocol itself is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The reframe: SSE push from server, clients hold the connection
&lt;/h2&gt;

&lt;p&gt;Server-Sent Events. One long-lived HTTP connection per SDK, opened at startup, held open by the server, written to only when a flag changes. The flag toggle becomes O(N) writes across N open sockets instead of O(N) polls per interval.&lt;/p&gt;

&lt;p&gt;Why SSE over WebSocket: unidirectional fits the problem (server tells client, client never tells server), it's plain HTTP so corporate proxies don't choke, browsers and most HTTP libraries support it natively, and reconnect-with-Last-Event-ID is part of the spec. WebSocket is fine if you also need client-to-server messages, but for flag distribution you don't.&lt;/p&gt;

&lt;p&gt;Here's the wire-level protocol an SDK should implement. Real SSE, not pseudo-code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FlagStream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sdk_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sdk_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk_key&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# exponential backoff on disconnect — flags service is best-effort
&lt;/span&gt;        &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sdk_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/event-stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cache-Control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no-cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Last-Event-ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt;

                &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/sdk/stream?env=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="c1"&gt;# stream=True is the whole point — don't buffer the response
&lt;/span&gt;                &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                    &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;  &lt;span class="c1"&gt;# reset on successful connect
&lt;/span&gt;                    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RequestException&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;ConnectionError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="c1"&gt;# keep last-known flags; never block app startup on this
&lt;/span&gt;                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backoff&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;backoff&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_parse_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Iterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iter_lines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decode_unicode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# blank line = dispatch the event
&lt;/span&gt;                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data_buffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_buffer&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;
                    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
                &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;  &lt;span class="c1"&gt;# SSE comment / keepalive
&lt;/span&gt;            &lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;data_buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;event_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server side is symmetrically simple. On connect, send the full flag snapshot as a &lt;code&gt;put&lt;/code&gt; event. On every subsequent change, send a &lt;code&gt;patch&lt;/code&gt; event with just the diff.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# server-side SSE handler — FastAPI / Starlette
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sse_starlette.sse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EventSourceResponse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/sdk/stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_event_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;event_generator&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="c1"&gt;# 1) snapshot — bring the client to current state
&lt;/span&gt;        &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;flag_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;put&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;# 2) live patches — pubsub fan-out from the write path
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;change&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;flag_pubsub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subscribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_disconnected&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;diff&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;EventSourceResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;event_generator&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;ping&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# send :keepalive every 15s for proxy timeouts
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Capacity changes character. A modern Linux box holds 200k+ open SSE connections with a sane file-descriptor limit and a non-blocking server. Five such gateway nodes cover the 100k fleet with 5x headroom. The hard part stops being throughput and becomes connection lifecycle: graceful drain on deploy, half-open detection, idle-killer proxies in front of you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Edge caching: flag evaluations at the CDN edge for read-anywhere clients
&lt;/h2&gt;

&lt;p&gt;For browser SDKs and mobile SDKs, even SSE is too chatty. Every cold start opens a connection, downloads the snapshot, then holds an idle socket. On a flaky mobile network you'd rather not.&lt;/p&gt;

&lt;p&gt;The reframe again: push the &lt;em&gt;evaluation&lt;/em&gt; to the edge. Flag config is small enough that it fits in a CloudFlare Worker, a Fastly Compute@Edge function, or a Lambda@Edge handler. The SDK calls one HTTP endpoint, the edge worker has the flag rules cached locally, and the answer comes back from the nearest PoP in 30ms.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// CloudFlare Worker — evaluates a single flag at the edge&lt;/span&gt;
&lt;span class="c1"&gt;// flag config is hydrated from KV (CF's edge KV store)&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;flagKey&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pathname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// { userId, attrs }&lt;/span&gt;

    &lt;span class="c1"&gt;// KV hit is sub-ms at the edge; miss falls through to origin&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cfgRaw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FLAGS_KV&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`flag:&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;flagKey&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;cfgRaw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;UNKNOWN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;404&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cfgRaw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 60s edge cache, but vary on the bucket — not the userId itself&lt;/span&gt;
    &lt;span class="c1"&gt;// (don't blow up the cache key space)&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stickyBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;flagKey&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Headers&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Cache-Control&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;public, s-maxage=60&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;X-Flag-Bucket&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// 1) kill switch — fast path&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;offVariation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;OFF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// 2) targeting rules — explicit user/segment overrides&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;clauses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TARGET_MATCH&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// 3) percentage rollout — bucket the user, compare to threshold&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;stickyBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;salt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;variant&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rollout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;variant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cumulativeWeight&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;variant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ROLLOUT&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;fallthrough&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;FALLTHROUGH&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Edge cache invalidation runs off the same pubsub channel the SSE gateway uses. When a flag changes, you push a KV update to every edge PoP and the next request reads the new config. Propagation is dominated by KV replication time, which on CloudFlare KV is sub-second globally.&lt;/p&gt;

&lt;p&gt;The gotcha: edge caching only works for flag values that don't depend on per-request secrets. If the flag rules reference &lt;code&gt;ctx.attrs.email&lt;/code&gt; to do regex matching, you can't cache the response; the cache key would explode. Restrict edge evaluation to flags with bucket-based rollouts and named-segment matches; route attribute-heavy evaluations back to origin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Flag-config distribution: durable store, snapshot to object storage, push diffs over pubsub
&lt;/h2&gt;

&lt;p&gt;Behind the gateway and the edge sits the source of truth. The write path is low-traffic: a dashboard call writes a new flag version to a relational store (Postgres, simple), bumps a monotonic version counter, and publishes a change event to Redis pubsub or NATS.&lt;/p&gt;

&lt;p&gt;The read path is everything. Gateway pods don't read Postgres on every SDK request; they hold the flag set in process memory and subscribe to the same pubsub channel. On boot, a gateway reads a snapshot from S3 (refreshed by a background job every 30s) so a cold restart of the entire fleet doesn't herd against Postgres.&lt;/p&gt;

&lt;p&gt;S3 plus CloudFront is also the SDK-side fallback channel. If the SSE connection won't establish (corporate proxy strips the connection, mobile network is hostile, the gateway is down), the SDK falls back to a 60-second polled GET of &lt;code&gt;flags-{env}-{version}.json&lt;/code&gt; from a public CloudFront URL. Slower, lossier, but the application boots with real flag values instead of compiled-in defaults.&lt;/p&gt;

&lt;p&gt;The pattern to name in the interview: &lt;em&gt;write to a durable store, snapshot to object storage for cold-start, push diffs over pubsub for hot fan-out, fall back to polled snapshots for hostile networks&lt;/em&gt;. Three independent paths, ranked by cost and latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  SDK-side caching with TTL + fallback (offline-safe)
&lt;/h2&gt;

&lt;p&gt;Every flag evaluation must answer in microseconds. The SDK keeps the full flag set in memory and serves evaluations locally. The SSE stream keeps the in-memory copy fresh.&lt;/p&gt;

&lt;p&gt;When the connection drops, the SDK keeps serving the last-known values. That's the offline-safe property. No timeout, no fallback default unless the SDK has &lt;em&gt;never&lt;/em&gt; received a snapshot. Add a &lt;code&gt;lastSyncedAt&lt;/code&gt; field on the SDK that your monitoring scrapes; a process that's been disconnected from the flag service for 10 minutes is a real signal, but it shouldn't crash the request path.&lt;/p&gt;

&lt;p&gt;Compiled-in defaults belong in the application code, not the SDK. The contract is: &lt;em&gt;if the SDK has no value for this key, return the default that the calling code supplied&lt;/em&gt;. The application owner decides what "off" looks like for that specific flag, not the flag platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Targeting and rules engine: boolean predicates, rollout percentages, sticky bucketing
&lt;/h2&gt;

&lt;p&gt;Three primitives, in order: kill switch, targeting rules, percentage rollout. Already shown in the edge evaluator. Worth saying out loud in the interview because it answers "what's a flag actually evaluating?"&lt;/p&gt;

&lt;p&gt;Sticky bucketing is the load-bearing piece. When you say "5% rollout", a given user must always land in the same bucket. Otherwise the user flips between treatment and control across requests, ruins your experiment, and corrupts your analytics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# the bucketing hash — identical implementation on every SDK,
# every edge worker, every backend evaluator. one source of truth.
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sticky_bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flag_salt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_buckets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# SHA-1, not for security — for stable, uniform distribution across languages.
&lt;/span&gt;    &lt;span class="c1"&gt;# Every SDK ships the same impl. If you swap algorithms you re-bucket
&lt;/span&gt;    &lt;span class="c1"&gt;# every user mid-experiment, which is a silent data corruption bug.
&lt;/span&gt;    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;flag_salt&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# take 4 bytes, big-endian, mod the bucket count
&lt;/span&gt;    &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;big&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;total_buckets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the salt. Without it, the same user lands in bucket 4,217 for &lt;em&gt;every&lt;/em&gt; flag, so a user in the 5% rollout of flag A is also in the 5% rollout of flag B, C, and D. Correlation across experiments destroys your stats. The salt is usually the flag key itself plus a per-flag random string set at flag creation.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;total_buckets=10_000&lt;/code&gt;: lets you express rollout in basis points (0.01% granularity), enough precision for ramp schedules and small canary groups.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 90-second answer that wins the round
&lt;/h2&gt;

&lt;p&gt;When the interviewer drops the prompt, talk for 90 seconds before drawing anything:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Feature flags are a read-heavy, write-rare distribution problem. The naive &lt;code&gt;GET /flags&lt;/code&gt; design fails at 100k SDK clients because polling forces a bad tradeoff between propagation latency and origin load. I'd reframe to a push protocol: SSE from a stateless gateway tier, with each SDK holding one long-lived connection that receives a full snapshot on connect and patches on every flag change. The gateways subscribe to a pubsub channel (Redis or NATS) that the write path publishes to. The source of truth is Postgres for writes and S3-plus-CDN for cold-start snapshots, so a fleet restart doesn't herd. For browser and mobile SDKs I'd push evaluation to the edge via CloudFlare Workers or Lambda@Edge, with flag config replicated to edge KV; that gives 30ms response from the nearest PoP. SDKs cache locally and evaluate in microseconds. Sticky bucketing uses a salted SHA-1 hash with the same implementation in every SDK, edge worker, and backend evaluator. The whole system is offline-safe because the SDK keeps serving last-known values when the stream drops."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's the answer. Now they ask follow-ups and you draw boxes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: sticky bucketing requires a deterministic hash, and every SDK must agree
&lt;/h2&gt;

&lt;p&gt;The single failure mode that sinks real flag platforms: hash drift across SDKs.&lt;/p&gt;

&lt;p&gt;Your Node SDK uses MurmurHash3 because someone copy-pasted from LaunchDarkly's old open-source SDK. Your Go SDK uses FNV-1a because Go's stdlib has it. Your Python SDK uses MD5 because that's what the first engineer reached for. The same user gets bucketed three different ways. You roll out a flag to 10% and you actually hit 27% of users because each SDK has independent randomness.&lt;/p&gt;

&lt;p&gt;Worse: the bug is invisible. Aggregate counts look right (10% of evaluations across the fleet return the new variant), but per-user consistency is gone. An experiment that should detect a 2% conversion lift sees noise. A canary that should affect 5% of traffic affects different 5%-slices in different SDKs.&lt;/p&gt;

&lt;p&gt;The fix is governance, not code: one bucketing spec, written down, with test vectors. Every SDK ships a &lt;code&gt;test_sticky_bucket.py&lt;/code&gt; (or &lt;code&gt;_test.go&lt;/code&gt;, etc.) with at least 20 &lt;code&gt;(user_id, flag_salt) -&amp;gt; expected_bucket&lt;/code&gt; pairs. CI fails if any SDK disagrees with the canonical vectors. When you change the algorithm, you bump a &lt;code&gt;bucketingVersion&lt;/code&gt; field on every flag and run the old and new algorithms in parallel during the cutover.&lt;/p&gt;

&lt;p&gt;If the interviewer is sharp they'll ask about this. Bring it up unprompted and you've shown you've actually shipped one of these systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;This pattern (protocol reframe, push beats poll, edge evaluation, deterministic bucketing) is one of fifteen full system designs walked end-to-end in &lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;System Design Pocket Guide: Interviews&lt;/a&gt;. The feature-flag design lives next to the rate-limiter, the URL shortener, and the notification system; each one structured around the 90-second answer and the follow-up questions that actually decide the round.&lt;/p&gt;

&lt;p&gt;What's the gnarliest follow-up you've been asked on this kind of design? Drop it in the comments and I'll work through it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GX2SQ594" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6dw87uaq2vin2k1bwb0.jpg" alt="System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>interview</category>
      <category>distributedsystems</category>
      <category>scalability</category>
    </item>
    <item>
      <title>JSONB vs Relational in 2026: 5 Query Shapes, 5 Verdicts</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 12:05:00 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/jsonb-vs-relational-in-2026-5-query-shapes-5-verdicts-c1c</link>
      <guid>https://forem.com/gabrielanhaia/jsonb-vs-relational-in-2026-5-query-shapes-5-verdicts-c1c</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GYLMVX9S" rel="noopener noreferrer"&gt;Database Playbook: Choosing the Right Store for Every System You Build&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;JSONB landed in Postgres 9.4 back in 2014. A decade later your team is still arguing about it. Half of them want every new table to have a &lt;code&gt;data jsonb&lt;/code&gt; column "for flexibility." The other half think every JSONB column is a future migration tax. Both camps are right about half the time, which is the most unhelpful version of being right.&lt;/p&gt;

&lt;p&gt;So let's stop arguing in the abstract. There are five query shapes that cover roughly 90% of what apps actually do against Postgres. Each one has a clear winner. The row count where the verdict flips is also knowable. Here it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  When JSONB actually ships
&lt;/h2&gt;

&lt;p&gt;Before the verdicts, a sanity check. JSONB is the right call when you genuinely don't know the shape yet, when the shape is polymorphic per row, or when you need to store an opaque blob that you'll read back as-is.&lt;/p&gt;

&lt;p&gt;Three patterns that age well:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Audit logs&lt;/strong&gt;: &lt;code&gt;event_payload jsonb&lt;/code&gt; because every event type has different fields and you want to dump the whole envelope. Nobody queries individual keys at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid prototyping&lt;/strong&gt;: week one of a product, you have no idea what fields the integration partner will send. Catch it in JSONB, see what the data looks like, extract columns in week three.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Polymorphic per-row data&lt;/strong&gt;: a &lt;code&gt;settings&lt;/code&gt; column where a &lt;code&gt;kafka&lt;/code&gt; integration has different keys than a &lt;code&gt;slack&lt;/code&gt; one. Forcing both into a shared schema costs more than it saves.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Outside those three, the default should be columns. The five query shapes below show why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shape 1: Equality on a known field (relational wins by a lot)
&lt;/h2&gt;

&lt;p&gt;The most common app query in existence: "find the user with email &lt;code&gt;x&lt;/code&gt;." If &lt;code&gt;email&lt;/code&gt; lives inside a JSONB column, this is what it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- email lives in profile-&amp;gt;&amp;gt;'email'&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'email'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'sarah@example.com'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can index it with an expression index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;users_profile_email_idx&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'email'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That works. It's also the moment you realize you've reinvented a column with extra steps. Every query that filters by email now has to spell the JSON path correctly, the index won't apply if anyone writes &lt;code&gt;profile -&amp;gt; 'email'&lt;/code&gt; instead of &lt;code&gt;profile -&amp;gt;&amp;gt; 'email'&lt;/code&gt;, and the planner's row estimates for expression indexes are routinely worse than for plain columns.&lt;/p&gt;

&lt;p&gt;Benchmarked on 5M rows with the email present on every row, the column version is consistently 2–4x faster on cold cache and emits a saner plan. The verdict here doesn't flip with size. If you're filtering on a known scalar field, that field belongs in a column.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shape 2: Existence check (&lt;code&gt;?&lt;/code&gt; operator), where JSONB with &lt;code&gt;jsonb_path_ops&lt;/code&gt; is fine
&lt;/h2&gt;

&lt;p&gt;Now the actual JSONB use case: "find every row whose &lt;code&gt;features&lt;/code&gt; JSON has a &lt;code&gt;beta_billing&lt;/code&gt; key."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="s1"&gt;'beta_billing'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without an index, Postgres scans every row and parses every JSONB document. On 5M rows that's not a query you want on a hot path. The fix is a GIN index, and this is where the index choice matters more than people realise:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- the default GIN class indexes everything: keys, values, paths.&lt;/span&gt;
&lt;span class="c1"&gt;-- bigger, supports more operators (@&amp;gt;, ?, ?|, ?&amp;amp;, @?, @@).&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;accounts_features_gin&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- jsonb_path_ops indexes only the hashed path-value pairs.&lt;/span&gt;
&lt;span class="c1"&gt;-- about 30% smaller, faster builds, but only supports @&amp;gt;.&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;accounts_features_gin_pathops&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your workload is pure &lt;code&gt;@&amp;gt;&lt;/code&gt; containment lookups, &lt;code&gt;jsonb_path_ops&lt;/code&gt; is the right pick: smaller index, faster builds, faster lookups. If you need &lt;code&gt;?&lt;/code&gt; existence checks too, you're stuck with the default opclass. Or you keep both, which I've seen teams do when the containment index is the hot path and the existence one runs on admin pages.&lt;/p&gt;

&lt;p&gt;A nasty subtlety: &lt;code&gt;?&lt;/code&gt; checks for a key at the top level only. &lt;code&gt;WHERE features ? 'beta_billing'&lt;/code&gt; won't find &lt;code&gt;{"flags": {"beta_billing": true}}&lt;/code&gt;. People learn this the hard way at 2am. For nested keys you need &lt;code&gt;@?&lt;/code&gt; with a jsonpath, or restructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shape 3: Aggregation across a JSONB field (relational wins below ~1M rows)
&lt;/h2&gt;

&lt;p&gt;"Sum &lt;code&gt;amount_cents&lt;/code&gt; from every order placed last week."&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;amount_cents&lt;/code&gt; lives inside &lt;code&gt;order_data jsonb&lt;/code&gt;, every aggregation row has to extract and cast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;order_data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'amount_cents'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;-&amp;gt;&amp;gt;&lt;/code&gt; operator returns text. The cast to &lt;code&gt;bigint&lt;/code&gt; happens for every row in the scan. On a &lt;code&gt;bigint&lt;/code&gt; column, the same query reads a fixed-width number directly from the tuple. On a 500k-row weekly window, the column version finishes in tens of milliseconds, while the JSONB version takes seconds. Easily a 20–40x gap on typical hardware.&lt;/p&gt;

&lt;p&gt;The crossover point is somewhere between 100k and 1M rows depending on payload size and how warm the cache is. Below that you'll mostly notice it on slow endpoints. Above that, aggregation queries start showing up in your slow log as the dominant cost. Hot numeric fields you aggregate on belong in columns. Full stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shape 4: Containment query (&lt;code&gt;@&amp;gt;&lt;/code&gt;), where JSONB shines
&lt;/h2&gt;

&lt;p&gt;"Find every event whose payload contains &lt;code&gt;{"source": "stripe", "type": "invoice.paid"}&lt;/code&gt;." Try expressing that with relational columns and you end up with N filter predicates per query plus index combinations the planner has to choose between. With JSONB it's one operator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"source": "stripe", "type": "invoice.paid"}'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;jsonb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the &lt;code&gt;jsonb_path_ops&lt;/code&gt; GIN index from Shape 2, this is fast. Really fast. The index stores hashed path-value pairs, so the lookup is essentially a hash probe against a posting list. On a 50M-row event table, this returns in a couple of milliseconds for selective predicates.&lt;/p&gt;

&lt;p&gt;This is the shape JSONB was built for. Multi-key containment against semi-structured data with an unbounded set of possible keys. If your app does a lot of "find the subset of records whose JSON matches this prototype," JSONB plus &lt;code&gt;jsonb_path_ops&lt;/code&gt; is genuinely the right answer and there's no relational shape that beats it without a denormalisation explosion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shape 5: Sparse / polymorphic schema, where JSONB is the right answer
&lt;/h2&gt;

&lt;p&gt;"Store user-defined custom fields per account, where account A has &lt;code&gt;{vat_id, billing_contact}&lt;/code&gt; and account B has &lt;code&gt;{purchase_order_number, accounting_email, ap_phone}&lt;/code&gt;."&lt;/p&gt;

&lt;p&gt;The relational shape is the EAV anti-pattern: a &lt;code&gt;custom_fields&lt;/code&gt; table with &lt;code&gt;(account_id, field_name, field_value)&lt;/code&gt; and a million JOINs to assemble one record. That works at small scale and rots at large scale. The other relational option is nullable columns for every possible custom field, which means a schema migration every time a customer asks for a new one.&lt;/p&gt;

&lt;p&gt;JSONB makes this trivial:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt;         &lt;span class="n"&gt;bigserial&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;       &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;custom&lt;/span&gt;     &lt;span class="n"&gt;jsonb&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'{}'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;jsonb&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- queryable when you need it&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;accounts_custom_gin&lt;/span&gt;
  &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get type-flexible storage, indexed containment lookups, and no schema churn. The trade is that you have no schema enforcement at the column level. That responsibility moves to the app layer, ideally with a versioned validator (JSON Schema, a Pydantic/Zod model, whatever your stack uses). When the schema is genuinely open-ended, this trade is good.&lt;/p&gt;

&lt;h2&gt;
  
  
  The migration path: extract hot keys to columns, keep cold ones in JSONB
&lt;/h2&gt;

&lt;p&gt;The common mistake is treating this as binary. It isn't. A table can have a relational core and a JSONB tail. The pattern that ages best in production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with &lt;code&gt;data jsonb&lt;/code&gt; when shape is unclear.&lt;/li&gt;
&lt;li&gt;Run for a few weeks. Look at your slow query log. The fields showing up in &lt;code&gt;WHERE&lt;/code&gt; and &lt;code&gt;ORDER BY&lt;/code&gt; are the hot ones.&lt;/li&gt;
&lt;li&gt;Extract those fields into typed columns. Keep the long tail in JSONB.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a migration script for that extraction, written so you can paste it as a template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;BEGIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 1. add the new typed columns, nullable so the backfill can run hot.&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;amount_cents&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt; &lt;span class="nb"&gt;char&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;placed_at&lt;/span&gt; &lt;span class="n"&gt;timestamptz&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 2. backfill from the JSONB blob. batch for big tables.&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt;
  &lt;span class="n"&gt;customer_id&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;amount_cents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'amount_cents'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;currency&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'currency'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;placed_at&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'placed_at'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="n"&gt;timestamptz&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 3. enforce NOT NULL once the backfill is verified.&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
  &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;  &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;amount_cents&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;currency&lt;/span&gt;     &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;placed_at&lt;/span&gt;    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 4. drop the now-redundant keys from the JSONB blob to save space.&lt;/span&gt;
&lt;span class="c1"&gt;--    skip this step if anything still reads them via the JSON path.&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;
  &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'customer_id'&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'amount_cents'&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'currency'&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'placed_at'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- 5. indexes appropriate for the new columns.&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;orders_customer_id_idx&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;orders_placed_at_idx&lt;/span&gt;   &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;placed_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For tables above ~5M rows, do step 2 in batches with &lt;code&gt;LIMIT&lt;/code&gt; + a &lt;code&gt;WHERE primary_key BETWEEN&lt;/code&gt; clause, and run them in a loop outside a single transaction. A single &lt;code&gt;UPDATE&lt;/code&gt; over 50M rows holds a row exclusive lock long enough to wreck your p99 latency, even with the rest of the workload otherwise idle.&lt;/p&gt;

&lt;p&gt;Also worth saying: dropping keys from JSONB in step 4 rewrites every row. On a big table, that's a lot of bloat and a vacuum after. Skip it if disk is cheap and reads through the JSON path are still happening anywhere in the codebase. The schema-as-documentation value isn't worth a multi-hour &lt;code&gt;VACUUM FULL&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha: JSONB columns can't be a foreign key target
&lt;/h2&gt;

&lt;p&gt;This is the trap that catches teams who push too hard into JSONB. Foreign keys can only reference columns. Not JSONB paths. Not expressions.&lt;/p&gt;

&lt;p&gt;So this fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- doesn't work. there's no syntax for "FK targeting a JSONB path."&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;invoices_customer_fk&lt;/span&gt;
  &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- ERROR: syntax error at or near "("&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can't fake it with a check constraint, either. Check constraints in Postgres can't run subqueries, so &lt;code&gt;CHECK (EXISTS (SELECT 1 FROM customers WHERE id = (data-&amp;gt;&amp;gt;'customer_id')::bigint))&lt;/code&gt; is rejected. You're stuck with either trigger-based enforcement (slow and easy to bypass) or no referential integrity (and the orphaned-record problems that follow).&lt;/p&gt;

&lt;p&gt;The escape hatch is exactly what the migration script above does: extract the FK target to a real column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="s1"&gt;'customer_id'&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;bigint&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
  &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;invoices_customer_fk&lt;/span&gt;
    &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you have a real FK with &lt;code&gt;ON DELETE&lt;/code&gt; behaviour, plan-time row estimates, and the rest of what relational integrity gives you. The JSONB column can still hold the long tail.&lt;/p&gt;

&lt;p&gt;If you find yourself wanting FKs from three different JSONB keys, that's the database telling you those keys should have been columns from day one. Listen to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five verdicts on one card
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query shape&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;th&gt;Crossover row count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Equality on a known scalar&lt;/td&gt;
&lt;td&gt;Relational column&lt;/td&gt;
&lt;td&gt;Never flips. Column wins at all sizes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Existence check (&lt;code&gt;?&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;JSONB + GIN&lt;/td&gt;
&lt;td&gt;Below ~10k rows the seq scan is fine; above, you need GIN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Numeric aggregation&lt;/td&gt;
&lt;td&gt;Relational column&lt;/td&gt;
&lt;td&gt;Flips around 100k–1M rows; columns win above&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Containment (&lt;code&gt;@&amp;gt;&lt;/code&gt;) on semi-structured data&lt;/td&gt;
&lt;td&gt;JSONB + &lt;code&gt;jsonb_path_ops&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;JSONB wins at all sizes when the schema is open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Polymorphic / per-row sparse keys&lt;/td&gt;
&lt;td&gt;JSONB&lt;/td&gt;
&lt;td&gt;JSONB wins at all sizes; alternative is EAV pain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The decision isn't "JSONB or relational." It's "which fields per table." Hot, typed, queried-by-equality, foreign-keyed: column. Cold, sparse, polymorphic, containment-searched: JSONB. The migration path between them is well-trodden. Use it.&lt;/p&gt;

&lt;p&gt;What's the worst JSONB-vs-column call you've seen ship to prod, and how long did it take to undo?&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;Picking the right storage shape is the chapter most teams skim and the one that costs them six months later. The &lt;a href="https://www.amazon.com/dp/B0GYLMVX9S" rel="noopener noreferrer"&gt;Database Playbook: Choosing the Right Store for Every System You Build&lt;/a&gt; walks the JSONB-versus-relational decision in more depth — the GIN opclass trade-offs, the schema-evolution patterns that don't paint you into a corner, and the queries you can't write because you picked the wrong store. If you spend any of your week tuning Postgres or arguing about it in PRs, it's probably worth the read.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GYLMVX9S" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktttopxmkwrt9qcazhnc.jpg" alt="Database Playbook: Choosing the Right Store for Every System You Build" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>performance</category>
      <category>backend</category>
    </item>
    <item>
      <title>Logical Replication for Migrations: Zero-Downtime Postgres Upgrades in 2026</title>
      <dc:creator>Gabriel Anhaia</dc:creator>
      <pubDate>Sun, 24 May 2026 12:04:34 +0000</pubDate>
      <link>https://forem.com/gabrielanhaia/logical-replication-for-migrations-zero-downtime-postgres-upgrades-in-2026-4in</link>
      <guid>https://forem.com/gabrielanhaia/logical-replication-for-migrations-zero-downtime-postgres-upgrades-in-2026-4in</guid>
      <description>&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Book:&lt;/strong&gt; &lt;a href="https://www.amazon.com/dp/B0GYLMVX9S" rel="noopener noreferrer"&gt;Database Playbook: Choosing the Right Store for Every System You Build&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Also by me:&lt;/strong&gt; &lt;em&gt;Thinking in Go&lt;/em&gt; (2-book series) — &lt;a href="https://xgabriel.com/go-book" rel="noopener noreferrer"&gt;Complete Guide to Go Programming&lt;/a&gt; + &lt;a href="https://xgabriel.com/hexagonal-go" rel="noopener noreferrer"&gt;Hexagonal Architecture in Go&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;My project:&lt;/strong&gt; &lt;a href="https://hermes-ide.com" rel="noopener noreferrer"&gt;Hermes IDE&lt;/a&gt; | &lt;a href="https://github.com/hermes-hq/hermes-ide" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; — an IDE for developers who ship with Claude Code and other AI coding tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Me:&lt;/strong&gt; &lt;a href="https://xgabriel.com" rel="noopener noreferrer"&gt;xgabriel.com&lt;/a&gt; | &lt;a href="https://github.com/gabrielanhaia" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Major Postgres upgrades used to mean a maintenance window. With logical replication, they're zero-downtime, if you set it up right. There are three traps that turn a 20-minute cutover into a 6-hour incident, and none of them are in the official upgrade docs.&lt;/p&gt;

&lt;p&gt;This is the playbook a team I work with uses for 13 to 16 to 17 jumps on databases that can't take a banner. Real SQL, the sequence script that always gets forgotten, and the verification step that catches the silent data drift before users do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why pg_upgrade is the old way
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;pg_upgrade&lt;/code&gt; is fast. On the same machine, with hard links, it can move a 500GB cluster in under a minute. The catch is everything around it.&lt;/p&gt;

&lt;p&gt;You stop the database. You run the upgrade. You restart it. You analyze statistics so the planner doesn't fall over on the first query. If anything fails partway, you're recovering from backup while your status page bleeds. And &lt;code&gt;pg_upgrade&lt;/code&gt; doesn't help at all if you're moving across machines: different host, different storage class, different cloud region.&lt;/p&gt;

&lt;p&gt;A team I talked to last quarter ran &lt;code&gt;pg_upgrade&lt;/code&gt; on a 2TB cluster as part of a 13 to 16 jump. The binary swap took 90 seconds. The &lt;code&gt;ANALYZE&lt;/code&gt; on the upgraded cluster took 47 minutes. During those 47 minutes, the application was up but every query that touched a non-trivial index ran a sequential scan. Their p99 went from 80ms to 14 seconds. Customers noticed.&lt;/p&gt;

&lt;p&gt;Logical replication sidesteps all of that. You build the new cluster while the old one is still serving traffic. You analyze on the new cluster while nobody's watching it. You cut over at a moment of your choosing, with seconds of write pause instead of minutes of downtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Logical replication primer
&lt;/h2&gt;

&lt;p&gt;Logical replication moves row-level changes from a publisher to a subscriber over a normal Postgres connection. It's been stable since Postgres 10 and has gotten meaningfully better every release since.&lt;/p&gt;

&lt;p&gt;Three pieces you need to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Publication&lt;/strong&gt;: a named set of tables on the source. Created with &lt;code&gt;CREATE PUBLICATION&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subscription&lt;/strong&gt;: a connection from the target back to the source that pulls changes for a publication. Created with &lt;code&gt;CREATE SUBSCRIPTION&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication slot&lt;/strong&gt;: a server-side bookmark on the publisher that tracks how far the subscriber has consumed. The slot keeps WAL files around until the subscriber acks them, which is why a stuck subscriber can fill your disk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The flow: the publisher decodes its WAL into row changes, ships them over the wire, the subscriber applies them with regular SQL. Because it's logical (rows, not pages), the publisher and subscriber can run different Postgres versions. That's what makes this whole thing work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-step zero-downtime upgrade
&lt;/h2&gt;

&lt;p&gt;The whole sequence assumes you have a target cluster running the new Postgres version, reachable from the source over the network, with enough disk for the data plus 30% headroom for the initial sync.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1. Prepare the schema on the target
&lt;/h3&gt;

&lt;p&gt;Logical replication does not copy the schema. You have to dump it yourself.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# on a host that can reach the source&lt;/span&gt;
pg_dump &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgres-13.internal &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5432 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;migrator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dbname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schema-only&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-owner&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-acl&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;orders-schema.sql

&lt;span class="c"&gt;# apply it to the target&lt;/span&gt;
psql &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgres-16.internal &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5432 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;migrator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dbname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;orders &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;orders-schema.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Disable foreign key checks on the subscriber? No, leave them on. Logical replication applies changes in commit order, so FK ordering works out. What you do need to disable, for the initial sync, is any trigger-based logic that would double-fire on rows that already happened upstream. The subscriber respects &lt;code&gt;session_replication_role = replica&lt;/code&gt;, but only for triggers explicitly marked &lt;code&gt;ENABLE REPLICA TRIGGER&lt;/code&gt;. By default, your application triggers sit out during replication apply. Good.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2. Create the publication on the source
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- on postgres-13.internal&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;PUBLICATION&lt;/span&gt; &lt;span class="n"&gt;orders_migration&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;FOR ALL TABLES&lt;/code&gt; is the right call for a full-cluster migration. If you want a subset, list them: &lt;code&gt;FOR TABLE customers, orders, line_items&lt;/code&gt;. Either way, this is cheap. It's metadata, no data movement yet.&lt;/p&gt;

&lt;p&gt;Confirm the publisher's &lt;code&gt;wal_level&lt;/code&gt; is &lt;code&gt;logical&lt;/code&gt;. If it's not, you're restarting the source to flip it, which is the one downtime moment in this whole plan that you can't avoid. Check ahead of time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;wal_level&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- if 'replica', set 'logical' in postgresql.conf and restart&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3. Create the subscription on the target
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- on postgres-16.internal&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class="n"&gt;orders_migration_sub&lt;/span&gt;
  &lt;span class="k"&gt;CONNECTION&lt;/span&gt; &lt;span class="s1"&gt;'host=postgres-13.internal port=5432 dbname=orders user=replicator password=...'&lt;/span&gt;
  &lt;span class="n"&gt;PUBLICATION&lt;/span&gt; &lt;span class="n"&gt;orders_migration&lt;/span&gt;
  &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;copy_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;create_slot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;slot_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'orders_migration_slot'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;streaming&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'parallel'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;binary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few choices worth knowing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;copy_data = true&lt;/code&gt; triggers the initial table sync: every existing row gets copied before incremental changes start applying. For a 2TB cluster expect hours.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;streaming = 'parallel'&lt;/code&gt; (Postgres 16+) lets large transactions stream as they happen instead of waiting for commit. Big win for write-heavy workloads.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;binary = true&lt;/code&gt; is faster but requires matching column types on both sides. If you changed a type during this migration, leave it &lt;code&gt;false&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Watch the sync progress:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;subname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;received_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;latest_end_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latest_end_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;received_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lag_bytes&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_subscription&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;lag_bytes&lt;/code&gt; is &lt;code&gt;0 bytes&lt;/code&gt; and stays there during normal write load, you're caught up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4. Cut over
&lt;/h3&gt;

&lt;p&gt;The cutover is where teams get nervous. Here's the actual sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set the application to read-only mode (or pause writes entirely). One feature flag, one deploy.&lt;/li&gt;
&lt;li&gt;Wait 5 seconds for in-flight transactions to commit on the source.&lt;/li&gt;
&lt;li&gt;Confirm &lt;code&gt;lag_bytes = 0&lt;/code&gt; from the subscription stat query above.&lt;/li&gt;
&lt;li&gt;Bump sequences (next section: the trap that bites everyone).&lt;/li&gt;
&lt;li&gt;Flip DNS / connection string / PgBouncer config to point at the new cluster.&lt;/li&gt;
&lt;li&gt;Take the application out of read-only mode.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total write pause: 10 to 30 seconds depending on how fast your config rollout is. That's the "zero-downtime": reads kept serving from a read replica throughout, writes paused for under a minute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5. Decommission
&lt;/h3&gt;

&lt;p&gt;Don't drop the old cluster yet. Run the new one as the source of truth for at least 24 hours, with the old cluster still up but no traffic. If something goes wrong, you can flip back.&lt;/p&gt;

&lt;p&gt;After 24 hours of clean operation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- on postgres-16.internal&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="n"&gt;SUBSCRIPTION&lt;/span&gt; &lt;span class="n"&gt;orders_migration_sub&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- on postgres-13.internal&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="n"&gt;PUBLICATION&lt;/span&gt; &lt;span class="n"&gt;orders_migration&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- check the slot is gone (DROP SUBSCRIPTION should have cleaned it,&lt;/span&gt;
&lt;span class="c1"&gt;-- but verify because a leaked slot will fill disk)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;slot_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;restart_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;retained_wal&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_replication_slots&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then snapshot the old cluster and shut it down.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trap 1. Sequences don't replicate
&lt;/h2&gt;

&lt;p&gt;This is the one. Logical replication copies row changes. Sequences are not rows; they're a separate object Postgres bumps independently. When you cut over, every sequence on the new cluster is at whatever value the schema dump set it to (probably 1), while your application is about to insert a row that conflicts with an existing primary key.&lt;/p&gt;

&lt;p&gt;You have to bump every sequence to match the source, right before the cutover. Here's the script that does it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- run on the SOURCE to generate the bump statements&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'SELECT setval(%L, %s, true);'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;sequencename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_value&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_sequences&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'pg_catalog'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'information_schema'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sequencename&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That returns one line per sequence, like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;setval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'public.orders_id_seq'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8472913&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;setval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'public.customers_id_seq'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;412877&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;setval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'public.line_items_id_seq'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;31204882&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You collect the output, then during step 4 of the cutover (between the write pause and the DNS flip), you run those statements on the target. The &lt;code&gt;true&lt;/code&gt; flag means "the next call to &lt;code&gt;nextval&lt;/code&gt; returns a value greater than this," which is what you want.&lt;/p&gt;

&lt;p&gt;A safer variant, if you don't trust your write-pause window: add a buffer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'SELECT setval(%L, %s, true);'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'.'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;sequencename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_value&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;  &lt;span class="c1"&gt;-- buffer for in-flight writes&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_sequences&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'pg_catalog'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'information_schema'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You lose 10,000 IDs per sequence. That's nothing. What you gain is bulletproof: even if a write slipped through your read-only flag, the new cluster will not collide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trap 2. DDL doesn't replicate
&lt;/h2&gt;

&lt;p&gt;Logical replication moves row data, not schema changes. If you &lt;code&gt;ALTER TABLE orders ADD COLUMN canceled_at TIMESTAMPTZ&lt;/code&gt; on the source while replication is active, the subscriber will start failing on the next insert that includes that column, with an error like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: logical replication target relation "public.orders" is missing
replicated column: "canceled_at"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replication halts. The slot keeps growing. You're now in an incident.&lt;/p&gt;

&lt;p&gt;The fix is process, not SQL. During the migration window, you freeze DDL. No schema changes, no new columns, no index rebuilds that drop and recreate. If you absolutely must ship a schema change mid-migration:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Apply the DDL on the subscriber first.&lt;/li&gt;
&lt;li&gt;Apply the DDL on the publisher second.&lt;/li&gt;
&lt;li&gt;Use only additive changes (&lt;code&gt;ADD COLUMN NULL&lt;/code&gt;, &lt;code&gt;ADD INDEX CONCURRENTLY&lt;/code&gt;). Never drop, rename, or change a type.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For longer migrations where you can't freeze DDL for days, look at the &lt;code&gt;pglogical&lt;/code&gt; extension or Postgres 17's experimental DDL replication. For a 1–2 day window, just freeze.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trap 3. Large objects and unlogged tables get skipped
&lt;/h2&gt;

&lt;p&gt;Two categories of data that logical replication silently ignores:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large objects&lt;/strong&gt; (the &lt;code&gt;pg_largeobject&lt;/code&gt; system table, accessed via &lt;code&gt;lo_*&lt;/code&gt; functions). If your app stores PDFs or images using the LO API, those bytes are not in your subscription. You'll have an empty file table on the new cluster and not know it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unlogged tables&lt;/strong&gt;. These are tables created with &lt;code&gt;CREATE UNLOGGED TABLE&lt;/code&gt;, used for ephemeral data because they skip the WAL. Logical replication reads from the WAL. No WAL, no replication.&lt;/p&gt;

&lt;p&gt;Workarounds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For large objects: use &lt;code&gt;pg_dump --large-objects&lt;/code&gt; separately, restore into the target after the initial sync, then accept that any LOs written during the cutover window will need a delta sync. The cleaner long-term answer is to move away from large objects to &lt;code&gt;bytea&lt;/code&gt; columns or external object storage. Postgres 16 marked the LO API as effectively legacy.&lt;/li&gt;
&lt;li&gt;For unlogged tables: dump and restore them at cutover time, accepting that data written between dump and cutover is lost. If the table genuinely doesn't matter (cache, session store), that's fine. If it does matter, it shouldn't be unlogged.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's a third quiet skip worth knowing: tables without a &lt;code&gt;REPLICA IDENTITY&lt;/code&gt;. By default that's the primary key. If you have a table with no PK and no &lt;code&gt;REPLICA IDENTITY FULL&lt;/code&gt; set, updates and deletes will fail to replicate with &lt;code&gt;cannot update table without primary key&lt;/code&gt;. The check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nspname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relreplident&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_class&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_namespace&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relnamespace&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relkind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'r'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relreplident&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'d'&lt;/span&gt;  &lt;span class="c1"&gt;-- default = primary key&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_index&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indrelid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indisprimary&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any row that comes back needs a &lt;code&gt;REPLICA IDENTITY FULL&lt;/code&gt; set before you start.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verifying the cutover
&lt;/h2&gt;

&lt;p&gt;Row counts are not enough. A table can have the right count and the wrong data. You want checksums.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- run on both source and target, compare results&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="s1"&gt;'orders'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;row_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="s1"&gt;','&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;
  &lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;table_checksum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That builds a per-row hash, concatenates them in primary-key order, and hashes the result. Two clusters with identical data return identical checksums. Two clusters off by one row return wildly different hashes.&lt;/p&gt;

&lt;p&gt;For tables with hundreds of millions of rows, this gets slow. Sample instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="s1"&gt;'orders'&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;sample_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;','&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;checksum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;TABLESAMPLE&lt;/span&gt; &lt;span class="n"&gt;BERNOULLI&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;-- 1% sample&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;2000000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;-- bounded range&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the same sample bounds on both sides. If checksums match on three or four random ranges, you're good.&lt;/p&gt;

&lt;p&gt;The 60-second smoke test for the application: hit your top 5 read paths, hit your top 3 write paths, then run a query you know exact-counts the result for (a small reference table, an enum lookup). If all three pass, you're shipping.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gotcha. Replication lag during cutover
&lt;/h2&gt;

&lt;p&gt;Right when you cut over, replication lag tends to spike. The subscriber suddenly stops receiving the load it was streaming, the source is finishing in-flight transactions, and &lt;code&gt;pg_stat_subscription&lt;/code&gt; can briefly show non-zero &lt;code&gt;lag_bytes&lt;/code&gt; even after you've paused writes.&lt;/p&gt;

&lt;p&gt;If your application has a read replica pattern, keep reads going to the old primary's read replica for a 5-minute window after cutover. Writes go to the new cluster, reads to the old one. Once &lt;code&gt;lag_bytes&lt;/code&gt; has been zero for 60 seconds straight, flip reads over too.&lt;/p&gt;

&lt;p&gt;This sounds like overkill. It isn't. The class of bug it prevents (a user submits a form, the write hits the new cluster, the immediate GET reads from a stale read replica) is exactly the kind of "ghost data" issue that gets logged as "intermittent UI bug" and lives in your backlog for months.&lt;/p&gt;

&lt;p&gt;Zero-downtime is doable. It is not effortless. The five-step playbook gets you there; the three traps and the checksum verification keep you out of the incident channel.&lt;/p&gt;

&lt;p&gt;What's the worst cutover you've shipped, and which of these traps bit you? Drop it in the comments.&lt;/p&gt;




&lt;h2&gt;
  
  
  If this was useful
&lt;/h2&gt;

&lt;p&gt;This kind of migration sits in the gap between "knows Postgres" and "knows Postgres in production." If you want the longer treatment (picking the right replication tool for your topology, when to choose physical over logical, what happens to your stats and autovacuum after a major version jump), that's chapter 7 of the &lt;a href="https://www.amazon.com/dp/B0GYLMVX9S" rel="noopener noreferrer"&gt;Database Playbook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/dp/B0GYLMVX9S" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktttopxmkwrt9qcazhnc.jpg" alt="Database Playbook: Choosing the Right Store for Every System You Build"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>devops</category>
      <category>replication</category>
    </item>
  </channel>
</rss>
