<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Philippe Gagneux</title>
    <description>The latest articles on Forem by Philippe Gagneux (@philippegagneux).</description>
    <link>https://forem.com/philippegagneux</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3790494%2F9b3b7c69-c90c-4bc6-9573-9c0558cb6c18.png</url>
      <title>Forem: Philippe Gagneux</title>
      <link>https://forem.com/philippegagneux</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/philippegagneux"/>
    <language>en</language>
    <item>
      <title>Your Retry Config is Wrong (And So Was Mine)</title>
      <dc:creator>Philippe Gagneux</dc:creator>
      <pubDate>Tue, 24 Feb 2026 23:50:56 +0000</pubDate>
      <link>https://forem.com/philippegagneux/your-retry-config-is-wrong-and-so-was-mine-3eg</link>
      <guid>https://forem.com/philippegagneux/your-retry-config-is-wrong-and-so-was-mine-3eg</guid>
      <description>&lt;p&gt;On May 12, 2022, DoorDash went down for over three hours. Not because a database failed – because a database got &lt;em&gt;slow&lt;/em&gt;. A routine latency spike in the order storage layer triggered retries. Those retries hit downstream services, which triggered their retries. Within minutes, what started as 50ms of added latency became a full retry storm: every service in the chain hammering every service below it, each one tripling the load on the next. The shared circuit breaker – designed to protect against exactly this – tripped and took out unrelated services that happened to share the same dependency. Three hours of downtime. All because every service had the same retry config: &lt;code&gt;retries: 3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;DoorDash isn't alone. In December 2024, OpenAI went down for over four hours when a telemetry deploy caused every node in their largest clusters to execute resource-intensive Kubernetes API operations simultaneously – a thundering herd that overwhelmed the control plane and locked engineers out of recovery tools. Cloudflare had a similar feedback loop in 2025 involving Let's Encrypt rate limiting and their own retry logic. A 2022 OSDI paper studying metastable failures found that retry policy was the sustaining effect in &lt;em&gt;half&lt;/em&gt; of the 22 incidents they analyzed.&lt;/p&gt;

&lt;p&gt;The root cause in every case is the same: uniform retry configuration across a service chain.&lt;/p&gt;

&lt;p&gt;I set out to find the optimal retry allocation. I found it, proved it mathematically, and then ran it on a real service chain. The math was right. The config was wrong. Here's what happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The multiplication problem
&lt;/h2&gt;

&lt;p&gt;When I say "uniform retries are multiplicative, not additive," most engineers nod and move on. So let me be specific.&lt;/p&gt;

&lt;p&gt;You have 8 services. Each retries 3 times on failure. Your mental model says: if the leaf service fails, you get 8 × 3 = 24 extra requests. That's wrong.&lt;/p&gt;

&lt;p&gt;The actual number is 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 = 6,561.&lt;/p&gt;

&lt;p&gt;Each retry at layer N triggers a full cascade of retries through layers N+1 to 8. The gateway retries 3 times. Each of those hits auth, which retries 3 times. Each of &lt;em&gt;those&lt;/em&gt; hits orders, which retries 3 times. You're computing a product, not a sum.&lt;/p&gt;

&lt;p&gt;Google's SRE team documented 64x amplification at just 3 layers deep. At Agoda, 8% of all production request volume during a slowdown was retry traffic. These aren't theoretical numbers – they're from production telemetry.&lt;/p&gt;

&lt;p&gt;At 16 services, the theoretical ceiling is 3^16 ≈ 43 million. Nobody hits that number because circuit breakers trip first. But "your circuit breakers save you by killing your own services" is not the safety story you think it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why nobody questions this
&lt;/h2&gt;

&lt;p&gt;Istio's default retry config is &lt;code&gt;attempts: 2, retryOn: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes"&lt;/code&gt;. Every VirtualService gets its own retry policy. Nothing in the Istio docs warns you about cross-service interaction.&lt;/p&gt;

&lt;p&gt;The Google SRE book talks about retry budgets in Chapter 22, but every example uses uniform values. The mental model it builds is per-service: "this service should retry N times." Not "this service's retries multiply against every other service's retries."&lt;/p&gt;

&lt;p&gt;Kubernetes and Istio docs show single-service retry config. Always. I've never seen an official example that shows a 5-service chain with &lt;code&gt;retries: 3&lt;/code&gt; on each one and a diagram of what happens when the leaf fails. The multiplicative explosion is invisible in docs because docs show one VirtualService at a time.&lt;/p&gt;

&lt;p&gt;And it works fine in staging. Your staging environment has 2-3 services. 3^3 = 27. That's noise. The bomb only detonates in production, where you have 8-20 services deep and real traffic to amplify.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math says: concentrate retries
&lt;/h2&gt;

&lt;p&gt;I built a cost model with six components – reliability, amplification, cascade timing, latency, resonance interference, circuit breaker saturation – and ran a constrained optimizer across chain lengths from 4 to 128 services.&lt;/p&gt;

&lt;p&gt;Three key principles fell out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Retry volume is a product, not a sum.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;V = r₁ × r₂ × ... × rₙ&lt;/p&gt;

&lt;p&gt;Each layer multiplies the worst case for every layer below it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Reliability has diminishing returns per layer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a service succeeds 95% of the time, one retry gives you 99.75%. A second gives 99.9875%. A third gives 99.999%. Smaller gains, full multiplicative cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Circuit breakers flip the equation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When retry volume exceeds the CB threshold (~5-10 consecutive errors), the breaker trips for &lt;em&gt;all&lt;/em&gt; requests. Retries past the CB threshold actively reduce reliability.&lt;/p&gt;

&lt;p&gt;The optimizer kept landing on the same answer: for any chain of 8+ services, concentrate all retries on exactly 2 services and set everything else to r=1. Total volume: 12x instead of 6,561x. A 99.8% reduction.&lt;/p&gt;

&lt;p&gt;The weird part: &lt;strong&gt;this allocation doesn't change when you add more services.&lt;/strong&gt; I tested 8, 16, 32, 64, 128 services. Same answer every time. The first few positions get retries, everything after that gets r=1. A 512-dimensional optimization problem collapses to a 3-dimensional one.&lt;/p&gt;

&lt;p&gt;I proved this analytically – the optimal vector "freezes" once the chain is long enough. Neat result. I was pretty pleased with myself.&lt;/p&gt;

&lt;p&gt;Then I ran it on a real service chain and everything fell apart.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment that broke the theory
&lt;/h2&gt;

&lt;p&gt;I deployed 8 services as Docker containers on a VPS. Real TCP connections, real DNS resolution, real resource contention (64MB memory, 0.25 CPU per container). I injected failures: service 5 at 10% failure rate, service 7 at 5%, the rest at 1%. Then I sent 500 concurrent requests and compared my mathematically optimal config against the uniform default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normal load (1-10% failure rates):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Uniform (r=3)&lt;/th&gt;
&lt;th&gt;"Optimal" (r=[1,4,1,3,1,1,1,1])&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Success rate&lt;/td&gt;
&lt;td&gt;99.0%&lt;/td&gt;
&lt;td&gt;97.6%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-1.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total retries&lt;/td&gt;
&lt;td&gt;84&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+21%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P99 latency&lt;/td&gt;
&lt;td&gt;385ms&lt;/td&gt;
&lt;td&gt;455ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+70ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Stress (5-30% failure rates):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Uniform (r=3)&lt;/th&gt;
&lt;th&gt;"Optimal"&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Success rate&lt;/td&gt;
&lt;td&gt;95.2%&lt;/td&gt;
&lt;td&gt;87.8%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-7.4%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total retries&lt;/td&gt;
&lt;td&gt;420&lt;/td&gt;
&lt;td&gt;487&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+16%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P99 latency&lt;/td&gt;
&lt;td&gt;476ms&lt;/td&gt;
&lt;td&gt;583ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+107ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The "optimized" config was worse on every metric. More retries, not fewer. Lower success rate. Higher tail latency. Under stress, 7.4% more requests failed.&lt;/p&gt;

&lt;p&gt;My cost model was minimizing the wrong thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the math goes wrong
&lt;/h2&gt;

&lt;p&gt;The per-service metrics told the story. Under stress, here's where the retries landed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;1&lt;/th&gt;
&lt;th&gt;2&lt;/th&gt;
&lt;th&gt;3&lt;/th&gt;
&lt;th&gt;4&lt;/th&gt;
&lt;th&gt;5&lt;/th&gt;
&lt;th&gt;6&lt;/th&gt;
&lt;th&gt;7&lt;/th&gt;
&lt;th&gt;8&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Uniform retries&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;169&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;28&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;111&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Optimal" retries&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;116&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;371&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The optimized config concentrated 76% of all retries at service 4. Service 4 sits upstream of service 5 (the 30%-failure bottleneck). Every time service 4 retries, it re-sends a request through services 5, 6, 7, and 8. That's 4 downstream hops per retry, through services that are already under stress.&lt;/p&gt;

&lt;p&gt;The analytical model minimizes the &lt;em&gt;product&lt;/em&gt; of retries (3^8 = 6,561 → 4×3 = 12). But in a real system, the &lt;em&gt;cost&lt;/em&gt; of a retry depends on where in the chain it happens:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A retry at position 2 re-traverses 6 downstream services.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;A retry at position 7 re-traverses 1 downstream service.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The retry at position 2 is 6x more expensive than the retry at position 7 – but the product-based model treats them identically.&lt;/p&gt;

&lt;p&gt;The correct cost function isn't &lt;code&gt;Π rᵢ&lt;/code&gt;. It's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cost = Σᵢ rᵢ × (N - i) × fᵢ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;(N - i)&lt;/code&gt; is the number of downstream hops and &lt;code&gt;fᵢ&lt;/code&gt; is the failure rate at position &lt;code&gt;i&lt;/code&gt;. Each retry is priced by how much downstream work it creates.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually works
&lt;/h2&gt;

&lt;p&gt;The core idea still holds: &lt;strong&gt;don't use uniform retries.&lt;/strong&gt; The 6,561x multiplication problem is real. But the fix isn't "concentrate retries early." It's simpler:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Put your retries close to the failure, not upstream of it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If service 5 fails 10% of the time, give service 5 a higher retry count – or the service &lt;em&gt;immediately&lt;/em&gt; upstream of it (service 4). Don't give service 2 four retries when each retry traverses 6 hops through the failure zone.&lt;/p&gt;

&lt;p&gt;The practical retry allocation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Identify your highest-failure-rate services.&lt;/strong&gt; Look at &lt;code&gt;rate(istio_requests_total{response_code=~"5.."}[5m])&lt;/code&gt; per service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Give the retry budget to their immediate neighbors.&lt;/strong&gt; The service directly upstream of a failure hotspot should get r=2 or 3. The service directly downstream should keep r=1 (it's the one failing – retrying into it from 6 hops away makes things worse).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Everything far from the failure: r=1.&lt;/strong&gt; Your gateway, your auth service, your API middleware – if they're not adjacent to a failure hotspot, they get r=1. Period.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Never exceed a total product of ~20.&lt;/strong&gt; Multiply all your retry values along the chain. If the product exceeds 20, you're past the circuit breaker saturation point and additional retries are pure cost.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Istio YAML for a service adjacent to a hotspot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualService&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-processor&lt;/span&gt;  &lt;span class="c1"&gt;# directly upstream of the flaky payment service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;order-processor&lt;/span&gt;
  &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-processor&lt;/span&gt;
      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;attempts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
        &lt;span class="na"&gt;perTryTimeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500ms&lt;/span&gt;
        &lt;span class="na"&gt;retryOn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5xx,reset,connect-failure,retriable-status-codes"&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;      &lt;span class="na"&gt;retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;attempts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
        &lt;span class="na"&gt;perTryTimeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500ms&lt;/span&gt;
        &lt;span class="na"&gt;retryOn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5xx,connect-failure"&lt;/span&gt;
      &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One thing that bit me: &lt;code&gt;perTryTimeout&lt;/code&gt; includes the first attempt. If you set &lt;code&gt;attempts: 3&lt;/code&gt; and &lt;code&gt;perTryTimeout: 500ms&lt;/code&gt;, you're saying "try up to 3 times, each attempt has 500ms." The outer &lt;code&gt;timeout&lt;/code&gt; is the total wall clock for all attempts. Set it to &lt;code&gt;perTryTimeout × attempts&lt;/code&gt; as a ceiling.&lt;/p&gt;

&lt;p&gt;Also: Istio retries stack with application-level retries. If your Go service has a retry loop in the HTTP client AND the VirtualService has retries configured, you're multiplying again. Audit both. &lt;code&gt;kubectl get virtualservice -A -o yaml | grep -A5 retries&lt;/code&gt; gives you the mesh-level view. For app-level, search for retry libraries (go-retryablehttp, resilience4j, polly, tenacity).&lt;/p&gt;

&lt;h2&gt;
  
  
  The timeout tradeoff the model found
&lt;/h2&gt;

&lt;p&gt;One finding from the cost model that I didn't expect.&lt;/p&gt;

&lt;p&gt;Classical SRE wisdom says: gateway timeout must be greater than downstream timeout × retries. If your downstream has 500ms timeout and 4 retries, gateway needs at least 2.0s. Makes sense – don't give up while downstream retries are still running.&lt;/p&gt;

&lt;p&gt;The cost model's optimal gateway timeout is 1.4s – &lt;em&gt;below&lt;/em&gt; the cascade-consistent minimum of 2.0s.&lt;/p&gt;

&lt;p&gt;Why? Two reasons. First, the model penalizes synchronized timeouts across services (they create correlated retry bursts). A 1.4s gateway timeout breaks the synchronization with the 0.5s downstream timeouts. Second, the 600ms saved per request reduces worst-case latency, and in the model, the latency reduction outweighs the cascade penalty from occasionally timing out before downstream retries complete.&lt;/p&gt;

&lt;p&gt;I haven't load-tested this specific finding – the Docker experiment compared retry allocations, not timeout values. But the engineering logic is sound: under saturation, shorter gateway timeouts drop failing requests faster, freeing connections and reducing queue depth.&lt;/p&gt;

&lt;p&gt;The practical version: keep cascade-consistent timeouts as default. But consider an adaptive threshold – when your 5xx rate crosses 10%, tighten the gateway timeout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;RATE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; istio-system deploy/prometheus-server &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  promtool query instant &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'sum(rate(istio_requests_total{response_code=~"5.."}[1m])) / sum(rate(istio_requests_total[1m]))'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $2}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$RATE&lt;/span&gt;&lt;span class="s2"&gt; &amp;gt; 0.10"&lt;/span&gt; | bc &lt;span class="nt"&gt;-l&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="o"&gt;))&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;kubectl patch virtualservice api-gateway &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"spec":{"http":[{"timeout":"1400ms"}]}}'&lt;/span&gt;
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not pretty. But it's better than holding failing requests until your gateway OOMs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do Monday morning
&lt;/h2&gt;

&lt;p&gt;Run this and look at the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get virtualservice &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-B10&lt;/span&gt; &lt;span class="nt"&gt;-A5&lt;/span&gt; &lt;span class="s2"&gt;"retries:"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Multiply all the &lt;code&gt;attempts&lt;/code&gt; values along your longest call chain. If the product is over 50, you have a retry bomb. If it's over 200, you're one partial outage away from a DoorDash-style cascade.&lt;/p&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find your highest-error-rate services. &lt;code&gt;rate(istio_requests_total{response_code=~"5.."}[5m])&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Give their immediate upstream neighbor &lt;code&gt;attempts: 2&lt;/code&gt; or &lt;code&gt;3&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Set everything else to &lt;code&gt;attempts: 1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Keep the total product under 20.&lt;/li&gt;
&lt;li&gt;Deploy to a canary, watch retry volume drop. Compare end-to-end success rate.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. No new infrastructure. No service mesh upgrade. A config change that takes 20 minutes.&lt;/p&gt;

&lt;p&gt;The uniform retry config is wrong. My "optimal" config was also wrong. The actual answer is simpler than both: &lt;strong&gt;retries cost more the further they are from the failure. Put them close.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Methodology: Six-component cost model (reliability, amplification, cascade timing, latency, resonance, circuit breaker saturation). The freezing result is proven for feedforward chains with independent failures. Docker experiment: 8 Node.js containers, real TCP, 20 concurrent requests, Prometheus per service. Single run – take the exact percentages with a grain of salt, but the direction is consistent across configs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>sre</category>
      <category>microservices</category>
    </item>
  </channel>
</rss>
