<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Akshat Sinha</title>
    <description>The latest articles on Forem by Akshat Sinha (@pingtoprod).</description>
    <link>https://forem.com/pingtoprod</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3906571%2F6ceb4b97-48d7-48ac-94d5-88ee21b67386.png</url>
      <title>Forem: Akshat Sinha</title>
      <link>https://forem.com/pingtoprod</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/pingtoprod"/>
    <language>en</language>
    <item>
      <title>When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story &amp; The Playbook That Saved Us</title>
      <dc:creator>Akshat Sinha</dc:creator>
      <pubDate>Tue, 12 May 2026 10:08:31 +0000</pubDate>
      <link>https://forem.com/pingtoprod/when-coredns-falls-silent-a-kubernetes-dns-disaster-story-the-playbook-that-saved-us-5dek</link>
      <guid>https://forem.com/pingtoprod/when-coredns-falls-silent-a-kubernetes-dns-disaster-story-the-playbook-that-saved-us-5dek</guid>
      <description>&lt;p&gt;&lt;em&gt;A real-world incident narrative + definitive best practices for CoreDNS at scale&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Prologue: The Calm Before the Storm
&lt;/h2&gt;

&lt;p&gt;The cluster was healthy. 312 pods spread across 24 nodes. CoreDNS two replicas, default settings, humming along since the cluster was provisioned eighteen months ago. Nobody had touched it. Nobody &lt;em&gt;needed&lt;/em&gt; to touch it.&lt;/p&gt;

&lt;p&gt;Until the Wednesday nobody expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 1 The Incident: "Why Is Payment Timing Out?"
&lt;/h2&gt;

&lt;p&gt;It started with a Slack ping at 11:42 PM.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;@oncall-alert&lt;/strong&gt; &lt;code&gt;[CRITICAL]&lt;/code&gt; Payment service unreachable circuit breaker open on checkout-gateway&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I SSH'd into the jump box. First instinct: &lt;code&gt;kubectl get pods&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl get pods -n production | grep payment
payment-svc-8d4f6b7c-x2k9m   1/1     Running   0          45d
order-processor-6c8d9f4-x7q2w 1/1     Running   0          45d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All pods running. All healthy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl get svc -n production
NAME               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)
payment-service    ClusterIP   10.102.144.200   &amp;lt;none&amp;gt;        8080/TCP
order-processor    ClusterIP   10.102.145.88    &amp;lt;none&amp;gt;        8080/TCP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Services exist. IPs assigned. Then I did the obvious:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kubectl exec -it payment-svc-8d4f6b7c-x2k9m -n production -- curl -s http://10.102.145.88:8080/health
{"status":"ok"}   

$ kubectl exec -it payment-svc-8d4f6b7c-x2k9m -n production -- nslookup order-processor.production.svc.cluster.local
;; connection timed out; no servers could be reached
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;IP works. DNS doesn't.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The entire payment pipeline was dead not because services were down, but because pods couldn't &lt;em&gt;find&lt;/em&gt; each other by name. Every microservice call that relied on DNS resolution was failing. Idempotency queues were backing up. Retry storms were starting. It was 15 minutes before we declared it an SEV-2.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 2 The Interrogation: What Is CoreDNS Doing?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Logging Into CoreDNS
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="nt"&gt;-l&lt;/span&gt; k8s-app&lt;span class="o"&gt;=&lt;/span&gt;kube-dns &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The errors were telling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ERROR] plugin/kubernetes: Get "https://10.96.0.1:443/api/v1/namespaces/production/services/order-processor":
context deadline exceeded (Client.Timeout exceeded while awaiting headers)

[ERROR] plugin/forward:
2 errors occurred:
        * read udp 10.244.1.8:53291-&amp;gt;8.8.8.8:53: i/o timeout
        * read udp 10.244.1.8:38472-&amp;gt;8.8.4.4:53: i/o timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Two problems screaming simultaneously:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;CoreDNS couldn't talk to the Kubernetes API server fast enough (internal lookups failing)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CoreDNS couldn't reach upstream DNS (external lookups timing out)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  2.2 Checking CoreDNS Health
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="nt"&gt;-l&lt;/span&gt; k8s-app&lt;span class="o"&gt;=&lt;/span&gt;kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-5d78c9fd5d-4kx2m   1/1     Running   0          182d
coredns-5d78c9fd5d-9vr3j   1/1     Running   0          182d
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pods were "Running." But running doesn't mean performing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Measuring the Damage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# From a pod  external DNS resolution&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time &lt;/span&gt;nslookup google.com
Server:         10.96.0.10
Address:        10.96.0.10#53

&lt;span class="p"&gt;;;&lt;/span&gt; connection timed out&lt;span class="p"&gt;;&lt;/span&gt; no servers could be reached
&lt;span class="p"&gt;;;&lt;/span&gt; → Total: 5 seconds of waiting, &lt;span class="k"&gt;then &lt;/span&gt;failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Internal resolution  same story&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time &lt;/span&gt;nslookup order-processor.production.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10#53

&lt;span class="p"&gt;;;&lt;/span&gt; connection timed out&lt;span class="p"&gt;;&lt;/span&gt; no servers could be reached
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Normal DNS resolution should take &lt;strong&gt;1–5 milliseconds&lt;/strong&gt;. We were at &lt;strong&gt;5 seconds (timeout)&lt;/strong&gt; or complete failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 CPU Throttling The Hidden Killer
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl top pods &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="nt"&gt;-l&lt;/span&gt; k8s-app&lt;span class="o"&gt;=&lt;/span&gt;kube-dns
NAME                       CPU&lt;span class="o"&gt;(&lt;/span&gt;cores&lt;span class="o"&gt;)&lt;/span&gt;   MEMORY&lt;span class="o"&gt;(&lt;/span&gt;bytes&lt;span class="o"&gt;)&lt;/span&gt;
coredns-5d78c9fd5d-4kx2m   97m          168Mi
coredns-5d78c9fd5d-9vr3j   95m          172Mi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl get deployment coredns &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.spec.template.spec.containers[0].resources}'&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"limits"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"cpu"&lt;/span&gt;:&lt;span class="s2"&gt;"100m"&lt;/span&gt;,&lt;span class="s2"&gt;"memory"&lt;/span&gt;:&lt;span class="s2"&gt;"170Mi"&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;,&lt;span class="s2"&gt;"requests"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"cpu"&lt;/span&gt;:&lt;span class="s2"&gt;"75m"&lt;/span&gt;,&lt;span class="s2"&gt;"memory"&lt;/span&gt;:&lt;span class="s2"&gt;"70Mi"&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;97m usage against 100m limit.&lt;/strong&gt; Three millicores of headroom for a Go binary handling hundreds of queries per second. CoreDNS was CPU-throttled nearly continuously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Confirmed: throttling counters through the roof&lt;/span&gt;
kubectl debug &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system coredns-5d78c9fd5d-4kx2m &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;coredns &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/cpu.stat
nr_throttled 14832
throttled_time 294812005  &lt;span class="c"&gt;# → 4.9 MINUTES of throttled time per minute!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Go scheduler was starving. Goroutines queued, DNS queries backed up, timeouts cascaded.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.5 The &lt;code&gt;ndots&lt;/code&gt; Multiplier
&lt;/h3&gt;

&lt;p&gt;Let's talk about the silent multiplier. Every pod in Kubernetes has a default DNS config inherited from the kubelet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;dnsConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ndots&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5"&lt;/span&gt;
  &lt;span class="na"&gt;searches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;default.svc.cluster.local&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;svc.cluster.local&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cluster.local&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;eu-west-1.compute.internal&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ndots: 5&lt;/code&gt; is a &lt;em&gt;threshold&lt;/em&gt;, not a queue. Here's how it actually works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If a hostname has &lt;strong&gt;fewer than N dots&lt;/strong&gt;, the resolver &lt;strong&gt;prepends search domains first&lt;/strong&gt;, then tries the name as-is only if none of those succeed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If a hostname has &lt;strong&gt;N dots or more&lt;/strong&gt;, the resolver &lt;strong&gt;tries it as an absolute name first&lt;/strong&gt;, then falls through to search domains if it fails.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So with &lt;code&gt;ndots: 5&lt;/code&gt;, when our application calls &lt;code&gt;api.stripe.com&lt;/code&gt; (which has 2 dots fewer than 5):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Attempt&lt;/th&gt;
&lt;th&gt;Query Sent&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api.stripe.com.default.svc.cluster.local&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NXDOMAIN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api.stripe.com.svc.cluster.local&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NXDOMAIN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api.stripe.com.cluster.local&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NXDOMAIN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api.stripe.com.eu-west-1.compute.internal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;NXDOMAIN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;api.stripe.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Resolved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;5 queries for 1 hostname.&lt;/strong&gt; With 312 pods × average 8 external calls per startup = &lt;strong&gt;12,480 DNS queries&lt;/strong&gt; hitting CoreDNS. With &lt;code&gt;ndots: 2&lt;/code&gt;, that's &lt;strong&gt;2,496 queries&lt;/strong&gt;. An &lt;strong&gt;80% amplification&lt;/strong&gt; caused by one setting.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why&lt;/strong&gt; &lt;code&gt;ndots: 2&lt;/code&gt; &lt;strong&gt;is the right production value:&lt;/strong&gt; Internal Kubernetes service names follow the pattern &lt;code&gt;service.namespace.svc.cluster.local&lt;/code&gt; that's at minimum 2 dots (&lt;code&gt;payment-service.production&lt;/code&gt;). With &lt;code&gt;ndots: 2&lt;/code&gt;, names with 2+ dots are tried as absolute first (which is correct for FQDNs), while short names like &lt;code&gt;order-processor&lt;/code&gt; still get search domains prepended. External hostnames like &lt;code&gt;api.stripe.com&lt;/code&gt; (2 dots) are tried absolute first no search domain penalty.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Chapter 3 The Diagnosis: Five Problems At Once
&lt;/h2&gt;

&lt;p&gt;We'd found the killers. Not one, but five compounding failures:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Here's how these five problems compounded each other:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                    COREDNS FAILURE CHAIN                    │
│                                                             │
│  1. ndots: 5         → 5x query amplification on externals  │
│         ↓                                                   │
│  2. No node-local     → Every query traverses the network   │
│     cache               to reach CoreDNS pods               │
│         ↓                                                   │
│  3. Resources too     → 97m/100m CPU = constant throttling  │
│     tight               → goroutines back up                │
│         ↓                                                   │
│  4. Cache exhausted   → Under load, cache evicts entries,   │
│     under memory        misses spike, more upstream queries │
│         ↓                                                   │
│  5. Static replica    → No autoscaling, no PDB              │
│     count               → Single point of failure           │
│                                                             │
│  Result: 5s timeouts → Circuit breakers trip → OUTAGE       │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 4 The Fix: Step by Step
&lt;/h2&gt;

&lt;p&gt;We applied fixes in strict order each one built on the previous. Changing five things simultaneously is how you create a new mystery.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 1: Set &lt;code&gt;ndots: 2&lt;/code&gt; 80% Load Reduction in 5 Minutes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The mechanism, stated precisely:&lt;/strong&gt; With &lt;code&gt;ndots: 2&lt;/code&gt;, any name with 2 or more dots is tried as an absolute name first (before search domains are appended). Names with fewer than 2 dots still use the search list first. In practice, this means:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Name&lt;/th&gt;
&lt;th&gt;Dots&lt;/th&gt;
&lt;th&gt;Behavior under &lt;code&gt;ndots: 2&lt;/code&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order-processor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Search domains prepended first: &lt;code&gt;order-processor.default.svc.cluster.local&lt;/code&gt; → resolves&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order-processor.prod&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Search domains prepended first: &lt;code&gt;order-processor.prod.default.svc.cluster.local&lt;/code&gt; → resolves&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;order-processor.production.svc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Tried as absolute first (2 ≥ 2) → resolves or falls through to search list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;api.stripe.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Tried as absolute first (2 ≥ 2) → resolves immediately on direct lookup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;api.example.com&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Tried as absolute first → resolves directly, no search domain penalty&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pod-level configuration&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dnsPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterFirst&lt;/span&gt;
  &lt;span class="na"&gt;dnsConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ndots&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To enforce &lt;code&gt;ndots: 2&lt;/code&gt; cluster-wide without editing every single Deployment YAML, you must use a &lt;code&gt;Mutating Admission Webhook&lt;/code&gt; (like Kyverno or OPA Gatekeeper) to inject the dnsConfig block into pods as they are created.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Impact measurement:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: 312 pods × 8 calls × 5 retries = 12,480 external queries/min
After:  312 pods × 8 calls × 1 direct   = 2,496 external queries/min

Reduction: 80%  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 2: Deploy NodeLocal DNSCache The Game Changer
&lt;/h3&gt;

&lt;p&gt;Before you deploy the manifest, understand what changes architecturally:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; A lightweight CoreDNS cache instance running as a DaemonSet on &lt;strong&gt;every node&lt;/strong&gt;, bound to the link-local IP &lt;code&gt;169.254.20.10&lt;/code&gt;. Pods resolve DNS locally instead of sending queries across the cluster to CoreDNS pods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The architecture shift:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BEFORE (today):
┌─────┐    ┌──────────────────────────────────────┐
│ Pod A│───▶│ kube-dns Service (ClusterIP: 10.96.0.10) │
└─────┘    └───────────┬──────────────────────────┘
                       │
           ┌───────────▼───────────┐
           │ CoreDNS Pod (Node 3)  │──▶ Upstream (8.8.8.8)
           │ CoreDNS Pod (Node 7)  │──▶ Upstream (8.8.4.4)
           └───────────────────────┘
           ↑
     Network hop on EVERY query
     Cross-node traffic
     CoreDNS pods are the bottleneck

AFTER (with NodeLocal DNSCache):
┌─────┐    ┌──────────────────────┐
│ Pod A│───▶│ Node Cache (169.254.20.10) │───▶ Cache hit → instant  
└─────┘    │    (same node)       │
           └───────────┬──────────┘
                       │ (only on cache miss)
           ┌───────────▼───────────┐
           │ CoreDNS Pod (any node)│──▶ Upstream (8.8.8.8)
           └───────────────────────┘
           No cross-node traffic for cached queries
           CoreDNS load drops ~90%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; For a large cluster, the majority of DNS queries are for services that haven't changed recently and can be cached. By keeping the cache local, you eliminate the network round-trip &lt;em&gt;and&lt;/em&gt; reduce CoreDNS load simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployment steps:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2a: Test on a subset of nodes first (rollback safety)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before deploying cluster-wide, validate on a small canary node group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Can beary subset  deploy to specific nodes first&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DaemonSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nodelocaldns&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-dns&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-dns&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-dns&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;priorityClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;system-node-critical&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nodelocaldns&lt;/span&gt;
      &lt;span class="na"&gt;hostNetwork&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;dnsPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Default&lt;/span&gt;
      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Exists&lt;/span&gt;
      &lt;span class="c1"&gt;# RESTRICT TO CANARY NODES FIRST:&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;node-role&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker-canary&lt;/span&gt;   &lt;span class="c1"&gt;# Label only your test nodes&lt;/span&gt;
      &lt;span class="c1"&gt;# Once validated, remove nodeSelector for full deployment&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node-cache&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.k8s.io/dns/k8s-dns-node-cache:1.23.0&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-localip=169.254.20.10"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-conf=/etc/Corefile/Corefile"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-upstreamsvc=kube-dns-upstream"&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5Mi&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;50m&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15Mi&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;53&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dns-udp&lt;/span&gt;
            &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;UDP&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;53&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dns-tcp&lt;/span&gt;
            &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9259&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metrics&lt;/span&gt;
            &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
        &lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;privileged&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;config-volume&lt;/span&gt;
            &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/etc/Corefile&lt;/span&gt;
            &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xtables-lock&lt;/span&gt;
            &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/run/xtables.lock&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xtables-lock&lt;/span&gt;
          &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/run/xtables.lock&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FileOrCreate&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;config-volume&lt;/span&gt;
          &lt;span class="na"&gt;configMap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nodelocaldns&lt;/span&gt;
            &lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Corefile&lt;/span&gt;
                &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Corefile&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nodelocaldns&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Corefile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;cluster.local {&lt;/span&gt;
        &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="s"&gt;cache {&lt;/span&gt;
            &lt;span class="s"&gt;success 99840 30   # Positive cache: up to ~28 hours&lt;/span&gt;
            &lt;span class="s"&gt;denial 30          # NXDOMAIN cache: 30 seconds&lt;/span&gt;
        &lt;span class="s"&gt;}&lt;/span&gt;
        &lt;span class="s"&gt;reload&lt;/span&gt;
        &lt;span class="s"&gt;forward . __PILLAR__CLUSTER__DNS__&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
    &lt;span class="s"&gt;in-addr.arpa {&lt;/span&gt;
        &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="s"&gt;cache 30&lt;/span&gt;
        &lt;span class="s"&gt;reload&lt;/span&gt;
        &lt;span class="s"&gt;forward . __PILLAR__CLUSTER__DNS__&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
    &lt;span class="s"&gt;ip6.arpa {&lt;/span&gt;
        &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="s"&gt;cache 30&lt;/span&gt;
        &lt;span class="s"&gt;reload&lt;/span&gt;
        &lt;span class="s"&gt;forward . __PILLAR__CLUSTER__DNS__&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
    &lt;span class="s"&gt;. {&lt;/span&gt;
        &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="s"&gt;cache 30&lt;/span&gt;
        &lt;span class="s"&gt;reload&lt;/span&gt;
        &lt;span class="s"&gt;forward . __PILLAR__UPSTREAM__SERVERS__&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; When using the official kubeadm deployment (&lt;code&gt;kubeadm init --feature-gates=NodeLocalDNSCache=true&lt;/code&gt;), the &lt;code&gt;__PILLAR__&lt;/code&gt; placeholders are automatically substituted. For manual deployment, replace &lt;code&gt;__PILLAR__CLUSTER__DNS__&lt;/code&gt; with the kube-dns ClusterIP (e.g., &lt;code&gt;10.96.0.10&lt;/code&gt;) and &lt;code&gt;__PILLAR__UPSTREAM__SERVERS__&lt;/code&gt; with upstream resolvers (e.g., &lt;code&gt;8.8.8.8 8.8.4.4&lt;/code&gt;).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Rollback procedure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# If issues arise on canary nodes:&lt;/span&gt;
&lt;span class="c"&gt;# 1. Remove the DaemonSet  pods revert to using kube-dns service immediately&lt;/span&gt;
kubectl delete daemonset nodelocaldns &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system

&lt;span class="c"&gt;# 2. Verify DNS is working again through the normal path&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &amp;lt;pod&amp;gt; &lt;span class="nt"&gt;--&lt;/span&gt; nslookup order-processor.production.svc.cluster.local
&lt;span class="c"&gt;# Should resolve through kube-dns service again&lt;/span&gt;

&lt;span class="c"&gt;# 3. Investigate and fix before re-deploying&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If NodeLocal DNSCache intercepts all DNS traffic and the upstream is misconfigured, pods will experience resolution failures. The &lt;code&gt;nodeSelector&lt;/code&gt; approach above lets you validate on a subset before cluster-wide rollout.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 3: Tune the Corefile Make CoreDNS Efficient
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coredns&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Corefile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;.:53 {&lt;/span&gt;
        &lt;span class="s"&gt;errors&lt;/span&gt;
        &lt;span class="s"&gt;health {&lt;/span&gt;
            &lt;span class="s"&gt;lameduck 5s&lt;/span&gt;
        &lt;span class="s"&gt;}&lt;/span&gt;
        &lt;span class="s"&gt;ready&lt;/span&gt;

        &lt;span class="s"&gt;# Aggressive caching layer&lt;/span&gt;
        &lt;span class="s"&gt;cache {&lt;/span&gt;
            &lt;span class="s"&gt;success 99840 30         # Successful responses cached for ~28 hours&lt;/span&gt;
            &lt;span class="s"&gt;denial 60                # NXDOMAIN cached for 60s (prevents repeated failed lookups)&lt;/span&gt;
            &lt;span class="s"&gt;prefetch 120 1200 4 25   # Proactively refresh popular entries at 25% TTL remaining&lt;/span&gt;
        &lt;span class="s"&gt;}&lt;/span&gt;

        &lt;span class="s"&gt;# Kubernetes service discovery  hardened&lt;/span&gt;
        &lt;span class="s"&gt;kubernetes cluster.local {&lt;/span&gt;
            &lt;span class="s"&gt;pods verified             # Don't resolve pods that aren't in Running state&lt;/span&gt;
            &lt;span class="s"&gt;fallthrough in-addr.arpa ip6.arpa&lt;/span&gt;
            &lt;span class="s"&gt;ttl 30                     # Lower TTL for faster cluster change propagation&lt;/span&gt;
        &lt;span class="s"&gt;}&lt;/span&gt;

        &lt;span class="s"&gt;# External upstream  resilient configuration&lt;/span&gt;
        &lt;span class="s"&gt;forward . 8.8.8.8 8.8.4.4 1.1.1.1 {&lt;/span&gt;
            &lt;span class="s"&gt;max_concurrent 1000       # Prevent upstream saturation&lt;/span&gt;
            &lt;span class="s"&gt;prefer_tcp                # TCP handles retries and large responses reliably&lt;/span&gt;
            &lt;span class="s"&gt;health_check 30s          # Detect upstream failures quickly&lt;/span&gt;
            &lt;span class="s"&gt;policy random             # Distribute load across upstreams&lt;/span&gt;
            &lt;span class="s"&gt;expire 10s                # Retry interval for failed upstreams&lt;/span&gt;
            &lt;span class="s"&gt;serve_tcp                 # Support both protocols&lt;/span&gt;
            &lt;span class="s"&gt;serve_udp&lt;/span&gt;
        &lt;span class="s"&gt;}&lt;/span&gt;

        &lt;span class="s"&gt;# Allow runtime Corefile reload without pod restart&lt;/span&gt;
        &lt;span class="s"&gt;reload&lt;/span&gt;

        &lt;span class="s"&gt;# Metrics for monitoring&lt;/span&gt;
        &lt;span class="s"&gt;prometheus :9153&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key directives explained:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Directive&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_concurrent 1000&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevents a single upstream from being overwhelmed during bursts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prefetch&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Proactively refreshes popular entries before TTL expires, preventing cache stampedes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pods verified&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Avoids DNS entries for terminating pods prevents stale connections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fallthrough&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lets external resolvers handle non-Kubernetes names instead of failing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prefer_tcp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TCP handles large responses and retries better than UDP, reducing silent packet loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;denial 60&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Negative caching for 60s stops repeated NXDOMAIN lookups for non-existent names&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  Step 4: Right-Size Resources Stop Starving the Go Runtime
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Forget the default&lt;/strong&gt; &lt;code&gt;100m&lt;/code&gt; &lt;strong&gt;CPU limit.&lt;/strong&gt; CoreDNS is a Go binary with concurrent goroutines servicing all cluster DNS. The official CoreDNS scaling benchmarks show:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Single CoreDNS replica on 2 vCPU node:
  - Internal queries: 33,669 QPS (2.6ms latency)
  - External queries:  6,733 QPS (12ms latency, client perspective)
  - At this load, both vCPUs were pegged at ~1900m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Memory formula&lt;/strong&gt; (from &lt;a href="https://github.com/coredns/deployment/tree/master/kubernetes/Scaling_CoreDNS.md" rel="noopener noreferrer"&gt;CoreDNS Scaling Guide&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Memory (default settings) = (Pods + Services) / 1000 + 54 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cluster Scale&lt;/th&gt;
&lt;th&gt;Pods + Services&lt;/th&gt;
&lt;th&gt;Memory Needed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small (&amp;lt; 50 pods)&lt;/td&gt;
&lt;td&gt;~60&lt;/td&gt;
&lt;td&gt;~55 Mi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium (500 pods)&lt;/td&gt;
&lt;td&gt;~600&lt;/td&gt;
&lt;td&gt;~59 Mi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large (5000 pods)&lt;/td&gt;
&lt;td&gt;~6,000&lt;/td&gt;
&lt;td&gt;~64 Mi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XLarge (150K pods)&lt;/td&gt;
&lt;td&gt;~158,000&lt;/td&gt;
&lt;td&gt;~212 Mi&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Resource requests/limits  no more starving the Go runtime&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;200m"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;150Mi"&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;250Mi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this change in our cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before: throttled_time = 294 seconds/minute (97% of time throttled)
After:  throttled_time = 0 seconds/minute
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 5: Deploy CPA Cluster Proportional Autoscaler
&lt;/h3&gt;

&lt;p&gt;This is &lt;strong&gt;not&lt;/strong&gt; regular HPA. The &lt;a href="https://github.com/kubernetes-sigs/cluster-proportional-autoscaler" rel="noopener noreferrer"&gt;Cluster Proportional Autoscaler&lt;/a&gt; is specifically designed for infrastructure add-ons like CoreDNS that need to scale proportionally with cluster size. Unlike HPA, which requires a metrics pipeline and custom metrics API, CPA watches node count and adjusts replicas via a simple formula.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The formula:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;replicas = max(ceil(nodes / nodesPerReplica), ceil(cores / coresPerReplica))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With a floor of 2 when &lt;code&gt;preventSinglePointFailure: true&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coredns-cpa-config&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;linear&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;{"coresPerReplica": 128, "nodesPerReplica": 4, "preventSinglePointFailure": true}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scaling examples (annotated which constraint is binding):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Nodes&lt;/th&gt;
&lt;th&gt;Cores&lt;/th&gt;
&lt;th&gt;Calculation&lt;/th&gt;
&lt;th&gt;Replicas&lt;/th&gt;
&lt;th&gt;Binding Constraint&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;max(8/4, 16/128) = max(2, 0.125)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nodesPerReplica&lt;/code&gt; (2 &amp;gt; 0.125). Also hits &lt;code&gt;preventSinglePointFailure&lt;/code&gt; floor of 2.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;max(24/4, 48/128) = max(6, 0.375)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nodesPerReplica&lt;/code&gt; (6 &amp;gt; 0.375)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;max(100/4, 200/128) = max(25, 1.56)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;nodesPerReplica&lt;/code&gt; (25 &amp;gt; 1.56)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In every example above, &lt;code&gt;nodesPerReplica&lt;/code&gt; is the binding constraint the CPU-based calculation produces a value below 1, so it doesn't contribute. This is typical for CoreDNS, which is more sensitive to the number of nodes (and therefore the number of NodeLocal DNSCache instances generating upstream queries) than to raw cluster compute.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Cloud vendor endorsement:&lt;/strong&gt; Oracle OKE, AWS EKS, and Azure AKS all recommend CPA for CoreDNS autoscaling. &lt;a href="https://docs.aws.amazon.com/eks/latest/best-practices/scale-cluster-services.html" rel="noopener noreferrer"&gt;EKS Best Practices&lt;/a&gt; explicitly states: &lt;em&gt;"It's recommended you use NodeLocal DNS or the cluster proportional autoscaler to scale CoreDNS."&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Deploy CPA via Helm&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;coredns-cpa cluster-proportional-autoscaler &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--repo&lt;/span&gt; https://kubernetes-sigs.github.io/cluster-proportional-autoscaler &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; rbac.create&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; image.tag&lt;span class="o"&gt;=&lt;/span&gt;v1.12.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; defaultRequests.cpu&lt;span class="o"&gt;=&lt;/span&gt;100m &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; defaultRequests.memory&lt;span class="o"&gt;=&lt;/span&gt;70Mi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; defaultLimits.cpu&lt;span class="o"&gt;=&lt;/span&gt;500m &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; defaultLimits.memory&lt;span class="o"&gt;=&lt;/span&gt;250Mi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;configMap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;coredns-cpa-config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 6: PodDisruptionBudget Never Take All DNS Down at Once
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coredns-pdb&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;     &lt;span class="c1"&gt;# Never fewer than 2 running CoreDNS pods&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-dns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without a PDB, a node drain or rolling update can kill all CoreDNS replicas simultaneously instant cluster-wide DNS blackout. This is especially dangerous when using NodeLocal DNSCache, because the node-local caches still forward misses to CoreDNS pods. If all CoreDNS pods are evicted, every cache miss fails.&lt;/p&gt;




&lt;h2&gt;
  
  
  Chapter 5 Verification: Did It Work?
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. DNS resolution speed  should be &amp;lt;5ms now&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time &lt;/span&gt;nslookup google.com
real    0m0.003s   &lt;span class="c"&gt;# ← was 5 seconds before&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;time &lt;/span&gt;nslookup order-processor.production.svc.cluster.local
real    0m0.002s   &lt;span class="c"&gt;# ← was timing out before&lt;/span&gt;

&lt;span class="c"&gt;# 2. CoreDNS resource usage  should have massive headroom&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl top pods &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="nt"&gt;-l&lt;/span&gt; k8s-app&lt;span class="o"&gt;=&lt;/span&gt;kube-dns
NAME                       CPU&lt;span class="o"&gt;(&lt;/span&gt;cores&lt;span class="o"&gt;)&lt;/span&gt;   MEMORY&lt;span class="o"&gt;(&lt;/span&gt;bytes&lt;span class="o"&gt;)&lt;/span&gt;
coredns-5d78c9fd5d-4kx2m   28m          88Mi         &lt;span class="c"&gt;# ← was 97m/100m&lt;/span&gt;

&lt;span class="c"&gt;# 3. CPU throttling  should be zero&lt;/span&gt;
kubectl debug &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system coredns-5d78c9fd5d-4kx2m &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;coredns &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/cpu.stat

nr_throttled 0
throttled_time 0          &lt;span class="c"&gt;# ← was 294 seconds/minute&lt;/span&gt;

&lt;span class="c"&gt;# 4. Node-local cache hit rate  should be &amp;gt;90%&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="nt"&gt;-l&lt;/span&gt; k8s-app&lt;span class="o"&gt;=&lt;/span&gt;kube-dns &lt;span class="nt"&gt;--&lt;/span&gt; curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:9153/metrics | &lt;span class="nb"&gt;grep &lt;/span&gt;cache
coredns_cache_hits_total 48291
coredns_cache_misses_total 4892
&lt;span class="c"&gt;# Hit ratio: 90.7% &lt;/span&gt;

&lt;span class="c"&gt;# 5. CPA working  verify replicas scaled&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;kubectl describe deployment coredns | &lt;span class="nb"&gt;grep &lt;/span&gt;Replicas
Replicas:               6 &lt;span class="o"&gt;(&lt;/span&gt;desired&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 6 Monitoring: No More Surprises
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus alerting rules  add to your rules file&lt;/span&gt;
&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coredns.rules&lt;/span&gt;
  &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ALERT: CoreDNS error rate &amp;gt; 5% for 3 minutes&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CoreDNSHighErrorRate&lt;/span&gt;
    &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;rate(coredns_dns_requests_total{rcode=~"SERVFAIL|REFUSED|TIMEOUT"}[5m])&lt;/span&gt;
        &lt;span class="s"&gt;/&lt;/span&gt;
        &lt;span class="s"&gt;rate(coredns_dns_requests_total[5m])&lt;/span&gt;
      &lt;span class="s"&gt;) &amp;gt; 0.05&lt;/span&gt;
    &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3m&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
    &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CoreDNS&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;humanize&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;

  &lt;span class="c1"&gt;# ALERT: DNS resolution slow (p99 &amp;gt; 50ms)&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CoreDNSHighLatency&lt;/span&gt;
    &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;histogram_quantile(0.99,&lt;/span&gt;
        &lt;span class="s"&gt;rate(coredns_dns_request_duration_seconds_bucket[5m])&lt;/span&gt;
      &lt;span class="s"&gt;) &amp;gt; 0.05&lt;/span&gt;
    &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
    &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CoreDNS&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;p99&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;latency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}s"&lt;/span&gt;

  &lt;span class="c1"&gt;# ALERT: Cache efficiency dropping&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CoreDNSCacheHitRateLow&lt;/span&gt;
    &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;rate(coredns_cache_hits_total[10m])&lt;/span&gt;
        &lt;span class="s"&gt;/&lt;/span&gt;
        &lt;span class="s"&gt;(rate(coredns_cache_hits_total[10m]) + rate(coredns_cache_misses_total[10m]))&lt;/span&gt;
      &lt;span class="s"&gt;) &amp;lt; 0.80&lt;/span&gt;
    &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
    &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CoreDNS&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cache&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;hit&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ratio&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;below&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;80%"&lt;/span&gt;

  &lt;span class="c1"&gt;# ALERT: CPU throttling detected&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CoreDNSCPUThrottling&lt;/span&gt;
    &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;rate(container_cpu_cfs_throttled_periods_total{container="coredns"}[5m]) &amp;gt; 0&lt;/span&gt;
    &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
    &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CoreDNS&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;being&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;CPU&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;throttled"&lt;/span&gt;

  &lt;span class="c1"&gt;# ALERT: CoreDNS pod restarting&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CoreDNSPodRestarts&lt;/span&gt;
    &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;increase(kube_pod_container_status_restarts_total{container="coredns"}[1h]) &amp;gt; 3&lt;/span&gt;
    &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 7 The Playbook: TL;DR Checklist
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;╔══════════════════════════════════════════════════════════════╗
║              COREDNS PRODUCTION HARDENING CHECKLIST           ║
╠══════════════════════════════════════════════════════════════╣
║ □ 1. Set ndots: 2 (immediate 80% query reduction)            ║
║ □ 2. Deploy NodeLocal DNSCache (game changer for clusters    ║
║      with &amp;gt;50 pods  test on canary nodes first, then        ║
║      roll out cluster-wide)                                   ║
║ □ 3. Tune Corefile: cache, prefetch, max_concurrent,         ║
║      prefer_tcp, pods verified                                ║
║ □ 4. Right-size CPU limits (min 200m request, 500m limit     ║
║       Go needs breathing room)                               ║
║ □ 5. Deploy CPA (Cluster Proportional Autoscaler)             ║
║       NOT regular HPA  scales with cluster size            ║
║ □ 6. Set PodDisruptionBudget (minAvailable: 2)               ║
║ □ 7. Add monitoring alerts (error rate, latency,             ║
║      cache hit ratio, CPU throttling)                         ║
║ □ 8. Test with: kubectl exec &amp;lt;pod&amp;gt; -- nslookup               ║
║      &amp;lt;service&amp;gt; &amp;amp;&amp;amp; verify &amp;lt;5ms, zero errors                    ║
╚══════════════════════════════════════════════════════════════╝
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Priority order do them in this sequence:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;ndots: 2&lt;/code&gt; → 5 minutes, 80% of the problem solved&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NodeLocal DNSCache → 15 minutes, eliminates cross-node traffic (test on canary nodes first)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Resource increase → 5 minutes, un-throttles the Go runtime&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Corefile tuning → 10 minutes, cache + prefetch + failover&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CPA + PDB → 20 minutes, future-proofs against cluster growth&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Monitoring → 30 minutes, prevents the next incident&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Epilogue
&lt;/h2&gt;

&lt;p&gt;That Wednesday night, we applied Steps 1–4 between 12:00 AM and 12:47 AM. DNS resolution went from &lt;strong&gt;5-second timeouts&lt;/strong&gt; to &lt;strong&gt;2-millisecond responses&lt;/strong&gt;. We deployed CPA the following day. The cluster hasn't had a DNS-related incident since.&lt;/p&gt;

&lt;p&gt;DNS is boring until it breaks, and when it breaks, everything breaks. CoreDNS isn't a set-and-forget service at scale, it's critical infrastructure that demands deliberate sizing, caching strategy, and proportional autoscaling. The Cluster Proportional Autoscaler exists specifically because &lt;code&gt;kubectl scale deployment coredns --replicas=X&lt;/code&gt; doesn't scale when your cluster goes from 24 nodes to 240.&lt;/p&gt;

&lt;p&gt;Monitor it. Size it. Cache it. Scale it.&lt;/p&gt;

&lt;p&gt;Your 3 AM pager will thank you. 🌙&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/coredns/deployment/tree/master/kubernetes/Scaling_CoreDNS.md" rel="noopener noreferrer"&gt;CoreDNS Scaling Guide (GitHub)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/" rel="noopener noreferrer"&gt;NodeLocal DNSCache Kubernetes Official Docs&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/eks/latest/best-practices/scale-cluster-services.html" rel="noopener noreferrer"&gt;EKS Best Practices Scaling Cluster Services&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/kubernetes-sigs/cluster-proportional-autoscaler" rel="noopener noreferrer"&gt;Cluster Proportional Autoscaler (kubernetes-sigs)&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://neon.com/blog/improving-dns-performance-with-nodelocaldns" rel="noopener noreferrer"&gt;Improving DNS Performance with NodeLocalDNS Neon Engineering Blog&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengbestpractices_topic-Large-Scale-Clusters-best-practices.htm" rel="noopener noreferrer"&gt;Oracle OKE Large Cluster Best Practices&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://coredns.io/plugins/kubernetes/" rel="noopener noreferrer"&gt;CoreDNS Kubernetes Plugin Documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://coredns.io/plugins/cache/" rel="noopener noreferrer"&gt;CoreDNS Cache Plugin Documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>networking</category>
      <category>sre</category>
    </item>
    <item>
      <title>Argo Rollouts in Production: Canary, AnalysisTemplates, and the Gotchas Nobody Documents</title>
      <dc:creator>Akshat Sinha</dc:creator>
      <pubDate>Thu, 07 May 2026 05:00:00 +0000</pubDate>
      <link>https://forem.com/pingtoprod/argo-rollouts-in-production-canary-analysistemplates-and-the-gotchas-nobody-documents-10fi</link>
      <guid>https://forem.com/pingtoprod/argo-rollouts-in-production-canary-analysistemplates-and-the-gotchas-nobody-documents-10fi</guid>
      <description>&lt;p&gt;It started with a routine Tuesday deploy. Nothing fancy, a small config change to our ingress controller across a few clusters. We'd done this a hundred times. Standard values.yaml modification and then letting ArgoCD do its magic, watch the rolling update do its thing, grab a Tea ( personal preference, you can grab a coffee as well ).&lt;/p&gt;

&lt;p&gt;Famous last words.&lt;/p&gt;

&lt;p&gt;By the time I checked the dashboards, three clusters were throwing 502s. The rolling update had dutifully cycled through pods, but it had no clue that the new config was messing up our TLS termination. It just kept going. &lt;strong&gt;That's the thing about Kubernetes Deployments, they're optimistic to a fault.&lt;/strong&gt; They'll roll out bad code with the same enthusiasm as good code, and by the time your metrics catch up, you've already blasted through all your replicas.&lt;/p&gt;

&lt;p&gt;I spent the afternoon writing rollback scripts and explaining to stakeholders why "production-ready" Kubernetes had just taken down three environments.&lt;/p&gt;

&lt;p&gt;That was the day I stopped trusting &lt;code&gt;kind: Deployment&lt;/code&gt; for anything that matters atleast on production.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem Isn't Rolling Updates. Actually It's What They Can't See
&lt;/h2&gt;

&lt;p&gt;Here's what they don't tell you in the tutorials: &lt;code&gt;RollingUpdate&lt;/code&gt; isn't a deployment strategy, it's a pod replacement algorithm (yes I said what I said). It knows how to swap old pods for new ones without downtime. It has zero clue whether your application is actually &lt;em&gt;working&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Think about what the native Deployment actually gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Readiness probes&lt;/strong&gt; — checks if &lt;em&gt;a pod&lt;/em&gt; is ready, not if your &lt;em&gt;release&lt;/em&gt; is healthy&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rolling updates&lt;/strong&gt; — controls &lt;em&gt;speed&lt;/em&gt;, not &lt;em&gt;safety&lt;/em&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pause support&lt;/strong&gt; — you can halt, but there's no automated rollback on failure&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Umm, that's pretty much it, not counting the Pre-stop hooks and stuff&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No traffic management between old and new versions. No external metric validation. No blast radius control. No ability to preview a release before it gets real traffic. Research tells us that &lt;strong&gt;80% of production outages are caused by small changes&lt;/strong&gt;, and the native Deployment has no opinion about any of them.&lt;/p&gt;

&lt;p&gt;You want canary? Write it yourself with two Deployments and a fragile mess of Service selectors. You want automated rollback based on error rates? Build a custom controller. You want blue-green with preview environments? Good luck.&lt;/p&gt;

&lt;p&gt;The Kubernetes community will tell you "use Argo Rollouts!" — and they're right. But most tutorials stop at "here's how to replace Deployment with Rollout for blue-green." Let me show you what actually matters when you're running this in anger.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;kind: Rollout&lt;/code&gt; vs &lt;code&gt;kind: Deployment&lt;/code&gt; — The Actual Diff
&lt;/h2&gt;

&lt;p&gt;The first CRD is &lt;code&gt;kind: Rollout&lt;/code&gt;. It's marketed as a drop-in replacement for &lt;code&gt;kind: Deployment&lt;/code&gt;, and it mostly is but as a DevOps Engineer you can't bet everything on &lt;strong&gt;mostly it should work&lt;/strong&gt;, so let's be precise about what changes.&lt;/p&gt;

&lt;p&gt;Here's a native Deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RollingUpdate&lt;/span&gt;
    &lt;span class="na"&gt;rollingUpdate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;maxSurge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;maxUnavailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here's the equivalent &lt;code&gt;kind: Rollout&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;  &lt;span class="c1"&gt;# &amp;lt;-- changed&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rollout&lt;/span&gt;                      &lt;span class="c1"&gt;# &amp;lt;-- changed&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                        &lt;span class="c1"&gt;# &amp;lt;-- this whole block changes&lt;/span&gt;
    &lt;span class="na"&gt;canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
          &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m.&lt;/span&gt; 
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
          &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
      &lt;span class="na"&gt;analysis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;templates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;templateName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard-health-check&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app.default.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things change: the &lt;code&gt;apiVersion&lt;/code&gt;, the &lt;code&gt;kind&lt;/code&gt;, and the &lt;code&gt;strategy&lt;/code&gt; block. Everything else — &lt;code&gt;selector&lt;/code&gt;, &lt;code&gt;replicas&lt;/code&gt;, &lt;code&gt;template&lt;/code&gt;, container spec — is identical. The controller picks it up and manages two ReplicaSets (stable + canary) behind the scenes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One important gotcha:&lt;/strong&gt; Argo Rollouts creates and manages its own ReplicaSets, so if you migrate an existing Deployment, delete the Deployment first. Running both simultaneously causes a conflict over the same pods.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How to migrate from&lt;/em&gt; &lt;strong&gt;&lt;em&gt;kind: Deployment&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;to&lt;/em&gt; &lt;strong&gt;&lt;em&gt;kind: Rollout&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;without downtime is a whole differet story, that would need a seperate blog post.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Canary Problem Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Before we go further, there's a widely misunderstood behaviour that'll bite you if you skip it.&lt;/p&gt;

&lt;p&gt;When you set &lt;code&gt;setWeight: 30&lt;/code&gt; in a canary step, most people assume 30% of your &lt;em&gt;users&lt;/em&gt; get the new version. &lt;strong&gt;That's not what happens.&lt;/strong&gt; Argo Rollouts guarantees that 30% of &lt;em&gt;network requests&lt;/em&gt; go to canary, but those requests are completely random. The same user can hit stable on request 1, canary on request 2, and stable again on request 3. For stateless APIs this is tolerable. For anything with session state, user-specific features, or UI changes, this is a disaster.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix: Header-Based Routing
&lt;/h3&gt;

&lt;p&gt;You need a traffic provider (NGINX Ingress, Istio, Traefik, etc.) and a dedicated canary URL per user group. Here's how it looks with NGINX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rollout&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;canaryService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-canary&lt;/span&gt;   &lt;span class="c1"&gt;# separate Service for canary pods&lt;/span&gt;
      &lt;span class="na"&gt;stableService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-stable&lt;/span&gt;   &lt;span class="c1"&gt;# separate Service for stable pods&lt;/span&gt;
      &lt;span class="na"&gt;trafficRouting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;stableIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-ingress&lt;/span&gt;
          &lt;span class="na"&gt;annotationPrefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx.ingress.kubernetes.io&lt;/span&gt;
          &lt;span class="na"&gt;additionalIngressAnnotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;canary-by-header&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;X-Canary-User&lt;/span&gt;
            &lt;span class="na"&gt;canary-by-header-value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
      &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
          &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
          &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now requests with the header &lt;code&gt;X-Canary-User: true&lt;/code&gt; always hit the canary. Everyone else stays on stable. You can give this header to internal testers, beta users, or a specific account tier — controlled, consistent, reproducible canary exposure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decoupling Traffic from Replicas: &lt;code&gt;setCanaryScale&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;By default, Argo Rollouts scales canary replicas proportionally to traffic weight. At 10% traffic, you get ~10% of your total replica count. This can cause resource issues — at 10 replicas total, 10% traffic with 1 canary pod means that pod is handling a tenth of your prod load with zero redundancy.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;setCanaryScale&lt;/code&gt; fixes this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setCanaryScale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;        &lt;span class="c1"&gt;# always keep 3 canary pods regardless of weight&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
    &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setCanaryScale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchTrafficWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# now scale proportionally&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is critical for cost efficiency in large clusters so you're not spinning up 50 canary pods the moment you hit 50% traffic weight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual Gates: Explicit Human Approval
&lt;/h3&gt;

&lt;p&gt;Automated analysis is great. But sometimes you want a person to look at dashboards before traffic increases. Use &lt;code&gt;pause: {duration: 0}&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
    &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;      &lt;span class="c1"&gt;# timed pause — auto-advances&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;                   &lt;span class="c1"&gt;# indefinite pause — REQUIRES manual promotion&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An indefinite pause blocks until someone runs &lt;code&gt;kubectl argo rollouts promote my-app&lt;/code&gt; or clicks Promote in the dashboard. Ideal for compliance-gated releases or high-stakes deploys.&lt;/p&gt;




&lt;h2&gt;
  
  
  AnalysisTemplate: Executable Success Criteria
&lt;/h2&gt;

&lt;p&gt;I used to think monitoring was enough. "We'll watch the dashboards and roll back if things go south." Cute. By the time a you are back after grabbing the cup of tea, you've already served errors to real users for minutes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AnalysisTemplate&lt;/code&gt; is where you define what "goooood" looks like, not vague SLOs buried in a wiki, but actual executable queries against your metrics provider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate-check&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-name&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
    &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;gt;= &lt;/span&gt;&lt;span class="m"&gt;0.95&lt;/span&gt;
    &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus:9090&lt;/span&gt;
        &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(&lt;/span&gt;
            &lt;span class="s"&gt;requests_total{service="{{args.service-name}}",status!~"5.."}[5m]&lt;/span&gt;
          &lt;span class="s"&gt;)) /&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(&lt;/span&gt;
            &lt;span class="s"&gt;requests_total{service="{{args.service-name}}"}[5m]&lt;/span&gt;
          &lt;span class="s"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;failureLimit: 3&lt;/code&gt; is important, it means the analysis can fail up to 3 consecutive checks before the rollout aborts. This prevents a single metric spike from triggering a premature rollback. Tune this based on your traffic patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Beyond Prometheus — The Providers You're Not Using
&lt;/h3&gt;

&lt;p&gt;Most blogs show only Prometheus. Here's the full picture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Prometheus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Error rates, latency, saturation — the default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datadog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;If your org is Datadog-first; same PromQL-style queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New Relic / CloudWatch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud-native shops already invested in these&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;InfluxDB / Wavefront&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IoT or high-frequency telemetry workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HTTP Endpoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Call any arbitrary URL and evaluate the response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes Job&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Run any test script as a gate — massively underrated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;Kubernetes Job provider&lt;/strong&gt; deserves special attention. It lets you run integration tests, smoke tests, or any shell script as an analysis step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integration-test&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test-runner&lt;/span&gt;
              &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-test-runner:latest&lt;/span&gt;
              &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pytest"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests/smoke/"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-v"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
            &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;
        &lt;span class="na"&gt;backoffLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the Job exits 0, analysis passes. Non-zero means failure, rollback triggers. This is how you gate a canary on actual test results, not just infrastructure metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  ClusterAnalysisTemplate — Define Once, Use Everywhere
&lt;/h3&gt;

&lt;p&gt;If you're managing multiple namespaces (and you are), use &lt;code&gt;ClusterAnalysisTemplate&lt;/code&gt; instead of &lt;code&gt;AnalysisTemplate&lt;/code&gt;. It's cluster-scoped — define it once, reference it from any Rollout in any namespace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterAnalysisTemplate&lt;/span&gt;   &lt;span class="c1"&gt;# &amp;lt;-- cluster-scoped&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard-health-check&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
    &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;gt;= &lt;/span&gt;&lt;span class="m"&gt;0.95&lt;/span&gt;
    &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
        &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(&lt;/span&gt;
            &lt;span class="s"&gt;istio_requests_total{&lt;/span&gt;
              &lt;span class="s"&gt;destination_service=~"{{args.service}}",&lt;/span&gt;
              &lt;span class="s"&gt;response_code!~"5.*"&lt;/span&gt;
            &lt;span class="s"&gt;}[5m]&lt;/span&gt;
          &lt;span class="s"&gt;)) /&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(&lt;/span&gt;
            &lt;span class="s"&gt;istio_requests_total{&lt;/span&gt;
              &lt;span class="s"&gt;destination_service=~"{{args.service}}"&lt;/span&gt;
            &lt;span class="s"&gt;}[5m]&lt;/span&gt;
          &lt;span class="s"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When your Prometheus address changes (and it will), you update one file. Not fifty.&lt;/p&gt;




&lt;h2&gt;
  
  
  AnalysisRun: The Live Execution You Should Actually Inspect
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;AnalysisRun&lt;/code&gt; is the third CRD, and it's the one people forget to look at during an active rollout. It's the live execution of an &lt;code&gt;AnalysisTemplate&lt;/code&gt; — one gets created automatically each time a Rollout triggers an analysis.&lt;/p&gt;

&lt;p&gt;An &lt;code&gt;AnalysisRun&lt;/code&gt; has three possible outcomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Successful&lt;/strong&gt; → Argo Rollouts advances to the next step&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Failed&lt;/strong&gt; → Rollout aborts, traffic snaps back to stable, canary scales to zero&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inconclusive&lt;/strong&gt; → Rollout pauses, waits for manual judgment (useful when metrics are ambiguous)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most useful thing you can do during a live canary is inspect the AnalysisRun directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# See all analysis runs for a rollout&lt;/span&gt;
kubectl argo rollouts get rollout my-app

&lt;span class="c"&gt;# Detailed view of a specific analysis run&lt;/span&gt;
kubectl describe analysisrun my-app-&amp;lt;&lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;

&lt;span class="c"&gt;# Watch it live&lt;/span&gt;
kubectl argo rollouts get rollout my-app &lt;span class="nt"&gt;--watch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--watch&lt;/code&gt; flag is your best friend. It gives you a live terminal view of step progression, traffic weights, and analysis status without needing to open the dashboard.&lt;/p&gt;

&lt;p&gt;You can also run an &lt;code&gt;AnalysisTemplate&lt;/code&gt; independently, outside of a Rollout, for dry-run validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisRun&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dry-run-health-check&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service-name&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app.default.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;templates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;templateName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;error-rate-check&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this before wiring analysis into a production Rollout. Validate your PromQL actually returns what you think it returns. Save yourself the embarrassment of an analysis that always passes because the query is wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;code&gt;kind: Experiment&lt;/code&gt;- A/B Testing Inside Your Pipeline
&lt;/h2&gt;

&lt;p&gt;Last year, we were migrating from GKE to a multi-cloud setup. Needed to verify our app behaved identically across regions with different latency profiles. Normally, you'd do this manually spin up a test deployment, run some benchmarks, compare.&lt;/p&gt;

&lt;p&gt;Enter &lt;code&gt;kind: Experiment&lt;/code&gt;. It lets you run multiple ReplicaSets side-by-side for a set duration, with optional analysis on each. Think of it as Kayenta-style comparison analysis, but native to your deployment pipeline.&lt;/p&gt;

&lt;p&gt;The most common use case isn't standalone experiments though — it's embedding them as a &lt;strong&gt;canary step&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Inside a Rollout's canary steps&lt;/span&gt;
&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;experiment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30m&lt;/span&gt;
    &lt;span class="na"&gt;templates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;baseline&lt;/span&gt;
      &lt;span class="na"&gt;specRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stable&lt;/span&gt;        &lt;span class="c1"&gt;# uses the current stable spec&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;canary&lt;/span&gt;
      &lt;span class="na"&gt;specRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;canary&lt;/span&gt;        &lt;span class="c1"&gt;# uses the new canary spec&lt;/span&gt;
    &lt;span class="na"&gt;analyses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compare-latency&lt;/span&gt;
      &lt;span class="na"&gt;templateName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;p95-latency-comparison&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;baseline-service&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{templates.baseline.service.name}}"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;canary-service&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{templates.canary.service.name}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both versions run in parallel for 30 minutes. Your analysis compares their p95 latency side-by-side. If canary is statistically worse, the experiment fails and the rollout aborts, before a single real user sees the new version.&lt;/p&gt;

&lt;p&gt;That's not a deployment strategy. That's engineering confidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Argo Rollouts Dashboard: Yes There's a GUI of your Control Room
&lt;/h2&gt;

&lt;p&gt;Here's what most tutorials skip entirely: there's a full web UI, and it's actually good.&lt;/p&gt;

&lt;p&gt;Install the kubectl plugin first if you haven't:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# macOS&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;argoproj/tap/kubectl-argo-rollouts

&lt;span class="c"&gt;# Linux&lt;/span&gt;
curl &lt;span class="nt"&gt;-LO&lt;/span&gt; https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
&lt;span class="nb"&gt;chmod&lt;/span&gt; +x kubectl-argo-rollouts-linux-amd64
&lt;span class="nb"&gt;mv &lt;/span&gt;kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then launch the dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl argo rollouts dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It opens on &lt;code&gt;http://localhost:3100&lt;/code&gt;. What you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Live rollout status&lt;/strong&gt; — step progression, current traffic weights, active canary vs stable pod counts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AnalysisRun status&lt;/strong&gt; — each metric check, pass/fail, consecutive failures, timestamps&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;One-click controls&lt;/strong&gt; — Promote, Abort, Retry directly from the UI without touching kubectl&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rollout history&lt;/strong&gt; — every revision with its status and timestamp&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the thing to show your team when you're making the case for Argo Rollouts. Watching a canary step from 10% → 50% in real-time while analysis checks tick green is more persuasive than any architecture diagram.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For production&lt;/strong&gt;, you can use the argo-rollouts controller helm chart and enable dashboard there, they also support enabling ingress for dashboard so you are mostly set. If you have already migrated the nginx controller to Gateway you might have to write a seperate HTTPRoute, if not you can use a loadbalancer to access it. Make sure that its only internally accessible and not public facing :).&lt;/p&gt;

&lt;p&gt;Here's the Github Repo Link to Helm chart incase:- &lt;a href="https://github.com/argoproj/argo-helm/tree/main/charts/argo-rollouts" rel="noopener noreferrer"&gt;https://github.com/argoproj/argo-helm/tree/main/charts/argo-rollouts&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Notifications — The Part Everyone Gets Wrong
&lt;/h2&gt;

&lt;p&gt;Argo Rollouts has native notification support since v1.1, with self-service namespace configuration since v1.6, but most setups are half-baked. Teams wire up on-rollout-aborted and call it done, which is one event out of nine and usually not even the most actionable one. Most blogs show the annotation and stop there. Here's the full wiring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Create the Slack Token Secret
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create secret generic argo-rollouts-notification-secret &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;slack-token&lt;span class="o"&gt;=&lt;/span&gt;xoxb-your-slack-bot-token &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; argo-rollouts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure the Notification ConfigMap
&lt;/h3&gt;

&lt;p&gt;This is where triggers and templates live. Apply it in the &lt;code&gt;argo-rollouts&lt;/code&gt; namespace for cluster-wide defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argo-rollouts-notification-cm&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argo-rollouts&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Slack integration&lt;/span&gt;
  &lt;span class="na"&gt;service.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;token: $slack-token&lt;/span&gt;

  &lt;span class="c1"&gt;# Message templates&lt;/span&gt;
  &lt;span class="na"&gt;template.rollout-aborted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;message: |&lt;/span&gt;
      &lt;span class="s"&gt;:red_circle: Rollout *{{.rollout.metadata.name}}* aborted in namespace *{{.rollout.metadata.namespace}}*&lt;/span&gt;
      &lt;span class="s"&gt;Reason: {{.rollout.status.message}}&lt;/span&gt;
      &lt;span class="s"&gt;Canary weight at time of abort: {{.rollout.status.currentPodHash}}&lt;/span&gt;

  &lt;span class="na"&gt;template.analysis-run-failed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;message: |&lt;/span&gt;
      &lt;span class="s"&gt;:warning: Analysis failed for *{{.rollout.metadata.name}}*&lt;/span&gt;
      &lt;span class="s"&gt;Failed metric: {{range .analysisRun.status.metricResults}}{{if eq .phase "Failed"}}{{.name}}{{end}}{{end}}&lt;/span&gt;
      &lt;span class="s"&gt;Initiating automatic rollback.&lt;/span&gt;

  &lt;span class="na"&gt;template.rollout-completed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;message: |&lt;/span&gt;
      &lt;span class="s"&gt;:white_check_mark: Rollout *{{.rollout.metadata.name}}* completed successfully.&lt;/span&gt;
      &lt;span class="s"&gt;New stable image: {{range .rollout.spec.template.spec.containers}}{{.image}}{{end}}&lt;/span&gt;

  &lt;span class="na"&gt;template.rollout-paused&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;message: |&lt;/span&gt;
      &lt;span class="s"&gt;:pause_button: Rollout *{{.rollout.metadata.name}}* paused — awaiting manual promotion.&lt;/span&gt;
      &lt;span class="s"&gt;Promote with: `kubectl argo rollouts promote {{.rollout.metadata.name}} -n {{.rollout.metadata.namespace}}`&lt;/span&gt;

  &lt;span class="c1"&gt;# Triggers — maps events to templates&lt;/span&gt;
  &lt;span class="na"&gt;trigger.on-rollout-aborted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;- send: [rollout-aborted]&lt;/span&gt;

  &lt;span class="na"&gt;trigger.on-analysis-run-failed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;- send: [analysis-run-failed]&lt;/span&gt;

  &lt;span class="na"&gt;trigger.on-rollout-completed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;- send: [rollout-completed]&lt;/span&gt;

  &lt;span class="na"&gt;trigger.on-rollout-paused&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;- send: [rollout-paused]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Annotate Your Rollout
&lt;/h3&gt;

&lt;p&gt;Now teams can self-subscribe to any trigger without touching the central configmap (the v1.6 self-service model):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rollout&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Alert on abort and analysis failure&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-rollout-aborted.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#alerts-team-a"&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-analysis-run-failed.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#alerts-team-a"&lt;/span&gt;
    &lt;span class="c1"&gt;# Notify on success too — close the loop&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-rollout-completed.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#deploys-team-a"&lt;/span&gt;
    &lt;span class="c1"&gt;# Alert when a manual gate is waiting for promotion&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-rollout-paused.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#deploys-team-a"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The nine built-in triggers cover the full lifecycle:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;When It Fires&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-rollout-updated&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A new rollout revision starts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-rollout-step-completed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Each canary step completes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-rollout-paused&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rollout pauses (manual gate or analysis inconclusive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-rollout-completed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rollout reaches 100% stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-rollout-aborted&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rollout aborts for any reason&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-analysis-run-started&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;An AnalysisRun begins&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-analysis-run-completed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;An AnalysisRun finishes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-analysis-run-failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;An AnalysisRun fails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;on-analysis-run-error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Provider error (e.g., Prometheus unreachable)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;on-analysis-run-error&lt;/code&gt; trigger is one people forget. If your Prometheus goes down mid-canary, you want to know immediately, not discover it when you wonder why the rollout is stuck.&lt;/p&gt;




&lt;h2&gt;
  
  
  Argo Rollouts + Argo CD: The GitOps Stack
&lt;/h2&gt;

&lt;p&gt;A common source of confusion: &lt;strong&gt;Argo CD and Argo Rollouts are not the same tool, and they solve different problems.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Argo CD&lt;/strong&gt; ensures your cluster matches the desired state in Git. It's a reconciliation engine. It sees your &lt;code&gt;kind: Rollout&lt;/code&gt; manifest in Git and syncs it to the cluster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Argo Rollouts&lt;/strong&gt; controls &lt;em&gt;how&lt;/em&gt; the transition from old to new happens once that manifest lands. It manages the traffic shifting, analysis, and promotion/rollback logic.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workflow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer pushes new image tag to Git
        ↓
Argo CD detects the diff and syncs the Rollout spec
        ↓
Argo Rollouts controller picks up the new spec
        ↓
Canary step begins: 10% traffic → AnalysisRun starts
        ↓
Analysis passes → 50% → analysis passes → 100%
        ↓
New version is stable. Argo CD shows "Synced + Healthy"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;One important config when using both together: set &lt;code&gt;ignoreDifferences&lt;/code&gt; in your Argo CD Application &lt;strong&gt;and&lt;/strong&gt; enable &lt;code&gt;respectIgnoreDifferences&lt;/code&gt; to avoid Argo CD fighting Argo Rollouts over the replica count during a canary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Argo CD Application:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;RespectIgnoreDifferences=true&lt;/span&gt;
  &lt;span class="na"&gt;ignoreDifferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rollout&lt;/span&gt;
    &lt;span class="na"&gt;jsonPointers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/spec/replicas&lt;/span&gt;   &lt;span class="c1"&gt;# Argo Rollouts manages this during canary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without both, Argo CD will try to sync the replica count back to what's in Git while Argo Rollouts is actively scaling canary pods. The two controllers fight each other and you get undefined behaviour.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Complete Production Ready Example
&lt;/h2&gt;

&lt;p&gt;Here's everything tied together, the kind of manifest I wish someone had shown me before I learned it the expensive way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. ClusterAnalysisTemplate — define once, use everywhere&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterAnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard-health-check&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
    &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;gt;= &lt;/span&gt;&lt;span class="m"&gt;0.95&lt;/span&gt;
    &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
        &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(&lt;/span&gt;
            &lt;span class="s"&gt;istio_requests_total{&lt;/span&gt;
              &lt;span class="s"&gt;destination_service=~"{{args.service}}",&lt;/span&gt;
              &lt;span class="s"&gt;response_code!~"5.*"&lt;/span&gt;
            &lt;span class="s"&gt;}[5m]&lt;/span&gt;
          &lt;span class="s"&gt;)) /&lt;/span&gt;
          &lt;span class="s"&gt;sum(rate(&lt;/span&gt;
            &lt;span class="s"&gt;istio_requests_total{&lt;/span&gt;
              &lt;span class="s"&gt;destination_service=~"{{args.service}}"&lt;/span&gt;
            &lt;span class="s"&gt;}[5m]&lt;/span&gt;
          &lt;span class="s"&gt;))&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;p95-latency&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
    &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;lt;= &lt;/span&gt;&lt;span class="m"&gt;500&lt;/span&gt;   &lt;span class="c1"&gt;# ms&lt;/span&gt;
    &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
        &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;histogram_quantile(0.95,&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(&lt;/span&gt;
              &lt;span class="s"&gt;istio_request_duration_milliseconds_bucket{&lt;/span&gt;
                &lt;span class="s"&gt;destination_service=~"{{args.service}}"&lt;/span&gt;
              &lt;span class="s"&gt;}[5m]&lt;/span&gt;
            &lt;span class="s"&gt;)) by (le)&lt;/span&gt;
          &lt;span class="s"&gt;)&lt;/span&gt;

&lt;span class="s"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# 2. The Rollout&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rollout&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-rollout-aborted.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#alerts-my-team"&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-analysis-run-failed.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#alerts-my-team"&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-rollout-completed.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#deploys-my-team"&lt;/span&gt;
    &lt;span class="na"&gt;notifications.argoproj.io/subscribe.on-rollout-paused.slack&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#deploys-my-team"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v2&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;canaryService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-canary&lt;/span&gt;
      &lt;span class="na"&gt;stableService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-stable&lt;/span&gt;
      &lt;span class="na"&gt;trafficRouting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                    &lt;span class="c1"&gt;# Can use ALB, Istio, Traefik (Gateway is Supported via plugins haven't explored it yet) &lt;/span&gt;
          &lt;span class="na"&gt;stableIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-ingress&lt;/span&gt;
      &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setCanaryScale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;             &lt;span class="c1"&gt;# stable replica count regardless of weight&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
          &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;          &lt;span class="c1"&gt;# timed: auto-advances after 10m&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;                &lt;span class="c1"&gt;# manual gate: requires explicit promotion&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; 
          &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
      &lt;span class="na"&gt;analysis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;startingStep&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;          &lt;span class="c1"&gt;# analysis starts after first setWeight&lt;/span&gt;
        &lt;span class="na"&gt;templates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;templateName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard-health-check&lt;/span&gt;
          &lt;span class="na"&gt;clusterScope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;     &lt;span class="c1"&gt;# use ClusterAnalysisTemplate&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;service&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app.default.svc.cluster.local&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Flagger: The Elephant in the Room
&lt;/h2&gt;

&lt;p&gt;Flagger is worth mentioning because the question always comes up. The fundamental difference is architectural: Flagger &lt;em&gt;wraps&lt;/em&gt; your existing &lt;code&gt;kind: Deployment&lt;/code&gt; rather than replacing it, which matters if migrating manifests feels risky or if you're already deep in the Flux ecosystem. Argo Rollouts also supports &lt;a href="https://argo-rollouts.readthedocs.io/en/stable/migrating/#reference-deployment-from-rollout" rel="noopener noreferrer"&gt;referencing an existing Deployment&lt;/a&gt; without replacing it, similar to how Flagger works.&lt;/p&gt;

&lt;p&gt;But tradeoff is real. Flagger's surface area is smaller and its GitOps integration with Flux is excellent. Argo Rollouts gives you more granular step control, a dashboard, and the &lt;code&gt;Experiment&lt;/code&gt; CRD. Neither is wrong, they reflect different team philosophies. If you're Flux-native, evaluate Flagger first. If you want the full progressive delivery toolkit in one place, you're already in the right article.&lt;/p&gt;




&lt;h2&gt;
  
  
  When NOT to Use Argo Rollouts
&lt;/h2&gt;

&lt;p&gt;This is the section most blogs skip because it doesn't sell the tool. But every senior engineer respects a writer who gives them the failure modes upfront.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use Argo Rollouts for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Infrastructure controllers&lt;/strong&gt; — cert-manager, nginx, coredns, sealed-secrets. These aren't application deployments; they're cluster plumbing. A canary of your ingress controller is chaos.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Applications with shared mutable state&lt;/strong&gt; — if your app writes to a shared file, a shared queue, or a shared database schema without backward compatibility, running two versions simultaneously will corrupt data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Worker/queue consumers&lt;/strong&gt; — apps that pull from a queue typically can't handle two versions processing the same messages. Argo Rollouts doesn't control queue routing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Long-lived parallel versions&lt;/strong&gt; — Argo Rollouts assumes a brief deployment window (15–60 minutes typically, 1–2 hours max). Running canary for days or weeks before deciding to promote creates operational complexity and rollback ambiguity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-cluster rollouts&lt;/strong&gt; — Argo Rollouts operates within a single cluster. If you need coordinated rollouts across clusters, look at Argo CD ApplicationSets or multi-cluster progressive delivery tools.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Legacy apps that can't run multiple versions concurrently&lt;/strong&gt; — some apps hold exclusive locks, bind to fixed ports, or have singleton assumptions. For these, Blue-Green (not canary) is your only option, and even that requires validation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And a note on StatefulSets and DaemonSets: as of Argo Rollouts 1.9, support for these workload types is in active development. Don't try to use &lt;code&gt;kind: Rollout&lt;/code&gt; as a drop-in for &lt;code&gt;kind: StatefulSet&lt;/code&gt;. NO it won't work for now.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bottom Line for Production Deployments
&lt;/h2&gt;

&lt;p&gt;If you're still using &lt;code&gt;kind: Deployment&lt;/code&gt; for anything that matters, you're gambling. Not because Kubernetes is bad, it's not. But Deployments were designed for a simpler era. They assume your code is either "ready" or "not ready." Real production systems are more nuanced than that.&lt;/p&gt;

&lt;p&gt;The four CRDs &lt;code&gt;Rollout&lt;/code&gt;, &lt;code&gt;AnalysisTemplate&lt;/code&gt;, &lt;code&gt;AnalysisRun&lt;/code&gt;, and &lt;code&gt;Experiment&lt;/code&gt; aren't just features. They're the difference between "deploy and hope" and actual progressive delivery. Layer in the dashboard for visibility, notifications for observability and header-based routing for controlled canary exposure and you've built a deployment pipeline that can catch problems before your users do.&lt;/p&gt;

&lt;p&gt;Start with &lt;code&gt;Rollout&lt;/code&gt; as a drop-in replacement. Add &lt;code&gt;ClusterAnalysisTemplate&lt;/code&gt; when you're ready to automate pass/fail decisions. Use the dashboard during live canaries. Wire up notifications properly — all the triggers, not just abort. And when you're feeling brave, &lt;code&gt;Experiment&lt;/code&gt; will change how you think about pre-production testing.&lt;/p&gt;

&lt;p&gt;One more thing: set &lt;code&gt;pause: {}&lt;/code&gt; for your first few production canaries. Get comfortable promoting manually. Understand what "good" looks like in your AnalysisRun output. Then, and only then, remove the manual gate and let the system decide.&lt;/p&gt;

&lt;p&gt;Future you will thank present you when a canary fails at 2 AM and the right Slack channel gets paged before any person notices.&lt;/p&gt;

&lt;p&gt;Now you don't need to hurry back after grabbing your tea, go and have an easy sip, let Rollouts handle the prod.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Working with Kubernetes across multi-cloud setups, one bad deploy at a time. Follow along as I document the stuff they don't put in the docs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>devex</category>
      <category>kubernetes</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
