<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Samson Tanimawo</title>
    <description>The latest articles on Forem by Samson Tanimawo (@samson_tanimawo).</description>
    <link>https://forem.com/samson_tanimawo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830227%2F02ea1ab7-513f-4426-b63d-9120142bc431.png</url>
      <title>Forem: Samson Tanimawo</title>
      <link>https://forem.com/samson_tanimawo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/samson_tanimawo"/>
    <language>en</language>
    <item>
      <title>Kubernetes Network Policies: Lessons from Production Incidents</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 05 May 2026 14:33:42 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/kubernetes-network-policies-lessons-from-production-incidents-2fmh</link>
      <guid>https://forem.com/samson_tanimawo/kubernetes-network-policies-lessons-from-production-incidents-2fmh</guid>
      <description>&lt;h2&gt;
  
  
  Why Default Kubernetes Networking Is Wrong
&lt;/h2&gt;

&lt;p&gt;Fresh Kubernetes cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every pod can talk to every other pod&lt;/li&gt;
&lt;li&gt;Across namespaces, across services, across environments&lt;/li&gt;
&lt;li&gt;No egress restrictions&lt;/li&gt;
&lt;li&gt;No ingress restrictions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a lateral movement attack waiting to happen. One compromised pod = entire cluster.&lt;/p&gt;

&lt;p&gt;Network Policies fix this. Most teams ignore them until the first security audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Rule That Breaks Things
&lt;/h2&gt;

&lt;p&gt;Start with: "deny all traffic by default, explicitly allow what you need."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-deny-all&lt;/span&gt;
&lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Egress&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply this to a running namespace, and &lt;strong&gt;everything breaks&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pods can't reach DNS (kube-dns is in &lt;code&gt;kube-system&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Pods can't reach the API server&lt;/li&gt;
&lt;li&gt;Metrics scraping fails&lt;/li&gt;
&lt;li&gt;Service mesh control plane loses connectivity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a bug. This is the point. You have to explicitly allow everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Allow-List Pattern
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-server&lt;/span&gt;
&lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-server&lt;/span&gt;
&lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Egress&lt;/span&gt;
&lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="c1"&gt;# DNS&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-dns&lt;/span&gt;
&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;UDP&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;53&lt;/span&gt;
&lt;span class="c1"&gt;# Database&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every allowed flow has to be declared. No exceptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Incidents We've Debugged
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Incident 1: The invisible DNS failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Symptoms: pods running but inter-service calls failing intermittently. Logs showed DNS resolution timeouts.&lt;/p&gt;

&lt;p&gt;Root cause: a recently applied Network Policy forgot to allow UDP port 53 to kube-dns. About 50% of DNS queries were failing.&lt;/p&gt;

&lt;p&gt;Fix: always include DNS egress in the base template.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-dns&lt;/span&gt;
&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;UDP&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;53&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;53&lt;/span&gt; &lt;span class="c1"&gt;# Some clients fall back to TCP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Incident 2: Metrics gap after namespace migration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Symptoms: Prometheus stopped scraping metrics for services in the &lt;code&gt;payments&lt;/code&gt; namespace after a security audit applied strict policies.&lt;/p&gt;

&lt;p&gt;Root cause: the scraper pod in the &lt;code&gt;monitoring&lt;/code&gt; namespace couldn't reach the metrics port on &lt;code&gt;payments&lt;/code&gt; pods. Network Policy blocked it.&lt;/p&gt;

&lt;p&gt;Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring&lt;/span&gt;
&lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9090&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Incident 3: External API calls blocked&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Symptoms: a service that integrates with Stripe started returning 503s after a deploy.&lt;/p&gt;

&lt;p&gt;Root cause: the new Network Policy allowed egress to internal services but didn't allow egress to external IPs. Stripe calls failed.&lt;/p&gt;

&lt;p&gt;Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="c1"&gt;# Allow to all external IPs on HTTPS&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;ipBlock&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;cidr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0/0&lt;/span&gt;
&lt;span class="na"&gt;except&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;10.0.0.0/8&lt;/span&gt; &lt;span class="c1"&gt;# Block private ranges&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;172.16.0.0/12&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;192.168.0.0/16&lt;/span&gt;
&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Namespace-Level vs Pod-Level
&lt;/h2&gt;

&lt;p&gt;Namespace-level policies are easier but coarse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Allow all pods in "api" namespace to reach "db" namespace&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
&lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Egress&lt;/span&gt;
&lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pod-level policies are harder but more secure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Only the "orders" service can reach the "orders_db"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders&lt;/span&gt;
&lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;orders-db&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start namespace-level. Tighten to pod-level for sensitive services (payments, auth, PII stores).&lt;/p&gt;

&lt;h2&gt;
  
  
  The CNI Matters
&lt;/h2&gt;

&lt;p&gt;Not all CNIs support all Network Policy features:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cilium: full support, plus advanced L7 (HTTP method filtering)
Calico: full support for v1 spec
Weave Net: basic support
Flannel: NONE need to pair with Calico
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're on Flannel without Calico, your Network Policies are being silently ignored. Check your CNI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Rollout Strategy
&lt;/h2&gt;

&lt;p&gt;Don't apply strict policies to production on day one. You will cause an outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 (week 1)&lt;/strong&gt;: Deploy in dev cluster only, break things, fix things&lt;br&gt;
&lt;strong&gt;Phase 2 (week 2)&lt;/strong&gt;: Apply "log only" mode in staging (Cilium supports this)&lt;br&gt;
&lt;strong&gt;Phase 3 (week 3-4)&lt;/strong&gt;: Apply in staging as enforced, watch for issues&lt;br&gt;
&lt;strong&gt;Phase 4 (week 5+)&lt;/strong&gt;: Apply in production during a quiet window, have rollback ready&lt;/p&gt;

&lt;p&gt;Total time: 4-6 weeks for a full rollout. Faster rollouts cause faster outages.&lt;/p&gt;
&lt;h2&gt;
  
  
  Testing Network Policies
&lt;/h2&gt;

&lt;p&gt;Before deploying:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test from inside the cluster&lt;/span&gt;
kubectl run test-pod &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; sh
wget &lt;span class="nt"&gt;-qO-&lt;/span&gt; http://api-service:8080

&lt;span class="c"&gt;# Verify denied flows fail&lt;/span&gt;
wget &lt;span class="nt"&gt;-qO-&lt;/span&gt; http://admin-service:8080 &lt;span class="c"&gt;# Should timeout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI idea: spin up a cluster, apply your policies, run a test suite that asserts which pods can reach which others. Fails on policy regressions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability Into Policies
&lt;/h2&gt;

&lt;p&gt;Cilium provides Hubble UI for real-time network flows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;See which pods are talking&lt;/li&gt;
&lt;li&gt;See which flows are denied&lt;/li&gt;
&lt;li&gt;Visualize policy coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without something like Hubble, debugging Network Policy issues is archaeology. Invest in it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Minimum Viable Policy
&lt;/h2&gt;

&lt;p&gt;If you're starting from zero, here's the minimum:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 1. Default deny ingress across all namespaces&lt;/span&gt;
&lt;span class="c1"&gt;# 2. Allow DNS egress (kube-dns)&lt;/span&gt;
&lt;span class="c1"&gt;# 3. Allow same-namespace pod-to-pod&lt;/span&gt;
&lt;span class="c1"&gt;# 4. Explicit cross-namespace allows for specific services&lt;/span&gt;
&lt;span class="c1"&gt;# 5. Deny egress to private IP ranges (except approved)&lt;/span&gt;
&lt;span class="c1"&gt;# 6. Allow egress to 0.0.0.0/0 on 443 for external APIs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This covers 80% of the security benefit for 20% of the complexity. Tighten from there over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Skipping DNS&lt;/strong&gt; everything breaks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forgetting metrics scrape paths&lt;/strong&gt; Prometheus goes blind&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not testing in staging first&lt;/strong&gt; production outage on day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using a CNI that doesn't support policies&lt;/strong&gt; silent failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Applying policies before installing them&lt;/strong&gt; chicken and egg&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No rollback plan&lt;/strong&gt; panic mode when things break&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Network Policies are one of the highest-leverage security controls in Kubernetes. They're also one of the easiest ways to cause a self-inflicted outage. Treat them with respect.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>security</category>
      <category>networking</category>
      <category>sre</category>
    </item>
    <item>
      <title>Reducing Toil: The Google SRE Book Applied to Startups</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 04 May 2026 14:28:17 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/reducing-toil-the-google-sre-book-applied-to-startups-hcp</link>
      <guid>https://forem.com/samson_tanimawo/reducing-toil-the-google-sre-book-applied-to-startups-hcp</guid>
      <description>&lt;h2&gt;
  
  
  The Google Rule That Breaks at Startups
&lt;/h2&gt;

&lt;p&gt;Google's SRE book says: &lt;strong&gt;SRE time should be no more than 50% toil&lt;/strong&gt;. The other 50% must go to engineering work that reduces toil.&lt;/p&gt;

&lt;p&gt;At a 10-person startup, your "SRE team" is one overworked engineer. They're already at 95% toil. There is no slack to reduce it.&lt;/p&gt;

&lt;p&gt;So you have to be ruthless about what work is worth automating and what work is worth eliminating entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining Toil Precisely
&lt;/h2&gt;

&lt;p&gt;Google's definition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manual&lt;/li&gt;
&lt;li&gt;Repetitive&lt;/li&gt;
&lt;li&gt;Automatable&lt;/li&gt;
&lt;li&gt;Tactical (not strategic)&lt;/li&gt;
&lt;li&gt;Lacks enduring value&lt;/li&gt;
&lt;li&gt;Scales linearly with service growth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a task checks all six boxes, it's toil. If it checks some but not all, it might be legitimate engineering work.&lt;/p&gt;

&lt;p&gt;Example: responding to an alert is tactical and lacks enduring value, but if it's not repetitive, it's not toil.&lt;/p&gt;

&lt;p&gt;Example: writing a new runbook is manual, but it's strategic and has enduring value, so it's not toil.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Startup-Sized Toil Audit
&lt;/h2&gt;

&lt;p&gt;Track for 2 weeks. Every 30 minutes, write down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What am I doing right now?&lt;/li&gt;
&lt;li&gt;Is it toil (manual, repetitive, automatable)?&lt;/li&gt;
&lt;li&gt;How long have I been doing it this week?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the end of 2 weeks, you'll have a toil ranking. Pick the top 3.&lt;/p&gt;

&lt;p&gt;Typical top offenders:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Manually running deploys (30+ min/week)&lt;/li&gt;
&lt;li&gt;Responding to known false-positive alerts (3+ hours/week)&lt;/li&gt;
&lt;li&gt;Provisioning new dev environments (1+ hour per request)&lt;/li&gt;
&lt;li&gt;Checking on flaky CI runs (2+ hours/week)&lt;/li&gt;
&lt;li&gt;Writing the same runbook context in every incident (1+ hour/incident)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Rule 1: Eliminate Before Automating
&lt;/h2&gt;

&lt;p&gt;The best toil is toil you don't do at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before automating the deploy process&lt;/strong&gt;, ask: why do we deploy manually?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the answer is "we don't trust our tests" → fix the tests&lt;/li&gt;
&lt;li&gt;If the answer is "we need human approval" → build a self-serve approval flow&lt;/li&gt;
&lt;li&gt;If the answer is "production is scary" → build better rollback, then trust the automation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Before automating alert response&lt;/strong&gt;, ask: why is the alert firing?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If it's a false positive → fix the alert&lt;/li&gt;
&lt;li&gt;If it's a symptom of something deeper → fix the root cause&lt;/li&gt;
&lt;li&gt;If it's expected behavior → delete the alert&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Automating false positives is worse than doing them manually it hides the underlying problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 2: Automate the Second-Most Common Task
&lt;/h2&gt;

&lt;p&gt;Counter-intuitive but works:&lt;/p&gt;

&lt;p&gt;The most common manual task is usually the one you've already optimized manually. You've gotten fast at it.&lt;/p&gt;

&lt;p&gt;The second-most common task is where you're slow, it's still frequent, and automation has the highest ROI.&lt;/p&gt;

&lt;p&gt;Example: you spend 3 hours/week on deploys (already optimized with scripts). You spend 2 hours/week manually provisioning dev environments (still done via the UI). Automate the environments first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 3: Measure Before and After
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;toil_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;deploy_manual_time&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;35 min/week&lt;/span&gt;
&lt;span class="na"&gt;deploy_automated_time&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5 min/week&lt;/span&gt; &lt;span class="c1"&gt;# Saves 30 min/week&lt;/span&gt;

&lt;span class="na"&gt;alert_response_time&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;180 min/week&lt;/span&gt;
&lt;span class="na"&gt;alert_response_time_after_tuning&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;45 min/week&lt;/span&gt; &lt;span class="c1"&gt;# Saves 135 min/week&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you can't measure the savings, you don't know if the automation worked.&lt;/p&gt;

&lt;p&gt;Rule: if automating a toil task takes 10x longer than the toil itself saves per year, don't automate. Eliminate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 4: Self-Service Is the Force Multiplier
&lt;/h2&gt;

&lt;p&gt;At scale, toil scales linearly with team size. Self-service breaks the linear relationship.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instead of SRE provisioning dev environments → Terraform module + docs + CI approval&lt;/li&gt;
&lt;li&gt;Instead of SRE running database queries → read-only proxy with query approval flow&lt;/li&gt;
&lt;li&gt;Instead of SRE creating alerts → YAML templates engineers can copy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rule of thumb: &lt;strong&gt;if three different engineers have asked you to do the same thing, build it as self-service&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 5: Runbooks Are Temporary Debt
&lt;/h2&gt;

&lt;p&gt;A runbook says "here's the manual procedure." A good runbook is instructions for a bot, not a human.&lt;/p&gt;

&lt;p&gt;Every runbook should have an expiration date:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;runbook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restart_stuck_worker&lt;/span&gt;
&lt;span class="na"&gt;created&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2024-01-15&lt;/span&gt;
&lt;span class="na"&gt;expires&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2024-04-15&lt;/span&gt;
&lt;span class="na"&gt;automation_ticket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;#4872&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a runbook is still manual after 3 months, either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It's rare enough to not need automation&lt;/li&gt;
&lt;li&gt;We've failed to allocate time for automation&lt;/li&gt;
&lt;li&gt;The underlying issue should be fixed instead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Either way, reconsider.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 80/20 Rule for Startup SRE
&lt;/h2&gt;

&lt;p&gt;At Google scale, you can justify building a platform team. At startup scale, you can't. So you apply Pareto ruthlessly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to automate (20% effort, 80% value)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploys (frees 30+ min/week, prevents errors)&lt;/li&gt;
&lt;li&gt;Dev environment provisioning (frees 2+ hrs/week)&lt;/li&gt;
&lt;li&gt;Known-cause alert response (frees 3+ hrs/week)&lt;/li&gt;
&lt;li&gt;Secret rotation&lt;/li&gt;
&lt;li&gt;Backup verification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What not to automate (80% effort, 20% value)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provisioning one-off infrastructure&lt;/li&gt;
&lt;li&gt;Debugging novel issues&lt;/li&gt;
&lt;li&gt;Writing custom dashboards&lt;/li&gt;
&lt;li&gt;Responding to security incidents (needs judgment)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Save your engineering cycles for the high-leverage automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Monthly Toil Review
&lt;/h2&gt;

&lt;p&gt;Every month, ask:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What toil do I do now that I didn't do last month?&lt;/li&gt;
&lt;li&gt;What toil did I eliminate in the last month?&lt;/li&gt;
&lt;li&gt;Is my total toil going up or down?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If toil is going up and you're not seeing a plan to reduce it, that's your biggest reliability problem. Not the outages. Not the alerts. The toil.&lt;/p&gt;

&lt;p&gt;Because toil crowds out the time you need to fix the underlying systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Part
&lt;/h2&gt;

&lt;p&gt;The hardest part of toil reduction isn't technical. It's psychological.&lt;/p&gt;

&lt;p&gt;Toil feels productive. You finish tasks. You feel needed. You're the hero who fixed the broken thing.&lt;/p&gt;

&lt;p&gt;Engineering work to eliminate toil feels slow. You build for weeks before seeing results. Nobody pages you for doing it.&lt;/p&gt;

&lt;p&gt;Resist the dopamine of toil. The goal is to make yourself less needed, not more.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>toil</category>
      <category>productivity</category>
      <category>automation</category>
    </item>
    <item>
      <title>Incident Severity Levels: SEV-1 to SEV-5 Calibration</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 03 May 2026 14:27:49 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/incident-severity-levels-sev-1-to-sev-5-calibration-52j1</link>
      <guid>https://forem.com/samson_tanimawo/incident-severity-levels-sev-1-to-sev-5-calibration-52j1</guid>
      <description>&lt;h2&gt;
  
  
  Why Severity Is Broken at Most Companies
&lt;/h2&gt;

&lt;p&gt;Everyone has severity levels. Almost nobody agrees on what they mean.&lt;/p&gt;

&lt;p&gt;Ask ten engineers what SEV-2 means and you'll get eight different answers. This causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Under-paged incidents (people thought SEV-3 meant "no rush")&lt;/li&gt;
&lt;li&gt;Over-paged incidents (everything is SEV-1)&lt;/li&gt;
&lt;li&gt;Exhausted on-call (false alarms)&lt;/li&gt;
&lt;li&gt;Missed SLOs (incidents not escalated in time)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calibration matters. Here's a definition that actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Five Levels
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SEV-1: Critical&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Primary product is completely down for all users&lt;/li&gt;
&lt;li&gt;Active data loss&lt;/li&gt;
&lt;li&gt;Security breach in progress&lt;/li&gt;
&lt;li&gt;Core business stopped (can't process payments, can't log in)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 5 minutes&lt;br&gt;
Escalation: Immediate, all hands&lt;br&gt;
Post-mortem: Required, public within 5 days&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-2: High&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Primary product is degraded for most users&lt;/li&gt;
&lt;li&gt;Core feature unavailable for a subset&lt;/li&gt;
&lt;li&gt;Significant customer impact but workaround exists&lt;/li&gt;
&lt;li&gt;Performance significantly degraded (&amp;gt;5x normal latency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 15 minutes&lt;br&gt;
Escalation: Page primary on-call, notify secondary&lt;br&gt;
Post-mortem: Required, internal within 5 days&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-3: Medium&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-critical feature broken&lt;/li&gt;
&lt;li&gt;Affects a small percentage of users&lt;/li&gt;
&lt;li&gt;Degraded performance within tolerance&lt;/li&gt;
&lt;li&gt;Bug in new feature rollout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 1 hour&lt;br&gt;
Escalation: Page during business hours, ticket overnight&lt;br&gt;
Post-mortem: Recommended&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-4: Low&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minor bug with workaround&lt;/li&gt;
&lt;li&gt;Internal tooling broken&lt;/li&gt;
&lt;li&gt;Non-customer-facing issue&lt;/li&gt;
&lt;li&gt;Cosmetic problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: 1 business day&lt;br&gt;
Escalation: Ticket only&lt;br&gt;
Post-mortem: Not required&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SEV-5: Informational&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not actually broken&lt;/li&gt;
&lt;li&gt;Preemptive warning&lt;/li&gt;
&lt;li&gt;"This might become a problem"&lt;/li&gt;
&lt;li&gt;Observed anomaly without impact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Target response: Backlog&lt;br&gt;
Escalation: None&lt;br&gt;
Post-mortem: Not required&lt;/p&gt;
&lt;h2&gt;
  
  
  The Calibration Problem
&lt;/h2&gt;

&lt;p&gt;Levels written on paper are useless. What matters is &lt;strong&gt;consistent application&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Run this exercise: take your last 50 incidents. Ask three SRE leads to independently assign severity levels. Compare.&lt;/p&gt;

&lt;p&gt;If more than 20% disagree by at least one level, your definitions aren't calibrated. Run training.&lt;/p&gt;
&lt;h2&gt;
  
  
  The "When In Doubt" Rules
&lt;/h2&gt;

&lt;p&gt;When severity is ambiguous, default to &lt;strong&gt;higher severity&lt;/strong&gt; and downgrade if wrong.&lt;/p&gt;

&lt;p&gt;Better to over-escalate and apologize than under-escalate and miss a SEV-1 for 4 hours.&lt;/p&gt;

&lt;p&gt;Specific rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User data loss&lt;/strong&gt; → always SEV-1 or SEV-2, never lower&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security issue&lt;/strong&gt; → always SEV-1 or SEV-2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Revenue impact&lt;/strong&gt; → SEV-2 minimum if measurable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uncertain scope&lt;/strong&gt; → start at higher severity, downgrade when scope is clear&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Customer Impact Matrix
&lt;/h2&gt;

&lt;p&gt;For fast calibration, use a matrix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| &amp;lt;1% users | 1-10% users | 10-50% | &amp;gt;50% users
Product Down | SEV-2 | SEV-1 | SEV-1 | SEV-1
Major Degraded | SEV-3 | SEV-2 | SEV-2 | SEV-1
Minor Degraded | SEV-4 | SEV-3 | SEV-2 | SEV-2
Workaround Exists| SEV-4 | SEV-4 | SEV-3 | SEV-2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a fast severity assignment without relying on intuition.&lt;/p&gt;

&lt;h2&gt;
  
  
  Time-Based Escalation
&lt;/h2&gt;

&lt;p&gt;Severity isn't fixed for the incident lifetime. It escalates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;sev_2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;auto_escalate_to_sev_1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if_not_resolved_in&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60_minutes&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if_user_impact_grows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;above_10_percent&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if_revenue_loss_exceeds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$10000/hour&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start at SEV-2, auto-escalate if things worsen. Don't let an incident linger at the same severity if the impact is growing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Downgrade Rule
&lt;/h2&gt;

&lt;p&gt;Downgrading is allowed &lt;strong&gt;but must be justified in writing&lt;/strong&gt; in the incident channel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Downgrading from SEV-1 to SEV-2 at 10:23. Initial reports of
total outage were incorrect. Real impact is ~5% of users in
us-west-2 only. Ticket: INC-1234"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents silent downgrades that understate severity for retro analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLO Integration
&lt;/h2&gt;

&lt;p&gt;Your SLOs and severity levels should align:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SLO: 99.95% availability (21.6 min/month budget)

If this month's error budget burned:
&amp;lt;25% → normal operations
25-50% → no SEV-3 burn-down deploys
50-75% → SEV-2 threshold lowered
&amp;gt;75% → any degradation is SEV-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you're running low on error budget, everything gets more severe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Incident Categories
&lt;/h2&gt;

&lt;p&gt;Beyond numeric severity, label incidents by type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;INCIDENT_TYPES&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;infrastructure (AWS, networking)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;application (code bug)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment (bad release)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;capacity (scaling failure)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;data (corruption, loss)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;security (breach, exposure)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;external (3rd-party dependency)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Severity tells you how urgent. Type tells you who to page.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Monthly Review
&lt;/h2&gt;

&lt;p&gt;Once a month, review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All SEV-1s and SEV-2s&lt;/li&gt;
&lt;li&gt;Any SEV-3 that should have been SEV-2&lt;/li&gt;
&lt;li&gt;Any SEV-2 that should have been SEV-3&lt;/li&gt;
&lt;li&gt;Average time from incident open to correct severity assignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adjust the definitions based on what you learn. Severity is a living standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pet severity&lt;/strong&gt; every team invents their own. Standardize company-wide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SEV-0&lt;/strong&gt; don't add levels above SEV-1. Just use "SEV-1 all hands."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity inflation&lt;/strong&gt; if every incident is SEV-2, nobody takes SEV-2 seriously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity deflation&lt;/strong&gt; pressure to avoid post-mortems leads to fake SEV-4s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unchanging severity&lt;/strong&gt; escalation is a tool, use it&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;Severity should mean the same thing to every person in the org. Engineers, PMs, execs, customer support.&lt;/p&gt;

&lt;p&gt;When someone says "SEV-1," everyone should know what that means, how urgent it is, and what the response looks like.&lt;/p&gt;

&lt;p&gt;When you achieve that, incident response gets dramatically better.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>incidents</category>
      <category>sre</category>
      <category>oncall</category>
      <category>process</category>
    </item>
    <item>
      <title>Memory Leak Detection in Long-Running Services</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sat, 02 May 2026 14:27:34 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/memory-leak-detection-in-long-running-services-l9f</link>
      <guid>https://forem.com/samson_tanimawo/memory-leak-detection-in-long-running-services-l9f</guid>
      <description>&lt;h2&gt;
  
  
  The Slowest Incident to Diagnose
&lt;/h2&gt;

&lt;p&gt;Memory leaks are sneaky. The service runs fine for hours. Then, slowly, it gets worse. Slower responses, more GC pauses, eventual OOM kills.&lt;/p&gt;

&lt;p&gt;And when you look at the first 30 minutes of metrics, everything looks normal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Flavors of Memory Growth
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. True leaks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Objects allocated but never freed&lt;/li&gt;
&lt;li&gt;Classic in C/C++, rare in Go/Java with GC&lt;/li&gt;
&lt;li&gt;Grows linearly forever until OOM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Unbounded caches&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache adds entries but never evicts&lt;/li&gt;
&lt;li&gt;Common in Node.js, Python, Go&lt;/li&gt;
&lt;li&gt;Grows until memory pressure triggers other issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Memory fragmentation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Heap is large but not usable&lt;/li&gt;
&lt;li&gt;Happens in long-running Java, Go,.NET services&lt;/li&gt;
&lt;li&gt;Not really a "leak" but behaves like one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three cause the same symptom: memory grows over time. Treatment is different for each.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detection Without Heap Dumps
&lt;/h2&gt;

&lt;p&gt;Before you reach for pprof or heap dumps, the fastest diagnosis is graph-watching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Is memory growing linearly over the last 24 hours?
deriv(container_memory_working_set_bytes{service="api"}[24h]) &amp;gt; 0

# Is GC pause time increasing?
rate(jvm_gc_pause_seconds_sum[1h]) &amp;gt; 0.05
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If memory is growing by ~500MB/day and GC pauses are increasing, you have a leak. Diagnosis complete.&lt;/p&gt;

&lt;p&gt;The question is &lt;strong&gt;where&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Go Memory Profiling
&lt;/h2&gt;

&lt;p&gt;Go makes this relatively easy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="s"&gt;"net/http/pprof"&lt;/span&gt;

&lt;span class="c"&gt;// In main():&lt;/span&gt;
&lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ListenAndServe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;":6060"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get a heap profile&lt;/span&gt;
go tool pprof http://localhost:6060/debug/pprof/heap

&lt;span class="c"&gt;# In the pprof shell:&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;pprof&lt;span class="o"&gt;)&lt;/span&gt; top
&lt;span class="o"&gt;(&lt;/span&gt;pprof&lt;span class="o"&gt;)&lt;/span&gt; list suspiciousFunction
&lt;span class="o"&gt;(&lt;/span&gt;pprof&lt;span class="o"&gt;)&lt;/span&gt; web &lt;span class="c"&gt;# generates a SVG callgraph&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Objects with high &lt;code&gt;inuse_space&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Objects with growing counts over time&lt;/li&gt;
&lt;li&gt;Unexpected large maps or slices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key trick&lt;/strong&gt;: take two heap profiles 1 hour apart and diff them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go tool pprof &lt;span class="nt"&gt;-base&lt;/span&gt; heap1.pprof heap2.pprof
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What shows up as "new" allocations in the diff is almost certainly your leak.&lt;/p&gt;

&lt;h2&gt;
  
  
  Java Memory Profiling
&lt;/h2&gt;

&lt;p&gt;Java is harder because the JVM adds layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dump the heap&lt;/span&gt;
jmap &lt;span class="nt"&gt;-dump&lt;/span&gt;:format&lt;span class="o"&gt;=&lt;/span&gt;b,file&lt;span class="o"&gt;=&lt;/span&gt;heap.hprof &amp;lt;pid&amp;gt;

&lt;span class="c"&gt;# Analyze with Eclipse MAT or JVisualVM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In MAT, look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leak Suspects report (automatic)&lt;/li&gt;
&lt;li&gt;Dominator tree (what's holding the most memory)&lt;/li&gt;
&lt;li&gt;GC roots path (what's preventing garbage collection)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common Java culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Static collections (especially &lt;code&gt;static Map&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;ThreadLocal values without cleanup&lt;/li&gt;
&lt;li&gt;Listeners/callbacks registered but never unregistered&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;finalize()&lt;/code&gt; methods delaying collection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Node.js Memory Profiling
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Enable the inspector&lt;/span&gt;
&lt;span class="nx"&gt;node&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nx"&gt;inspect&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;js&lt;/span&gt;

&lt;span class="c1"&gt;// Then in Chrome DevTools → Memory → Heap Snapshot&lt;/span&gt;
&lt;span class="c1"&gt;// Take 3 snapshots: baseline, after 10 min, after 20 min&lt;/span&gt;
&lt;span class="c1"&gt;// Compare to find retained objects&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common Node culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event emitter listeners that accumulate&lt;/li&gt;
&lt;li&gt;Closures holding references to large objects&lt;/li&gt;
&lt;li&gt;Unbounded caches (remember, Node has no built-in LRU)&lt;/li&gt;
&lt;li&gt;Stream buffers not being drained&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Python Memory Profiling
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tracemalloc&lt;/span&gt;
&lt;span class="n"&gt;tracemalloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;#... run the leaky operation...
&lt;/span&gt;
&lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracemalloc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;take_snapshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;top_stats&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;lineno&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;stat&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_stats&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use &lt;code&gt;memory_profiler&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;memory_profiler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;profile&lt;/span&gt;

&lt;span class="nd"&gt;@profile&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;suspect_function&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
&lt;span class="c1"&gt;# code here
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common Python culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Global lists/dicts growing unbounded&lt;/li&gt;
&lt;li&gt;Reference cycles with &lt;code&gt;__del__&lt;/code&gt; methods&lt;/li&gt;
&lt;li&gt;C extensions leaking (hardest to find)&lt;/li&gt;
&lt;li&gt;Pandas DataFrames kept around too long&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Cache Leak Special Case
&lt;/h2&gt;

&lt;p&gt;The most common "leak" isn't a leak at all. It's a cache without eviction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BAD: unbounded
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_from_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# GOOD: bounded LRU
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;functools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lru_cache&lt;/span&gt;

&lt;span class="nd"&gt;@lru_cache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch_from_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Always bound your caches. Always.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fragmentation in Go
&lt;/h2&gt;

&lt;p&gt;Go's garbage collector can leave the heap fragmented. You see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runtime memory is high&lt;/li&gt;
&lt;li&gt;Heap profile shows low allocations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;runtime.GC()&lt;/code&gt; doesn't reduce usage much&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solution: tune &lt;code&gt;GOGC&lt;/code&gt; or force memory release:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="s"&gt;"runtime/debug"&lt;/span&gt;
&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetGCPercent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// More aggressive GC&lt;/span&gt;
&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FreeOSMemory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c"&gt;// Return memory to OS&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Long-Running Service Pattern
&lt;/h2&gt;

&lt;p&gt;Services that run for weeks without restart accumulate cruft. Even without leaks.&lt;/p&gt;

&lt;p&gt;We use this pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deployment_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;max_uptime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;7d&lt;/span&gt;
&lt;span class="na"&gt;restart_schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rolling&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restart&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;every&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;days"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every pod gets restarted weekly during a quiet window. Eliminates slow memory growth as a class of problem.&lt;/p&gt;

&lt;p&gt;This isn't defeat. It's acknowledging that long-running processes in any language eventually accumulate state you don't want.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Diagnostic Checklist
&lt;/h2&gt;

&lt;p&gt;When a service is suspected of leaking:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is memory growing linearly or logarithmically? (linear = real leak)&lt;/li&gt;
&lt;li&gt;Is GC frequency/duration increasing? (yes = real pressure)&lt;/li&gt;
&lt;li&gt;Are request rates growing proportionally? (yes = normal growth, not leak)&lt;/li&gt;
&lt;li&gt;Take heap profile, save baseline&lt;/li&gt;
&lt;li&gt;Wait 1 hour, take second profile, diff&lt;/li&gt;
&lt;li&gt;Look for unexpected high-count objects&lt;/li&gt;
&lt;li&gt;Trace back to allocation site&lt;/li&gt;
&lt;li&gt;Fix the leak, deploy, watch metrics for 24h&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rinse and repeat. Memory leaks are annoying but systematically fixable.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>debugging</category>
      <category>memory</category>
      <category>sre</category>
      <category>performance</category>
    </item>
    <item>
      <title>CI/CD Reliability: When Your Deploy Pipeline is Your SPOF</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Fri, 01 May 2026 14:23:02 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/cicd-reliability-when-your-deploy-pipeline-is-your-spof-4g7i</link>
      <guid>https://forem.com/samson_tanimawo/cicd-reliability-when-your-deploy-pipeline-is-your-spof-4g7i</guid>
      <description>&lt;h2&gt;
  
  
  The Invisible SPOF
&lt;/h2&gt;

&lt;p&gt;Every engineering org has a single point of failure that nobody lists on their risk registry: &lt;strong&gt;the deploy pipeline itself&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When CI/CD breaks, you can't ship features. You can't deploy hotfixes. You can't roll back a broken release. Your production doesn't go down, but your ability to fix production does.&lt;/p&gt;

&lt;p&gt;We had a 4-hour outage last year caused by a GitHub Actions incident. Not a single server went down. We just couldn't deploy the fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Categorizing the Risk
&lt;/h2&gt;

&lt;p&gt;Your pipeline consists of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;source_control&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# GitHub, GitLab, Bitbucket&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;can't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;merge&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;PRs"&lt;/span&gt;

&lt;span class="na"&gt;ci_runners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# GitHub Actions, CircleCI, self-hosted&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;builds&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;run"&lt;/span&gt;

&lt;span class="na"&gt;artifact_storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# ECR, Artifactory, S3&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;or&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;push"&lt;/span&gt;

&lt;span class="na"&gt;deployment_controller&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# ArgoCD, Flux, Spinnaker&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deploys&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;happen"&lt;/span&gt;

&lt;span class="na"&gt;cluster_api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# k8s API, cloud provider API&lt;/span&gt;
&lt;span class="na"&gt;failure_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;don't&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;change"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer is a failure domain. A serious pipeline needs fallback plans for each.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Manual Escape Hatch
&lt;/h2&gt;

&lt;p&gt;Rule #1: &lt;strong&gt;You must have a documented path to deploy manually.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not for daily use for emergencies. Every team should know:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How to build the image locally&lt;/li&gt;
&lt;li&gt;How to push to the registry&lt;/li&gt;
&lt;li&gt;How to update the cluster without the normal pipeline&lt;/li&gt;
&lt;li&gt;Who has permission to do this in production&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We test this quarterly. Every SRE must deploy one service manually, end-to-end, in under 10 minutes, without the pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardening the Pipeline Itself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Pin your dependencies&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# BAD&lt;/span&gt;
&lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@main&lt;/span&gt;

&lt;span class="c1"&gt;# GOOD&lt;/span&gt;
&lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4.1.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;actions/checkout@main&lt;/code&gt; breaks, your deploys break. Pin to versions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Cache everything locally&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/yourorg&lt;/span&gt;
&lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ecr.amazonaws.com/yourorg&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the primary registry is down, you need a mirror. Every production image should exist in at least two registries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Monitor the pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You probably monitor your services. Do you monitor your CI?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;pipeline_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;build_success_rate (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;99%)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;deploy_duration_p99 (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;5 min)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;time_to_rollback_p99 (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;2 min)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;runner_queue_depth (target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;5)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert on these the same way you'd alert on a service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Test disaster modes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Can you ship if GitHub Actions is down?&lt;br&gt;
Can you ship if the main registry is unreachable?&lt;br&gt;
Can you ship if ArgoCD is down?&lt;/p&gt;

&lt;p&gt;If the answer is "no", you have undocumented SPOFs.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Rollback Rule
&lt;/h2&gt;

&lt;p&gt;Every deploy must be reversible in under 2 minutes. No exceptions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;time_to_deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15 minutes&lt;/span&gt;
&lt;span class="na"&gt;time_to_rollback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;90 seconds&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your rollback takes longer than your deploy, your pipeline is backwards.&lt;/p&gt;

&lt;p&gt;How to achieve fast rollbacks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep the previous image running in parallel during deploys&lt;/li&gt;
&lt;li&gt;Use traffic-shifting deploys (ALB weights, Istio)&lt;/li&gt;
&lt;li&gt;Label every image with the git commit&lt;/li&gt;
&lt;li&gt;Never deploy untested rollback paths&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Deploy Freeze
&lt;/h2&gt;

&lt;p&gt;Some teams never deploy on Fridays. This is cargo culting.&lt;/p&gt;

&lt;p&gt;The real rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Don't deploy when the on-call person is asleep&lt;/li&gt;
&lt;li&gt;Don't deploy during peak traffic windows&lt;/li&gt;
&lt;li&gt;Don't deploy major changes during holidays&lt;/li&gt;
&lt;li&gt;DO deploy hotfixes anytime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If Friday at 5pm is the only time you can deploy a fix, you deploy. The alternative is customers suffering all weekend.&lt;/p&gt;

&lt;p&gt;A reliable pipeline makes any-time deploys safe. Banning Friday deploys means your pipeline isn't reliable enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-Provider Strategy
&lt;/h2&gt;

&lt;p&gt;Big-ticket item: run critical workloads on CI from a different vendor than your code host.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code: GitHub
CI: CircleCI (not GitHub Actions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When GitHub Actions is down (it happens twice a year), your builds still run. When CircleCI is down, you can fall back to GitHub Actions.&lt;/p&gt;

&lt;p&gt;This doubles your CI bill but removes a major SPOF.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Break Glass" Deploy
&lt;/h2&gt;

&lt;p&gt;Every pipeline should have an emergency bypass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Normal deploy (takes 15 minutes, runs all tests)&lt;/span&gt;
./deploy.sh

&lt;span class="c"&gt;# Break-glass deploy (skips tests, full audit log, Slack alert)&lt;/span&gt;
./deploy.sh &lt;span class="nt"&gt;--break-glass&lt;/span&gt; &lt;span class="nt"&gt;--reason&lt;/span&gt; &lt;span class="s2"&gt;"Fixing P1 incident #1234"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The break-glass path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires written justification&lt;/li&gt;
&lt;li&gt;Skips long-running tests&lt;/li&gt;
&lt;li&gt;Notifies the whole team&lt;/li&gt;
&lt;li&gt;Writes to a permanent audit log&lt;/li&gt;
&lt;li&gt;Can only be used with incident in progress&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Used maybe 3-5 times a year. Without it, your 2-hour deploy pipeline becomes a bottleneck when every minute matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Metric That Matters Most
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mean Time to Deploy a Hotfix (MTTDHF)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From "we need to fix this" to "fix is in production" how long?&lt;/p&gt;

&lt;p&gt;Good: under 30 minutes&lt;br&gt;
Great: under 10 minutes&lt;br&gt;
Unicorn: under 5 minutes&lt;/p&gt;

&lt;p&gt;Track this. Optimize it. It's the most important reliability metric nobody talks about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Your pipeline is production infrastructure. Treat it with the same respect.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor it&lt;/li&gt;
&lt;li&gt;Back it up&lt;/li&gt;
&lt;li&gt;Test failure modes&lt;/li&gt;
&lt;li&gt;Document manual paths&lt;/li&gt;
&lt;li&gt;Never let it become a SPOF&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When it breaks during an incident, you'll be very glad you did.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>reliability</category>
      <category>devops</category>
      <category>deployments</category>
    </item>
    <item>
      <title>Multi-Region Failover: Lessons from Running It Hot</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:47:08 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-9c4</link>
      <guid>https://forem.com/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-9c4</guid>
      <description>&lt;h2&gt;
  
  
  Why "Hot" Matters
&lt;/h2&gt;

&lt;p&gt;Three multi-region strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold&lt;/strong&gt;: Backup region is off. You start it when primary fails. RTO: hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm&lt;/strong&gt;: Backup region runs on minimum capacity. Scale up on failover. RTO: 15-30 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hot&lt;/strong&gt;: Both regions serve live traffic simultaneously. RTO: seconds.&lt;/p&gt;

&lt;p&gt;If you need under 15 minutes RTO, you need hot. Everything else is marketing copy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of Warm Failover
&lt;/h2&gt;

&lt;p&gt;Warm sounds great on paper. In practice, on the day you need it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The warm region has never seen real load&lt;/li&gt;
&lt;li&gt;DNS cache propagation takes 5-15 minutes&lt;/li&gt;
&lt;li&gt;Autoscaling lags because it's starting cold&lt;/li&gt;
&lt;li&gt;Your team has never run on the warm region&lt;/li&gt;
&lt;li&gt;Half your connection strings are hardcoded to the primary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Warm failover works in tabletop exercises. It does not work during real incidents under pressure. We learned this the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It Hot: The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐
│ DNS / CDN │
└──────┬──────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │
│ 50% TX │ │ 50% TX │
└────┬─────┘ └─────┬────┘
│ │
└──────┬──────┬────────┘
▼ ▼
┌─────────────┐
│ Shared DB │
│ (writer + │
│ replicas) │
└─────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both regions always serve traffic. Split is usually 50/50 but can shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Parts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Database replication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where multi-region gets hard. Three options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single writer, multi-region readers&lt;/strong&gt;: simplest, but writes pay cross-region latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-master&lt;/strong&gt;: complex, but truly hot requires conflict resolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Region-sharded&lt;/strong&gt;: users pinned to a region for writes, simplest if your data model allows it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use region-sharded for user-scoped data and single-writer for global config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Session stickiness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a user's session is in Region A, and their next request goes to Region B, things break.&lt;/p&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JWT tokens with no server state&lt;/li&gt;
&lt;li&gt;Session data in a globally replicated store (DynamoDB Global Tables, CockroachDB)&lt;/li&gt;
&lt;li&gt;Cookie routing that pins a user to a region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Cache coherence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Region A's cache doesn't know when Region B updates the database. Options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short TTLs (1-5 minutes) and accept the inconsistency&lt;/li&gt;
&lt;li&gt;Pub/sub cache invalidation across regions (complex)&lt;/li&gt;
&lt;li&gt;Read-through caches only, never write-through&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Failover Mechanics
&lt;/h2&gt;

&lt;p&gt;When Region A dies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Health checks detect failure&lt;/strong&gt; route53/ALB removes Region A from DNS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic shifts to Region B&lt;/strong&gt; already warm, already running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autoscaling kicks in&lt;/strong&gt; Region B doubles capacity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User sessions degrade gracefully&lt;/strong&gt; re-authentication, cache warmup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring reports the shift&lt;/strong&gt; team gets paged, not customers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Target: customer-facing latency spike of under 30 seconds, full recovery under 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing It Monthly
&lt;/h2&gt;

&lt;p&gt;If you don't test failover monthly, you don't have failover. You have hope.&lt;/p&gt;

&lt;p&gt;We do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First Tuesday of every month, 10 AM&lt;/li&gt;
&lt;li&gt;Route100% of traffic to Region B for 30 minutes&lt;/li&gt;
&lt;li&gt;Watch dashboards, fix anything that degrades&lt;/li&gt;
&lt;li&gt;Route back to 50/50&lt;/li&gt;
&lt;li&gt;Document any issues, fix them&lt;/li&gt;
&lt;li&gt;Repeat next month with the other region&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Reality Check
&lt;/h2&gt;

&lt;p&gt;Running hot doubles your compute cost. For most companies, that's $10K-$100K/month.&lt;/p&gt;

&lt;p&gt;The question is: what's your revenue per hour of downtime?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Company A: $100K/hr revenue → 99.99% target → hot is worth it
Company B: $1K/hr revenue → 99.9% target → warm is fine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do the math. Don't copy FAANG patterns if your revenue doesn't justify them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Operational Complexity Tax
&lt;/h2&gt;

&lt;p&gt;Running hot costs more than money. It costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More runbooks&lt;/strong&gt; (one per region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More monitoring&lt;/strong&gt; (cross-region latency, replication lag)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harder debugging&lt;/strong&gt; ("which region was this request in?")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compliance surface&lt;/strong&gt; (data residency, each region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More deployment pipelines&lt;/strong&gt; (usually)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Budget 20% more engineering time for multi-region from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure in DNS config&lt;/strong&gt; your DNS provider becomes the new SPOF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing only with healthy traffic&lt;/strong&gt; test with 2x normal load during drills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forgetting about databases&lt;/strong&gt; DB failover is the hardest part&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using regions as backup, not active&lt;/strong&gt; never tested until crisis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not planning for split-brain&lt;/strong&gt; what if both regions think they're primary?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Minimum Viable Hot Setup
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Two regions, stateless app tier, 50/50 traffic&lt;/li&gt;
&lt;li&gt;Database: multi-AZ primary, cross-region async replica&lt;/li&gt;
&lt;li&gt;CDN/DNS: health-check-based routing&lt;/li&gt;
&lt;li&gt;Session: JWT-based (stateless)&lt;/li&gt;
&lt;li&gt;Monthly failover drills&lt;/li&gt;
&lt;li&gt;Runbooks tested in last 90 days&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start there. Layer in complexity as you need it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>multiregion</category>
      <category>failover</category>
      <category>sre</category>
      <category>aws</category>
    </item>
    <item>
      <title>Multi-Region Failover: Lessons from Running It Hot</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:08:47 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-1h8g</link>
      <guid>https://forem.com/samson_tanimawo/multi-region-failover-lessons-from-running-it-hot-1h8g</guid>
      <description>&lt;h2&gt;
  
  
  Why "Hot" Matters
&lt;/h2&gt;

&lt;p&gt;Three multi-region strategies:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold&lt;/strong&gt;: Backup region is off. You start it when primary fails. RTO: hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warm&lt;/strong&gt;: Backup region runs on minimum capacity. Scale up on failover. RTO: 15-30 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hot&lt;/strong&gt;: Both regions serve live traffic simultaneously. RTO: seconds.&lt;/p&gt;

&lt;p&gt;If you need under 15 minutes RTO, you need hot. Everything else is marketing copy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Illusion of Warm Failover
&lt;/h2&gt;

&lt;p&gt;Warm sounds great on paper. In practice, on the day you need it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The warm region has never seen real load&lt;/li&gt;
&lt;li&gt;DNS cache propagation takes 5-15 minutes&lt;/li&gt;
&lt;li&gt;Autoscaling lags because it's starting cold&lt;/li&gt;
&lt;li&gt;Your team has never run on the warm region&lt;/li&gt;
&lt;li&gt;Half your connection strings are hardcoded to the primary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Warm failover works in tabletop exercises. It does not work during real incidents under pressure. We learned this the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It Hot: The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────┐
│ DNS / CDN │
└──────┬──────┘
│
┌──────────┴──────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Region A │ │ Region B │
│ 50% TX │ │ 50% TX │
└────┬─────┘ └─────┬────┘
│ │
└──────┬──────┬────────┘
▼ ▼
┌─────────────┐
│ Shared DB │
│ (writer + │
│ replicas) │
└─────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both regions always serve traffic. Split is usually 50/50 but can shift.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hard Parts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Database replication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where multi-region gets hard. Three options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single writer, multi-region readers&lt;/strong&gt;: simplest, but writes pay cross-region latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-master&lt;/strong&gt;: complex, but truly hot requires conflict resolution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Region-sharded&lt;/strong&gt;: users pinned to a region for writes, simplest if your data model allows it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We use region-sharded for user-scoped data and single-writer for global config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Session stickiness&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a user's session is in Region A, and their next request goes to Region B, things break.&lt;/p&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JWT tokens with no server state&lt;/li&gt;
&lt;li&gt;Session data in a globally replicated store (DynamoDB Global Tables, CockroachDB)&lt;/li&gt;
&lt;li&gt;Cookie routing that pins a user to a region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Cache coherence&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Region A's cache doesn't know when Region B updates the database. Options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Short TTLs (1-5 minutes) and accept the inconsistency&lt;/li&gt;
&lt;li&gt;Pub/sub cache invalidation across regions (complex)&lt;/li&gt;
&lt;li&gt;Read-through caches only, never write-through&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Failover Mechanics
&lt;/h2&gt;

&lt;p&gt;When Region A dies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Health checks detect failure&lt;/strong&gt; route53/ALB removes Region A from DNS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic shifts to Region B&lt;/strong&gt; already warm, already running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autoscaling kicks in&lt;/strong&gt; Region B doubles capacity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User sessions degrade gracefully&lt;/strong&gt; re-authentication, cache warmup&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring reports the shift&lt;/strong&gt; team gets paged, not customers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Target: customer-facing latency spike of under 30 seconds, full recovery under 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing It Monthly
&lt;/h2&gt;

&lt;p&gt;If you don't test failover monthly, you don't have failover. You have hope.&lt;/p&gt;

&lt;p&gt;We do this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First Tuesday of every month, 10 AM&lt;/li&gt;
&lt;li&gt;Route100% of traffic to Region B for 30 minutes&lt;/li&gt;
&lt;li&gt;Watch dashboards, fix anything that degrades&lt;/li&gt;
&lt;li&gt;Route back to 50/50&lt;/li&gt;
&lt;li&gt;Document any issues, fix them&lt;/li&gt;
&lt;li&gt;Repeat next month with the other region&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Reality Check
&lt;/h2&gt;

&lt;p&gt;Running hot doubles your compute cost. For most companies, that's $10K-$100K/month.&lt;/p&gt;

&lt;p&gt;The question is: what's your revenue per hour of downtime?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Company A: $100K/hr revenue → 99.99% target → hot is worth it
Company B: $1K/hr revenue → 99.9% target → warm is fine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do the math. Don't copy FAANG patterns if your revenue doesn't justify them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Operational Complexity Tax
&lt;/h2&gt;

&lt;p&gt;Running hot costs more than money. It costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More runbooks&lt;/strong&gt; (one per region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More monitoring&lt;/strong&gt; (cross-region latency, replication lag)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Harder debugging&lt;/strong&gt; ("which region was this request in?")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compliance surface&lt;/strong&gt; (data residency, each region)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More deployment pipelines&lt;/strong&gt; (usually)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Budget 20% more engineering time for multi-region from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single point of failure in DNS config&lt;/strong&gt; your DNS provider becomes the new SPOF&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing only with healthy traffic&lt;/strong&gt; test with 2x normal load during drills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forgetting about databases&lt;/strong&gt; DB failover is the hardest part&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using regions as backup, not active&lt;/strong&gt; never tested until crisis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not planning for split-brain&lt;/strong&gt; what if both regions think they're primary?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Minimum Viable Hot Setup
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Two regions, stateless app tier, 50/50 traffic&lt;/li&gt;
&lt;li&gt;Database: multi-AZ primary, cross-region async replica&lt;/li&gt;
&lt;li&gt;CDN/DNS: health-check-based routing&lt;/li&gt;
&lt;li&gt;Session: JWT-based (stateless)&lt;/li&gt;
&lt;li&gt;Monthly failover drills&lt;/li&gt;
&lt;li&gt;Runbooks tested in last 90 days&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start there. Layer in complexity as you need it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>multiregion</category>
      <category>failover</category>
      <category>sre</category>
      <category>aws</category>
    </item>
    <item>
      <title>Disaster Recovery Drills That Actually Work</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 29 Apr 2026 15:45:56 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/disaster-recovery-drills-that-actually-work-2npp</link>
      <guid>https://forem.com/samson_tanimawo/disaster-recovery-drills-that-actually-work-2npp</guid>
      <description>&lt;h2&gt;
  
  
  Most DR Drills Are Theater
&lt;/h2&gt;

&lt;p&gt;Someone schedules a meeting. A few senior engineers walk through a runbook. Everyone agrees "yes, we could do this" and marks it complete.&lt;/p&gt;

&lt;p&gt;Then the real disaster hits and nobody remembers the procedure, the runbook is 2 years out of date, and half the backup systems don't work.&lt;/p&gt;

&lt;p&gt;Real DR drills test whether your team can actually recover, not whether they can talk about recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Levels of DR Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Tabletop&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walk through a scenario on paper&lt;/li&gt;
&lt;li&gt;Identify missing runbooks&lt;/li&gt;
&lt;li&gt;Find ownership gaps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: New team members, initial gap analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Doesn't prove anything actually works&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Partial Failure Test&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail one component in staging&lt;/li&gt;
&lt;li&gt;Watch recovery happen with real tools&lt;/li&gt;
&lt;li&gt;Time the full recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Validating specific runbooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Staging ≠ production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Full Production Drill&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail a real production component&lt;/li&gt;
&lt;li&gt;Customer-facing (announce a maintenance window if needed)&lt;/li&gt;
&lt;li&gt;Full team responds as if it's real&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Proving you can recover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Scary, high-coordination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams stop at Level 1. Good teams do Level 2 quarterly. The best teams do Level 3 twice a year.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real DR Drill Scenario
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Primary database becomes unreachable&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup (48 hours before)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schedule window with product team&lt;/li&gt;
&lt;li&gt;Pick a time when customer impact is minimal&lt;/li&gt;
&lt;li&gt;Brief the team: "Something will fail tomorrow, respond as normal"&lt;/li&gt;
&lt;li&gt;Pre-position the incident commander&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execution&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At T+0, block network access to primary database via iptables&lt;/li&gt;
&lt;li&gt;Start stopwatch&lt;/li&gt;
&lt;li&gt;Watch the team respond&lt;/li&gt;
&lt;li&gt;Do NOT intervene or give hints&lt;/li&gt;
&lt;li&gt;Document every action, every decision, every delay&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Metrics to capture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to detection (first alert fire)&lt;/li&gt;
&lt;li&gt;Time to engagement (first human acknowledges)&lt;/li&gt;
&lt;li&gt;Time to diagnosis ("we know what's wrong")&lt;/li&gt;
&lt;li&gt;Time to mitigation (customer impact stops)&lt;/li&gt;
&lt;li&gt;Time to recovery (fully restored)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scoring the Drill
&lt;/h2&gt;

&lt;p&gt;Good scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detection: under 2 minutes&lt;/li&gt;
&lt;li&gt;Engagement: under 5 minutes&lt;/li&gt;
&lt;li&gt;Diagnosis: under 15 minutes&lt;/li&gt;
&lt;li&gt;Mitigation: under 30 minutes (for DR scenarios)&lt;/li&gt;
&lt;li&gt;Recovery: depends on scenario&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these are 5x longer than target, you have a real problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Always Goes Wrong
&lt;/h2&gt;

&lt;p&gt;In every DR drill we run:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Runbook is out of date.&lt;/strong&gt; The one that worked 6 months ago has wrong commands now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials don't work.&lt;/strong&gt; The service account was rotated, nobody updated the runbook.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup is untested.&lt;/strong&gt; The restore fails because the backup is corrupted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation paths are stale.&lt;/strong&gt; The "DBA on-call" has left the company.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependencies are missing.&lt;/strong&gt; The recovery playbook assumes Service X is up, but Service X depends on the failed component.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every drill uncovers 3-5 of these. Fix them, then drill again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chaos Engineering vs. DR Drills
&lt;/h2&gt;

&lt;p&gt;These are different. Chaos engineering is continuous (daily/weekly) and usually automated. DR drills are intentional and large-scale.&lt;/p&gt;

&lt;p&gt;Chaos engineering answers: "Can we survive small failures routinely?"&lt;/p&gt;

&lt;p&gt;DR drills answer: "Can we survive catastrophic failures at all?"&lt;/p&gt;

&lt;p&gt;You need both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blame-Free Rule
&lt;/h2&gt;

&lt;p&gt;DR drills expose weaknesses. Those weaknesses are process problems, not people problems.&lt;/p&gt;

&lt;p&gt;The rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No firing based on drill performance&lt;/li&gt;
&lt;li&gt;No promotions based on "being the hero"&lt;/li&gt;
&lt;li&gt;Focus on process gaps, not individual failures&lt;/li&gt;
&lt;li&gt;The post-drill retrospective is 90% about fixing systems, 10% about training people&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the team is afraid of the drill, you'll never learn anything real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequency That Actually Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1 (Tabletop): Monthly, 1 hour
Level 2 (Partial): Quarterly, 4 hours including retro
Level 3 (Full Production): Twice a year, full day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also: after every major infrastructure change, drill the affected components.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Lesson
&lt;/h2&gt;

&lt;p&gt;The drill is the easy part. The hard part is &lt;strong&gt;making the fixes from the drill a priority&lt;/strong&gt; when there's feature pressure.&lt;/p&gt;

&lt;p&gt;We track "DR drill remediation items" as a standing OKR. If after two quarters the same items are still open, the SRE team has authority to freeze feature work until they're fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you've never done a DR drill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick one scenario (database failure, region outage, API gateway down)&lt;/li&gt;
&lt;li&gt;Schedule it for a quiet hour&lt;/li&gt;
&lt;li&gt;Run a tabletop first find the obvious gaps&lt;/li&gt;
&lt;li&gt;Fix those gaps&lt;/li&gt;
&lt;li&gt;Run a partial failure test in staging&lt;/li&gt;
&lt;li&gt;Measure everything&lt;/li&gt;
&lt;li&gt;Run a retro focused on process&lt;/li&gt;
&lt;li&gt;Schedule the next one&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do this for three scenarios, and you'll have a DR program. Do it for ten, and you'll have a resilient company.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dr</category>
      <category>reliability</category>
      <category>sre</category>
      <category>chaosengineering</category>
    </item>
    <item>
      <title>Disaster Recovery Drills That Actually Work</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 29 Apr 2026 14:41:48 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/disaster-recovery-drills-that-actually-work-1n83</link>
      <guid>https://forem.com/samson_tanimawo/disaster-recovery-drills-that-actually-work-1n83</guid>
      <description>&lt;h2&gt;
  
  
  Most DR Drills Are Theater
&lt;/h2&gt;

&lt;p&gt;Someone schedules a meeting. A few senior engineers walk through a runbook. Everyone agrees "yes, we could do this" and marks it complete.&lt;/p&gt;

&lt;p&gt;Then the real disaster hits and nobody remembers the procedure, the runbook is 2 years out of date, and half the backup systems don't work.&lt;/p&gt;

&lt;p&gt;Real DR drills test whether your team can actually recover, not whether they can talk about recovery.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Levels of DR Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Tabletop&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Walk through a scenario on paper&lt;/li&gt;
&lt;li&gt;Identify missing runbooks&lt;/li&gt;
&lt;li&gt;Find ownership gaps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: New team members, initial gap analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Doesn't prove anything actually works&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Partial Failure Test&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail one component in staging&lt;/li&gt;
&lt;li&gt;Watch recovery happen with real tools&lt;/li&gt;
&lt;li&gt;Time the full recovery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Validating specific runbooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Staging ≠ production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Level 3: Full Production Drill&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actually fail a real production component&lt;/li&gt;
&lt;li&gt;Customer-facing (announce a maintenance window if needed)&lt;/li&gt;
&lt;li&gt;Full team responds as if it's real&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Useful for&lt;/strong&gt;: Proving you can recover&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limits&lt;/strong&gt;: Scary, high-coordination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams stop at Level 1. Good teams do Level 2 quarterly. The best teams do Level 3 twice a year.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real DR Drill Scenario
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: Primary database becomes unreachable&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup (48 hours before)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schedule window with product team&lt;/li&gt;
&lt;li&gt;Pick a time when customer impact is minimal&lt;/li&gt;
&lt;li&gt;Brief the team: "Something will fail tomorrow, respond as normal"&lt;/li&gt;
&lt;li&gt;Pre-position the incident commander&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execution&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;At T+0, block network access to primary database via iptables&lt;/li&gt;
&lt;li&gt;Start stopwatch&lt;/li&gt;
&lt;li&gt;Watch the team respond&lt;/li&gt;
&lt;li&gt;Do NOT intervene or give hints&lt;/li&gt;
&lt;li&gt;Document every action, every decision, every delay&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Metrics to capture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to detection (first alert fire)&lt;/li&gt;
&lt;li&gt;Time to engagement (first human acknowledges)&lt;/li&gt;
&lt;li&gt;Time to diagnosis ("we know what's wrong")&lt;/li&gt;
&lt;li&gt;Time to mitigation (customer impact stops)&lt;/li&gt;
&lt;li&gt;Time to recovery (fully restored)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Scoring the Drill
&lt;/h2&gt;

&lt;p&gt;Good scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detection: under 2 minutes&lt;/li&gt;
&lt;li&gt;Engagement: under 5 minutes&lt;/li&gt;
&lt;li&gt;Diagnosis: under 15 minutes&lt;/li&gt;
&lt;li&gt;Mitigation: under 30 minutes (for DR scenarios)&lt;/li&gt;
&lt;li&gt;Recovery: depends on scenario&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these are 5x longer than target, you have a real problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Always Goes Wrong
&lt;/h2&gt;

&lt;p&gt;In every DR drill we run:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Runbook is out of date.&lt;/strong&gt; The one that worked 6 months ago has wrong commands now.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Credentials don't work.&lt;/strong&gt; The service account was rotated, nobody updated the runbook.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup is untested.&lt;/strong&gt; The restore fails because the backup is corrupted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation paths are stale.&lt;/strong&gt; The "DBA on-call" has left the company.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dependencies are missing.&lt;/strong&gt; The recovery playbook assumes Service X is up, but Service X depends on the failed component.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every drill uncovers 3-5 of these. Fix them, then drill again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chaos Engineering vs. DR Drills
&lt;/h2&gt;

&lt;p&gt;These are different. Chaos engineering is continuous (daily/weekly) and usually automated. DR drills are intentional and large-scale.&lt;/p&gt;

&lt;p&gt;Chaos engineering answers: "Can we survive small failures routinely?"&lt;/p&gt;

&lt;p&gt;DR drills answer: "Can we survive catastrophic failures at all?"&lt;/p&gt;

&lt;p&gt;You need both.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blame-Free Rule
&lt;/h2&gt;

&lt;p&gt;DR drills expose weaknesses. Those weaknesses are process problems, not people problems.&lt;/p&gt;

&lt;p&gt;The rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No firing based on drill performance&lt;/li&gt;
&lt;li&gt;No promotions based on "being the hero"&lt;/li&gt;
&lt;li&gt;Focus on process gaps, not individual failures&lt;/li&gt;
&lt;li&gt;The post-drill retrospective is 90% about fixing systems, 10% about training people&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the team is afraid of the drill, you'll never learn anything real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequency That Actually Works
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 1 (Tabletop): Monthly, 1 hour
Level 2 (Partial): Quarterly, 4 hours including retro
Level 3 (Full Production): Twice a year, full day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also: after every major infrastructure change, drill the affected components.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Lesson
&lt;/h2&gt;

&lt;p&gt;The drill is the easy part. The hard part is &lt;strong&gt;making the fixes from the drill a priority&lt;/strong&gt; when there's feature pressure.&lt;/p&gt;

&lt;p&gt;We track "DR drill remediation items" as a standing OKR. If after two quarters the same items are still open, the SRE team has authority to freeze feature work until they're fixed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting Point
&lt;/h2&gt;

&lt;p&gt;If you've never done a DR drill:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick one scenario (database failure, region outage, API gateway down)&lt;/li&gt;
&lt;li&gt;Schedule it for a quiet hour&lt;/li&gt;
&lt;li&gt;Run a tabletop first find the obvious gaps&lt;/li&gt;
&lt;li&gt;Fix those gaps&lt;/li&gt;
&lt;li&gt;Run a partial failure test in staging&lt;/li&gt;
&lt;li&gt;Measure everything&lt;/li&gt;
&lt;li&gt;Run a retro focused on process&lt;/li&gt;
&lt;li&gt;Schedule the next one&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do this for three scenarios, and you'll have a DR program. Do it for ten, and you'll have a resilient company.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dr</category>
      <category>reliability</category>
      <category>sre</category>
      <category>chaosengineering</category>
    </item>
    <item>
      <title>Feature Flags as a Reliability Tool, Not Just an A/B Platform</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Tue, 28 Apr 2026 14:11:47 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/feature-flags-as-a-reliability-tool-not-just-an-ab-platform-40e</link>
      <guid>https://forem.com/samson_tanimawo/feature-flags-as-a-reliability-tool-not-just-an-ab-platform-40e</guid>
      <description>&lt;h2&gt;
  
  
  Most Teams Use Feature Flags Wrong
&lt;/h2&gt;

&lt;p&gt;They wire up LaunchDarkly or Unleash, use it for two A/B tests, then forget about it.&lt;/p&gt;

&lt;p&gt;Meanwhile, their production is full of &lt;code&gt;if (isNewCheckoutEnabled)&lt;/code&gt; blocks that nobody remembers how to toggle.&lt;/p&gt;

&lt;p&gt;Feature flags are not primarily an experimentation tool. They're a &lt;strong&gt;reliability tool&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Value
&lt;/h2&gt;

&lt;p&gt;Feature flags let you &lt;strong&gt;separate deploy from release&lt;/strong&gt;. You ship code to production cold, then turn it on gradually for real users.&lt;/p&gt;

&lt;p&gt;When things break, you flip the switch back in 10 seconds. No rollback, no redeploy, no PR reverts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Reliability Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Kill Switches&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every risky new feature ships behind a kill switch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;featureFlags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEnabled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;new_payment_flow&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;newPaymentFlow&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;legacyPaymentFlow&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the new flow has a bug, you don't rollback. You flip the flag.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Gradual Rollouts&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;new_search_algorithm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;rollout_percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt; &lt;span class="c1"&gt;# Start at 1% of users&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user.tier&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;==&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'internal'"&lt;/span&gt;
&lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="c1"&gt;# Internal users always see it&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy to 1%, watch metrics, go to 5%, watch, 25%, 50%, 100%. Takes 2-4 hours per rollout instead of a single risky deploy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Circuit Breakers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;external_recommendations_service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;automatic_disable_if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;error_rate_above&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5%&lt;/span&gt;
&lt;span class="na"&gt;for_minutes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a downstream service starts failing, the flag auto-disables that feature. Your product degrades gracefully instead of crashing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Load Shedding&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;expensive_realtime_dashboard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;enabled_when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;cpu_utilization_below&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;70%&lt;/span&gt;
&lt;span class="na"&gt;active_users_below&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under load, disable non-critical features to preserve the critical path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern: Permanent Flags
&lt;/h2&gt;

&lt;p&gt;After a feature is 100% rolled out, the flag should be deleted within 2 weeks. Every flag left in the codebase is technical debt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flag hygiene rules:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Every flag has an expiration date (90 days max)
- Every flag has an owner in CODEOWNERS
- CI fails if a flag is older than 180 days
- Monthly flag cleanup is part of standard operations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We track "flag count" as a reliability metric. If it grows unbounded, we're doing it wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;A solid feature flag system has three parts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Definition store&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source of truth for all flags&lt;/li&gt;
&lt;li&gt;Versioned in Git or a managed service (LaunchDarkly, Unleash, GrowthBook)&lt;/li&gt;
&lt;li&gt;Audit log for every change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Client SDK&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In-app flag evaluation&lt;/li&gt;
&lt;li&gt;Falls back to defaults if the service is unreachable&lt;/li&gt;
&lt;li&gt;Caches decisions for 60 seconds&lt;/li&gt;
&lt;li&gt;Emits telemetry for flag usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Admin interface&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change flags without deploying code&lt;/li&gt;
&lt;li&gt;See current state across environments&lt;/li&gt;
&lt;li&gt;Role-based access (not everyone can flip prod flags)&lt;/li&gt;
&lt;li&gt;Approval workflow for high-risk flags&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evaluating at the Right Layer
&lt;/h2&gt;

&lt;p&gt;Flags can live at multiple layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CDN edge use for marketing experiments
Load balancer use for blue/green deploys
App server use for feature experiments
Database use for schema migrations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The deeper the layer, the faster the rollout. CDN flags flip in seconds. Database flags take minutes to propagate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reliability Metric
&lt;/h2&gt;

&lt;p&gt;Track: &lt;strong&gt;mean time to mitigate (MTTM)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If your team can mitigate an incident in under 30 seconds via a feature flag flip, that's a win. If you have to redeploy to mitigate, your reliability is bottlenecked by deploy time.&lt;/p&gt;

&lt;p&gt;Good teams: MTTM under 60 seconds&lt;br&gt;
Great teams: MTTM under 15 seconds&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Gotchas
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stale flags skew A/B results&lt;/strong&gt; clean them up after experiments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flags without defaults cause prod outages&lt;/strong&gt; every flag must have a safe fallback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flag flips mid-request cause weird bugs&lt;/strong&gt; evaluate at request start, cache for the request lifetime&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nested flags (flags inside flags) are impossible to reason about&lt;/strong&gt; avoid&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A Reliability-First Flag Strategy
&lt;/h2&gt;

&lt;p&gt;Start simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Every new feature ships behind a kill switch&lt;/li&gt;
&lt;li&gt;Gradual rollouts for anything touching the critical path&lt;/li&gt;
&lt;li&gt;Circuit breakers for external dependencies&lt;/li&gt;
&lt;li&gt;Flag cleanup is a monthly ritual&lt;/li&gt;
&lt;li&gt;Track MTTM and optimize it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Feature flags are the most underrated reliability tool in modern engineering. Treat them that way.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>featureflags</category>
      <category>reliability</category>
      <category>devops</category>
      <category>deployments</category>
    </item>
    <item>
      <title>eBPF for SREs: Observability Without Agents</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Mon, 27 Apr 2026 14:11:20 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/ebpf-for-sres-observability-without-agents-2ohk</link>
      <guid>https://forem.com/samson_tanimawo/ebpf-for-sres-observability-without-agents-2ohk</guid>
      <description>&lt;h2&gt;
  
  
  The Agent Problem
&lt;/h2&gt;

&lt;p&gt;Traditional monitoring means shipping an agent with every service. That agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adds memory overhead&lt;/li&gt;
&lt;li&gt;Needs to be updated&lt;/li&gt;
&lt;li&gt;Gets out of date&lt;/li&gt;
&lt;li&gt;Breaks with kernel upgrades&lt;/li&gt;
&lt;li&gt;Needs instrumentation code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;eBPF says: &lt;strong&gt;what if the kernel itself could emit observability data?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What eBPF Actually Is
&lt;/h2&gt;

&lt;p&gt;eBPF (extended Berkeley Packet Filter) lets you run sandboxed programs inside the Linux kernel without recompiling or loading modules. It was originally for packet filtering. Now it powers Cilium, Pixie, Falco, and dozens of other tools.&lt;/p&gt;

&lt;p&gt;From an SRE perspective: &lt;strong&gt;you get deep visibility into syscalls, network traffic, process behavior, and filesystem operations with zero code changes to your applications&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Can Observe
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;every TCP connection (src, dst, bytes, duration)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;DNS queries and response times&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;TLS handshake failures&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;HTTP request/response cycles&lt;/span&gt;

&lt;span class="na"&gt;application&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;function call latencies (uprobes)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory allocations&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;lock contention&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;GC pauses&lt;/span&gt;

&lt;span class="na"&gt;security&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;syscall audit trails&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;privilege escalations&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;suspicious file access&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container escape attempts&lt;/span&gt;

&lt;span class="na"&gt;performance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CPU scheduling delays&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;I/O wait time per process&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;disk latency histograms&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;page fault patterns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of this &lt;strong&gt;without modifying your application code&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Practical Example: Detecting Slow HTTP Requests
&lt;/h2&gt;

&lt;p&gt;Traditional approach: instrument your HTTP framework with OpenTelemetry, deploy a collector, ship traces.&lt;/p&gt;

&lt;p&gt;eBPF approach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install bpftrace&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;bpftrace

&lt;span class="c"&gt;# Trace every HTTP response larger than 1MB&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bpftrace &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'
uprobe:/usr/lib/libssl.so:SSL_write {
@http_writes[pid] = count();
@http_bytes[comm] = sum(arg2);
}
'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No code changes. No restarts. Real-time visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools Worth Knowing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Pixie&lt;/strong&gt; (now part of New Relic)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-instruments every service in your K8s cluster&lt;/li&gt;
&lt;li&gt;No code changes, no sidecars&lt;/li&gt;
&lt;li&gt;Full HTTP, MySQL, Postgres, DNS tracing&lt;/li&gt;
&lt;li&gt;Open source&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Cilium&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network observability + security policy enforcement&lt;/li&gt;
&lt;li&gt;Replaces kube-proxy&lt;/li&gt;
&lt;li&gt;Hubble UI for service-to-service traffic visualization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Falco&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runtime security detection&lt;/li&gt;
&lt;li&gt;"Alert if a process inside a container spawns a shell"&lt;/li&gt;
&lt;li&gt;Writes rules in YAML&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Parca&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuous profiling via eBPF&lt;/li&gt;
&lt;li&gt;See CPU flame graphs across your entire fleet&lt;/li&gt;
&lt;li&gt;Identify the most expensive code paths&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Tracee&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Security-focused eBPF tracing&lt;/li&gt;
&lt;li&gt;Detects privilege escalations, cryptojacking, suspicious syscalls&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Tradeoffs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero app code changes&lt;/li&gt;
&lt;li&gt;Near-zero overhead (kernel-level efficiency)&lt;/li&gt;
&lt;li&gt;Unified view across languages (Go, Python, Java, Rust, all seen the same way)&lt;/li&gt;
&lt;li&gt;No agent lifecycle to manage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires Linux 4.14+ (5.0+ preferred)&lt;/li&gt;
&lt;li&gt;Steep learning curve for custom probes&lt;/li&gt;
&lt;li&gt;Limited visibility into in-process logic (you see syscalls, not business logic)&lt;/li&gt;
&lt;li&gt;eBPF verifier rejects programs for subtle reasons&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When eBPF Shines
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network debugging&lt;/strong&gt;: "Why is service A slow to reach service B?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security auditing&lt;/strong&gt;: "What containers are making unexpected syscalls?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance profiling&lt;/strong&gt;: "Where is the cluster CPU time actually going?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident forensics&lt;/strong&gt;: "Reconstruct the syscall timeline during the outage"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When eBPF Is Wrong
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Business logic observability&lt;/strong&gt; you still need OpenTelemetry for spans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application errors&lt;/strong&gt; your logs and exception tracking still matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-region correlation&lt;/strong&gt; eBPF is node-local&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use eBPF for infrastructure and network. Use OpenTelemetry for application logic. They complement each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Deploy Pixie in a dev cluster (1-line install)&lt;/li&gt;
&lt;li&gt;Open the UI, watch real-time HTTP traffic&lt;/li&gt;
&lt;li&gt;Try a bpftrace one-liner to trace a specific syscall&lt;/li&gt;
&lt;li&gt;Read the Cilium + Hubble docs&lt;/li&gt;
&lt;li&gt;Replace one agent-based tool with its eBPF equivalent&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The future of observability is kernel-native. Agent-based tools will still exist, but the gap will keep shrinking.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ebpf</category>
      <category>observability</category>
      <category>linux</category>
      <category>kernel</category>
    </item>
    <item>
      <title>Observability as Code: Managing Dashboards and Alerts with Terraform</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Sun, 26 Apr 2026 14:11:06 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/observability-as-code-managing-dashboards-and-alerts-with-terraform-2hl4</link>
      <guid>https://forem.com/samson_tanimawo/observability-as-code-managing-dashboards-and-alerts-with-terraform-2hl4</guid>
      <description>&lt;h2&gt;
  
  
  The Problem with Click-Ops Dashboards
&lt;/h2&gt;

&lt;p&gt;Your team has 200 dashboards. You don't know who owns them. Half are broken. The rest show yesterday's reality.&lt;/p&gt;

&lt;p&gt;This is click-ops debt, and it compounds faster than code debt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Observability as Code
&lt;/h2&gt;

&lt;p&gt;Every dashboard, alert, and SLO definition should live in a Git repository alongside your service code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"datadog_dashboard"&lt;/span&gt; &lt;span class="s2"&gt;"api_gateway"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"API Gateway - Golden Signals"&lt;/span&gt;
&lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Owner: @platform-team"&lt;/span&gt;
&lt;span class="nx"&gt;layout_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ordered"&lt;/span&gt;

&lt;span class="nx"&gt;widget&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;timeseries_definition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Request Rate (per second)"&lt;/span&gt;
&lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;q&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sum:api.requests{service:gateway}.as_rate()"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;widget&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;timeseries_definition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"P99 Latency"&lt;/span&gt;
&lt;span class="nx"&gt;request&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;q&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"max:api.latency{service:gateway}.as_count()"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lives next to &lt;code&gt;main.tf&lt;/code&gt; for your service. When you deploy the service, you deploy the observability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benefits That Compound
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Ownership is clear.&lt;/strong&gt; The file has a CODEOWNERS entry. PRs require review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dashboards auto-update.&lt;/strong&gt; Renaming a service? Terraform refactor propagates to all dashboards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Drift detection.&lt;/strong&gt; Someone clicked "save as" in the UI and now that dashboard is out of sync. &lt;code&gt;terraform plan&lt;/code&gt; catches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Review before production.&lt;/strong&gt; Alert changes go through PR review. No more "who set this threshold?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Tooling by Platform
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;datadog&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DataDog/datadog&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;datadog_monitor, datadog_dashboard, datadog_slo&lt;/span&gt;

&lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana_dashboard, grafana_alert_rule&lt;/span&gt;

&lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;approach&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;YAML files in Git, deployed by ArgoCD&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alert rules, recording rules&lt;/span&gt;

&lt;span class="na"&gt;new_relic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;newrelic/newrelic&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;newrelic_alert_policy, newrelic_dashboard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pick one source of truth. Don't mix.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Real Example
&lt;/h2&gt;

&lt;p&gt;We have a module that takes a service name and generates a complete observability stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"service_observability"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/observability"&lt;/span&gt;

&lt;span class="nx"&gt;service_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"payment-processor"&lt;/span&gt;
&lt;span class="nx"&gt;team_slack&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"#payments"&lt;/span&gt;
&lt;span class="nx"&gt;severity_map&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;error_rate_pct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;
&lt;span class="nx"&gt;p99_latency_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="nx"&gt;saturation_pct&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;slo_targets&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="nx"&gt;availability&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9995&lt;/span&gt;
&lt;span class="nx"&gt;latency_p99&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.99&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One module call creates: 3 dashboards, 8 alerts, 2 SLOs, a Slack channel binding, and a PagerDuty escalation policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Part
&lt;/h2&gt;

&lt;p&gt;The code is easy. The hard part is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Migrating existing click-ops dashboards&lt;/strong&gt; budget 2 weeks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Getting engineers to edit YAML/HCL instead of the UI&lt;/strong&gt; budget 3 months of reminders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocking UI edits&lt;/strong&gt; some tools let you set dashboards to read-only&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reviewing alert changes&lt;/strong&gt; PR reviewers need context&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern to Avoid
&lt;/h2&gt;

&lt;p&gt;Don't write Terraform for every custom chart an engineer wants. That leads to 500-line dashboard modules nobody understands.&lt;/p&gt;

&lt;p&gt;Instead, define &lt;strong&gt;standard dashboards&lt;/strong&gt; (golden signals, RED/USE, SLO burn rate) as modules. Let engineers add their own custom dashboards in the UI if they want, but mark them as "explore-only" (not alert-worthy).&lt;/p&gt;

&lt;p&gt;Core observability = code. Experimental exploration = UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Strategy
&lt;/h2&gt;

&lt;p&gt;Week 1: Pick 1 service, convert its dashboards to Terraform&lt;br&gt;
Week 2: Add alerts + SLOs to Terraform&lt;br&gt;
Week 3: Delete the UI versions&lt;br&gt;
Week 4: Create a module from the patterns&lt;br&gt;
Month 2: Roll out to 10 more services&lt;br&gt;
Month 3: Require all new services to use the module&lt;/p&gt;

&lt;p&gt;Six months in, your click-ops debt is gone and your observability is reproducible.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>observability</category>
      <category>devops</category>
      <category>iac</category>
    </item>
  </channel>
</rss>
