<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Samson Tanimawo</title>
    <description>The latest articles on Forem by Samson Tanimawo (@samson_tanimawo).</description>
    <link>https://forem.com/samson_tanimawo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3830227%2F02ea1ab7-513f-4426-b63d-9120142bc431.png</url>
      <title>Forem: Samson Tanimawo</title>
      <link>https://forem.com/samson_tanimawo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/samson_tanimawo"/>
    <language>en</language>
    <item>
      <title>The Golden Signals: A Practical Implementation Guide</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 16 Apr 2026 08:12:08 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-6ii</link>
      <guid>https://forem.com/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-6ii</guid>
      <description>&lt;h2&gt;
  
  
  Four Metrics to Rule Them All
&lt;/h2&gt;

&lt;p&gt;Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.&lt;/p&gt;

&lt;p&gt;Here's a practical guide from someone who's implemented them across 50+ services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 1: Latency
&lt;/h2&gt;

&lt;p&gt;Not all latency is equal. You need to track successful requests and error requests separately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad: Average latency
&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_request_time&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt;  &lt;span class="c1"&gt;# Useless
&lt;/span&gt;
&lt;span class="c1"&gt;# Good: Percentile latency, separated by status
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prometheus_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Histogram&lt;/span&gt;

&lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http_request_duration_seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Request latency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status_class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.middleware&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track_latency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;
    &lt;span class="n"&gt;status_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;xx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_class&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert on p99, not p50. Your happiest users don't need help.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighLatencyP99&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 2: Traffic
&lt;/h2&gt;

&lt;p&gt;Traffic tells you "is this normal?" It's the context for every other signal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Current request rate
rate(http_requests_total[5m])

# Compare to same time last week
rate(http_requests_total[5m]) 
  / 
rate(http_requests_total[5m] offset 7d)

# Alert on sudden drops (possible outage nobody noticed)
- alert: TrafficDrop
  expr: &amp;gt;
    rate(http_requests_total[5m]) 
    &amp;lt; 
    (rate(http_requests_total[5m] offset 1h) * 0.5)
  for: 10m
  annotations:
    summary: "Traffic dropped &amp;gt;50% compared to 1 hour ago"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic drops are often more concerning than traffic spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 3: Errors
&lt;/h2&gt;

&lt;p&gt;Track error rate as a percentage, not absolute count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Error rate percentage
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But also track error types separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;error_categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;5xx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Server&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(our&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fault)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;4xx_excluding_404&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Client&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(possible&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;issue)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeouts"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;circuit_breaker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dependency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failures"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 4: Saturation
&lt;/h2&gt;

&lt;p&gt;The most underrated signal. Saturation answers: "how close are we to full?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CPU saturation
process_cpu_seconds_total / container_spec_cpu_quota

# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes

# Connection pool saturation
active_connections / max_connections

# Queue saturation (the one everyone forgets)
message_queue_depth / message_queue_capacity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Every service gets a standard dashboard with four rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Row 1: Latency   [p50] [p90] [p99] [error latency]
Row 2: Traffic    [rate] [vs last week] [by endpoint]
Row 3: Errors     [rate %] [by type] [by endpoint]
Row 4: Saturation [CPU] [Memory] [Connections] [Queue]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern
&lt;/h2&gt;

&lt;p&gt;Don't build a golden signals dashboard per service manually. Template it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dashboard"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Golden Signals: {{ service_name }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"templating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"query"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"prod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"staging"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One template, 50 dashboards. Update once, apply everywhere.&lt;/p&gt;

&lt;p&gt;If you want golden signal monitoring that sets itself up automatically, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>monitoring</category>
      <category>observability</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Golden Signals: A Practical Implementation Guide</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:30:02 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-1ope</link>
      <guid>https://forem.com/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-1ope</guid>
      <description>&lt;h2&gt;
  
  
  Four Metrics to Rule Them All
&lt;/h2&gt;

&lt;p&gt;Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.&lt;/p&gt;

&lt;p&gt;Here's a practical guide from someone who's implemented them across 50+ services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 1: Latency
&lt;/h2&gt;

&lt;p&gt;Not all latency is equal. You need to track successful requests and error requests separately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad: Average latency
&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_request_time&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt;  &lt;span class="c1"&gt;# Useless
&lt;/span&gt;
&lt;span class="c1"&gt;# Good: Percentile latency, separated by status
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prometheus_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Histogram&lt;/span&gt;

&lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http_request_duration_seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Request latency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status_class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.middleware&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track_latency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;
    &lt;span class="n"&gt;status_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;xx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_class&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert on p99, not p50. Your happiest users don't need help.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighLatencyP99&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 2: Traffic
&lt;/h2&gt;

&lt;p&gt;Traffic tells you "is this normal?" It's the context for every other signal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Current request rate
rate(http_requests_total[5m])

# Compare to same time last week
rate(http_requests_total[5m]) 
  / 
rate(http_requests_total[5m] offset 7d)

# Alert on sudden drops (possible outage nobody noticed)
- alert: TrafficDrop
  expr: &amp;gt;
    rate(http_requests_total[5m]) 
    &amp;lt; 
    (rate(http_requests_total[5m] offset 1h) * 0.5)
  for: 10m
  annotations:
    summary: "Traffic dropped &amp;gt;50% compared to 1 hour ago"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic drops are often more concerning than traffic spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 3: Errors
&lt;/h2&gt;

&lt;p&gt;Track error rate as a percentage, not absolute count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Error rate percentage
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But also track error types separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;error_categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;5xx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Server&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(our&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fault)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;4xx_excluding_404&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Client&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(possible&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;issue)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeouts"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;circuit_breaker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dependency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failures"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 4: Saturation
&lt;/h2&gt;

&lt;p&gt;The most underrated signal. Saturation answers: "how close are we to full?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CPU saturation
process_cpu_seconds_total / container_spec_cpu_quota

# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes

# Connection pool saturation
active_connections / max_connections

# Queue saturation (the one everyone forgets)
message_queue_depth / message_queue_capacity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Every service gets a standard dashboard with four rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Row 1: Latency   [p50] [p90] [p99] [error latency]
Row 2: Traffic    [rate] [vs last week] [by endpoint]
Row 3: Errors     [rate %] [by type] [by endpoint]
Row 4: Saturation [CPU] [Memory] [Connections] [Queue]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern
&lt;/h2&gt;

&lt;p&gt;Don't build a golden signals dashboard per service manually. Template it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dashboard"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Golden Signals: {{ service_name }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"templating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"query"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"prod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"staging"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One template, 50 dashboards. Update once, apply everywhere.&lt;/p&gt;

&lt;p&gt;If you want golden signal monitoring that sets itself up automatically, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>monitoring</category>
      <category>observability</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Golden Signals: A Practical Implementation Guide</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:11:12 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-3jk2</link>
      <guid>https://forem.com/samson_tanimawo/the-golden-signals-a-practical-implementation-guide-3jk2</guid>
      <description>&lt;h2&gt;
  
  
  Four Metrics to Rule Them All
&lt;/h2&gt;

&lt;p&gt;Google's SRE book introduced the four golden signals: Latency, Traffic, Errors, and Saturation. Simple concept, but I've seen teams struggle with implementation.&lt;/p&gt;

&lt;p&gt;Here's a practical guide from someone who's implemented them across 50+ services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 1: Latency
&lt;/h2&gt;

&lt;p&gt;Not all latency is equal. You need to track successful requests and error requests separately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad: Average latency
&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_request_time&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_requests&lt;/span&gt;  &lt;span class="c1"&gt;# Useless
&lt;/span&gt;
&lt;span class="c1"&gt;# Good: Percentile latency, separated by status
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prometheus_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Histogram&lt;/span&gt;

&lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http_request_duration_seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Request latency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;endpoint&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status_class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;025&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.middleware&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;track_latency&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;
    &lt;span class="n"&gt;status_class&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;xx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;REQUEST_LATENCY&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_class&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert on p99, not p50. Your happiest users don't need help.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HighLatencyP99&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.5&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 2: Traffic
&lt;/h2&gt;

&lt;p&gt;Traffic tells you "is this normal?" It's the context for every other signal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Current request rate
rate(http_requests_total[5m])

# Compare to same time last week
rate(http_requests_total[5m]) 
  / 
rate(http_requests_total[5m] offset 7d)

# Alert on sudden drops (possible outage nobody noticed)
- alert: TrafficDrop
  expr: &amp;gt;
    rate(http_requests_total[5m]) 
    &amp;lt; 
    (rate(http_requests_total[5m] offset 1h) * 0.5)
  for: 10m
  annotations:
    summary: "Traffic dropped &amp;gt;50% compared to 1 hour ago"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic drops are often more concerning than traffic spikes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Signal 3: Errors
&lt;/h2&gt;

&lt;p&gt;Track error rate as a percentage, not absolute count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Error rate percentage
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But also track error types separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;error_categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;5xx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Server&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(our&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fault)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;4xx_excluding_404&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Client&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;errors&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(possible&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;issue)"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;timeouts"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;circuit_breaker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Dependency&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failures"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Signal 4: Saturation
&lt;/h2&gt;

&lt;p&gt;The most underrated signal. Saturation answers: "how close are we to full?"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CPU saturation
process_cpu_seconds_total / container_spec_cpu_quota

# Memory saturation
container_memory_working_set_bytes / container_spec_memory_limit_bytes

# Connection pool saturation
active_connections / max_connections

# Queue saturation (the one everyone forgets)
message_queue_depth / message_queue_capacity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alert before you hit 100%. I use 80% as the threshold for warning and 95% for critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Every service gets a standard dashboard with four rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Row 1: Latency   [p50] [p90] [p99] [error latency]
Row 2: Traffic    [rate] [vs last week] [by endpoint]
Row 3: Errors     [rate %] [by type] [by endpoint]
Row 4: Saturation [CPU] [Memory] [Connections] [Queue]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fits on one screen. No scrolling. Any engineer can assess service health in 10 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anti-Pattern
&lt;/h2&gt;

&lt;p&gt;Don't build a golden signals dashboard per service manually. Template it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dashboard"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Golden Signals: {{ service_name }}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"templating"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"list"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"query"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"custom"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"prod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"staging"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One template, 50 dashboards. Update once, apply everywhere.&lt;/p&gt;

&lt;p&gt;If you want golden signal monitoring that sets itself up automatically, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>monitoring</category>
      <category>observability</category>
      <category>devops</category>
    </item>
    <item>
      <title>Kubernetes Observability: What to Monitor and Why</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Thu, 16 Apr 2026 00:19:43 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-1lbo</link>
      <guid>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-1lbo</guid>
      <description>&lt;h2&gt;
  
  
  The Kubernetes Monitoring Maze
&lt;/h2&gt;

&lt;p&gt;Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.&lt;/p&gt;

&lt;p&gt;After running K8s in production for four years, here's what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layers
&lt;/h2&gt;

&lt;p&gt;Kubernetes observability has three distinct layers, and you need different strategies for each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 1: Cluster Health
&lt;/h2&gt;

&lt;p&gt;These are your "is the platform working?" metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_cluster_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_ready_status&lt;/span&gt;        &lt;span class="c1"&gt;# Are all nodes healthy?&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_cpu_utilization&lt;/span&gt;     &lt;span class="c1"&gt;# Alert at 85%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_memory_utilization&lt;/span&gt;  &lt;span class="c1"&gt;# Alert at 90%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_disk_pressure&lt;/span&gt;       &lt;span class="c1"&gt;# Boolean alert&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_pid_pressure&lt;/span&gt;        &lt;span class="c1"&gt;# Rarely fires, always critical&lt;/span&gt;

  &lt;span class="na"&gt;control_plane&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;apiserver_request_latency_p99&lt;/span&gt;  &lt;span class="c1"&gt;# Alert &amp;gt; 1s&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;etcd_disk_wal_fsync_duration&lt;/span&gt;   &lt;span class="c1"&gt;# Alert &amp;gt; 100ms&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;scheduler_pending_pods&lt;/span&gt;         &lt;span class="c1"&gt;# Alert if &amp;gt; 0 for 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;controller_manager_queue_depth&lt;/span&gt; &lt;span class="c1"&gt;# Alert if growing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Don't alert on individual node CPU. Alert on cluster-level capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Alert when cluster is 80% utilized
(
  sum(node_cpu_seconds_total{mode!="idle"}) 
  / 
  sum(node_cpu_seconds_total)
) &amp;gt; 0.80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 2: Workload Health
&lt;/h2&gt;

&lt;p&gt;This is where most teams get it wrong. They monitor pods instead of workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_workload_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deployments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;available_replicas &amp;lt; desired_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# For &amp;gt; 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment_generation != observed_generation&lt;/span&gt;  &lt;span class="c1"&gt;# Stuck rollout&lt;/span&gt;

  &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;restart_count increasing&lt;/span&gt;       &lt;span class="c1"&gt;# CrashLoopBackOff detection&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container_oom_killed&lt;/span&gt;            &lt;span class="c1"&gt;# Memory limits too low&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pod_pending_duration &amp;gt; 2min&lt;/span&gt;     &lt;span class="c1"&gt;# Scheduling issues&lt;/span&gt;

  &lt;span class="na"&gt;hpa&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;current_replicas == max_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# Scale ceiling hit&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_utilization_vs_target&lt;/span&gt;         &lt;span class="c1"&gt;# Consistently above target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most valuable alert I ever wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Detect pods stuck in CrashLoopBackOff&lt;/span&gt;
&lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodCrashLooping&lt;/span&gt;
&lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(kube_pod_container_status_restarts_total[15m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;crash-looping"&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.namespace&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restarted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;times&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;15&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Application Performance
&lt;/h2&gt;

&lt;p&gt;This is what your users actually care about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;application_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;red_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Rate, Errors, Duration&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_rate_per_second&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;error_rate_percentage&lt;/span&gt;        &lt;span class="c1"&gt;# Alert &amp;gt; 1%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_duration_p99&lt;/span&gt;         &lt;span class="c1"&gt;# Alert &amp;gt; 500ms&lt;/span&gt;

  &lt;span class="na"&gt;use_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Utilization, Saturation, Errors&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;network_receive_bytes_rate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dashboard That Saves Us
&lt;/h2&gt;

&lt;p&gt;We built a single "K8s Health" dashboard with four panels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster capacity&lt;/strong&gt; — CPU/Memory/Disk utilization per node pool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload status&lt;/strong&gt; — Table of all deployments with health status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rates&lt;/strong&gt; — All services, sorted by error rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent events&lt;/strong&gt; — K8s events filtered to warnings and errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This one dashboard answers 90% of "is something wrong?" questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring pods instead of services&lt;/strong&gt; — Pods are ephemeral, services are what matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not setting resource requests&lt;/strong&gt; — Without requests, your metrics are meaningless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting on resource usage instead of SLOs&lt;/strong&gt; — High CPU isn't a problem if latency is fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring the control plane&lt;/strong&gt; — An unhealthy API server affects everything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want unified Kubernetes observability without the complexity, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>observability</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>Kubernetes Observability: What to Monitor and Why</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 23:29:48 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-8ek</link>
      <guid>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-8ek</guid>
      <description>&lt;h2&gt;
  
  
  The Kubernetes Monitoring Maze
&lt;/h2&gt;

&lt;p&gt;Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.&lt;/p&gt;

&lt;p&gt;After running K8s in production for four years, here's what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layers
&lt;/h2&gt;

&lt;p&gt;Kubernetes observability has three distinct layers, and you need different strategies for each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 1: Cluster Health
&lt;/h2&gt;

&lt;p&gt;These are your "is the platform working?" metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_cluster_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_ready_status&lt;/span&gt;        &lt;span class="c1"&gt;# Are all nodes healthy?&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_cpu_utilization&lt;/span&gt;     &lt;span class="c1"&gt;# Alert at 85%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_memory_utilization&lt;/span&gt;  &lt;span class="c1"&gt;# Alert at 90%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_disk_pressure&lt;/span&gt;       &lt;span class="c1"&gt;# Boolean alert&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_pid_pressure&lt;/span&gt;        &lt;span class="c1"&gt;# Rarely fires, always critical&lt;/span&gt;

  &lt;span class="na"&gt;control_plane&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;apiserver_request_latency_p99&lt;/span&gt;  &lt;span class="c1"&gt;# Alert &amp;gt; 1s&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;etcd_disk_wal_fsync_duration&lt;/span&gt;   &lt;span class="c1"&gt;# Alert &amp;gt; 100ms&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;scheduler_pending_pods&lt;/span&gt;         &lt;span class="c1"&gt;# Alert if &amp;gt; 0 for 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;controller_manager_queue_depth&lt;/span&gt; &lt;span class="c1"&gt;# Alert if growing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Don't alert on individual node CPU. Alert on cluster-level capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Alert when cluster is 80% utilized
(
  sum(node_cpu_seconds_total{mode!="idle"}) 
  / 
  sum(node_cpu_seconds_total)
) &amp;gt; 0.80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 2: Workload Health
&lt;/h2&gt;

&lt;p&gt;This is where most teams get it wrong. They monitor pods instead of workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_workload_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deployments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;available_replicas &amp;lt; desired_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# For &amp;gt; 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment_generation != observed_generation&lt;/span&gt;  &lt;span class="c1"&gt;# Stuck rollout&lt;/span&gt;

  &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;restart_count increasing&lt;/span&gt;       &lt;span class="c1"&gt;# CrashLoopBackOff detection&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container_oom_killed&lt;/span&gt;            &lt;span class="c1"&gt;# Memory limits too low&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pod_pending_duration &amp;gt; 2min&lt;/span&gt;     &lt;span class="c1"&gt;# Scheduling issues&lt;/span&gt;

  &lt;span class="na"&gt;hpa&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;current_replicas == max_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# Scale ceiling hit&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_utilization_vs_target&lt;/span&gt;         &lt;span class="c1"&gt;# Consistently above target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most valuable alert I ever wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Detect pods stuck in CrashLoopBackOff&lt;/span&gt;
&lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodCrashLooping&lt;/span&gt;
&lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(kube_pod_container_status_restarts_total[15m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;crash-looping"&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.namespace&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restarted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;times&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;15&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Application Performance
&lt;/h2&gt;

&lt;p&gt;This is what your users actually care about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;application_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;red_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Rate, Errors, Duration&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_rate_per_second&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;error_rate_percentage&lt;/span&gt;        &lt;span class="c1"&gt;# Alert &amp;gt; 1%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_duration_p99&lt;/span&gt;         &lt;span class="c1"&gt;# Alert &amp;gt; 500ms&lt;/span&gt;

  &lt;span class="na"&gt;use_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Utilization, Saturation, Errors&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;network_receive_bytes_rate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dashboard That Saves Us
&lt;/h2&gt;

&lt;p&gt;We built a single "K8s Health" dashboard with four panels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster capacity&lt;/strong&gt; — CPU/Memory/Disk utilization per node pool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload status&lt;/strong&gt; — Table of all deployments with health status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rates&lt;/strong&gt; — All services, sorted by error rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent events&lt;/strong&gt; — K8s events filtered to warnings and errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This one dashboard answers 90% of "is something wrong?" questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring pods instead of services&lt;/strong&gt; — Pods are ephemeral, services are what matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not setting resource requests&lt;/strong&gt; — Without requests, your metrics are meaningless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting on resource usage instead of SLOs&lt;/strong&gt; — High CPU isn't a problem if latency is fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring the control plane&lt;/strong&gt; — An unhealthy API server affects everything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want unified Kubernetes observability without the complexity, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>observability</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>Kubernetes Observability: What to Monitor and Why</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 23:18:54 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-mif</link>
      <guid>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-mif</guid>
      <description>&lt;h2&gt;
  
  
  The Kubernetes Monitoring Maze
&lt;/h2&gt;

&lt;p&gt;Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.&lt;/p&gt;

&lt;p&gt;After running K8s in production for four years, here's what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layers
&lt;/h2&gt;

&lt;p&gt;Kubernetes observability has three distinct layers, and you need different strategies for each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 1: Cluster Health
&lt;/h2&gt;

&lt;p&gt;These are your "is the platform working?" metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_cluster_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_ready_status&lt;/span&gt;        &lt;span class="c1"&gt;# Are all nodes healthy?&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_cpu_utilization&lt;/span&gt;     &lt;span class="c1"&gt;# Alert at 85%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_memory_utilization&lt;/span&gt;  &lt;span class="c1"&gt;# Alert at 90%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_disk_pressure&lt;/span&gt;       &lt;span class="c1"&gt;# Boolean alert&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_pid_pressure&lt;/span&gt;        &lt;span class="c1"&gt;# Rarely fires, always critical&lt;/span&gt;

  &lt;span class="na"&gt;control_plane&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;apiserver_request_latency_p99&lt;/span&gt;  &lt;span class="c1"&gt;# Alert &amp;gt; 1s&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;etcd_disk_wal_fsync_duration&lt;/span&gt;   &lt;span class="c1"&gt;# Alert &amp;gt; 100ms&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;scheduler_pending_pods&lt;/span&gt;         &lt;span class="c1"&gt;# Alert if &amp;gt; 0 for 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;controller_manager_queue_depth&lt;/span&gt; &lt;span class="c1"&gt;# Alert if growing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Don't alert on individual node CPU. Alert on cluster-level capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Alert when cluster is 80% utilized
(
  sum(node_cpu_seconds_total{mode!="idle"}) 
  / 
  sum(node_cpu_seconds_total)
) &amp;gt; 0.80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 2: Workload Health
&lt;/h2&gt;

&lt;p&gt;This is where most teams get it wrong. They monitor pods instead of workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_workload_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deployments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;available_replicas &amp;lt; desired_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# For &amp;gt; 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment_generation != observed_generation&lt;/span&gt;  &lt;span class="c1"&gt;# Stuck rollout&lt;/span&gt;

  &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;restart_count increasing&lt;/span&gt;       &lt;span class="c1"&gt;# CrashLoopBackOff detection&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container_oom_killed&lt;/span&gt;            &lt;span class="c1"&gt;# Memory limits too low&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pod_pending_duration &amp;gt; 2min&lt;/span&gt;     &lt;span class="c1"&gt;# Scheduling issues&lt;/span&gt;

  &lt;span class="na"&gt;hpa&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;current_replicas == max_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# Scale ceiling hit&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_utilization_vs_target&lt;/span&gt;         &lt;span class="c1"&gt;# Consistently above target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most valuable alert I ever wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Detect pods stuck in CrashLoopBackOff&lt;/span&gt;
&lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodCrashLooping&lt;/span&gt;
&lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(kube_pod_container_status_restarts_total[15m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;crash-looping"&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.namespace&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restarted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;times&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;15&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Application Performance
&lt;/h2&gt;

&lt;p&gt;This is what your users actually care about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;application_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;red_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Rate, Errors, Duration&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_rate_per_second&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;error_rate_percentage&lt;/span&gt;        &lt;span class="c1"&gt;# Alert &amp;gt; 1%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_duration_p99&lt;/span&gt;         &lt;span class="c1"&gt;# Alert &amp;gt; 500ms&lt;/span&gt;

  &lt;span class="na"&gt;use_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Utilization, Saturation, Errors&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;network_receive_bytes_rate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dashboard That Saves Us
&lt;/h2&gt;

&lt;p&gt;We built a single "K8s Health" dashboard with four panels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster capacity&lt;/strong&gt; — CPU/Memory/Disk utilization per node pool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload status&lt;/strong&gt; — Table of all deployments with health status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rates&lt;/strong&gt; — All services, sorted by error rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent events&lt;/strong&gt; — K8s events filtered to warnings and errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This one dashboard answers 90% of "is something wrong?" questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring pods instead of services&lt;/strong&gt; — Pods are ephemeral, services are what matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not setting resource requests&lt;/strong&gt; — Without requests, your metrics are meaningless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting on resource usage instead of SLOs&lt;/strong&gt; — High CPU isn't a problem if latency is fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring the control plane&lt;/strong&gt; — An unhealthy API server affects everything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want unified Kubernetes observability without the complexity, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>observability</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>Kubernetes Observability: What to Monitor and Why</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 23:08:21 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-47n2</link>
      <guid>https://forem.com/samson_tanimawo/kubernetes-observability-what-to-monitor-and-why-47n2</guid>
      <description>&lt;h2&gt;
  
  
  The Kubernetes Monitoring Maze
&lt;/h2&gt;

&lt;p&gt;Kubernetes gives you a thousand metrics out of the box. Most teams monitor all of them and understand none of them.&lt;/p&gt;

&lt;p&gt;After running K8s in production for four years, here's what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Layers
&lt;/h2&gt;

&lt;p&gt;Kubernetes observability has three distinct layers, and you need different strategies for each:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: Cluster Health (infrastructure)
Layer 2: Workload Health (your apps)
Layer 3: Application Performance (user experience)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 1: Cluster Health
&lt;/h2&gt;

&lt;p&gt;These are your "is the platform working?" metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_cluster_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_ready_status&lt;/span&gt;        &lt;span class="c1"&gt;# Are all nodes healthy?&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_cpu_utilization&lt;/span&gt;     &lt;span class="c1"&gt;# Alert at 85%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_memory_utilization&lt;/span&gt;  &lt;span class="c1"&gt;# Alert at 90%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_disk_pressure&lt;/span&gt;       &lt;span class="c1"&gt;# Boolean alert&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;node_pid_pressure&lt;/span&gt;        &lt;span class="c1"&gt;# Rarely fires, always critical&lt;/span&gt;

  &lt;span class="na"&gt;control_plane&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;apiserver_request_latency_p99&lt;/span&gt;  &lt;span class="c1"&gt;# Alert &amp;gt; 1s&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;etcd_disk_wal_fsync_duration&lt;/span&gt;   &lt;span class="c1"&gt;# Alert &amp;gt; 100ms&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;scheduler_pending_pods&lt;/span&gt;         &lt;span class="c1"&gt;# Alert if &amp;gt; 0 for 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;controller_manager_queue_depth&lt;/span&gt; &lt;span class="c1"&gt;# Alert if growing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Don't alert on individual node CPU. Alert on cluster-level capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Alert when cluster is 80% utilized
(
  sum(node_cpu_seconds_total{mode!="idle"}) 
  / 
  sum(node_cpu_seconds_total)
) &amp;gt; 0.80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 2: Workload Health
&lt;/h2&gt;

&lt;p&gt;This is where most teams get it wrong. They monitor pods instead of workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;critical_workload_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deployments&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;available_replicas &amp;lt; desired_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# For &amp;gt; 5min&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;deployment_generation != observed_generation&lt;/span&gt;  &lt;span class="c1"&gt;# Stuck rollout&lt;/span&gt;

  &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;restart_count increasing&lt;/span&gt;       &lt;span class="c1"&gt;# CrashLoopBackOff detection&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;container_oom_killed&lt;/span&gt;            &lt;span class="c1"&gt;# Memory limits too low&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;pod_pending_duration &amp;gt; 2min&lt;/span&gt;     &lt;span class="c1"&gt;# Scheduling issues&lt;/span&gt;

  &lt;span class="na"&gt;hpa&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;current_replicas == max_replicas&lt;/span&gt;  &lt;span class="c1"&gt;# Scale ceiling hit&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_utilization_vs_target&lt;/span&gt;         &lt;span class="c1"&gt;# Consistently above target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The most valuable alert I ever wrote:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Detect pods stuck in CrashLoopBackOff&lt;/span&gt;
&lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodCrashLooping&lt;/span&gt;
&lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(kube_pod_container_status_restarts_total[15m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;15m&lt;/span&gt;
&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
&lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;crash-looping"&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.pod&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.namespace&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;has&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;restarted&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;times&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;15&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;minutes"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 3: Application Performance
&lt;/h2&gt;

&lt;p&gt;This is what your users actually care about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;application_metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;red_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Rate, Errors, Duration&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_rate_per_second&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;error_rate_percentage&lt;/span&gt;        &lt;span class="c1"&gt;# Alert &amp;gt; 1%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;request_duration_p99&lt;/span&gt;         &lt;span class="c1"&gt;# Alert &amp;gt; 500ms&lt;/span&gt;

  &lt;span class="na"&gt;use_method&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Utilization, Saturation, Errors&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cpu_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;memory_request_vs_limit_ratio&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;network_receive_bytes_rate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Dashboard That Saves Us
&lt;/h2&gt;

&lt;p&gt;We built a single "K8s Health" dashboard with four panels:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster capacity&lt;/strong&gt; — CPU/Memory/Disk utilization per node pool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload status&lt;/strong&gt; — Table of all deployments with health status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error rates&lt;/strong&gt; — All services, sorted by error rate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent events&lt;/strong&gt; — K8s events filtered to warnings and errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This one dashboard answers 90% of "is something wrong?" questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Mistakes
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring pods instead of services&lt;/strong&gt; — Pods are ephemeral, services are what matter&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not setting resource requests&lt;/strong&gt; — Without requests, your metrics are meaningless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerting on resource usage instead of SLOs&lt;/strong&gt; — High CPU isn't a problem if latency is fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignoring the control plane&lt;/strong&gt; — An unhealthy API server affects everything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want unified Kubernetes observability without the complexity, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>observability</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>On-Call Wellness: Protecting Your Engineers from Burnout</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 15:21:46 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/on-call-wellness-protecting-your-engineers-from-burnout-1fii</link>
      <guid>https://forem.com/samson_tanimawo/on-call-wellness-protecting-your-engineers-from-burnout-1fii</guid>
      <description>&lt;h2&gt;
  
  
  The On-Call Burnout Epidemic
&lt;/h2&gt;

&lt;p&gt;I watched three senior SREs leave our team in six months. Exit interviews all said the same thing: on-call was unsustainable.&lt;/p&gt;

&lt;p&gt;We were spending $500K+ recruiting replacements for a problem that could have been fixed with $0 and better practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Warning Signs
&lt;/h2&gt;

&lt;p&gt;Before someone quits, they show these signals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cynicism in post-mortems&lt;/strong&gt; — "This will never get fixed"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert numbness&lt;/strong&gt; — Slow to respond, missed pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacation avoidance&lt;/strong&gt; — "I can't take time off, who would cover?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope creep rejection&lt;/strong&gt; — "That's not my problem"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meeting silence&lt;/strong&gt; — Previously engaged, now checked out&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you see three or more of these in someone on your team, they're already halfway out the door.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Changed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hard Cap on Pages
&lt;/h3&gt;

&lt;p&gt;We set a maximum of 2 pages per 8-hour on-call shift. If someone gets paged more than that, the secondary automatically takes over and the incident is escalated as a process failure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on_call_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_pages_per_shift&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;shift_duration_hours&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;overflow_action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate_to_secondary"&lt;/span&gt;
  &lt;span class="na"&gt;overflow_review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekly_ops_review"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Follow-the-Sun Rotation
&lt;/h3&gt;

&lt;p&gt;We stopped asking people to be on-call at 3am. With team members across US timezones, we created overlapping business-hours shifts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shift A (Eastern):  6am - 2pm ET
Shift B (Central):  11am - 7pm CT  
Shift C (Pacific):  2pm - 10pm PT
Overnight:          Managed by alert automation + escalation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nobody gets paged between 10pm and 6am unless it's a true P1.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. On-Call Compensation
&lt;/h3&gt;

&lt;p&gt;We implemented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$500 flat fee per on-call week&lt;/li&gt;
&lt;li&gt;$200 per off-hours page&lt;/li&gt;
&lt;li&gt;Comp day after any overnight incident &amp;gt; 30 minutes&lt;/li&gt;
&lt;li&gt;On-call swaps require zero management approval&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The "Toil Budget"
&lt;/h3&gt;

&lt;p&gt;Each engineer gets a toil budget: maximum 30% of their time on operational work. If toil exceeds 30%, they're pulled from on-call until the team automates the excess.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Weekly toil tracking:
  Alert response:     4 hours
  Manual deployments: 2 hours  
  Config updates:     1 hour
  Ad-hoc debugging:   3 hours
  ─────────────────────────
  Total:              10 hours (25% of 40hr week) ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Quarterly On-Call Reviews
&lt;/h3&gt;

&lt;p&gt;Every quarter, we review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pages per person&lt;/li&gt;
&lt;li&gt;Off-hours disruptions&lt;/li&gt;
&lt;li&gt;Toil percentages&lt;/li&gt;
&lt;li&gt;Team sentiment survey&lt;/li&gt;
&lt;li&gt;Attrition risk signals&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After (6 months)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Attrition rate&lt;/td&gt;
&lt;td&gt;40%/year&lt;/td&gt;
&lt;td&gt;8%/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pages per shift&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;1.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-hours pages&lt;/td&gt;
&lt;td&gt;12/week&lt;/td&gt;
&lt;td&gt;2/week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team NPS&lt;/td&gt;
&lt;td&gt;-15&lt;/td&gt;
&lt;td&gt;+45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recruitment cost saved&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;~$400K/year&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;On-call wellness isn't a perk. It's a business decision. Replacing a senior SRE costs $150-200K in recruiting, onboarding, and lost productivity. Preventing burnout costs almost nothing.&lt;/p&gt;

&lt;p&gt;If you're looking to reduce on-call toil and protect your team from burnout, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>oncall</category>
      <category>burnout</category>
      <category>culture</category>
    </item>
    <item>
      <title>On-Call Wellness: Protecting Your Engineers from Burnout</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 15:06:15 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/on-call-wellness-protecting-your-engineers-from-burnout-39fc</link>
      <guid>https://forem.com/samson_tanimawo/on-call-wellness-protecting-your-engineers-from-burnout-39fc</guid>
      <description>&lt;h2&gt;
  
  
  The On-Call Burnout Epidemic
&lt;/h2&gt;

&lt;p&gt;I watched three senior SREs leave our team in six months. Exit interviews all said the same thing: on-call was unsustainable.&lt;/p&gt;

&lt;p&gt;We were spending $500K+ recruiting replacements for a problem that could have been fixed with $0 and better practices.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Warning Signs
&lt;/h2&gt;

&lt;p&gt;Before someone quits, they show these signals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cynicism in post-mortems&lt;/strong&gt; — "This will never get fixed"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert numbness&lt;/strong&gt; — Slow to respond, missed pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacation avoidance&lt;/strong&gt; — "I can't take time off, who would cover?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope creep rejection&lt;/strong&gt; — "That's not my problem"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meeting silence&lt;/strong&gt; — Previously engaged, now checked out&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you see three or more of these in someone on your team, they're already halfway out the door.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Changed
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hard Cap on Pages
&lt;/h3&gt;

&lt;p&gt;We set a maximum of 2 pages per 8-hour on-call shift. If someone gets paged more than that, the secondary automatically takes over and the incident is escalated as a process failure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on_call_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;max_pages_per_shift&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;shift_duration_hours&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;overflow_action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escalate_to_secondary"&lt;/span&gt;
  &lt;span class="na"&gt;overflow_review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekly_ops_review"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Follow-the-Sun Rotation
&lt;/h3&gt;

&lt;p&gt;We stopped asking people to be on-call at 3am. With team members across US timezones, we created overlapping business-hours shifts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Shift A (Eastern):  6am - 2pm ET
Shift B (Central):  11am - 7pm CT  
Shift C (Pacific):  2pm - 10pm PT
Overnight:          Managed by alert automation + escalation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Nobody gets paged between 10pm and 6am unless it's a true P1.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. On-Call Compensation
&lt;/h3&gt;

&lt;p&gt;We implemented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$500 flat fee per on-call week&lt;/li&gt;
&lt;li&gt;$200 per off-hours page&lt;/li&gt;
&lt;li&gt;Comp day after any overnight incident &amp;gt; 30 minutes&lt;/li&gt;
&lt;li&gt;On-call swaps require zero management approval&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. The "Toil Budget"
&lt;/h3&gt;

&lt;p&gt;Each engineer gets a toil budget: maximum 30% of their time on operational work. If toil exceeds 30%, they're pulled from on-call until the team automates the excess.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Weekly toil tracking:
  Alert response:     4 hours
  Manual deployments: 2 hours  
  Config updates:     1 hour
  Ad-hoc debugging:   3 hours
  ─────────────────────────
  Total:              10 hours (25% of 40hr week) ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Quarterly On-Call Reviews
&lt;/h3&gt;

&lt;p&gt;Every quarter, we review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pages per person&lt;/li&gt;
&lt;li&gt;Off-hours disruptions&lt;/li&gt;
&lt;li&gt;Toil percentages&lt;/li&gt;
&lt;li&gt;Team sentiment survey&lt;/li&gt;
&lt;li&gt;Attrition risk signals&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After (6 months)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Attrition rate&lt;/td&gt;
&lt;td&gt;40%/year&lt;/td&gt;
&lt;td&gt;8%/year&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pages per shift&lt;/td&gt;
&lt;td&gt;4.7&lt;/td&gt;
&lt;td&gt;1.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Off-hours pages&lt;/td&gt;
&lt;td&gt;12/week&lt;/td&gt;
&lt;td&gt;2/week&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team NPS&lt;/td&gt;
&lt;td&gt;-15&lt;/td&gt;
&lt;td&gt;+45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recruitment cost saved&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;~$400K/year&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Key Insight
&lt;/h2&gt;

&lt;p&gt;On-call wellness isn't a perk. It's a business decision. Replacing a senior SRE costs $150-200K in recruiting, onboarding, and lost productivity. Preventing burnout costs almost nothing.&lt;/p&gt;

&lt;p&gt;If you're looking to reduce on-call toil and protect your team from burnout, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>oncall</category>
      <category>burnout</category>
      <category>culture</category>
    </item>
    <item>
      <title>Post-Mortem Best Practices That Actually Drive Change</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 07:37:37 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/post-mortem-best-practices-that-actually-drive-change-3pin</link>
      <guid>https://forem.com/samson_tanimawo/post-mortem-best-practices-that-actually-drive-change-3pin</guid>
      <description>&lt;h2&gt;
  
  
  The Post-Mortem Nobody Learns From
&lt;/h2&gt;

&lt;p&gt;I've sat through hundreds of post-mortems. Most follow the same pattern: something breaks, someone writes a Google Doc, we have a meeting, we list action items, nobody follows up, the same thing happens again in 3 months.&lt;/p&gt;

&lt;p&gt;Here's how to break the cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blameless Culture Trap
&lt;/h2&gt;

&lt;p&gt;"Blameless" doesn't mean "actionless." The biggest failure mode I see is teams that use blameless culture as an excuse to avoid accountability.&lt;/p&gt;

&lt;p&gt;Blameless means: we don't punish the person who pushed the bad deploy.&lt;br&gt;
Blameless does NOT mean: nobody is responsible for fixing the systemic issue.&lt;/p&gt;
&lt;h2&gt;
  
  
  My Post-Mortem Template
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Incident: [SERVICE] [SYMPTOM] on [DATE]&lt;/span&gt;

&lt;span class="gu"&gt;## Impact&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Duration: X minutes
&lt;span class="p"&gt;-&lt;/span&gt; Users affected: N
&lt;span class="p"&gt;-&lt;/span&gt; Revenue impact: $X
&lt;span class="p"&gt;-&lt;/span&gt; SLO budget consumed: X%

&lt;span class="gu"&gt;## Timeline (UTC)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - First alert fired
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - On-call acknowledged
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - Root cause identified
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - Fix deployed
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - Service recovered
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - All-clear declared

&lt;span class="gu"&gt;## Root Cause&lt;/span&gt;
[2-3 sentences. Technical but readable.]

&lt;span class="gu"&gt;## Contributing Factors&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; [Factor that made the incident possible]
&lt;span class="p"&gt;2.&lt;/span&gt; [Factor that made detection slow]
&lt;span class="p"&gt;3.&lt;/span&gt; [Factor that made resolution slow]

&lt;span class="gu"&gt;## What Went Well&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Something that worked]
&lt;span class="p"&gt;-&lt;/span&gt; [Something that helped]

&lt;span class="gu"&gt;## What Went Wrong&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Process failure]
&lt;span class="p"&gt;-&lt;/span&gt; [Technical gap]

&lt;span class="gu"&gt;## Action Items&lt;/span&gt;
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| ...    | ...   | P1/P2/P3 | ...      | Open   |

&lt;span class="gu"&gt;## Lessons Learned&lt;/span&gt;
[1-2 paragraphs of genuine insight]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  The Action Item Problem
&lt;/h2&gt;

&lt;p&gt;Action items from post-mortems have a 30% completion rate industry-wide. That's terrible. Here's why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Too many items (I've seen post-mortems with 15 action items)&lt;/li&gt;
&lt;li&gt;No clear ownership&lt;/li&gt;
&lt;li&gt;No deadline&lt;/li&gt;
&lt;li&gt;No follow-up mechanism&lt;/li&gt;
&lt;li&gt;Competing with feature work&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  The Fix: Three Rules
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rule 1: Maximum 3 action items per post-mortem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you can't narrow it to 3, you haven't identified the real problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2: Every action item gets a JIRA ticket linked to the next sprint.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "someday." Not "backlog." Next sprint. If it's not important enough for next sprint, it's not an action item.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 3: Review completion in the next post-mortem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start every post-mortem meeting by reviewing open action items from previous incidents. This creates accountability without blame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Post-mortem meeting agenda

1. Review open action items (10 min)
   - Incident #42: "Add circuit breaker" — DONE
   - Incident #43: "Add canary deploys" — IN PROGRESS (blocked on CI)
   - Incident #44: "Fix retry logic" — NOT STARTED (reassigning)

2. Current incident review (30 min)
   - Timeline walkthrough
   - Contributing factors
   - Action items (max 3)

3. Pattern analysis (10 min)
   - Any recurring themes?
   - Systemic issues to address?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Metric That Matters
&lt;/h2&gt;

&lt;p&gt;Track &lt;strong&gt;Repeat Incident Rate&lt;/strong&gt;: what percentage of incidents have the same root cause as a previous incident?&lt;/p&gt;

&lt;p&gt;When we started tracking this, our repeat rate was 45%. After implementing the three rules above, it dropped to 12% over six months.&lt;/p&gt;

&lt;p&gt;That's the real measure of whether your post-mortems are working.&lt;/p&gt;

&lt;p&gt;If you're looking for better incident learning loops and pattern detection across your post-mortems, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>postmortem</category>
      <category>incidents</category>
      <category>devops</category>
    </item>
    <item>
      <title>Post-Mortem Best Practices That Actually Drive Change</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 07:27:49 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/post-mortem-best-practices-that-actually-drive-change-5dgd</link>
      <guid>https://forem.com/samson_tanimawo/post-mortem-best-practices-that-actually-drive-change-5dgd</guid>
      <description>&lt;h2&gt;
  
  
  The Post-Mortem Nobody Learns From
&lt;/h2&gt;

&lt;p&gt;I've sat through hundreds of post-mortems. Most follow the same pattern: something breaks, someone writes a Google Doc, we have a meeting, we list action items, nobody follows up, the same thing happens again in 3 months.&lt;/p&gt;

&lt;p&gt;Here's how to break the cycle.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blameless Culture Trap
&lt;/h2&gt;

&lt;p&gt;"Blameless" doesn't mean "actionless." The biggest failure mode I see is teams that use blameless culture as an excuse to avoid accountability.&lt;/p&gt;

&lt;p&gt;Blameless means: we don't punish the person who pushed the bad deploy.&lt;br&gt;
Blameless does NOT mean: nobody is responsible for fixing the systemic issue.&lt;/p&gt;
&lt;h2&gt;
  
  
  My Post-Mortem Template
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Incident: [SERVICE] [SYMPTOM] on [DATE]&lt;/span&gt;

&lt;span class="gu"&gt;## Impact&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Duration: X minutes
&lt;span class="p"&gt;-&lt;/span&gt; Users affected: N
&lt;span class="p"&gt;-&lt;/span&gt; Revenue impact: $X
&lt;span class="p"&gt;-&lt;/span&gt; SLO budget consumed: X%

&lt;span class="gu"&gt;## Timeline (UTC)&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - First alert fired
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - On-call acknowledged
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - Root cause identified
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - Fix deployed
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - Service recovered
&lt;span class="p"&gt;-&lt;/span&gt; HH:MM - All-clear declared

&lt;span class="gu"&gt;## Root Cause&lt;/span&gt;
[2-3 sentences. Technical but readable.]

&lt;span class="gu"&gt;## Contributing Factors&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; [Factor that made the incident possible]
&lt;span class="p"&gt;2.&lt;/span&gt; [Factor that made detection slow]
&lt;span class="p"&gt;3.&lt;/span&gt; [Factor that made resolution slow]

&lt;span class="gu"&gt;## What Went Well&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Something that worked]
&lt;span class="p"&gt;-&lt;/span&gt; [Something that helped]

&lt;span class="gu"&gt;## What Went Wrong&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [Process failure]
&lt;span class="p"&gt;-&lt;/span&gt; [Technical gap]

&lt;span class="gu"&gt;## Action Items&lt;/span&gt;
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| ...    | ...   | P1/P2/P3 | ...      | Open   |

&lt;span class="gu"&gt;## Lessons Learned&lt;/span&gt;
[1-2 paragraphs of genuine insight]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  The Action Item Problem
&lt;/h2&gt;

&lt;p&gt;Action items from post-mortems have a 30% completion rate industry-wide. That's terrible. Here's why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Too many items (I've seen post-mortems with 15 action items)&lt;/li&gt;
&lt;li&gt;No clear ownership&lt;/li&gt;
&lt;li&gt;No deadline&lt;/li&gt;
&lt;li&gt;No follow-up mechanism&lt;/li&gt;
&lt;li&gt;Competing with feature work&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  The Fix: Three Rules
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Rule 1: Maximum 3 action items per post-mortem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you can't narrow it to 3, you haven't identified the real problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2: Every action item gets a JIRA ticket linked to the next sprint.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "someday." Not "backlog." Next sprint. If it's not important enough for next sprint, it's not an action item.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 3: Review completion in the next post-mortem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start every post-mortem meeting by reviewing open action items from previous incidents. This creates accountability without blame.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Post-mortem meeting agenda&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Review open action items (10 min)
&lt;span class="p"&gt;   -&lt;/span&gt; Incident #42: "Add circuit breaker" — DONE
&lt;span class="p"&gt;   -&lt;/span&gt; Incident #43: "Add canary deploys" — IN PROGRESS (blocked on CI)
&lt;span class="p"&gt;   -&lt;/span&gt; Incident #44: "Fix retry logic" — NOT STARTED (reassigning)
&lt;span class="p"&gt;
2.&lt;/span&gt; Current incident review (30 min)
&lt;span class="p"&gt;   -&lt;/span&gt; Timeline walkthrough
&lt;span class="p"&gt;   -&lt;/span&gt; Contributing factors
&lt;span class="p"&gt;   -&lt;/span&gt; Action items (max 3)
&lt;span class="p"&gt;
3.&lt;/span&gt; Pattern analysis (10 min)
&lt;span class="p"&gt;   -&lt;/span&gt; Any recurring themes?
&lt;span class="p"&gt;   -&lt;/span&gt; Systemic issues to address?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Metric That Matters
&lt;/h2&gt;

&lt;p&gt;Track &lt;strong&gt;Repeat Incident Rate&lt;/strong&gt;: what percentage of incidents have the same root cause as a previous incident?&lt;/p&gt;

&lt;p&gt;When we started tracking this, our repeat rate was 45%. After implementing the three rules above, it dropped to 12% over six months.&lt;/p&gt;

&lt;p&gt;That's the real measure of whether your post-mortems are working.&lt;/p&gt;

&lt;p&gt;If you're looking for better incident learning loops and pattern detection across your post-mortems, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>postmortem</category>
      <category>incidents</category>
      <category>devops</category>
    </item>
    <item>
      <title>Runbook Automation: From 45-Minute Fixes to 90-Second Recoveries</title>
      <dc:creator>Samson Tanimawo</dc:creator>
      <pubDate>Wed, 15 Apr 2026 03:21:55 +0000</pubDate>
      <link>https://forem.com/samson_tanimawo/runbook-automation-from-45-minute-fixes-to-90-second-recoveries-2243</link>
      <guid>https://forem.com/samson_tanimawo/runbook-automation-from-45-minute-fixes-to-90-second-recoveries-2243</guid>
      <description>&lt;h2&gt;
  
  
  The Runbook Nobody Reads
&lt;/h2&gt;

&lt;p&gt;We had runbooks. Beautiful, detailed, Google-Docs runbooks. 47 pages long. Nobody read them at 3am.&lt;/p&gt;

&lt;p&gt;The problem isn't the documentation. The problem is expecting a sleep-deprived human to follow a 47-step procedure correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Automation Ladder
&lt;/h2&gt;

&lt;p&gt;I think about runbook automation as a ladder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Level 0: No runbook (tribal knowledge)
Level 1: Written runbook (Google Doc)
Level 2: Structured runbook (checklist format)
Level 3: Semi-automated (scripts for each step)
Level 4: Fully automated (one-click remediation)
Level 5: Self-healing (no human needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most teams are at Level 1-2. The goal is Level 4-5 for your top 10 incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying Automation Candidates
&lt;/h2&gt;

&lt;p&gt;Not everything should be automated. Start with high-frequency, well-understood procedures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query your incident database&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
  &lt;span class="n"&gt;root_cause_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;frequency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resolution_time_minutes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_mttr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resolution_time_minutes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_impact_minutes&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;incidents&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="s1"&gt;'6 months'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;root_cause_category&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_impact_minutes&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For us, the top 5 were:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Disk full on log volumes (2x/week)&lt;/li&gt;
&lt;li&gt;Memory leak requiring pod restart (1x/week)&lt;/li&gt;
&lt;li&gt;Certificate expiry (1x/month, but high impact)&lt;/li&gt;
&lt;li&gt;Database connection pool exhaustion (1x/week)&lt;/li&gt;
&lt;li&gt;Stuck deployment (2x/week)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Example: Disk Full Auto-Remediation
&lt;/h2&gt;

&lt;p&gt;Before (Level 1 — runbook):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. SSH to the affected host
2. Run df -h to confirm
3. Check /var/log for large files
4. Run logrotate manually
5. If still full, find and remove old files
6. If still full, expand the volume
7. Verify service recovered
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After (Level 5 — self-healing):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# disk-remediation.sh — triggered by monitoring alert&lt;/span&gt;

&lt;span class="nv"&gt;HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
&lt;span class="nv"&gt;THRESHOLD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;90

&lt;span class="nv"&gt;USAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ssh &lt;span class="nv"&gt;$HOST&lt;/span&gt; &lt;span class="s2"&gt;"df /var/log --output=pcent | tail -1 | tr -dc '0-9'"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USAGE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-gt&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$THRESHOLD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[Auto-Remediation] Disk at &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;USAGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;% on &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOST&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

  &lt;span class="c"&gt;# Step 1: Rotate logs&lt;/span&gt;
  ssh &lt;span class="nv"&gt;$HOST&lt;/span&gt; &lt;span class="s2"&gt;"sudo logrotate -f /etc/logrotate.conf"&lt;/span&gt;

  &lt;span class="c"&gt;# Step 2: Clean old logs (&amp;gt;7 days)&lt;/span&gt;
  ssh &lt;span class="nv"&gt;$HOST&lt;/span&gt; &lt;span class="s2"&gt;"find /var/log -name '*.gz' -mtime +7 -delete"&lt;/span&gt;

  &lt;span class="c"&gt;# Step 3: Clean temp files&lt;/span&gt;
  ssh &lt;span class="nv"&gt;$HOST&lt;/span&gt; &lt;span class="s2"&gt;"find /tmp -mtime +3 -delete 2&amp;gt;/dev/null"&lt;/span&gt;

  &lt;span class="c"&gt;# Verify&lt;/span&gt;
  &lt;span class="nv"&gt;NEW_USAGE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;ssh &lt;span class="nv"&gt;$HOST&lt;/span&gt; &lt;span class="s2"&gt;"df /var/log --output=pcent | tail -1 | tr -dc '0-9'"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$NEW_USAGE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$THRESHOLD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[Auto-Remediation] Resolved. &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;USAGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;% -&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;NEW_USAGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;%"&lt;/span&gt;
    notify_slack &lt;span class="s2"&gt;"Disk full on &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOST&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; auto-resolved (&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;USAGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;% -&amp;gt; &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;NEW_USAGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;%)"&lt;/span&gt;
  &lt;span class="k"&gt;else
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"[Auto-Remediation] Still at &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;NEW_USAGE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;%. Escalating."&lt;/span&gt;
    page_oncall &lt;span class="s2"&gt;"Disk full on &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HOST&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; - auto-remediation failed. Manual intervention needed."&lt;/span&gt;
  &lt;span class="k"&gt;fi
fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Incident Type&lt;/th&gt;
&lt;th&gt;Before (MTTR)&lt;/th&gt;
&lt;th&gt;After (MTTR)&lt;/th&gt;
&lt;th&gt;Automation Level&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Disk full&lt;/td&gt;
&lt;td&gt;25 min&lt;/td&gt;
&lt;td&gt;90 sec&lt;/td&gt;
&lt;td&gt;Self-healing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory leak&lt;/td&gt;
&lt;td&gt;15 min&lt;/td&gt;
&lt;td&gt;45 sec&lt;/td&gt;
&lt;td&gt;One-click&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cert expiry&lt;/td&gt;
&lt;td&gt;45 min&lt;/td&gt;
&lt;td&gt;0 (prevented)&lt;/td&gt;
&lt;td&gt;Proactive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DB conn pool&lt;/td&gt;
&lt;td&gt;20 min&lt;/td&gt;
&lt;td&gt;60 sec&lt;/td&gt;
&lt;td&gt;Self-healing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stuck deploy&lt;/td&gt;
&lt;td&gt;30 min&lt;/td&gt;
&lt;td&gt;2 min&lt;/td&gt;
&lt;td&gt;One-click&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total monthly incident time: 14 hours → 45 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Golden Rule
&lt;/h2&gt;

&lt;p&gt;If you've fixed the same incident three times manually, it's time to automate. The third time pays for the automation effort. Everything after that is pure savings.&lt;/p&gt;

&lt;p&gt;If you're tired of repetitive incident response and want to automate your runbooks with AI, check out what we're building at &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;Nova AI Ops&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Written by Dr. Samson Tanimawo&lt;/strong&gt;&lt;br&gt;
BSc · MSc · MBA · PhD&lt;br&gt;
Founder &amp;amp; CEO, Nova AI Ops. &lt;a href="https://novaaiops.com" rel="noopener noreferrer"&gt;https://novaaiops.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>automation</category>
      <category>runbooks</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
