<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: arnoldbaraka</title>
    <description>The latest articles on Forem by arnoldbaraka (@king_baraka).</description>
    <link>https://forem.com/king_baraka</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3835757%2Fc519603f-aab0-4f65-b6e0-997d71dc547e.jpeg</url>
      <title>Forem: arnoldbaraka</title>
      <link>https://forem.com/king_baraka</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/king_baraka"/>
    <language>en</language>
    <item>
      <title>Our Production System Went Down at 2:13AM — Here’s Exactly What Happened</title>
      <dc:creator>arnoldbaraka</dc:creator>
      <pubDate>Fri, 20 Mar 2026 16:12:25 +0000</pubDate>
      <link>https://forem.com/king_baraka/our-production-system-went-down-at-213am-heres-exactly-what-happened-kgf</link>
      <guid>https://forem.com/king_baraka/our-production-system-went-down-at-213am-heres-exactly-what-happened-kgf</guid>
      <description>&lt;p&gt;At 2:13AM, production went down.&lt;/p&gt;

&lt;p&gt;No warning. No gradual degradation. Just alerts firing everywhere.&lt;/p&gt;

&lt;p&gt;CPU was fine. Memory was fine. Nodes were healthy.&lt;/p&gt;

&lt;p&gt;But users?&lt;br&gt;
Nothing was working.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;We traced it to Kubernetes.&lt;/p&gt;

&lt;p&gt;Pods were restarting.&lt;br&gt;
CrashLoopBackOff.&lt;/p&gt;

&lt;p&gt;But logs?&lt;br&gt;
Almost useless.&lt;/p&gt;

&lt;p&gt;No clear error. Just silence… and restarts.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;After digging deeper, we found it:&lt;/p&gt;

&lt;p&gt;An image pull issue.&lt;/p&gt;

&lt;p&gt;The cluster couldn’t pull from ECR.&lt;/p&gt;

&lt;p&gt;Not because the image didn’t exist.&lt;br&gt;
Not because of network.&lt;/p&gt;

&lt;p&gt;But because of authentication.&lt;/p&gt;

&lt;p&gt;Expired credentials.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;Here’s what made it worse:&lt;/p&gt;

&lt;p&gt;• CI/CD pipeline was green&lt;br&gt;&lt;br&gt;
• Deployment succeeded&lt;br&gt;&lt;br&gt;
• No alerts for registry auth failures&lt;br&gt;&lt;br&gt;
• Monitoring didn’t catch it early  &lt;/p&gt;

&lt;p&gt;Everything looked healthy.&lt;/p&gt;

&lt;p&gt;It wasn’t.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;What this incident taught me:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;“Green pipeline” ≠ working system
&lt;/li&gt;
&lt;li&gt;Observability must include external dependencies (ECR, APIs, etc.)
&lt;/li&gt;
&lt;li&gt;Kubernetes will fail silently in ways that look “normal”
&lt;/li&gt;
&lt;li&gt;Authentication failures are one of the most dangerous hidden killers
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;Fix:&lt;/p&gt;

&lt;p&gt;• Implemented registry auth monitoring&lt;br&gt;&lt;br&gt;
• Added image pull failure alerts&lt;br&gt;&lt;br&gt;
• Rotated credentials with automation&lt;br&gt;&lt;br&gt;
• Improved logging visibility  &lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;DevOps isn’t about tools.&lt;/p&gt;

&lt;p&gt;It’s about understanding failure.&lt;/p&gt;

&lt;p&gt;And failure doesn’t announce itself.&lt;/p&gt;

&lt;p&gt;It hides.&lt;/p&gt;

&lt;p&gt;—&lt;/p&gt;

&lt;p&gt;I’ll be sharing more real production incidents like this.&lt;/p&gt;

&lt;p&gt;No theory. No fluff.&lt;/p&gt;

&lt;p&gt;Just what actually happens in the trenches.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
