<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sreekanth Kuruba</title>
    <description>The latest articles on Forem by Sreekanth Kuruba (@sreekanth_kuruba_91721e5d).</description>
    <link>https://forem.com/sreekanth_kuruba_91721e5d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3286476%2Fc7a306ec-1c67-4d33-901a-1148effc29ce.jpg</url>
      <title>Forem: Sreekanth Kuruba</title>
      <link>https://forem.com/sreekanth_kuruba_91721e5d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sreekanth_kuruba_91721e5d"/>
    <language>en</language>
    <item>
      <title>DevOps vs Platform Engineering in 2026</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:29:40 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/devops-vs-platform-engineering-in-2026-h56</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/devops-vs-platform-engineering-in-2026-h56</guid>
      <description>&lt;p&gt;DevOps transformed how teams build and ship software.&lt;br&gt;
It helped organizations move faster with automation, CI/CD, and shared ownership.&lt;/p&gt;

&lt;p&gt;But as companies scale across countries and teams, new challenges start to appear.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What worked for small teams doesn’t always work at global scale.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Global Companies Are Quietly Shifting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Main Theme:&lt;/strong&gt;&lt;br&gt;
At global scale, traditional DevOps starts to crack. Platform Engineering is the next evolution that makes DevOps truly scalable, consistent, and effective across countries and large teams.&lt;/p&gt;




&lt;h2&gt;
  
  
  Imagine this:
&lt;/h2&gt;

&lt;p&gt;A company has engineering teams in India, the US, Europe, and Singapore.&lt;br&gt;
Hundreds of developers working across time zones.&lt;/p&gt;

&lt;p&gt;Yet, releasing even a small feature still takes weeks — not because the developers are slow, but because they’re stuck fighting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up environments&lt;/li&gt;
&lt;li&gt;Fixing inconsistent CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Waiting for approvals&lt;/li&gt;
&lt;li&gt;Dealing with tool chaos across teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This is the reality when traditional DevOps tries to scale internationally.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 What DevOps Solved — And Where It Breaks at Global Scale
&lt;/h2&gt;

&lt;p&gt;DevOps was revolutionary. It brought developers and operations together through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automation &amp;amp; CI/CD&lt;/li&gt;
&lt;li&gt;Infrastructure as Code (IaC)&lt;/li&gt;
&lt;li&gt;Shared responsibility (“You build it, you run it”)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It works beautifully for small and mid-sized teams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But here’s the uncomfortable truth:&lt;/strong&gt;&lt;br&gt;
👉 &lt;em&gt;At large scale, many developers become part-time infrastructure managers instead of product builders.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At global enterprise scale, DevOps starts showing serious cracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every team picks different tools → massive tool sprawl&lt;/li&gt;
&lt;li&gt;Same problems get solved repeatedly&lt;/li&gt;
&lt;li&gt;Compliance and regulations (GDPR, data sovereignty, etc.) become extremely hard to manage&lt;/li&gt;
&lt;li&gt;Developers waste more time on infrastructure than on actual features&lt;/li&gt;
&lt;li&gt;DevOps fatigue kicks in — frustration, burnout, and slower delivery&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🏗️ Platform Engineering: The Next Evolution
&lt;/h2&gt;

&lt;p&gt;Here’s the sharper truth:&lt;br&gt;
While DevOps focuses on collaboration, Platform Engineering focuses on developer productivity at scale.&lt;/p&gt;

&lt;p&gt;Think of it like this:&lt;/p&gt;

&lt;p&gt;DevOps = Every team manages their own kitchen&lt;br&gt;
Platform Engineering = One professional central kitchen with ready tools, standard recipes, and built-in safety&lt;/p&gt;

&lt;p&gt;So developers can stop worrying about setup and just focus on cooking great features.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ What Platform Engineering Actually Delivers
&lt;/h2&gt;

&lt;p&gt;A dedicated platform team builds an Internal Developer Platform (IDP) that offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🚀 Self-service environment creation (minutes instead of days/weeks)&lt;/li&gt;
&lt;li&gt;🛤️ “Golden Paths” — safe, standardized, and recommended workflows&lt;/li&gt;
&lt;li&gt;🔐 Security, compliance, and observability built-in by default&lt;/li&gt;
&lt;li&gt;🧭 A clean developer portal for easy self-service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Often powered by tools like Backstage, Crossplane, along with core DevOps tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Developers get guided freedom instead of complete chaos or total restriction.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚖️ DevOps vs Platform Engineering – Clear Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;DevOps&lt;/th&gt;
&lt;th&gt;Platform Engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Main Focus&lt;/td&gt;
&lt;td&gt;Collaboration between Dev &amp;amp; Ops&lt;/td&gt;
&lt;td&gt;Developer productivity &amp;amp; experience at scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ownership&lt;/td&gt;
&lt;td&gt;Shared by all teams&lt;/td&gt;
&lt;td&gt;Dedicated platform team&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Approach&lt;/td&gt;
&lt;td&gt;Flexible (every team does it their way)&lt;/td&gt;
&lt;td&gt;Standardized with smart guardrails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Suited For&lt;/td&gt;
&lt;td&gt;Small to mid-size teams&lt;/td&gt;
&lt;td&gt;Large global organizations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Key Metric&lt;/td&gt;
&lt;td&gt;Deployment frequency &amp;amp; speed&lt;/td&gt;
&lt;td&gt;Time saved + Developer Experience (DevEx)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;One-line summary:&lt;/strong&gt;&lt;br&gt;
DevOps gives freedom.&lt;br&gt;
Platform Engineering gives freedom that actually scales globally.&lt;/p&gt;




&lt;h2&gt;
  
  
  🌍 Why Global Companies Are Making This Shift in 2026
&lt;/h2&gt;

&lt;p&gt;At international level, complexity explodes — multi-cloud setups, different regulations, time zone differences, and 100+ engineering teams.&lt;/p&gt;

&lt;p&gt;Platform Engineering solves these by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drastically reducing repetitive work and cognitive load&lt;/li&gt;
&lt;li&gt;Bringing consistency across countries and clouds&lt;/li&gt;
&lt;li&gt;Making security &amp;amp; compliance automatic&lt;/li&gt;
&lt;li&gt;Improving developer happiness and retention&lt;/li&gt;
&lt;li&gt;Delivering faster feature delivery with lower risk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This is exactly why Platform Engineering roles are becoming some of the highest-paying and most strategic positions in 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Challenges &amp;amp; Smart Way to Adopt
&lt;/h2&gt;

&lt;p&gt;It’s not effortless. Common pitfalls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Building the platform without real developer feedback&lt;/li&gt;
&lt;li&gt;Making it too rigid&lt;/li&gt;
&lt;li&gt;Ignoring legacy systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Better approach:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start small (fix one major pain point first)&lt;/li&gt;
&lt;li&gt;Treat developers as customers&lt;/li&gt;
&lt;li&gt;Iterate continuously based on feedback&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧭 What Should You Learn?
&lt;/h2&gt;

&lt;p&gt;If you're an engineer (especially aiming for global or remote opportunities):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Master DevOps fundamentals&lt;/strong&gt;&lt;br&gt;
→ Docker, Kubernetes, Terraform, CI/CD&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Level up to Platform Engineering&lt;/strong&gt;&lt;br&gt;
→ Internal Developer Platforms (IDP)&lt;br&gt;
→ Developer portals (e.g., Backstage)&lt;br&gt;
→ Developer Experience (DevEx) mindset&lt;/p&gt;

&lt;p&gt;💡 Pro tip: Build even a small internal platform project — it gives you a massive edge in interviews and LinkedIn.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔮 Final Thought
&lt;/h2&gt;

&lt;p&gt;DevOps is not going away.&lt;/p&gt;

&lt;p&gt;But the companies winning in 2026 are not just “doing DevOps”.&lt;br&gt;
They are building Platform Engineering on top of it — turning DevOps into something scalable, structured, and developer-first at global scale.&lt;/p&gt;

&lt;p&gt;👉 The future is DevOps made effortless through smart platforms.&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 What about you?
&lt;/h2&gt;

&lt;p&gt;What is the &lt;strong&gt;biggest time-waster&lt;/strong&gt; in your current DevOps setup?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environment setup delays?&lt;/li&gt;
&lt;li&gt;CI/CD issues?&lt;/li&gt;
&lt;li&gt;Too many tools?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop your real experience in the comments — curious to see what teams are struggling with most 👇&lt;/p&gt;

</description>
      <category>devops</category>
      <category>platformengineering</category>
      <category>cloud</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Types of APIs Explained: REST, GraphQL, gRPC &amp; SOAP (With Real-World Examples)</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Thu, 09 Apr 2026 12:07:06 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/types-of-apis-explained-rest-graphql-grpc-soap-with-real-world-examples-1lo2</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/types-of-apis-explained-rest-graphql-grpc-soap-with-real-world-examples-1lo2</guid>
      <description>&lt;p&gt;&lt;strong&gt;Types of APIs Explained: REST, GraphQL, gRPC &amp;amp; SOAP (With Real-World Examples)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When beginners start learning &lt;strong&gt;APIs&lt;/strong&gt;, they usually think there’s only one kind:&lt;br&gt;&lt;br&gt;
“Send a request → Get a response.”&lt;/p&gt;

&lt;p&gt;But in reality, there are &lt;strong&gt;multiple types of APIs&lt;/strong&gt;, each built for different purposes — &lt;strong&gt;speed&lt;/strong&gt;, &lt;strong&gt;flexibility&lt;/strong&gt;, &lt;strong&gt;security&lt;/strong&gt;, or &lt;strong&gt;simplicity&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this guide, you’ll learn the &lt;strong&gt;main types of APIs&lt;/strong&gt; with simple explanations, code examples, and real-world use cases.&lt;/p&gt;
&lt;h3&gt;
  
  
  🧠 What is an API? (Quick Recap)
&lt;/h3&gt;

&lt;p&gt;An &lt;strong&gt;API (Application Programming Interface)&lt;/strong&gt; is a set of rules that allows different software systems to communicate with each other.&lt;/p&gt;

&lt;p&gt;One system sends a &lt;strong&gt;request&lt;/strong&gt; → another system processes it → and returns a &lt;strong&gt;response&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;style&lt;/strong&gt; and &lt;strong&gt;protocol&lt;/strong&gt; of this communication decide the &lt;strong&gt;type of API&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  🔹 Main Types of APIs by Architecture Style
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. REST APIs – The Most Popular Type&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REST (Representational State Transfer)&lt;/strong&gt; is the &lt;strong&gt;most widely used&lt;/strong&gt; API style in 2026.&lt;/p&gt;

&lt;p&gt;It uses standard &lt;strong&gt;HTTP methods&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GET&lt;/strong&gt; – Fetch data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POST&lt;/strong&gt; – Create new data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PUT / PATCH&lt;/strong&gt; – Update data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DELETE&lt;/strong&gt; – Delete data
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /users/1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Response:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sreekanth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sreekanth@example.com"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt; Public APIs, mobile apps, and web applications&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Popular Examples:&lt;/strong&gt; Stripe, Razorpay, GitHub, Google Maps  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why it’s popular:&lt;/strong&gt; Simple, scalable, and works everywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. GraphQL – Get Exactly What You Need&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GraphQL&lt;/strong&gt; solves a major problem of REST called &lt;strong&gt;over-fetching&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of getting extra data, the client can request &lt;strong&gt;exactly&lt;/strong&gt; the fields it needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight graphql"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="n"&gt;posts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="n"&gt;createdAt&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt; Modern frontend and mobile apps&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Popular Examples:&lt;/strong&gt; Facebook, Shopify, GitHub, Airbnb  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Big Advantage:&lt;/strong&gt; Faster responses and better control for developers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. gRPC – The Fastest for Microservices&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gRPC&lt;/strong&gt; is a high-performance framework developed by Google.&lt;/p&gt;

&lt;p&gt;It uses &lt;strong&gt;Protocol Buffers&lt;/strong&gt; (binary format) instead of JSON, making it &lt;strong&gt;much faster&lt;/strong&gt; and lighter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extremely fast and low latency&lt;/li&gt;
&lt;li&gt;Smaller data size&lt;/li&gt;
&lt;li&gt;Strongly typed&lt;/li&gt;
&lt;li&gt;Supports streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best For:&lt;/strong&gt; Internal microservices communication and high-traffic systems&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Popular Examples:&lt;/strong&gt; Uber, Netflix, Google, Kubernetes  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When to choose gRPC:&lt;/strong&gt; When you need &lt;strong&gt;maximum speed&lt;/strong&gt; between services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. SOAP – The Secure Enterprise Option&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SOAP (Simple Object Access Protocol)&lt;/strong&gt; is an older but still important protocol, especially in large organizations.&lt;/p&gt;

&lt;p&gt;It uses &lt;strong&gt;XML&lt;/strong&gt; and has strong built-in security features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Still Used In:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traditional banking core systems&lt;/li&gt;
&lt;li&gt;Government and highly regulated industries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important Note for India:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Modern systems like &lt;strong&gt;UPI&lt;/strong&gt;, &lt;strong&gt;BBPS&lt;/strong&gt;, and most fintech apps primarily use &lt;strong&gt;REST APIs&lt;/strong&gt; with ISO 20022 standards. They have largely moved away from SOAP for better speed and flexibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔹 Types of APIs by Access Level
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Public APIs&lt;/strong&gt; → Open to everyone (Example: Weather API, Google Maps)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private/Internal APIs&lt;/strong&gt; → Used only inside a company
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partner APIs&lt;/strong&gt; → Shared with specific business partners&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔄 Real-World Architecture Insight
&lt;/h3&gt;

&lt;p&gt;Most modern applications use a &lt;strong&gt;hybrid approach&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;External-facing&lt;/strong&gt; (apps &amp;amp; websites) → &lt;strong&gt;REST&lt;/strong&gt; or &lt;strong&gt;GraphQL&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal microservices&lt;/strong&gt; → &lt;strong&gt;gRPC&lt;/strong&gt; (for high speed)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legacy systems&lt;/strong&gt; → &lt;strong&gt;SOAP&lt;/strong&gt; (for security &amp;amp; compliance)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In India’s fintech ecosystem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UPI and public integrations → &lt;strong&gt;REST APIs&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;High-volume internal services → &lt;strong&gt;gRPC&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Old core banking systems → Often still use SOAP or hybrid setups&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🎯 Final Takeaway
&lt;/h3&gt;

&lt;p&gt;There is &lt;strong&gt;no single best API type&lt;/strong&gt; — each has its own strengths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;REST&lt;/strong&gt; → Best for &lt;strong&gt;simplicity&lt;/strong&gt; and wide compatibility
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphQL&lt;/strong&gt; → Best for &lt;strong&gt;flexibility&lt;/strong&gt; and precise data fetching
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gRPC&lt;/strong&gt; → Best for &lt;strong&gt;speed&lt;/strong&gt; and microservices
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOAP&lt;/strong&gt; → Best for &lt;strong&gt;security&lt;/strong&gt; in enterprise environments
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding these &lt;strong&gt;types of APIs&lt;/strong&gt; helps you design better systems and choose the right tool for every situation.&lt;/p&gt;

&lt;p&gt;💬 &lt;strong&gt;Your Turn:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Which &lt;strong&gt;type of API&lt;/strong&gt; have you used the most?&lt;br&gt;&lt;br&gt;
Which one do you want to learn next?  &lt;/p&gt;

&lt;p&gt;Drop your answers in the comments below! 👇&lt;/p&gt;




</description>
      <category>restapi</category>
      <category>graphql</category>
      <category>grpc</category>
      <category>soap</category>
    </item>
    <item>
      <title>API Explained: From Basics to Real-World Systems (UPI Deep Dive)</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Wed, 08 Apr 2026 07:21:54 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/api-explained-from-basics-to-real-world-systems-upi-deep-dive-8gn</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/api-explained-from-basics-to-real-world-systems-upi-deep-dive-8gn</guid>
      <description>&lt;p&gt;When you send ₹100 using PhonePe or Google Pay, it feels instant.&lt;br&gt;&lt;br&gt;
But behind that single tap, multiple systems communicate in real time across different banks.  &lt;/p&gt;

&lt;p&gt;👉 This seamless communication is powered by &lt;strong&gt;APIs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧠 What is an API?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
An &lt;strong&gt;Application Programming Interface (API)&lt;/strong&gt; is a set of rules that allows one software system to request another system to perform an action and return a result.  &lt;/p&gt;

&lt;p&gt;In simple terms:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Request → Process → Response&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🍽️ Simple Analogy&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Think of a restaurant:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You → Client
&lt;/li&gt;
&lt;li&gt;Waiter → API
&lt;/li&gt;
&lt;li&gt;Kitchen → Backend
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don’t enter the kitchen yourself&lt;br&gt;&lt;br&gt;
You just place an order, the waiter handles everything, and you get your food &lt;/p&gt;

&lt;p&gt;APIs work exactly the same way between different systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔍 Types of APIs (Quick Overview)&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;REST APIs&lt;/strong&gt; — Most common (uses HTTP methods: GET, POST, PUT, DELETE)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphQL&lt;/strong&gt; — Client decides exactly what data it needs
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gRPC / SOAP&lt;/strong&gt; — Used in high-performance or enterprise systems
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this blog, we’ll mainly focus on &lt;strong&gt;REST APIs&lt;/strong&gt;, as they power most modern applications including UPI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚙️ Basic API Example&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Here’s a very simple API call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;GET /users/1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Response:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sreekanth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DevOps Engineer"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client requests data → the server processes it → and sends back the response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📲 Real-World Example: UPI Payment Flow&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Let’s see what actually happens when you send ₹100:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App → Payer Bank → NPCI → Payee Bank → Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step-by-step:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;App collects amount + UPI PIN (encrypted)
&lt;/li&gt;
&lt;li&gt;App sends a secure API request to your bank
&lt;/li&gt;
&lt;li&gt;Your bank validates:

&lt;ul&gt;
&lt;li&gt;UPI PIN
&lt;/li&gt;
&lt;li&gt;Account balance
&lt;/li&gt;
&lt;li&gt;Daily limits
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Request is forwarded to &lt;strong&gt;NPCI&lt;/strong&gt; (National Payments Corporation of India)
&lt;/li&gt;
&lt;li&gt;NPCI routes the request to the payee’s bank
&lt;/li&gt;
&lt;li&gt;Payee’s bank credits the amount
&lt;/li&gt;
&lt;li&gt;Success response flows back to both apps
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 Total time: Usually &lt;strong&gt;under 2–3 seconds&lt;/strong&gt; ⚡&lt;/p&gt;

&lt;p&gt;Here’s the &lt;strong&gt;high-level UPI transaction flow&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;Another clean view of the UPI flow:&lt;/p&gt;

&lt;p&gt;And a simplified version showing Payer PSP → NPCI → Payee PSP:&lt;/p&gt;

&lt;p&gt;🔔 &lt;strong&gt;When Things Are Not Instant (Webhooks)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sometimes the bank takes longer.&lt;/p&gt;

&lt;p&gt;Instead of waiting:&lt;/p&gt;

&lt;p&gt;Transaction marked Pending ⏳&lt;br&gt;
Bank/NPCI sends a Webhook callback 🔔 once completed&lt;/p&gt;

&lt;p&gt;👉 Think of it as:&lt;br&gt;
“Don’t call us, we’ll call you.”&lt;/p&gt;

&lt;p&gt;This makes systems asynchronous and scalable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚙️ Sample UPI API Request&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="nf"&gt;POST&lt;/span&gt; &lt;span class="nn"&gt;/v1/payments/upi&lt;/span&gt; &lt;span class="k"&gt;HTTP&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="m"&gt;1.1&lt;/span&gt;
&lt;span class="na"&gt;Host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.phonepe.com&lt;/span&gt;
&lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Bearer &amp;lt;access_token&amp;gt;&lt;/span&gt;
&lt;span class="na"&gt;Content-Type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application/json&lt;/span&gt;

&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;           &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;paise&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;(₹&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="err"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payeeVpa"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"friend@oksbi"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"remarks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Lunch money"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"txnId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"txn-12345"&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;used&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;idempotency&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Response:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SUCCESS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transactionId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UPI987654321"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"responseCode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"00"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;🧩 API Request/Response Lifecycle (End-to-End Flow)&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Every API call follows a clear lifecycle. Here’s what happens from the moment the request is sent until the response is received:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔧 What Happens Inside the Backend&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When an API receives a request, it goes through several important layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway&lt;/strong&gt; – Handles rate limiting &amp;amp; routing
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt; – JWT, OAuth2, or API keys
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input Validation&lt;/strong&gt; – Checks request format and data
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business Logic&lt;/strong&gt; – Balance check, fraud detection, rules
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database Operations&lt;/strong&gt; – Secure debit/credit (ACID transaction)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External API Calls&lt;/strong&gt; – To NPCI or other banks
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logging &amp;amp; Monitoring&lt;/strong&gt; – For debugging and observability
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response&lt;/strong&gt; – Sent back to the client
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;👉 To prevent duplicate payments, systems use &lt;strong&gt;idempotency keys&lt;/strong&gt; (&lt;code&gt;txnId&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;🏦 &lt;strong&gt;Why ACID Matters in Payments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Banking systems rely on ACID properties:&lt;/p&gt;

&lt;p&gt;Atomicity: Either full transaction happens or none&lt;br&gt;
Consistency: Total money remains correct&lt;br&gt;
Isolation: Millions can transact simultaneously safely&lt;br&gt;
Durability: Once success is returned, it’s permanent&lt;/p&gt;

&lt;p&gt;👉 This ensures no “money lost” scenarios.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔄 Microservices Architecture&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Modern apps like PhonePe are not built as one single block. They are divided into independent microservices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User Service
&lt;/li&gt;
&lt;li&gt;Payment Service
&lt;/li&gt;
&lt;li&gt;Notification Service
&lt;/li&gt;
&lt;li&gt;Fraud Detection Service
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These services talk to each other using internal APIs or message queues.&lt;/p&gt;

&lt;p&gt;This makes systems scalable and fault-tolerant&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚡ Scaling APIs for Millions of Transactions&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Apps like PhonePe and Google Pay handle crores of transactions every day using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load balancing
&lt;/li&gt;
&lt;li&gt;Horizontal scaling (Kubernetes)
&lt;/li&gt;
&lt;li&gt;Caching (Redis)
&lt;/li&gt;
&lt;li&gt;Message queues (Kafka)
&lt;/li&gt;
&lt;li&gt;Rate limiting
&lt;/li&gt;
&lt;li&gt;Circuit breakers
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🎯 Final Takeaway&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
APIs are not just a technical concept — they are the &lt;strong&gt;invisible engine&lt;/strong&gt; behind everything: &lt;/p&gt;

&lt;p&gt;Sending money 💰&lt;br&gt;
Logging in 🔐&lt;br&gt;
Fetching data 📊&lt;/p&gt;

&lt;p&gt;Once you understand APIs,&lt;br&gt;
you start seeing the architecture behind every app you use.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>api</category>
      <category>microservices</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Why "Just Restart It" Stopped Working</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 24 Mar 2026 07:58:32 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/why-just-restart-it-stopped-working-2ef9</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/why-just-restart-it-stopped-working-2ef9</guid>
      <description>&lt;h2&gt;
  
  
  Why "Just Restart It" Stopped Working
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A eulogy for the universal debugging technique&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Universal Truth
&lt;/h2&gt;

&lt;p&gt;Every engineer has said it.&lt;br&gt;&lt;br&gt;
Every engineer has heard it.&lt;/p&gt;

&lt;p&gt;Three words that have debugged more systems than all monitoring tools combined:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Have you tried restarting it?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It worked for decades. So well we turned it into a meme. A joke. A badge of honor.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Did you turn it off and on again?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We laughed because it was true.&lt;/p&gt;


&lt;h2&gt;
  
  
  When Restarting Made Sense
&lt;/h2&gt;

&lt;p&gt;Once upon a time, a server was a physical thing.&lt;/p&gt;

&lt;p&gt;One machine. One process. One problem.&lt;/p&gt;

&lt;p&gt;When something broke:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Service stops responding
→ SSH into the box
→ ps aux | grep myapp
→ PID still there? Process hung?
→ kill -9 PID
→ ./start-myapp.sh
→ Everything works again

Total time: 2 minutes
Total stress: Minimal
Total sleep lost: None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why did this work?&lt;/p&gt;

&lt;p&gt;Because the problem was usually temporary.&lt;br&gt;&lt;br&gt;
A memory leak. A deadlock. A bad connection that timed out wrong.&lt;/p&gt;

&lt;p&gt;The code had a bug, sure. But restarting reset the state to &lt;em&gt;before the bug happened&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It wasn't elegant. It wasn't permanent.&lt;br&gt;&lt;br&gt;
But at 3 AM, that's all anyone cared about.&lt;/p&gt;


&lt;h2&gt;
  
  
  The First Sign of Trouble
&lt;/h2&gt;

&lt;p&gt;Then we got more servers.&lt;/p&gt;

&lt;p&gt;One box became ten.&lt;br&gt;&lt;br&gt;
Ten became a hundred.&lt;/p&gt;

&lt;p&gt;Restarting stopped being a single command.&lt;br&gt;&lt;br&gt;
It became a deployment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;server &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;servers.txt&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;ssh &lt;span class="nv"&gt;$server&lt;/span&gt; &lt;span class="s2"&gt;"systemctl restart myapp"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This worked. Mostly.&lt;/p&gt;

&lt;p&gt;Until the day it didn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Cascade
&lt;/h2&gt;

&lt;p&gt;I watched this happen once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;02:15 - Pager: "Database connections failing"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The on-call engineer checks the logs.&lt;br&gt;&lt;br&gt;
Database is overwhelmed. Too many connections.&lt;/p&gt;

&lt;p&gt;The solution, burned into muscle memory from years of single-server debugging:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Restart the database."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One command. One mistake.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl restart postgresql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The database came back in 45 seconds.&lt;/p&gt;

&lt;p&gt;In those 45 seconds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All 200 application servers lost their connection pools&lt;/li&gt;
&lt;li&gt;All 200 retried simultaneously, using identical retry logic&lt;/li&gt;
&lt;li&gt;All 200 failed their health checks&lt;/li&gt;
&lt;li&gt;The load balancer marked them all unhealthy&lt;/li&gt;
&lt;li&gt;The site went down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The database was fine.&lt;br&gt;&lt;br&gt;
The app servers were fine.&lt;br&gt;&lt;br&gt;
The connections were gone.&lt;/p&gt;

&lt;p&gt;The restart fixed nothing and broke everything.&lt;/p&gt;

&lt;p&gt;One restart.&lt;br&gt;&lt;br&gt;
47 minutes of downtime.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Restarting Broke
&lt;/h2&gt;

&lt;p&gt;Restarting worked when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State lived in one place&lt;/li&gt;
&lt;li&gt;Dependencies were simple&lt;/li&gt;
&lt;li&gt;Recovery was faster than finding root cause&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Restarting broke when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;State moved to databases, caches, message queues&lt;/li&gt;
&lt;li&gt;Services started calling other services&lt;/li&gt;
&lt;li&gt;"Just restart it" became "restart everything in the right order with the right delays and pray"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A restart is no longer a local action.&lt;br&gt;&lt;br&gt;
It's a distributed event.&lt;/p&gt;

&lt;p&gt;You don't restart &lt;em&gt;one thing&lt;/em&gt;.&lt;br&gt;&lt;br&gt;
You restart a graph of dependencies.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Happens When You Restart Now
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You restart Service A
↓
Service A disconnects from database
↓
Database releases locks
↓
Service B loses connection to Service A
↓
Service B retries aggressively
↓
Retries overwhelm Service C
↓
Service C crashes
↓
Everything is on fire
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;All because you restarted "just one thing."&lt;/p&gt;


&lt;h2&gt;
  
  
  The Lie We Tell Ourselves
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;"Restarting is harmless."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It isn't.&lt;/p&gt;

&lt;p&gt;Every restart is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A forced state reset&lt;/li&gt;
&lt;li&gt;A connection teardown&lt;/li&gt;
&lt;li&gt;A potential cascade trigger&lt;/li&gt;
&lt;li&gt;A temporary partial outage (even if small)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We accepted restarts as "free" because the cost was invisible.&lt;/p&gt;

&lt;p&gt;Until it wasn't.&lt;/p&gt;


&lt;h2&gt;
  
  
  What Replaced Restarting
&lt;/h2&gt;

&lt;p&gt;The industry didn't ban restarts.&lt;/p&gt;

&lt;p&gt;It made them unnecessary.&lt;/p&gt;
&lt;h3&gt;
  
  
  Health checks
&lt;/h3&gt;

&lt;p&gt;Detect problems before users do.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kubernetes liveness probe example&lt;/span&gt;
&lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If service unhealthy, don't send traffic
Let it recover or replace it
Users never see the failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Graceful degradation
&lt;/h3&gt;

&lt;p&gt;Fail partially, not completely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache down? Serve stale data
Database slow? Queue writes, serve reads
Something broke? Everything else keeps running
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Automatic replacement
&lt;/h3&gt;

&lt;p&gt;Never restart. Always replace.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pod dies? New one starts
Node fails? Pods move
Same binary. Clean state. No cascade
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Rolling restarts
&lt;/h3&gt;

&lt;p&gt;One at a time, with verification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Restart server 1 of 10
Wait for health check
Restart server 2 of 10
Never lose capacity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Systems That Don't Need Restarts
&lt;/h2&gt;

&lt;p&gt;Netflix doesn't restart. It terminates and replaces.&lt;br&gt;&lt;br&gt;
Google doesn't restart. It shifts load and repairs.&lt;br&gt;&lt;br&gt;
Your bank doesn't restart. It fails over to another region.&lt;/p&gt;

&lt;p&gt;These aren't magic.&lt;br&gt;&lt;br&gt;
They're design choices.&lt;/p&gt;

&lt;p&gt;They assumed from day one that "restart" was not a strategy.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Confession
&lt;/h2&gt;

&lt;p&gt;I still say "have you tried restarting it?"&lt;/p&gt;

&lt;p&gt;Sometimes it's the fastest path to &lt;em&gt;it works now&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;But I don't pretend it's a fix anymore.&lt;/p&gt;

&lt;p&gt;It's a diagnostic.&lt;br&gt;&lt;br&gt;
A temporary patch.&lt;br&gt;&lt;br&gt;
A way to buy time until the real problem reveals itself.&lt;/p&gt;

&lt;p&gt;The difference is:&lt;br&gt;&lt;br&gt;
I know the difference now.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Can Do Monday
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For your most critical service:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Find the last time it was restarted&lt;/li&gt;
&lt;li&gt;Ask: "Why did that restart happen?"&lt;/li&gt;
&lt;li&gt;Ask: "Could we have avoided it?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If yes, build the automation.&lt;br&gt;&lt;br&gt;
If no, document why (so next time you know).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For your next outage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resist the restart reflex&lt;/li&gt;
&lt;li&gt;Check dependencies first&lt;/li&gt;
&lt;li&gt;Check connections second&lt;/li&gt;
&lt;li&gt;Check logs third&lt;/li&gt;
&lt;li&gt;Restart only when you understand what you're about to break&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Question
&lt;/h2&gt;

&lt;p&gt;When was the last time you restarted something&lt;br&gt;&lt;br&gt;
and &lt;em&gt;didn't&lt;/em&gt; know exactly what would happen when it came back?&lt;/p&gt;

&lt;p&gt;Be honest.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of a series on operations in the age of distributed systems. Next up: "The Pager Should Not Exist."&lt;/em&gt;&lt;/p&gt;




</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>sre</category>
    </item>
    <item>
      <title>From Process Management to State Reconciliation</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 24 Feb 2026 03:09:46 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/from-process-management-to-state-reconciliation-9cj</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/from-process-management-to-state-reconciliation-9cj</guid>
      <description>&lt;h2&gt;
  
  
  I used to restart servers at 2AM… Kubernetes made that job disappear
&lt;/h2&gt;

&lt;p&gt;02:15 AM — Pager goes off&lt;br&gt;
“nginx is down on web-01”&lt;/p&gt;

&lt;p&gt;You wake up.&lt;br&gt;
Grab your laptop.&lt;br&gt;
SSH into the server.&lt;br&gt;
Run a few commands. Restart the process.&lt;/p&gt;

&lt;p&gt;02:22 AM — It’s back.&lt;/p&gt;

&lt;p&gt;Try to sleep again.&lt;/p&gt;

&lt;p&gt;This used to be normal.&lt;/p&gt;

&lt;p&gt;Then Kubernetes changed the rules.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧱 The old world: Process-driven operations
&lt;/h2&gt;

&lt;p&gt;Before Kubernetes, everything revolved around &lt;strong&gt;processes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A service was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Linux process&lt;/li&gt;
&lt;li&gt;Running on a specific machine&lt;/li&gt;
&lt;li&gt;Identified by a PID&lt;/li&gt;
&lt;li&gt;Restarted manually (or via basic supervisors)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The assumptions were simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Machines are stable&lt;/li&gt;
&lt;li&gt;Failures are rare&lt;/li&gt;
&lt;li&gt;Humans fix problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And when something broke…&lt;br&gt;
👉 &lt;strong&gt;you fixed it&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Availability depended on:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How fast someone could wake up and respond.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🐳 Containers helped… but didn’t solve the real problem
&lt;/h2&gt;

&lt;p&gt;With tools like Docker, things improved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consistent environments&lt;/li&gt;
&lt;li&gt;Faster deployments&lt;/li&gt;
&lt;li&gt;Fewer “works on my machine” issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But let’s be honest…&lt;/p&gt;

&lt;p&gt;If a container crashed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maybe it restarted&lt;/li&gt;
&lt;li&gt;Maybe it didn’t&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the node died?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re still in trouble&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If dependencies failed?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Still your problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Containers improved &lt;strong&gt;portability&lt;/strong&gt;&lt;br&gt;
👉 They did NOT guarantee &lt;strong&gt;reliability&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔄 Kubernetes changed the question
&lt;/h2&gt;

&lt;p&gt;Kubernetes doesn’t ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Is this process running?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Is the system in the state I declared?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s a massive shift.&lt;/p&gt;

&lt;p&gt;Instead of managing processes…&lt;br&gt;
you define &lt;strong&gt;desired state&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ The magic: State reconciliation
&lt;/h2&gt;

&lt;p&gt;You declare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“I want 3 replicas”&lt;/li&gt;
&lt;li&gt;“They should always be running”&lt;/li&gt;
&lt;li&gt;“They should be healthy”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kubernetes continuously checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current state&lt;/li&gt;
&lt;li&gt;Desired state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If something breaks…&lt;br&gt;
👉 it fixes it automatically&lt;/p&gt;

&lt;p&gt;Not later.&lt;br&gt;
Not after a pager alert.&lt;br&gt;
&lt;strong&gt;Continuously.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔄 Traditional vs Kubernetes minds
&lt;/h2&gt;




&lt;h2&gt;
  
  
  🧠 Why Kubernetes doesn’t care about PIDs
&lt;/h2&gt;

&lt;p&gt;In traditional systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PID = identity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PID = irrelevant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because a PID is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local to a machine&lt;/li&gt;
&lt;li&gt;Temporary&lt;/li&gt;
&lt;li&gt;Lost on restart&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kubernetes doesn’t track processes.&lt;/p&gt;

&lt;p&gt;It tracks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Desired outcomes&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You don’t ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“What’s the PID?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“Do I have 3 healthy pods?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 That’s the difference between &lt;strong&gt;instance thinking&lt;/strong&gt; and &lt;strong&gt;system thinking&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💥 The real shift: Replace, don’t repair
&lt;/h2&gt;

&lt;p&gt;Old mindset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fix the broken process&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;New mindset:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace it
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;👉 Failure is handled through replacement, not repair.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes doesn’t try to “save” things.&lt;/p&gt;

&lt;p&gt;It simply ensures:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The system matches your declared state&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🧪 Jobs are different too
&lt;/h2&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run jobs manually&lt;/li&gt;
&lt;li&gt;Monitor externally&lt;/li&gt;
&lt;li&gt;Retry manually&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define a Job&lt;/li&gt;
&lt;li&gt;Kubernetes ensures completion&lt;/li&gt;
&lt;li&gt;Retries automatically&lt;/li&gt;
&lt;li&gt;Tracks success/failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 You define intent.&lt;br&gt;
👉 System enforces outcome.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Failure is not an exception anymore
&lt;/h2&gt;

&lt;p&gt;At scale, failure is constant.&lt;/p&gt;

&lt;p&gt;Systems like Google’s Borg (Kubernetes’ ancestor) proved this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Machines fail&lt;/li&gt;
&lt;li&gt;Networks break&lt;/li&gt;
&lt;li&gt;Processes crash&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not &lt;em&gt;if&lt;/em&gt;&lt;br&gt;
But &lt;em&gt;how often&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes is built for this reality.&lt;/p&gt;

&lt;p&gt;It assumes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nodes will disappear&lt;/li&gt;
&lt;li&gt;Pods will die&lt;/li&gt;
&lt;li&gt;Networks will glitch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it’s okay with that.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔁 What actually changed?
&lt;/h2&gt;

&lt;p&gt;Before Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You maintained systems&lt;/li&gt;
&lt;li&gt;You fixed failures&lt;/li&gt;
&lt;li&gt;You reacted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You define intent&lt;/li&gt;
&lt;li&gt;The system maintains itself&lt;/li&gt;
&lt;li&gt;Recovery is automatic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Your job shifts from:&lt;br&gt;
&lt;strong&gt;operator → system designer&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🏁 Final thought
&lt;/h2&gt;

&lt;p&gt;Kubernetes doesn’t remove failure.&lt;/p&gt;

&lt;p&gt;It removes panic.&lt;/p&gt;

&lt;p&gt;The system doesn’t ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Who will fix this?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What should this look like?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And then it makes it happen.&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Your turn
&lt;/h2&gt;

&lt;p&gt;What’s the last thing you had to fix manually at 2AM?&lt;/p&gt;

&lt;p&gt;And could Kubernetes have handled it for you?&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>linux</category>
      <category>sre</category>
    </item>
    <item>
      <title>How Platform Engineering Changes the Game</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 27 Jan 2026 14:45:50 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/how-platform-engineering-changes-the-game-102d</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/how-platform-engineering-changes-the-game-102d</guid>
      <description>&lt;p&gt;DevOps isn't dying.&lt;br&gt;&lt;br&gt;
But the &lt;strong&gt;"central DevOps team doing everything" model&lt;/strong&gt; is hitting limits at scale.&lt;/p&gt;

&lt;p&gt;Here's what's replacing it — and &lt;strong&gt;why it works&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  🧱 What Platform Teams &lt;strong&gt;Actually&lt;/strong&gt; Build
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;(Not just theory)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Internal Developer Platforms (IDPs)&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single control plane for deployments, from dev → prod
&lt;/li&gt;
&lt;li&gt;Example: &lt;strong&gt;Backstage&lt;/strong&gt; (Spotify), &lt;strong&gt;Internal Developer Portal&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Result: &lt;strong&gt;60% less time&lt;/strong&gt; spent on deployment setup (Humanitec data)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Golden Paths, Not Guardrails&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-approved Terraform modules for AWS/GCP/Azure
&lt;/li&gt;
&lt;li&gt;Standardized K8s configurations with sane defaults
&lt;/li&gt;
&lt;li&gt;Security/compliance &lt;strong&gt;baked in&lt;/strong&gt;, not bolted on
&lt;/li&gt;
&lt;li&gt;Outcome: &lt;strong&gt;83% faster&lt;/strong&gt; infra provisioning (Gartner)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Self-Service, Not Ticket-Based&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Developers deploy via UI/API/Git push — no tickets
&lt;/li&gt;
&lt;li&gt;Automated approval workflows replace manual reviews
&lt;/li&gt;
&lt;li&gt;Impact: &lt;strong&gt;10x more deployments&lt;/strong&gt; with same team size&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🏢 Real-World Example: &lt;strong&gt;Amazon's "You Build It, You Run It"&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The famous mandate works &lt;strong&gt;because&lt;/strong&gt; of the invisible platform:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What developers see:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;git push&lt;/code&gt; → running service
&lt;/li&gt;
&lt;li&gt;Built-in monitoring, logging, alerting
&lt;/li&gt;
&lt;li&gt;One-click rollback, canary deployments
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What platform provides:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CodePipeline&lt;/strong&gt; templates (not custom Jenkins)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDK constructs&lt;/strong&gt; (not raw CloudFormation)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal service catalog&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardized observability stack&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;150M+ deployments/year
&lt;/li&gt;
&lt;li&gt;Teams deploy &lt;strong&gt;thousands of times daily&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No central bottleneck&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚙️ The Tooling Shift
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;OLD DevOps Stack:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Jenkins → Ansible → Custom scripts → Slack alerts → Manual dashboards&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NEW Platform Stack:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Backstage (UI) → ArgoCD (GitOps) → Crossplane (Control Plane)&lt;br&gt;&lt;br&gt;
→ OpenTelemetry (Observability) → Internal APIs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key difference:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Declarative&lt;/strong&gt; over imperative
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git as source of truth&lt;/strong&gt; for everything
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API-first&lt;/strong&gt; everything&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  📊 The Numbers Don't Lie
&lt;/h3&gt;

&lt;p&gt;Companies with mature platforms report:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50% less production incidents&lt;/strong&gt; (DORA)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;75% faster mean time to recovery&lt;/strong&gt; (MTTR)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;40% less time spent on "keeping lights on"&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3x more developer satisfaction&lt;/strong&gt; (SPACE metrics)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🤖 Where AI &lt;strong&gt;Actually&lt;/strong&gt; Helps Today
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Not:&lt;/strong&gt; "AI will write your Terraform"&lt;br&gt;&lt;br&gt;
&lt;strong&gt;But:&lt;/strong&gt; "AI explains why your deployment failed"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful patterns right now:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-driven &lt;strong&gt;failure analysis&lt;/strong&gt; in CI/CD logs
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization suggestions&lt;/strong&gt; for cloud resources
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security misconfiguration detection&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Documentation generation&lt;/strong&gt; from code changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Still needed:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Platform engineers to &lt;strong&gt;design the systems&lt;/strong&gt; AI operates on
&lt;/li&gt;
&lt;li&gt;Human judgment for &lt;strong&gt;architecture decisions&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cultural change management&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🚨 The Hard Parts (Nobody Talks About)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Platform adoption isn't automatic&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need &lt;strong&gt;developer buy-in&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Must be &lt;strong&gt;better than the DIY alternative&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Requires &lt;strong&gt;investment in UX&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Platform teams get it wrong when:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They build &lt;strong&gt;what they think devs need&lt;/strong&gt; (not what they actually need)
&lt;/li&gt;
&lt;li&gt;They create &lt;strong&gt;another complex tool&lt;/strong&gt; (instead of simplifying)
&lt;/li&gt;
&lt;li&gt;They &lt;strong&gt;over-standardize&lt;/strong&gt; and kill innovation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Success metrics are tricky&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not: "How many services use our platform?"
&lt;/li&gt;
&lt;li&gt;But: "How much faster can teams ship?"
&lt;/li&gt;
&lt;li&gt;And: "How many outages did we prevent?"&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🎯 The Real Shift
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;From:&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Submit a ticket, wait 3 days, get your dev environment"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;To:&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Click button, get environment, start coding in 5 minutes"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;From:&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Ops owns stability, Dev owns features" (siloed)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;To:&lt;/strong&gt;  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Teams own their services, platform provides safety nets" (aligned)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  💡 If You Remember One Thing
&lt;/h3&gt;

&lt;p&gt;Platform engineering &lt;strong&gt;isn't about building tools&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
It's about &lt;strong&gt;reducing cognitive load&lt;/strong&gt; for developers.&lt;/p&gt;

&lt;p&gt;The best platform is the one developers &lt;strong&gt;don't even notice&lt;/strong&gt; —&lt;br&gt;&lt;br&gt;
because it just &lt;strong&gt;gets out of their way&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;🔍 Are you building or using an internal platform?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;What's the ONE thing that made it successful (or painful)?&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>platformengineering</category>
      <category>devops</category>
      <category>automation</category>
      <category>internaldeveloperplatform</category>
    </item>
    <item>
      <title>Companies like Spotify (with Backstage) and Netflix scaled DevOps exactly this way — by building platforms instead of doing everything centrally.</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 06 Jan 2026 12:00:40 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/companies-like-spotify-with-backstage-and-netflix-scaled-devops-exactly-this-way-by-building-3n93</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/companies-like-spotify-with-backstage-and-netflix-scaled-devops-exactly-this-way-by-building-3n93</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/sreekanth_kuruba_91721e5d/why-traditional-devops-stops-scaling-1im" class="crayons-story__hidden-navigation-link"&gt;Why Traditional DevOps Stops Scaling&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/sreekanth_kuruba_91721e5d" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3286476%2Fc7a306ec-1c67-4d33-901a-1148effc29ce.jpg" alt="sreekanth_kuruba_91721e5d profile" class="crayons-avatar__image" width="96" height="96"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/sreekanth_kuruba_91721e5d" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Sreekanth Kuruba
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Sreekanth Kuruba
                
              
              &lt;div id="story-author-preview-content-3148221" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/sreekanth_kuruba_91721e5d" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3286476%2Fc7a306ec-1c67-4d33-901a-1148effc29ce.jpg" class="crayons-avatar__image" alt="" width="96" height="96"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Sreekanth Kuruba&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/sreekanth_kuruba_91721e5d/why-traditional-devops-stops-scaling-1im" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jan 6&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/sreekanth_kuruba_91721e5d/why-traditional-devops-stops-scaling-1im" id="article-link-3148221"&gt;
          Why Traditional DevOps Stops Scaling
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag crayons-tag--filled  " href="/t/discuss"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;discuss&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/platformengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;platformengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/career"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;career&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/sreekanth_kuruba_91721e5d/why-traditional-devops-stops-scaling-1im" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/sreekanth_kuruba_91721e5d/why-traditional-devops-stops-scaling-1im#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            2 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>devops</category>
      <category>platformengineering</category>
      <category>discuss</category>
      <category>career</category>
    </item>
    <item>
      <title>Why Traditional DevOps Stops Scaling</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 06 Jan 2026 06:06:56 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/why-traditional-devops-stops-scaling-1im</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/why-traditional-devops-stops-scaling-1im</guid>
      <description>&lt;p&gt;Traditional DevOps works well…&lt;br&gt;
&lt;strong&gt;until the organization grows.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At small scale, a central DevOps team deploying, fixing, and firefighting everything feels efficient.&lt;/p&gt;

&lt;p&gt;At large scale, it becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;And not because DevOps is bad.&lt;br&gt;
Because &lt;strong&gt;humans don’t scale the same way systems do&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  🚧 Why Traditional DevOps Stops Scaling
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. People become the bottleneck&lt;/strong&gt;&lt;br&gt;
As companies grow, everyone needs DevOps help.&lt;br&gt;
Deployments. Pipelines. Terraform. Kubernetes.&lt;/p&gt;

&lt;p&gt;Senior DevOps engineers are expensive and hard to hire.&lt;br&gt;
Soon, the DevOps team becomes a ticket queue instead of an enabler.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;2. Toolchains turn into spaghetti&lt;/strong&gt;&lt;br&gt;
CI tools, CD tools, scanners, monitors, secrets managers.&lt;/p&gt;

&lt;p&gt;Each one solves a problem.&lt;br&gt;
Together, they create complexity.&lt;/p&gt;

&lt;p&gt;Maintaining fragile integrations slows teams down more than it helps them move fast.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;3. Manual steps creep back in&lt;/strong&gt;&lt;br&gt;
Approvals, one-off fixes, environment-specific configs.&lt;/p&gt;

&lt;p&gt;Manual work means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inconsistency&lt;/li&gt;
&lt;li&gt;Errors&lt;/li&gt;
&lt;li&gt;Late-night outages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manual processes don’t scale. They multiply risk.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;4. Developers carry too much operational weight&lt;/strong&gt;&lt;br&gt;
“You build it, you run it” sounds great.&lt;/p&gt;

&lt;p&gt;But without the right abstractions, developers become:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accidental infrastructure experts&lt;/li&gt;
&lt;li&gt;Part-time SREs&lt;/li&gt;
&lt;li&gt;Slower feature builders&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cognitive load goes up. Velocity goes down.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;5. No self-service = no speed&lt;/strong&gt;&lt;br&gt;
Without self-service platforms, developers must touch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes YAML&lt;/li&gt;
&lt;li&gt;Terraform internals&lt;/li&gt;
&lt;li&gt;Cloud primitives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of shipping features, they wrestle with infrastructure.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;6. Silos quietly return&lt;/strong&gt;&lt;br&gt;
Even with DevOps intentions, silos reappear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ops rewarded for stability&lt;/li&gt;
&lt;li&gt;Dev rewarded for speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different incentives. Same old friction.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;7. Monitoring stays reactive&lt;/strong&gt;&lt;br&gt;
Traditional monitoring reacts &lt;em&gt;after&lt;/em&gt; things break.&lt;/p&gt;

&lt;p&gt;At scale, teams need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proactive observability&lt;/li&gt;
&lt;li&gt;Fast root cause analysis&lt;/li&gt;
&lt;li&gt;Context, not just alerts&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🧱 The Natural Outcome: Platform Engineering
&lt;/h3&gt;

&lt;p&gt;These challenges didn’t kill DevOps.&lt;/p&gt;

&lt;p&gt;They &lt;strong&gt;forced it to evolve&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Platform Engineering emerged to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Codify best practices&lt;/li&gt;
&lt;li&gt;Provide golden paths&lt;/li&gt;
&lt;li&gt;Abstract complexity&lt;/li&gt;
&lt;li&gt;Enable self-service safely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Internal Developer Platforms don’t replace DevOps principles.&lt;/p&gt;

&lt;p&gt;They make them work &lt;strong&gt;at enterprise scale&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  🧠 The Big Idea
&lt;/h3&gt;

&lt;p&gt;DevOps didn’t fail.&lt;/p&gt;

&lt;p&gt;It succeeded so well that it needed a new form.&lt;/p&gt;

&lt;p&gt;From:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Humans doing DevOps for everyone&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Platforms enabling DevOps for everyone&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s the shift.&lt;/p&gt;

&lt;p&gt;And it’s why Platform Engineering exists.&lt;/p&gt;

&lt;p&gt;🤔 The Big Question&lt;/p&gt;

&lt;p&gt;If DevOps can’t deploy everything forever…&lt;/p&gt;

&lt;p&gt;What replaces it?&lt;/p&gt;

&lt;p&gt;👉 In Part 2, we’ll look at how leading companies are solving this with Platform Engineering.&lt;/p&gt;




</description>
      <category>devops</category>
      <category>platformengineering</category>
      <category>discuss</category>
      <category>career</category>
    </item>
    <item>
      <title>Docker Networking: How Packets Actually Move</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 23 Dec 2025 13:36:51 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/docker-networking-how-packets-actually-move-2k6h</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/docker-networking-how-packets-actually-move-2k6h</guid>
      <description>&lt;p&gt;Containers do not have “networking” in the abstract sense.&lt;br&gt;&lt;br&gt;
They participate in Linux networking through isolation, indirection, and policy.  &lt;/p&gt;

&lt;p&gt;When a container sends a packet, it does not leave Docker. It leaves a &lt;strong&gt;network namespace&lt;/strong&gt;, traverses a &lt;strong&gt;virtual Ethernet pair&lt;/strong&gt;, crosses a &lt;strong&gt;bridge or routing boundary&lt;/strong&gt;, and is transformed by &lt;strong&gt;netfilter rules&lt;/strong&gt; before it ever reaches a wire.  &lt;/p&gt;

&lt;p&gt;Understanding this path explains nearly every networking behavior attributed to Docker.&lt;/p&gt;

&lt;h3&gt;
  
  
  Network Namespaces as the Isolation Boundary
&lt;/h3&gt;

&lt;p&gt;Each container runs inside its own network namespace containing:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interfaces
&lt;/li&gt;
&lt;li&gt;Routes
&lt;/li&gt;
&lt;li&gt;ARP tables
&lt;/li&gt;
&lt;li&gt;iptables chains
&lt;/li&gt;
&lt;li&gt;Loopback device
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Nothing inside the container is virtualized. The kernel enforces isolation by scoping visibility.  &lt;/p&gt;

&lt;p&gt;Docker’s responsibility is namespace construction and wiring — not packet delivery.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Default Bridge Is a Linux Bridge
&lt;/h3&gt;

&lt;p&gt;The default Docker network is backed by a Linux bridge named &lt;code&gt;docker0&lt;/code&gt;.  &lt;/p&gt;

&lt;p&gt;When a container is created:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A veth pair is allocated
&lt;/li&gt;
&lt;li&gt;One endpoint enters the container namespace as &lt;code&gt;eth0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The peer endpoint attaches to &lt;code&gt;docker0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;An IP is assigned from the bridge subnet
&lt;/li&gt;
&lt;li&gt;NAT rules are installed for outbound traffic
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The bridge provides Layer 2 adjacency. Routing and NAT occur outside the container.&lt;br&gt;&lt;br&gt;
This model trades simplicity for control and remains Docker’s default for a reason.&lt;/p&gt;

&lt;h3&gt;
  
  
  Port Publishing Is Address Translation, Not Exposure
&lt;/h3&gt;

&lt;p&gt;Publishing a port does not modify the container. It installs &lt;strong&gt;DNAT rules&lt;/strong&gt; on the host that rewrite incoming traffic.  &lt;/p&gt;

&lt;p&gt;Traffic flow:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Host interface receives packet
&lt;/li&gt;
&lt;li&gt;iptables PREROUTING rewrites destination
&lt;/li&gt;
&lt;li&gt;Packet forwarded to container IP
&lt;/li&gt;
&lt;li&gt;Return traffic SNATed back
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This explains why:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Containers do not bind host ports
&lt;/li&gt;
&lt;li&gt;Port collisions are resolved at the host layer
&lt;/li&gt;
&lt;li&gt;Network performance differs from host mode
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Port publishing is policy, not plumbing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Network Modes Are Policy Choices
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Trade-off&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bridge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Isolated namespace, NATed egress, explicit ingress&lt;/td&gt;
&lt;td&gt;Default, safest&lt;/td&gt;
&lt;td&gt;NAT overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Host&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No namespace, no translation&lt;/td&gt;
&lt;td&gt;Max performance&lt;/td&gt;
&lt;td&gt;No isolation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;None&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Namespace with only loopback&lt;/td&gt;
&lt;td&gt;Batch jobs, hardened workloads&lt;/td&gt;
&lt;td&gt;No connectivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Macvlan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real MAC address, appears as physical device&lt;/td&gt;
&lt;td&gt;VM-like networking&lt;/td&gt;
&lt;td&gt;Bypasses iptables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Overlay&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encapsulation for multi-host&lt;/td&gt;
&lt;td&gt;Swarm, Kubernetes&lt;/td&gt;
&lt;td&gt;Encapsulation latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  libnetwork Is Control, Not Data Plane
&lt;/h3&gt;

&lt;p&gt;libnetwork programs the kernel:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Allocates IPs
&lt;/li&gt;
&lt;li&gt;Selects drivers
&lt;/li&gt;
&lt;li&gt;Creates endpoints
&lt;/li&gt;
&lt;li&gt;Configures routing and firewall rules
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does not forward packets. The kernel always does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Container Communication Is Name Resolution
&lt;/h3&gt;

&lt;p&gt;User-defined bridge networks include an embedded DNS service.&lt;br&gt;&lt;br&gt;
Containers discover each other by name — Docker resolves names to IPs at runtime.&lt;br&gt;&lt;br&gt;
No static environment variables needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging Means Leaving the Container
&lt;/h3&gt;

&lt;p&gt;Most Docker networking failures occur &lt;strong&gt;outside&lt;/strong&gt; the container namespace.  &lt;/p&gt;

&lt;p&gt;Useful commands:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ip link show type veth&lt;/code&gt; — veth pairs
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;brctl show&lt;/code&gt; or &lt;code&gt;ip link show docker0&lt;/code&gt; — bridge membership
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ip route&lt;/code&gt; (host) vs &lt;code&gt;docker exec &amp;lt;id&amp;gt; ip route&lt;/code&gt; — routing
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;iptables -t nat -L -v -n&lt;/code&gt; — NAT chains
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nsenter --net=/proc/&amp;lt;pid&amp;gt;/ns/net&lt;/code&gt; — enter namespace
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Container logs rarely explain network issues. The host almost always does.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Docker is often blamed. The kernel is usually guilty.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance and Security Tradeoffs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Bridge: NAT overhead
&lt;/li&gt;
&lt;li&gt;Host: No isolation
&lt;/li&gt;
&lt;li&gt;Macvlan: Bypasses iptables
&lt;/li&gt;
&lt;li&gt;Overlay: Encapsulation latency
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker networking prioritizes &lt;strong&gt;containment&lt;/strong&gt; over concealment. Security comes from explicit policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary
&lt;/h3&gt;

&lt;p&gt;Docker networking is a composition of kernel primitives:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Network namespaces for isolation
&lt;/li&gt;
&lt;li&gt;veth pairs for connectivity
&lt;/li&gt;
&lt;li&gt;Bridges/routes for topology
&lt;/li&gt;
&lt;li&gt;netfilter for policy
&lt;/li&gt;
&lt;li&gt;libnetwork for orchestration
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once internalized, Docker networking becomes predictable.&lt;/p&gt;




</description>
      <category>networking</category>
      <category>docker</category>
      <category>linux</category>
      <category>devops</category>
    </item>
    <item>
      <title>Dockerfile Internals and the Image Build Pipeline</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Thu, 18 Dec 2025 06:32:00 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/dockerfile-internals-and-the-image-build-pipeline-37b1</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/dockerfile-internals-and-the-image-build-pipeline-37b1</guid>
      <description>&lt;p&gt;When engineers say "Docker builds an image," they usually mean a single command.&lt;br&gt;
In reality, &lt;code&gt;docker build&lt;/code&gt; triggers a deterministic pipeline that transforms a text file into an OCI-compliant artifact, composed of immutable, content-addressed layers.&lt;/p&gt;

&lt;p&gt;Understanding this pipeline explains why cache behaves the way it does, why instruction order matters, and why small Dockerfile changes can dramatically impact build time and image size.&lt;/p&gt;


&lt;h2&gt;
  
  
  From Dockerfile to Build Graph
&lt;/h2&gt;

&lt;p&gt;The build process starts long before any filesystem changes occur.&lt;/p&gt;

&lt;p&gt;Docker first parses the Dockerfile into an internal instruction graph.&lt;br&gt;
This phase validates syntax, resolves build stages, and prepares the build context after applying &lt;code&gt;.dockerignore&lt;/code&gt;. No layers are created here. The output is a dependency-aware plan for how the image &lt;em&gt;could&lt;/em&gt; be built.&lt;/p&gt;

&lt;p&gt;Only after this plan is constructed does execution begin.&lt;/p&gt;
&lt;h3&gt;
  
  
  Practical Impact: The &lt;code&gt;.dockerignore&lt;/code&gt; Advantage
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Without .dockerignore:&lt;/span&gt;
Sending build context to Docker daemon  1.2GB  &lt;span class="c"&gt;# Slow transfer&lt;/span&gt;

&lt;span class="c"&gt;# With proper .dockerignore:&lt;/span&gt;
Sending build context to Docker daemon  12.3kB  &lt;span class="c"&gt;# Fast transfer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Key files to exclude:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node_modules/
.git/
*.log
.env
dist/  # For multi-stage builds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Layer Creation Is Content, Not Commands
&lt;/h2&gt;

&lt;p&gt;Each filesystem-changing instruction such as &lt;code&gt;RUN&lt;/code&gt;, &lt;code&gt;COPY&lt;/code&gt;, or &lt;code&gt;ADD&lt;/code&gt; produces a new layer.&lt;br&gt;
These layers are immutable and identified by a cryptographic hash derived from their content and their parent layer.&lt;/p&gt;

&lt;p&gt;This is why Docker caching is reliable.&lt;br&gt;
If the inputs are identical, the resulting layer hash is identical. The build system does not care &lt;em&gt;why&lt;/em&gt; a command ran, only &lt;em&gt;what it produced&lt;/em&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cache Key Composition
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer Hash = SHA256(
  Parent Layer Hash +
  Instruction Content + 
  File Content (for COPY/ADD) +
  Build Arguments at this point
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Example Cache Behavior:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Layer 1: Always cached (base image)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:18-alpine&lt;/span&gt;

&lt;span class="c"&gt;# Layer 2: Cached unless WORKDIR changes&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;

&lt;span class="c"&gt;# Layer 3: Cache breaks if package.json changes&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; package*.json ./&lt;/span&gt;

&lt;span class="c"&gt;# Layer 4: Cache breaks if Layer 3 changes&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci

&lt;span class="c"&gt;# Layer 5: Cache breaks if ANY file changes&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;

&lt;span class="c"&gt;# Layer 6: Always cached (metadata)&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["npm", "start"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This design is what allows Docker to reuse layers across images, hosts, and even registries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why BuildKit Changed Everything
&lt;/h2&gt;

&lt;p&gt;The classic Docker builder executed instructions sequentially, treating each step as an isolated operation.&lt;br&gt;
BuildKit replaces this with a graph-based execution model.&lt;/p&gt;

&lt;p&gt;With BuildKit, independent steps can execute in parallel, cache keys are more precise, and sensitive data such as credentials can be mounted at build time without ever becoming part of an image layer.&lt;/p&gt;
&lt;h3&gt;
  
  
  BuildKit vs Classic: A Performance Comparison
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Classic Builder (sequential)&lt;/span&gt;
Step 1/8 : FROM alpine:latest
Step 2/8 : RUN apk add &lt;span class="nt"&gt;--no-cache&lt;/span&gt; python3
Step 3/8 : RUN pip &lt;span class="nb"&gt;install &lt;/span&gt;pandas
... &lt;span class="c"&gt;# Each step waits for previous&lt;/span&gt;

&lt;span class="c"&gt;# BuildKit (concurrent possible)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;+] Building 8.2s &lt;span class="o"&gt;(&lt;/span&gt;15/15&lt;span class="o"&gt;)&lt;/span&gt; FINISHED
 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; CACHED &lt;span class="o"&gt;[&lt;/span&gt;stage-1 2/6] ...
 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; CACHED &lt;span class="o"&gt;[&lt;/span&gt;stage-1 3/6] ...  &lt;span class="c"&gt;# Parallel execution&lt;/span&gt;
 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; CACHED &lt;span class="o"&gt;[&lt;/span&gt;stage-1 4/6] ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Advanced BuildKit Features
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Build Secrets (Never in Image Layers)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nt"&gt;--mount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;secret,id&lt;span class="o"&gt;=&lt;/span&gt;npm_token &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"//registry.npmjs.org/:_authToken=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /run/secrets/npm_token&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; .npmrc &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    npm ci
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Cache Mounts (Persistent Between Builds)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nt"&gt;--mount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cache,target&lt;span class="o"&gt;=&lt;/span&gt;/var/cache/apt &lt;span class="se"&gt;\
&lt;/span&gt;    apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; packages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not an optimization.&lt;br&gt;
It is a fundamental shift in how image builds are modeled.&lt;/p&gt;


&lt;h2&gt;
  
  
  Multi-Stage Builds as a Security Boundary
&lt;/h2&gt;

&lt;p&gt;Multi-stage builds are often described as a size optimization.&lt;br&gt;
More importantly, they create a clean separation between build-time and runtime concerns.&lt;/p&gt;

&lt;p&gt;Compilers, package managers, and secrets exist only in intermediate stages.&lt;br&gt;
The final image contains exactly what is required to run the application, and nothing else.&lt;/p&gt;
&lt;h3&gt;
  
  
  Security Impact Analysis
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Single-Stage (Vulnerable)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:18&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci  &lt;span class="c"&gt;# 600+ dev dependencies&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm run build
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["node", "dist/app.js"]&lt;/span&gt;
&lt;span class="c"&gt;# Result: 1.2GB image with dev tools, compilers, secrets&lt;/span&gt;

&lt;span class="c"&gt;# Multi-Stage (Secure)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;node:18&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run build  &lt;span class="c"&gt;# Dev dependencies here&lt;/span&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:18-alpine&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/dist ./dist&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/package*.json ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci &lt;span class="nt"&gt;--only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;production  &lt;span class="c"&gt;# Only 40 prod dependencies&lt;/span&gt;
&lt;span class="c"&gt;# Result: 180MB image, no dev tools, no build secrets&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This reduces attack surface, simplifies vulnerability scanning, and makes image provenance easier to reason about.&lt;/p&gt;


&lt;h2&gt;
  
  
  Debugging Builds Means Debugging Inputs
&lt;/h2&gt;

&lt;p&gt;Most Docker build issues are not runtime problems.&lt;br&gt;
They are cache invalidation problems.&lt;/p&gt;

&lt;p&gt;Unexpected rebuilds almost always trace back to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Changing inputs in early layers&lt;/li&gt;
&lt;li&gt;Overly broad &lt;code&gt;COPY&lt;/code&gt; instructions&lt;/li&gt;
&lt;li&gt;Uncontrolled build arguments&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Diagnostic Toolkit
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Layer Inspection&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;history &lt;/span&gt;myimage &lt;span class="nt"&gt;--no-trunc&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s2"&gt;"{{.CreatedBy}}"&lt;/span&gt;
dive myimage  &lt;span class="c"&gt;# Interactive layer explorer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Cache Analysis&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# See why cache invalidated&lt;/span&gt;
docker build &lt;span class="nt"&gt;--progress&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;plain &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Check specific layer&lt;/span&gt;
docker inspect myimage &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{{.RootFS.Layers}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Context Troubleshooting&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# See what's being sent to daemon&lt;/span&gt;
docker build &lt;span class="nt"&gt;--no-cache&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; 2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"sending build context"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tools like &lt;code&gt;docker build --progress=plain&lt;/code&gt;, &lt;code&gt;docker history&lt;/code&gt;, and layer inspection utilities expose these relationships directly, turning "Docker magic" back into observable behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Deterministic Builds
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Pin everything&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:18.20.1-alpine3.19  # Not :latest&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci &lt;span class="nt"&gt;--frozen-lockfile&lt;/span&gt;  &lt;span class="c"&gt;# Not npm install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Build-Time Optimization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Order matters: Stable → Changing&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; package*.json ./     # Infrequent changes&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci               &lt;span class="c"&gt;# Expensive operation&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .                 # Frequent changes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Size Optimization
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clean as you go&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apt-get update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; build-essential &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="c"&gt;# Build something &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    apt-get remove &lt;span class="nt"&gt;-y&lt;/span&gt; build-essential &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    apt-get autoremove &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class="k"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The OCI Artifact: What Actually Gets Built
&lt;/h2&gt;

&lt;p&gt;At the end of the pipeline, Docker produces:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Image Manifest&lt;/strong&gt; - Metadata and layer references&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Config&lt;/strong&gt; - Environment, entrypoint, working directory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer Tarballs&lt;/strong&gt; - Compressed filesystem diffs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index (multi-arch)&lt;/strong&gt; - Platform-specific manifests
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schemaVersion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"digest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sha256:abc123..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Content&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;hash&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1234567&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"digest"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sha256:def456..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Cmd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"npm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"start"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The Docker build pipeline transforms human-readable instructions into a secure, efficient, distributable artifact through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Graph-based planning&lt;/strong&gt; - Not linear execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content-addressable storage&lt;/strong&gt; - Deterministic layer creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage isolation&lt;/strong&gt; - Build/runtime separation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observable behavior&lt;/strong&gt; - Every layer is inspectable&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Understanding these internals moves teams from "Docker builds" to "engineered artifact pipelines."&lt;/p&gt;




</description>
      <category>devops</category>
      <category>dockerfile</category>
      <category>docker</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Docker internals deep dive what really happens when you run docker run (2025 edition)</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 16 Dec 2025 03:10:57 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/docker-internals-deep-dive-what-really-happens-when-you-run-docker-run-2025-edition-2k97</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/docker-internals-deep-dive-what-really-happens-when-you-run-docker-run-2025-edition-2k97</guid>
      <description>&lt;p&gt;🧵 You type &lt;code&gt;docker run nginx&lt;/code&gt;. In milliseconds, 7 components work together. Here's EXACTLY what happens at each layer (with debugging tips for when it breaks).&lt;/p&gt;

&lt;p&gt;Modern container platforms depend on predictable, modular behavior. Docker's architecture is a layered execution pipeline built around standard interfaces: REST, gRPC, OCI Runtime, and Linux kernel primitives. Understanding this flow eliminates ambiguity during debugging, scaling, or integrating with orchestration systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;1. Core Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;CLI  →  dockerd (API + Orchestration)  →  containerd (Runtime mgmt)&lt;br&gt;&lt;br&gt;
      →  containerd-shim (Process supervisor)  →  runc (OCI runtime)&lt;br&gt;&lt;br&gt;
      →  Linux Kernel (Namespaces, cgroups, fs, net)&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Docker CLI&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;User command interface&lt;/li&gt;
&lt;li&gt;Converts flags to JSON&lt;/li&gt;
&lt;li&gt;Talks to dockerd through &lt;code&gt;/var/run/docker.sock&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;dockerd&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;REST API server&lt;/li&gt;
&lt;li&gt;Container lifecycle orchestration&lt;/li&gt;
&lt;li&gt;Network/volume management&lt;/li&gt;
&lt;li&gt;Delegates image and runtime operations to containerd&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;containerd&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;High-level runtime manager&lt;/li&gt;
&lt;li&gt;Manages snapshots, images, and content store&lt;/li&gt;
&lt;li&gt;Pulls/unpacks layers&lt;/li&gt;
&lt;li&gt;Creates OCI runtime specifications&lt;/li&gt;
&lt;li&gt;Launches a &lt;code&gt;containerd-shim&lt;/code&gt; for each container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Image Storage Detail:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each layer is content-addressable via SHA256&lt;/p&gt;

&lt;p&gt;Identical layers are deduplicated&lt;/p&gt;

&lt;p&gt;OverlayFS uses hardlinks so layers are shared across containers&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;containerd-shim&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Parent process for the container's workload&lt;/li&gt;
&lt;li&gt;Keeps containers alive if dockerd/containerd restart&lt;/li&gt;
&lt;li&gt;Manages IO streams (logs, attach)&lt;/li&gt;
&lt;li&gt;Returns exit codes to containerd&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;runc&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Implements the OCI runtime spec&lt;/li&gt;
&lt;li&gt;Creates namespaces&lt;/li&gt;
&lt;li&gt;Applies cgroup limitations&lt;/li&gt;
&lt;li&gt;Mounts root filesystem&lt;/li&gt;
&lt;li&gt;Executes the entrypoint&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exits immediately after container creation&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Linux Kernel&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Enforces process isolation (namespaces)&lt;/li&gt;
&lt;li&gt;Resource control (cgroups)&lt;/li&gt;
&lt;li&gt;Layered filesystems (OverlayFS)&lt;/li&gt;
&lt;li&gt;Networking (veth, bridges, iptables/NAT)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  ✈️ The Airport Analogy: A Mental Model
&lt;/h2&gt;

&lt;p&gt;Just as you don't need to know air traffic control to board a flight, &lt;br&gt;
you don't need to understand all Docker components to run containers. &lt;br&gt;
But when things go wrong, knowing the layers helps troubleshoot!&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Airport Role&lt;/th&gt;
&lt;th&gt;Real-World Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docker CLI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Passenger Terminal&lt;/td&gt;
&lt;td&gt;You type &lt;code&gt;docker run&lt;/code&gt;, check status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;dockerd&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Airport Operations Center&lt;/td&gt;
&lt;td&gt;Manages all flights, gates, schedules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;containerd&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ground Control&lt;/td&gt;
&lt;td&gt;Loads luggage (images), assigns runways&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;containerd-shim&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gate Agents&lt;/td&gt;
&lt;td&gt;Ensures plane stays ready even if Ops Center reboots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;runc&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pilot&lt;/td&gt;
&lt;td&gt;Actually flies the plane (executes container)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kernel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Air Traffic Control&lt;/td&gt;
&lt;td&gt;Manages airspace (resources), prevents collisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Container&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The Actual Flight&lt;/td&gt;
&lt;td&gt;Your app running in isolated airspace&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Use this mental model to remember component relationships during troubleshooting.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;2. Execution Flow: &lt;code&gt;docker run -d -p 8080:80 nginx&lt;/code&gt;&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1. CLI → dockerd&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;CLI parses command, constructs a JSON payload, and sends it over the Unix socket.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2. dockerd Validation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Dockerd validates configuration, checks local images, and coordinates container creation.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3. Image Pull (if needed)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;containerd handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Registry authentication&lt;/li&gt;
&lt;li&gt;Manifest resolution&lt;/li&gt;
&lt;li&gt;Layer download and verification&lt;/li&gt;
&lt;li&gt;Storage in the content store&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 4. Filesystem Assembly&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;containerd prepares:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snapshot&lt;/li&gt;
&lt;li&gt;OverlayFS upper/lower directory layout&lt;/li&gt;
&lt;li&gt;OCI bundle with metadata and runtime config&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 5. Networking Setup&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Dockerd configures the network namespace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;veth pair creation&lt;/li&gt;
&lt;li&gt;Host end added to &lt;code&gt;docker0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Container assigned IP (e.g., 172.17.0.2)&lt;/li&gt;
&lt;li&gt;iptables DNAT for port-mapping&lt;/li&gt;
&lt;li&gt;MASQUERADE rule for outbound traffic&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 6. containerd → containerd-shim&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;containerd:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spawns shim&lt;/li&gt;
&lt;li&gt;Hands off the OCI spec&lt;/li&gt;
&lt;li&gt;Delegates lifecycle supervision&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 7. shim → runc&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;runc:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates namespaces&lt;/li&gt;
&lt;li&gt;Mounts rootfs&lt;/li&gt;
&lt;li&gt;Applies cgroup limits&lt;/li&gt;
&lt;li&gt;Executes container entrypoint&lt;/li&gt;
&lt;li&gt;Exits (shim remains as supervisor)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 8. Container Running&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Container runs as an isolated Linux process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;shim maintains lifecycle&lt;/li&gt;
&lt;li&gt;dockerd streams logs and reports state&lt;/li&gt;
&lt;li&gt;kernel enforces isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoqt1wa8cyp9o61040ps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoqt1wa8cyp9o61040ps.png" alt="Docker workflow" width="800" height="32"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;3. Component Responsibilities&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Delegates&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;User interface, request creation&lt;/td&gt;
&lt;td&gt;dockerd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dockerd&lt;/td&gt;
&lt;td&gt;API, orchestration, networking&lt;/td&gt;
&lt;td&gt;containerd&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;containerd&lt;/td&gt;
&lt;td&gt;Image mgmt, snapshots, lifecycle&lt;/td&gt;
&lt;td&gt;runc&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;containerd-shim&lt;/td&gt;
&lt;td&gt;Supervises container process&lt;/td&gt;
&lt;td&gt;kernel (via runc-created namespaces)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;runc&lt;/td&gt;
&lt;td&gt;Creates container environment&lt;/td&gt;
&lt;td&gt;kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel&lt;/td&gt;
&lt;td&gt;Isolation + resource control&lt;/td&gt;
&lt;td&gt;hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Related Architecture:&lt;/strong&gt;&lt;br&gt;
For Kubernetes, replace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dockerd  →  kubelet → CRI → containerd  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything downstream (containerd → shim → runc → kernel) remains unchanged.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. Key Clarifications&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Containers are processes, not virtual machines.&lt;/li&gt;
&lt;li&gt;runc does not stay resident; shim manages the lifecycle.&lt;/li&gt;
&lt;li&gt;Docker's layered filesystem is copy-on-write for efficient storage.&lt;/li&gt;
&lt;li&gt;Kubernetes removed dockerd and uses containerd directly for a simpler CRI pipeline.&lt;/li&gt;
&lt;li&gt;Live-restore works because shim decouples containers from dockerd.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. Debugging Guide (Ops-Ready Edition)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A structured, layered sequence for diagnosing container failures. Designed for SRE, DevOps, and runtime engineering teams.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Container exits immediately&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Follow the layers from highest to lowest impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Application Layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Severity: Low&lt;/strong&gt;&lt;br&gt;
Most failures originate here.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker logs &amp;lt;container&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks for: runtime exceptions, crash loops, missing configs, entrypoint failures.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Runtime Layer (containerd / OCI)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Severity: Medium&lt;/strong&gt;&lt;br&gt;
Issues here affect container creation, not app logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; containerd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Detects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Invalid OCI specs&lt;/li&gt;
&lt;li&gt;Snapshot/unpack errors&lt;/li&gt;
&lt;li&gt;Permission issues&lt;/li&gt;
&lt;li&gt;Image metadata failures&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Kernel Layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Severity: High&lt;/strong&gt;&lt;br&gt;
Kernel failures affect all containers on the node.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dmesg | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reveals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Namespace creation failures&lt;/li&gt;
&lt;li&gt;cgroup enforcement errors&lt;/li&gt;
&lt;li&gt;LSM blocks (AppArmor/SELinux)&lt;/li&gt;
&lt;li&gt;OverlayFS mount issues&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Slow container startup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Pinpoint latency at the registry, storage, or runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Image Pull / Unpack Latency&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; containerd &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"2 minutes ago"&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-Ei&lt;/span&gt; &lt;span class="s2"&gt;"pull|unpack"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finds slow remote pulls, layer unpack delays, decompression problems.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;2. Host Storage Bottleneck&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;iostat &lt;span class="nt"&gt;-dx&lt;/span&gt; 1 /var/lib/containerd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Detects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High I/O wait&lt;/li&gt;
&lt;li&gt;OverlayFS backing store saturation&lt;/li&gt;
&lt;li&gt;Slow disks or overloaded volumes&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Registry / Network Slowness&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;docker pull alpine:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Measures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Round-trip latency&lt;/li&gt;
&lt;li&gt;Download throughput&lt;/li&gt;
&lt;li&gt;Registry auth or proxy delays&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Network issues&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Trace connectivity host → bridge → container.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Verify NAT / Port Forward Rules&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;iptables &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-L&lt;/span&gt; DOCKER &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;2. Bridge &amp;amp; veth Topology&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ip addr show docker0
brctl show
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;3. Container Namespace Networking&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &amp;lt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ip addr show
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Common Error Patterns&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A quick pattern-matching cheat sheet.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Error Message&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;no such file or directory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Missing entrypoint or wrong working dir&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;permission denied&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;User namespace restriction, volume permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;address already in use&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Host port collision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exec format error&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Architecture mismatch (amd64 vs arm64)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;layer does not exist&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Corrupted image store, partial pull&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;failed to setup network namespace&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Kernel lacking required capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Recovery Actions&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Map root cause to corrective steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Image Pull Failures&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Check registry auth tokens&lt;/li&gt;
&lt;li&gt;Verify proxy/SSL configuration&lt;/li&gt;
&lt;li&gt;Test connectivity to registry endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. OCI Spec / Runtime Errors&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ensure Docker + containerd + runc versions are compatible&lt;/li&gt;
&lt;li&gt;Validate custom seccomp or AppArmor profiles&lt;/li&gt;
&lt;li&gt;Recreate corrupted snapshots&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Kernel Namespace / Cgroup Failures&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Check kernel version supports required features&lt;/li&gt;
&lt;li&gt;Validate cgroup v1/v2 mode&lt;/li&gt;
&lt;li&gt;Inspect sysctl overrides affecting namespaces&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbi9lcp37tvlpmatmok6s.png" alt="DEBUGGING TREE IMAGE" width="800" height="740"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  6. Summary
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;docker run&lt;/code&gt; invocation travels through a disciplined, modular execution path. Each component accepts a small, well-defined piece of responsibility and hands off cleanly to the next, forming a predictable control flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dockerd&lt;/strong&gt; parses intent and translates it into runtime instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containerd&lt;/strong&gt; orchestrates container lifecycle through stable gRPC APIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containerd-shim&lt;/strong&gt; isolates the container’s process management from daemon restarts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;runc&lt;/strong&gt; materializes the OCI Runtime Spec into Linux primitives.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The kernel&lt;/strong&gt; provides the final enforcement layer through namespaces, cgroups, and filesystem drivers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These boundaries are governed by open standards (REST → gRPC → OCI Spec → syscalls), ensuring compatibility, reliability, and deep observability across layers.&lt;/p&gt;

&lt;p&gt;Isolation, resource governance, and performance efficiency emerge directly from native Linux constructs—no hidden hypervisor, no extra abstraction. As a result, containers start fast, run lean, and scale predictably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational Note:&lt;/strong&gt;&lt;br&gt;
Because process ownership is delegated to containerd-shim, both dockerd and containerd can be restarted without disrupting running containers. This design supports safe daemon upgrades, node maintenance, and high-availability workflows that do not interrupt workloads.&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;Core Architecture&lt;/li&gt;
&lt;li&gt;Execution Flow
&lt;/li&gt;
&lt;li&gt;Component Responsibilities&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Key Clarifications&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Debugging Guide (Ops-Ready Edition)&lt;br&gt;
└── [DEBUGGING TREE IMAGE]&lt;br&gt;
└── Container exits immediately&lt;br&gt;
└── Slow container startup&lt;br&gt;
└── Network issues&lt;br&gt;
└── Common error patterns&lt;br&gt;
└── Recovery actions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Summary&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Drop your thoughts in the comments below! 👇&lt;br&gt;
Follow me for more deep dives into fundamental CS concepts made approachable!&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>containers</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>TLS 1.2 vs TLS 1.3 in Production (2025)</title>
      <dc:creator>Sreekanth Kuruba</dc:creator>
      <pubDate>Tue, 09 Dec 2025 13:30:11 +0000</pubDate>
      <link>https://forem.com/sreekanth_kuruba_91721e5d/tls-12-vs-tls-13-in-production-2025-5c0e</link>
      <guid>https://forem.com/sreekanth_kuruba_91721e5d/tls-12-vs-tls-13-in-production-2025-5c0e</guid>
      <description>&lt;h3&gt;
  
  
  &lt;strong&gt;How We Reduced p95 Latency by 40% and Eliminated Certificate Incidents&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Modern web performance depends on minimizing round trips. In late 2025, we evaluated our global traffic (300M+ requests/day) and found a surprising bottleneck:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Over 80% of our latency overhead came from TLS 1.2 handshakes — not from the application.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We migrated fully to TLS 1.3 across Cloudflare → ALB → Nginx.&lt;/p&gt;

&lt;p&gt;Here's the data, the architecture impact, and the configuration used.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Executive Summary&lt;br&gt;
Key Results:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt; 40% reduction in p95 latency&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reliability:&lt;/strong&gt; Certificate incidents dropped to zero&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; 28% reduction in ALB CPU usage&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Migration:&lt;/strong&gt; 45 minutes, near-zero risk&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compatibility:&lt;/strong&gt; 99.3% of traffic unaffected&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;1. The Simplest Analogy: Airport Security&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;TLS 1.2 = Old Airport Security&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Remove shoes&lt;/li&gt;
&lt;li&gt;Remove laptop&lt;/li&gt;
&lt;li&gt;Two screening stages&lt;/li&gt;
&lt;li&gt;Long waits for everyone&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;TLS 1.3 = Modern Fast-Track&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Single unified check&lt;/li&gt;
&lt;li&gt;Faster crypto negoatiation&lt;/li&gt;
&lt;li&gt;PreCheck (0-RTT) for returning users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Exactly the same logic applies to round trips.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;2. How the Handshake Changed&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;TLS 1.2 — 2 Round Trips&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client ──ClientHello────────────► Server
Client ◄─ServerHello+Cert──────── Server
Client ─────Finished────────────► Server
Client ◄────Finished───────────── Server
         ↑↑
     2 RTT required
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;TLS 1.3 — 1 Round Trip&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client ──ClientHello+KeyShare───► Server
Client ◄─ServerHello+Finished──── Server
Client ─────Finished────────────► Server
         ↑
     1 RTT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;TLS 1.3 (Resume) — 0-RTT&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client ──Early Data──────────────► Server
Client ◄─Immediate Response─────── Server
         ↑
       0 RTT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This is the core performance difference.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6ot04368k7q5ac2klf1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe6ot04368k7q5ac2klf1.png" alt="TLS protocol round-trip time comparison: TLS 1.2 (2 RTTs, slow) → TLS 1.3 (1 RTT, baseline) → TLS 1.3 Resume (0 RTT, instant" width="800" height="430"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;3. Real Production Data (Nov–Dec 2025)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;After enabling TLS 1.3 everywhere:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;TLS 1.2&lt;/th&gt;
&lt;th&gt;TLS 1.3&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;p95 TTFB (global)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;318 ms&lt;/td&gt;
&lt;td&gt;194 ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;–40%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full handshakes&lt;/td&gt;
&lt;td&gt;~40%&lt;/td&gt;
&lt;td&gt;&amp;lt;6%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;–85%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ALB CPU&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;–28%&lt;/td&gt;
&lt;td&gt;Savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failed handshakes&lt;/td&gt;
&lt;td&gt;1.2%&lt;/td&gt;
&lt;td&gt;0.4%&lt;/td&gt;
&lt;td&gt;Higher compatibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0-RTT usage&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;58%&lt;/td&gt;
&lt;td&gt;Faster repeat visitors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Certificate pages&lt;/td&gt;
&lt;td&gt;3–4/mo&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stability win&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Largest gains:&lt;/strong&gt;&lt;br&gt;
India, Brazil, Indonesia, South Africa&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;broadly APAC, LATAM, Africa (naturally high RTT regions).&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;4. Why TLS 1.3 Wins (Operational view)&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Fewer Round Trips&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Connection setup time is the single biggest latency factor for first-time visitors.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;High Resumption Success&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;TLS 1.3 replaces legacy session tickets with Pre-Shared Keys (PSKs), enabling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;94–98% session reuse&lt;/li&gt;
&lt;li&gt;Fewer full handshakes&lt;/li&gt;
&lt;li&gt;Lower CPU cost&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Simplified Cipher Suites&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;TLS 1.2 had 15–20 negotiable options.&lt;br&gt;
TLS 1.3 has 5 secure defaults.&lt;/p&gt;

&lt;p&gt;This removes misconfigurations entirely.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Forward Secrecy by Default&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Impossible to accidentally weaken.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Ready for ECH (2025–2026)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Encrypted ClientHello = SNI protection + privacy upgrade&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;5. Configuration That Works Everywhere (2025)&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Cloudflare&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SSL/TLS → Edge Certificates → Minimum TLS Version = 1.3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;AWS ALB / CloudFront&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use any policy with &lt;strong&gt;TLS13&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ELBSecurityPolicy-TLS13-1-2-2021-06&lt;/code&gt; or newer.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Nginx&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;ssl_protocols&lt;/span&gt; &lt;span class="s"&gt;TLSv1.3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ssl_early_data&lt;/span&gt; &lt;span class="no"&gt;on&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;              &lt;span class="c1"&gt;# Enables 0-RTT safely for GET/HEAD&lt;/span&gt;
&lt;span class="k"&gt;ssl_prefer_server_ciphers&lt;/span&gt; &lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ssl_session_cache&lt;/span&gt; &lt;span class="s"&gt;shared:TLS:50m&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ssl_session_timeout&lt;/span&gt; &lt;span class="s"&gt;1d&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ssl_session_tickets&lt;/span&gt; &lt;span class="no"&gt;off&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;        &lt;span class="c1"&gt;# Use PSK instead&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Caddy&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tls {
    protocols tls1.3
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;6. Monitoring Your TLS Migration&lt;/strong&gt;
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Live TLS version monitoring
tail -f /var/log/nginx/access.log | \
  awk '{print $NF}' | \
  sort | uniq -c

# CloudWatch metrics (AWS)
aws cloudwatch get-metric-statistics \
  --metric-name ProcessedBytes \
  --namespace AWS/ApplicationELB \
  --statistics Sum \
  --dimensions Name=LoadBalancer,Value=your-alb

# TLS error tracking
grep -E "SSL|TLS" /var/log/nginx/error.log | \
  cut -d' ' -f6- | \
  sort | uniq -c | sort -rn

# Client compatibility check
curl -I https://yoursite.com -v 2&amp;gt;&amp;amp;1 | grep -E "TLS|SSL"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Alert Threshold: &amp;gt;0.1% TLS 1.2 fallback after 7 days&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;7. When You Should Keep TLS 1.2 (Rare)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Organizations that commonly require fallback:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Banks with legacy proxies&lt;/li&gt;
&lt;li&gt;Government/defense systems&lt;/li&gt;
&lt;li&gt;Healthcare EMR systems&lt;/li&gt;
&lt;li&gt;Windows Server 2008 environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recommended fallback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;ssl_protocols&lt;/span&gt; &lt;span class="s"&gt;TLSv1.3&lt;/span&gt; &lt;span class="s"&gt;TLSv1.2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ssl_ciphers&lt;/span&gt; &lt;span class="s"&gt;"TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:ECDHE-RSA-AES256-GCM-SHA384"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check TLS 1.2 traffic usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; TLSv1.2 /var/log/nginx/access.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most modern consumer traffic = &amp;lt;0.7% TLS 1.2.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. ROI Calculator
&lt;/h2&gt;

&lt;p&gt;For 100M monthly requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TLS 1.2: ~40M full handshakes
TLS 1.3: ~6M full handshakes
Reduction: 34M handshakes

AWS ALB cost impact:
- LCU cost: $0.008/hour
- Monthly savings: ~$2,100
- Annual: $25,200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance ROI:
&lt;/h2&gt;

&lt;p&gt;40% faster TTFB = better conversion rates&lt;/p&gt;

&lt;p&gt;Improved Core Web Vitals = SEO boost&lt;/p&gt;

&lt;p&gt;Reduced CDN egress = lower bandwidth costs&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Recommended Migration Plan
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 1 — Observation&lt;/strong&gt; (Day 1-7)
&lt;/h3&gt;

&lt;p&gt;Enable TLS 1.3 with fallback. Monitor breakage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssl_protocols TLSv1.3 TLSv1.2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 2 — Prefer TLS 1.3&lt;/strong&gt; (Day 8-14)
&lt;/h3&gt;

&lt;p&gt;Prioritize TLS 1.3 in negotiation.&lt;br&gt;
Monitor error rates.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 3 — Enforce&lt;/strong&gt; (Day 15+)
&lt;/h3&gt;

&lt;p&gt;Disable TLS 1.2 once error rate stays below 0.1%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssl_protocols TLSv1.3;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total migration time for us: 45 minutes end-to-end.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;10. CDN Provider Differences (2025)&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;TLS 1.3 Default&lt;/th&gt;
&lt;th&gt;0-RTT Support&lt;/th&gt;
&lt;th&gt;ECH Support&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Rolling out&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Akamai&lt;/td&gt;
&lt;td&gt;Yes (Edge)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Beta&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fastly&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Planned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS CloudFront&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCP Cloud CDN&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What's your organization's TLS 1.3 status?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enforced everywhere (100% TLS 1.3)&lt;/p&gt;

&lt;p&gt;Enabled but with fallback&lt;/p&gt;

&lt;p&gt;Still evaluating/testing&lt;/p&gt;

&lt;p&gt;Not on roadmap yet&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;8. Final Recommendation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;TLS 1.3 is not "new technology" anymore.&lt;br&gt;
It is the expected baseline for global applications.&lt;/p&gt;

&lt;p&gt;Upgrading gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Faster connections&lt;/li&gt;
&lt;li&gt;Better Core Web Vitals&lt;/li&gt;
&lt;li&gt;Lower compute cost&lt;/li&gt;
&lt;li&gt;Simplified security posture&lt;/li&gt;
&lt;li&gt;Zero operational downsides&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In 2025, continuing to rely on TLS 1.2 means accepting unnecessary latency on every single request.&lt;/p&gt;




&lt;p&gt;Drop your thoughts in the comments below! 👇&lt;br&gt;
Follow me for more deep dives into fundamental CS concepts made approachable!&lt;/p&gt;

</description>
      <category>tls</category>
      <category>webdev</category>
      <category>networking</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
