<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Pratheesh Satheesh Kumar</title>
    <description>The latest articles on Forem by Pratheesh Satheesh Kumar (@pratheesh_s).</description>
    <link>https://forem.com/pratheesh_s</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1554347%2Ffa5e7a03-28d2-4924-8e18-dc817410b239.jpg</url>
      <title>Forem: Pratheesh Satheesh Kumar</title>
      <link>https://forem.com/pratheesh_s</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/pratheesh_s"/>
    <language>en</language>
    <item>
      <title>Why Two-Thirds of AI Teams Are Betting on Kubernetes (And What That Means for You)</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Mon, 04 May 2026 02:08:30 +0000</pubDate>
      <link>https://forem.com/pratheesh_s/why-two-thirds-of-ai-teams-are-betting-on-kubernetes-and-what-that-means-for-you-3edo</link>
      <guid>https://forem.com/pratheesh_s/why-two-thirds-of-ai-teams-are-betting-on-kubernetes-and-what-that-means-for-you-3edo</guid>
      <description>&lt;p&gt;Kubernetes and AI have become unlikely bedfellows—and the numbers prove it. New data from CNCF and SlashData reveals that two-thirds of organizations running generative AI models have standardized on Kubernetes for orchestration. But here's the thing: &lt;strong&gt;it's not because Kubernetes magically solves AI problems.&lt;/strong&gt; It's because the engineering fundamentals that make Kubernetes valuable—standardization, repeatability, resource isolation—are exactly what AI workloads demand when they move beyond the laptop and into production.&lt;/p&gt;

&lt;p&gt;If you're building or scaling AI systems, this isn't just trivia. It's a signal about where the industry is converging, and whether Kubernetes is right for you depends less on hype and more on what you're actually trying to accomplish.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Story Behind the Numbers
&lt;/h2&gt;

&lt;p&gt;Let's be clear: Kubernetes didn't become the platform of choice for AI because it was purpose-built for LLMs or model inference. It became the default because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standardization across teams&lt;/strong&gt;: When you have data scientists, ML engineers, and infrastructure teams all shipping models, Kubernetes provides a common deployment target. No more "it works on my machine" fragmentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource orchestration&lt;/strong&gt;: AI workloads are hungry. GPUs, accelerators, memory—Kubernetes abstracts these away and lets you define what each model needs without manual provisioning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy at scale&lt;/strong&gt;: If you're running multiple models for different teams or products, isolation and fair resource allocation become non-negotiable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's what the data really highlights: &lt;strong&gt;success with AI still comes down to boring, foundational work.&lt;/strong&gt; The teams winning aren't the ones who found the perfect Kubernetes YAML template. They're the ones with solid internal developer platforms (IDPs), clear observability, and a relentless focus on developer experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The IDP Question Every AI Team Needs to Answer
&lt;/h2&gt;

&lt;p&gt;The most important implication from this research is the emphasis on internal developer platforms. Here's why:&lt;/p&gt;

&lt;p&gt;AI teams move fast but often lack the operational maturity of traditional backend teams. They want to experiment, iterate, and ship—quickly. But you can't scale that without abstraction.&lt;/p&gt;

&lt;p&gt;An effective IDP for AI sits between your data scientists (who want to ship models) and Kubernetes (which handles the orchestration). It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-service model deployment&lt;/strong&gt;: Data scientists submit a model; the platform handles GPU allocation, versioning, and rollback.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardized observability&lt;/strong&gt;: Metrics, logs, and traces for inference endpoints—not just for ops, but for the ML team to catch drift and degradation early.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost visibility&lt;/strong&gt;: AI is expensive. Your IDP should show teams exactly what their models cost to run.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: A simplified model deployment abstraction&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ml.company.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ModelEndpoint&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-classifier-prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gcr.io/company/gpt-classifier:v2.1.3&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;accelerators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nvidia.com/gpu:&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;32Gi"&lt;/span&gt;
  &lt;span class="na"&gt;autoscaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;targetUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This layer matters &lt;em&gt;more&lt;/em&gt; than Kubernetes itself. Kubernetes is just the underlying engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Takeaway: Do You Actually Need Kubernetes for AI?
&lt;/h2&gt;

&lt;p&gt;Honest answer: &lt;strong&gt;probably, eventually.&lt;/strong&gt; But not on day one.&lt;/p&gt;

&lt;p&gt;If you're:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running a single model for inference with predictable load → managed services (Vertex AI, SageMaker, Modal) might be faster to market.&lt;/li&gt;
&lt;li&gt;Experimenting with models in notebooks → local containers and lightweight orchestration are enough.&lt;/li&gt;
&lt;li&gt;Running multiple models, multiple teams, with variable workloads and cost constraints → Kubernetes becomes the logical choice.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trap is assuming Kubernetes is the goal. It's not. &lt;strong&gt;The goal is reliable, scalable, observable AI systems that developers actually enjoy maintaining.&lt;/strong&gt; Kubernetes is often the best tool for that—but it requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Strong foundations first&lt;/strong&gt;: GitOps, infrastructure-as-code, observability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An IDP on top&lt;/strong&gt;: Don't expose Kubernetes complexity to data scientists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear resource governance&lt;/strong&gt;: AI compute is expensive; track it ruthlessly.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's Missing from the Narrative
&lt;/h2&gt;

&lt;p&gt;One thing the data doesn't capture: the operational overhead. Two-thirds of teams use Kubernetes for AI, but we don't know how many are struggling with it. How many are maintaining custom YAML hell? How many have visibility into whether their GPU allocation actually makes sense?&lt;/p&gt;

&lt;p&gt;The fact that two-thirds converge on Kubernetes is less about it being perfect and more about it being the least bad option at scale. That's important context.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Kubernetes isn't demanded by AI. It's &lt;em&gt;enabled&lt;/em&gt; by AI teams that have mature engineering practices and the discipline to build abstractions on top of it.&lt;/p&gt;

&lt;p&gt;If you're starting an AI project, ask yourself: Do we have the fundamentals in place? Do we have an IDP or the plan to build one? If the answer is "not yet," Kubernetes can wait. If you're already managing multiple models across teams, you're probably not far from needing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your experience?&lt;/strong&gt; Are you running AI workloads on Kubernetes? What would have made the journey smoother—and what would you do differently next time?&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>platform</category>
      <category>devops</category>
    </item>
    <item>
      <title>How Cloudflare Built Resilience: Lessons from Their Infrastructure Overhaul</title>
      <dc:creator>Pratheesh Satheesh Kumar</dc:creator>
      <pubDate>Sun, 03 May 2026 04:06:18 +0000</pubDate>
      <link>https://forem.com/pratheesh_s/how-cloudflare-built-resilience-lessons-from-their-infrastructure-overhaul-4oef</link>
      <guid>https://forem.com/pratheesh_s/how-cloudflare-built-resilience-lessons-from-their-infrastructure-overhaul-4oef</guid>
      <description>&lt;h1&gt;
  
  
  How Cloudflare Built Resilience: Lessons from Their Infrastructure Overhaul
&lt;/h1&gt;

&lt;p&gt;When a single misconfiguration can cascade across a global CDN and take down customer traffic, every deployment becomes a high-stakes decision. Cloudflare recently completed a massive push to make their infrastructure fundamentally more resilient—and their approach offers critical lessons for anyone operating at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Risk Concentrates in Configuration
&lt;/h2&gt;

&lt;p&gt;Most infrastructure incidents don't happen because of hardware failures or clever attacks. They happen because someone pushed a configuration change, the change propagated faster than expected, and there was no circuit breaker in between.&lt;/p&gt;

&lt;p&gt;Cloudflare's situation was familiar to anyone running global-scale systems: their engineering teams were shipping improvements constantly, but each deployment carried latent risk. A small mistake in a configuration file could reach millions of users before detection. The traditional guardrails—code review, staging tests, gradual rollouts—weren't enough to catch every edge case.&lt;/p&gt;

&lt;p&gt;This is why they launched "Fail Small," an engineering initiative focused on preventing large-scale incidents by making small failures impossible to propagate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two-Tool Foundation: Snapstone and Engineering Codex
&lt;/h2&gt;

&lt;p&gt;The solution wasn't a single tool. Instead, Cloudflare invested in two complementary systems:&lt;/p&gt;

&lt;h3&gt;
  
  
  Snapstone: Safer Configuration Changes
&lt;/h3&gt;

&lt;p&gt;Snapstone is a configuration validation and deployment framework that treats configuration changes with the same rigor as code deployments. Here's what makes it different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-flight validation&lt;/strong&gt;: Changes are tested against historical traffic patterns and failure scenarios before rollout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Staged rollout control&lt;/strong&gt;: Configuration doesn't flip globally—it rolls out in waves with automated rollback if anomalies appear&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change hygiene&lt;/strong&gt;: Every configuration change is tagged with context: who changed it, why, what it affects, and what the rollback plan is&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as infrastructure-as-code discipline applied to runtime configuration. The payoff is measurable: configuration-related incidents drop significantly because bad changes simply don't reach production simultaneously across all regions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Engineering Codex: Embedding Best Practices
&lt;/h3&gt;

&lt;p&gt;Tools alone don't prevent incidents—culture does. The Engineering Codex is Cloudflare's answer: a formalized knowledge base of "how we safely operate infrastructure" that's embedded into workflows.&lt;/p&gt;

&lt;p&gt;When engineers write configuration or deploy services, they're nudged toward patterns that have been proven safe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment templates that encode retry logic and timeout handling&lt;/li&gt;
&lt;li&gt;Configuration examples that highlight common failure modes&lt;/li&gt;
&lt;li&gt;Runbooks that appear automatically when certain alerts fire&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not gatekeeping. It's scaffolding. New engineers learn the "right way" by default, and experienced engineers can deviate with confidence because they understand the underlying principles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Beyond Cloudflare
&lt;/h2&gt;

&lt;p&gt;You might think: "Sure, this makes sense for a global CDN. But we're running a smaller operation." That's exactly backward.&lt;/p&gt;

&lt;p&gt;Cloudflare's insight applies &lt;em&gt;especially&lt;/em&gt; to smaller teams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your blast radius is fixed regardless of team size&lt;/strong&gt;. A misconfigured load balancer breaks things just as hard at a 50-person startup as at Cloudflare.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You have fewer engineers to catch mistakes&lt;/strong&gt;. Automation and frameworks matter more when you don't have five people reviewing every change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incidents are more expensive relative to revenue&lt;/strong&gt;. A 2-hour outage costs a larger company less (relatively) than a small startup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Fail Small philosophy: &lt;em&gt;Make the safe path the default path.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Actionable Takeaway: Start With Configuration as Code
&lt;/h2&gt;

&lt;p&gt;If you take one thing from Cloudflare's approach, it's this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat configuration changes with the same discipline as code deployments.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Today:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audit your current configuration management. Is it in version control? Are changes tested before rollout? Is there a rollback procedure?&lt;/li&gt;
&lt;li&gt;Identify your highest-risk configuration files (anything that affects traffic routing, authentication, or resource limits).&lt;/li&gt;
&lt;li&gt;Implement one simple control: all changes to critical configuration must be reviewed and tested in staging before production rollout.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need to build Snapstone from scratch. Tools like Terraform, ArgoCD, or even careful GitOps practices get you 80% of the way there.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture: Resilience is Systematic
&lt;/h2&gt;

&lt;p&gt;Cloudflare's Fail Small initiative reminds us that infrastructure resilience isn't about heroic incident response. It's about making bad outcomes progressively harder to achieve.&lt;/p&gt;

&lt;p&gt;Each control they added—validation, staged rollouts, embedded best practices—removes one more degree of freedom from the "I broke production" state space.&lt;/p&gt;

&lt;p&gt;What's one configuration change that could take down your service right now? How many approval gates stand between someone and deploying it? That's where to start.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your team's biggest source of configuration-related incidents? Have you invested in preventing them, or mostly in recovering from them? Drop your thoughts below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>cicd</category>
      <category>platform</category>
    </item>
  </channel>
</rss>
