<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Alexandre Vazquez</title>
    <description>The latest articles on Forem by Alexandre Vazquez (@alexandrev).</description>
    <link>https://forem.com/alexandrev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F167984%2F9c789f5c-7dab-4a86-aece-8bf66ea955bd.jpeg</url>
      <title>Forem: Alexandre Vazquez</title>
      <link>https://forem.com/alexandrev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/alexandrev"/>
    <language>en</language>
    <item>
      <title>Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 21 Apr 2026 11:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-54a1</link>
      <guid>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-54a1</guid>
      <description>&lt;h1&gt;
  
  
  Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does
&lt;/h1&gt;

&lt;h2&gt;
  
  
  How HPA Actually Decides to Scale
&lt;/h2&gt;

&lt;p&gt;The HPA controller uses a formula to determine desired replicas: &lt;code&gt;desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))&lt;/code&gt;. A critical detail is that "the metric value is expressed relative to the resource &lt;em&gt;request&lt;/em&gt;, not the resource limit." This distinction explains many HPA failures.&lt;/p&gt;

&lt;p&gt;HPA polls metrics every 15 seconds by default, scaling up within one to three polling cycles when thresholds are exceeded. Scale-down is deliberately slow, waiting 5 minutes by default to prevent oscillation.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU-Based HPA: When It Works and When It Doesn't
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Where CPU HPA Works Well
&lt;/h3&gt;

&lt;p&gt;CPU-based HPA succeeds with stateless request-processing workloads where CPU consumption correlates with request volume. Prerequisites include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accurate CPU requests&lt;/strong&gt; set to actual sustained consumption, not placeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasonable request-to-limit ratios&lt;/strong&gt; (1:4 or less)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU consumption that tracks user load linearly&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where CPU HPA Fails
&lt;/h3&gt;

&lt;p&gt;CPU HPA struggles with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency-sensitive services with sharp spikes&lt;/strong&gt; — by the time HPA detects and reacts to peaks, the burst may be over&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I/O-bound workloads&lt;/strong&gt; — showing low CPU even under heavy load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workloads with cold-start costs&lt;/strong&gt; — requiring earlier scaling decisions than CPU metrics can trigger&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Memory-Based HPA: Why It Almost Always Breaks
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Problem
&lt;/h3&gt;

&lt;p&gt;Memory is incompressible; exhausting it causes OOM termination. Unlike CPU, "memory consumption is relatively stable" for well-architected services. A Go service or JVM application maintains a consistent memory footprint regardless of traffic volume from 10 to 10,000 requests per second.&lt;/p&gt;

&lt;p&gt;This creates two outcomes: memory HPA either never triggers (useless) or always triggers (permanently scaled out).&lt;/p&gt;

&lt;h3&gt;
  
  
  The Request Misconfiguration Trap
&lt;/h3&gt;

&lt;p&gt;A Java service needing 512Mi heap but configured with a 256Mi request will immediately consume 200% of its request. An HPA with 70% memory threshold will scale such workloads to maximum replicas permanently. The solution is right-sizing requests, not adjusting thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  JVM and Go Runtime Memory Behavior
&lt;/h3&gt;

&lt;p&gt;The JVM allocates heap up to its maximum and doesn't release it aggressively, even after garbage collection. Go's garbage collector prioritizes low latency over minimal memory use, potentially holding memory above strict necessity.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Memory HPA Is Actually Appropriate
&lt;/h3&gt;

&lt;p&gt;Memory-based HPA is defensible only in narrow cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Workloads where memory consumption tracks load linearly&lt;/li&gt;
&lt;li&gt;As a secondary safety valve (not primary) at 85-90% threshold for protecting against memory leaks&lt;/li&gt;
&lt;li&gt;Caching services where avoiding eviction before scaling out is critical&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Right-Sizing Requests Before Adding HPA
&lt;/h2&gt;

&lt;p&gt;No HPA strategy works without accurate resource requests. Run workloads under representative load and measure actual consumption. VPA in recommendation mode provides data-driven baselines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VerticalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-vpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;targetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;updatePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;updateMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Off"&lt;/span&gt;   &lt;span class="c1"&gt;# Recommendation only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical note:&lt;/strong&gt; VPA and HPA cannot both auto-manage the same resource metric simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Signals: What to Scale On Instead
&lt;/h2&gt;

&lt;p&gt;Shift from resource consumption metrics (describing the past) to demand metrics (describing current needs).&lt;/p&gt;

&lt;h3&gt;
  
  
  Requests Per Second (RPS)
&lt;/h3&gt;

&lt;p&gt;For HTTP services, "requests per second per replica is usually the most accurate proxy for load." RPS measures demand directly, working for CPU-bound, memory-bound, or I/O-bound services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pods&lt;/span&gt;
    &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http_requests_per_second&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AverageValue&lt;/span&gt;
        &lt;span class="na"&gt;averageValue&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Queue Depth and Lag
&lt;/h3&gt;

&lt;p&gt;For consumer workloads reading from message queues, "consumer lag: how many messages are waiting to be processed" is the right scaling signal. KEDA was built for this use case, reading consumer group lag directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;P99 latency per replica is an excellent signal for latency-sensitive services, requiring custom metrics from service meshes or APM tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled and Predictive Scaling
&lt;/h3&gt;

&lt;p&gt;For predictable traffic patterns, proactive scaling outperforms reactive scaling. KEDA's Cron scaler enables time-based scaling rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  HPA Configuration Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Always Set minReplicas ≥ 2 for Production
&lt;/h3&gt;

&lt;p&gt;A single-replica HPA creates a single point of failure during scale-in events.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tune Stabilization Windows
&lt;/h3&gt;

&lt;p&gt;The default 5-minute scale-down stabilization is too aggressive for workloads with cyclical patterns. Increase it to match your workload's natural cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-hpa&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Resource&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cpu&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Utilization&lt;/span&gt;
        &lt;span class="na"&gt;averageUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
  &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;scaleDown&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;600&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Percent&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;25&lt;/span&gt;
        &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
    &lt;span class="na"&gt;scaleUp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Percent&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;behavior&lt;/code&gt; block (available in HPA v2) enables independent control over scale-up and scale-down.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use a Lower CPU Threshold Than You Think
&lt;/h3&gt;

&lt;p&gt;If scale-up takes 45 seconds, a 70% threshold leaves existing pods throttled during that window. Set CPU targets at 50-60% for services where scaling latency matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Combine HPA with PodDisruptionBudgets
&lt;/h3&gt;

&lt;p&gt;HPA scale-down terminates pods. Without a PodDisruptionBudget, multiple replicas can be terminated simultaneously during maintenance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-pdb&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50%"&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Don't Mix VPA Auto-Update with HPA on the Same Metric
&lt;/h3&gt;

&lt;p&gt;VPA auto-updating requests while HPA scales on those metrics creates conflicting control loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Framework: Which Autoscaler for Which Workload
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload type&lt;/th&gt;
&lt;th&gt;Recommended signal&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stateless HTTP API, CPU-bound&lt;/td&gt;
&lt;td&gt;CPU utilization at 50-60%&lt;/td&gt;
&lt;td&gt;HPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stateless HTTP API, I/O-bound&lt;/td&gt;
&lt;td&gt;RPS per replica or P99 latency&lt;/td&gt;
&lt;td&gt;HPA + custom metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message queue consumer&lt;/td&gt;
&lt;td&gt;Consumer lag / queue depth&lt;/td&gt;
&lt;td&gt;KEDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event-driven / Kafka / SQS&lt;/td&gt;
&lt;td&gt;Event rate or lag&lt;/td&gt;
&lt;td&gt;KEDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Predictable traffic pattern&lt;/td&gt;
&lt;td&gt;Schedule (time-based)&lt;/td&gt;
&lt;td&gt;KEDA Cron scaler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workload with memory leak risk&lt;/td&gt;
&lt;td&gt;CPU primary + memory at 85% secondary&lt;/td&gt;
&lt;td&gt;HPA (v2 multi-metric)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right-sizing before HPA&lt;/td&gt;
&lt;td&gt;Historical CPU/memory recommendations&lt;/td&gt;
&lt;td&gt;VPA recommendation mode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Going Beyond HPA: KEDA and Custom Metrics
&lt;/h2&gt;

&lt;p&gt;KEDA provides a Kubernetes-native autoscaling framework supporting over 60 built-in scalers. The key architectural point: "KEDA does not replace HPA — it feeds it." KEDA creates and manages HPA resources while consuming signals HPA cannot access natively.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can I use both CPU and memory in the same HPA?
&lt;/h3&gt;

&lt;p&gt;Yes. HPA v2 supports multiple metrics simultaneously, scaling to satisfy the most demanding metric. Use CPU at 60% threshold and memory at 85% threshold so memory only triggers in genuine overconsumption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does my workload scale up immediately after deployment?
&lt;/h3&gt;

&lt;p&gt;Resource request misconfiguration. Check actual consumption against requests using &lt;code&gt;kubectl top pods&lt;/code&gt;. If consuming 200% of request by simply running, adjust requests to match actual usage before enabling HPA.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does HPA scale down too aggressively and cause latency spikes?
&lt;/h3&gt;

&lt;p&gt;Increase &lt;code&gt;scaleDown.stabilizationWindowSeconds&lt;/code&gt; in the HPA &lt;code&gt;behavior&lt;/code&gt; block. Also add a &lt;code&gt;Percent&lt;/code&gt; policy limiting scale-down to 25% of replicas per minute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I set HPA on every deployment?
&lt;/h3&gt;

&lt;p&gt;No. HPA fits stateless services, consumers, and request handlers. It's inappropriate for stateful workloads requiring more than replica addition, singleton controllers, or batch jobs that should run to completion.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the minimum CPU request for reliable HPA?
&lt;/h3&gt;

&lt;p&gt;No absolute minimum, but requests below 100m make percentage thresholds coarse-grained. At 50m and 70% threshold, scaling triggers at 35m consumption. For lower needs, use RPS or custom metrics instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I debug HPA scaling decisions?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;kubectl describe hpa&lt;/code&gt; to see current metrics and last scaling events. Check HPA events with &lt;code&gt;kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler&lt;/code&gt;. For custom metrics, verify the metrics server returns expected values.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/kubernetes-hpa-best-practices/" rel="noopener noreferrer"&gt;alexandre-vazquez.com/kubernetes-hpa-best-practices&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloud</category>
      <category>scalability</category>
    </item>
    <item>
      <title>Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 21 Apr 2026 10:00:06 +0000</pubDate>
      <link>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-3oab</link>
      <guid>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-3oab</guid>
      <description>&lt;h1&gt;
  
  
  Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/kubernetes-hpa-best-practices/" rel="noopener noreferrer"&gt;alexandre-vazquez.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Read the full article on my blog: &lt;a href="https://alexandre-vazquez.com/kubernetes-hpa-best-practices/" rel="noopener noreferrer"&gt;https://alexandre-vazquez.com/kubernetes-hpa-best-practices/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Test Title</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 21 Apr 2026 10:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/test-title-cdg</link>
      <guid>https://forem.com/alexandrev/test-title-cdg</guid>
      <description>&lt;p&gt;Test&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>XSLT for beginners: your first transformation</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Mon, 20 Apr 2026 09:00:00 +0000</pubDate>
      <link>https://forem.com/alexandrev/xslt-for-beginners-your-first-transformation-53hl</link>
      <guid>https://forem.com/alexandrev/xslt-for-beginners-your-first-transformation-53hl</guid>
      <description>&lt;p&gt;XSLT is a language for transforming XML documents into other formats — another XML structure, HTML, plain text, or JSON. If you are new to it, the learning curve can feel steep because XSLT is declarative and template-driven, which is different from procedural languages. This guide walks you through the core ideas with working examples you can run in &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  What XSLT does
&lt;/h2&gt;

&lt;p&gt;You start with an XML source document. You write a stylesheet that describes rules for transforming it. The processor reads both and produces an output document. The stylesheet does not loop through the input line by line — instead, it defines templates that match nodes, and the processor calls those templates as it traverses the document tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your first stylesheet
&lt;/h2&gt;

&lt;p&gt;Start with a simple XML document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;


    Design Patterns
    Gang of Four
    45.00


    Clean Code
    Robert Martin
    38.00


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now write a stylesheet that turns this into an HTML table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;










            Title
            Author
            Price















&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Paste both into &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; and run it. You will see an HTML table with the book data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding templates
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;rule fires when the processor reaches the document root. Inside it,&lt;/code&gt; tells the processor to find all &lt;code&gt;book&lt;/code&gt; elements inside &lt;code&gt;catalog&lt;/code&gt; and call the matching template for each one.&lt;/p&gt;

&lt;p&gt;The second template &lt;code&gt;fires once per `book` element. Inside it,&lt;/code&gt; extracts the text content of the &lt;code&gt;title&lt;/code&gt; child.&lt;/p&gt;

&lt;p&gt;This is the core loop of XSLT: match, apply, select.&lt;/p&gt;

&lt;h2&gt;
  
  
  XPath selects nodes
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;select&lt;/code&gt; and &lt;code&gt;match&lt;/code&gt; attributes use XPath, a path language for navigating XML trees. A few rules you will use constantly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;XPath&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;catalog/book&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;book&lt;/code&gt; children of &lt;code&gt;catalog&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;//book&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All &lt;code&gt;book&lt;/code&gt; elements anywhere in the document&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;book/@id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The &lt;code&gt;id&lt;/code&gt; attribute of &lt;code&gt;book&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;book[price &amp;gt; 40]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Books whose price exceeds 40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The current node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;..&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The parent of the current node&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Filtering with predicates
&lt;/h2&gt;

&lt;p&gt;Add a predicate to the &lt;code&gt;apply-templates&lt;/code&gt; call to show only books over 40:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use &lt;code&gt;xsl:if&lt;/code&gt; inside the template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;






&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Sorting output
&lt;/h2&gt;

&lt;p&gt;Use &lt;code&gt;xsl:sort&lt;/code&gt; to control the order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What to try next
&lt;/h2&gt;

&lt;p&gt;Once you are comfortable with templates and XPath:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn &lt;code&gt;xsl:for-each&lt;/code&gt; for inline iteration&lt;/li&gt;
&lt;li&gt;Explore &lt;code&gt;xsl:choose&lt;/code&gt; for multi-branch conditionals&lt;/li&gt;
&lt;li&gt;Try &lt;code&gt;xsl:variable&lt;/code&gt; to store intermediate values&lt;/li&gt;
&lt;li&gt;Look at &lt;code&gt;xsl:param&lt;/code&gt; to pass values into your stylesheet from outside&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these work in XSLT 1.0, 2.0, and 3.0. The &lt;a href="https://xsltplayground.com" rel="noopener noreferrer"&gt;XSLT Playground&lt;/a&gt; lets you set the version and experiment without installing anything.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>data</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Debugging Distroless Containers: kubectl debug, Ephemeral Containers, and When to Use Each</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Thu, 16 Apr 2026 11:00:01 +0000</pubDate>
      <link>https://forem.com/alexandrev/debugging-distroless-containers-kubectl-debug-ephemeral-containers-and-when-to-use-each-203b</link>
      <guid>https://forem.com/alexandrev/debugging-distroless-containers-kubectl-debug-ephemeral-containers-and-when-to-use-each-203b</guid>
      <description>&lt;h1&gt;
  
  
  Debugging Distroless Containers: kubectl debug, Ephemeral Containers, and When to Use Each
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Why Distroless Breaks the Normal Debugging Workflow
&lt;/h2&gt;

&lt;p&gt;Traditional container debugging assumes shell access with standard tools like &lt;code&gt;ps&lt;/code&gt;, &lt;code&gt;netstat&lt;/code&gt;, and &lt;code&gt;curl&lt;/code&gt;. Distroless images intentionally exclude these utilities to reduce attack surface and CVEs. This creates an operational challenge: "when something goes wrong, you cannot use the tools that the process itself is not allowed to run."&lt;/p&gt;

&lt;p&gt;Kubernetes addresses this through ephemeral containers, stabilized in version 1.25, which enable temporary debug containers to be injected into running pods.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 1: kubectl debug with Ephemeral Containers
&lt;/h2&gt;

&lt;p&gt;The canonical solution uses ephemeral containers to inject a debug container sharing the target pod's network and process namespaces without modifying the original container or restarting the pod.&lt;/p&gt;

&lt;p&gt;Basic invocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl debug &lt;span class="nt"&gt;-it&lt;/span&gt; my-pod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-container
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--target&lt;/code&gt; flag shares the process namespace of the specified container, enabling inspection via &lt;code&gt;ps aux&lt;/code&gt; and &lt;code&gt;/proc/&lt;/code&gt; access.&lt;/p&gt;

&lt;p&gt;For network diagnostics, use a richer image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl debug &lt;span class="nt"&gt;-it&lt;/span&gt; my-pod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nicolaka/netshoot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-container
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Capabilities and Limitations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Ephemeral containers provide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full network namespace visibility&lt;/li&gt;
&lt;li&gt;Process inspection via &lt;code&gt;/proc/&lt;/code&gt; (open files, environment variables, memory maps)&lt;/li&gt;
&lt;li&gt;Pod-level DNS resolution access&lt;/li&gt;
&lt;li&gt;Outbound network calls from the pod's network context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ephemeral containers do not provide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct application container filesystem access&lt;/li&gt;
&lt;li&gt;Container removal after creation&lt;/li&gt;
&lt;li&gt;Volume mount modifications via CLI&lt;/li&gt;
&lt;li&gt;Resource limits support in the &lt;code&gt;kubectl debug&lt;/code&gt; CLI&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Accessing the Application Filesystem
&lt;/h3&gt;

&lt;p&gt;The workaround for filesystem access uses the &lt;code&gt;/proc&lt;/code&gt; filesystem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Browse via /proc&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; /proc/1/root/app/
&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/1/root/etc/config.yaml

&lt;span class="c"&gt;# Or chroot into the application's filesystem&lt;/span&gt;
&lt;span class="nb"&gt;chroot&lt;/span&gt; /proc/1/root /bin/sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;/proc//root&lt;/code&gt; symlink provides read access to the container's filesystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  RBAC Requirements
&lt;/h3&gt;

&lt;p&gt;Ephemeral containers require the &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt; subresource permission, separate from &lt;code&gt;pods/exec&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRole&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ephemeral-debugger&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/ephemeralcontainers"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;update"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/attach"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production, scope this tightly with time-limited bindings and approval workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 2: kubectl debug --copy-to (Pod Copy Strategy)
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;--copy-to&lt;/code&gt; flag creates a full pod copy with modifications:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl debug my-pod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--copy-to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-pod-debug &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-app:debug &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--share-processes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a new pod with the container image replaced. Add a debug container alongside the original:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl debug my-pod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--copy-to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-pod-debug &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--share-processes&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;debugger
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Limitations
&lt;/h3&gt;

&lt;p&gt;The copy strategy does not debug the original pod because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It lacks the original pod's in-memory state&lt;/li&gt;
&lt;li&gt;It creates a new Pod UID, potentially triggering different admission policies&lt;/li&gt;
&lt;li&gt;For crashing pods, the copy will also crash unless the entrypoint is modified&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For crash debugging, combine with a modified entrypoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl debug my-crashing-pod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--copy-to&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-pod-debug &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--share-processes&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;sleep &lt;/span&gt;3600
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Option 3: Debug Image Variants
&lt;/h2&gt;

&lt;p&gt;Maintain a debug variant of your application image including shell tooling. Google distroless images provide &lt;code&gt;:debug&lt;/code&gt; tags with BusyBox:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Production image&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; gcr.io/distroless/java17-debian12&lt;/span&gt;

&lt;span class="c"&gt;# Debug variant&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; gcr.io/distroless/java17-debian12:debug&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Chainguard images follow a similar pattern with &lt;code&gt;:latest-dev&lt;/code&gt; variants that include apk and shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Production&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; cgr.dev/chainguard/go:latest&lt;/span&gt;

&lt;span class="c"&gt;# Development/debug&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; cgr.dev/chainguard/go:latest-dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For custom images, use multi-stage builds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;golang:1.22&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; . .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;go build &lt;span class="nt"&gt;-o&lt;/span&gt; myapp .

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;gcr.io/distroless/static-debian12&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/myapp /myapp&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["/myapp"]&lt;/span&gt;

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;gcr.io/distroless/static-debian12:debug&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;debug&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/myapp /myapp&lt;/span&gt;
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["/myapp"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build both targets and push &lt;code&gt;my-app:${VERSION}&lt;/code&gt; (production) and &lt;code&gt;my-app:${VERSION}-debug&lt;/code&gt; (debug) to your registry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Considerations
&lt;/h3&gt;

&lt;p&gt;Debug image variants undermine distroless security benefits if deployed to production. Track usage carefully, require explicit approval, and ensure removal after debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 4: cdebug
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;cdebug&lt;/code&gt; is an open-source CLI tool that simplifies ephemeral container debugging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;cdebug

&lt;span class="c"&gt;# Debug a running pod&lt;/span&gt;
cdebug &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; my-pod

&lt;span class="c"&gt;# Specify namespace and container&lt;/span&gt;
cdebug &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; production my-pod &lt;span class="nt"&gt;-c&lt;/span&gt; my-container

&lt;span class="c"&gt;# Use specific debug image&lt;/span&gt;
cdebug &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; my-pod &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nicolaka/netshoot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;cdebug&lt;/code&gt; adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic filesystem chroot to the target container's filesystem&lt;/li&gt;
&lt;li&gt;Docker container integration (&lt;code&gt;cdebug exec&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;No RBAC complications for Docker-based local development&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff is that it requires third-party tooling installation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 5: Node-Level Debugging
&lt;/h2&gt;

&lt;p&gt;For issues that ephemeral containers cannot address—pod crashing too fast, kernel-level problems, or tools requiring elevated privileges—node-level debugging provides direct container access from the host node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl debug node/my-node-name &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nicolaka/netshoot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the privileged pod, use &lt;code&gt;nsenter&lt;/code&gt; to enter container namespaces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find the container's PID&lt;/span&gt;
crictl ps | &lt;span class="nb"&gt;grep &lt;/span&gt;my-container
crictl inspect  | &lt;span class="nb"&gt;grep &lt;/span&gt;pid

&lt;span class="c"&gt;# Enter the container's namespaces&lt;/span&gt;
nsenter &lt;span class="nt"&gt;-t&lt;/span&gt;  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; /bin/sh

&lt;span class="c"&gt;# Enter only network namespace&lt;/span&gt;
nsenter &lt;span class="nt"&gt;-t&lt;/span&gt;  &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; ip a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach enables running &lt;code&gt;strace&lt;/code&gt; and other kernel-level tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Trace all syscalls from the application process&lt;/span&gt;
nsenter &lt;span class="nt"&gt;-t&lt;/span&gt;  &lt;span class="nt"&gt;--&lt;/span&gt; strace &lt;span class="nt"&gt;-p&lt;/span&gt;  &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;network
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  RBAC and Security
&lt;/h3&gt;

&lt;p&gt;Node-level debugging requires &lt;code&gt;nodes/proxy&lt;/code&gt; and ability to create privileged pods. The debug pod runs with &lt;code&gt;hostPID: true&lt;/code&gt; and &lt;code&gt;hostNetwork: true&lt;/code&gt;, providing visibility into all node processes. Treat this as a break-glass procedure with dual approval, complete audit logging, and immediate cleanup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Approach: Access Profile Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Active production incident, pod running&lt;/td&gt;
&lt;td&gt;kubectl debug + ephemeral container&lt;/td&gt;
&lt;td&gt;pods/ephemeralcontainers RBAC, k8s 1.25+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pod crashing too fast to attach&lt;/td&gt;
&lt;td&gt;kubectl debug --copy-to + modified entrypoint&lt;/td&gt;
&lt;td&gt;Ability to create pods in namespace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer debugging in dev/staging&lt;/td&gt;
&lt;td&gt;cdebug exec or kubectl debug&lt;/td&gt;
&lt;td&gt;pods/ephemeralcontainers or pod create&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need full filesystem access&lt;/td&gt;
&lt;td&gt;kubectl debug --copy-to + debug image variant&lt;/td&gt;
&lt;td&gt;Debug image in registry, pod create&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need strace or kernel tracing&lt;/td&gt;
&lt;td&gt;Node-level debug with nsenter&lt;/td&gt;
&lt;td&gt;nodes/proxy, cluster admin equivalent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network packet capture&lt;/td&gt;
&lt;td&gt;kubectl debug + nicolaka/netshoot&lt;/td&gt;
&lt;td&gt;pods/ephemeralcontainers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Docker debugging&lt;/td&gt;
&lt;td&gt;cdebug exec&lt;/td&gt;
&lt;td&gt;Docker socket access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI-reproducible debug environment&lt;/td&gt;
&lt;td&gt;Debug image variant in separate build target&lt;/td&gt;
&lt;td&gt;Separate image tag in registry&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Developer — Local or Development Cluster
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Reproduce bugs, inspect configuration, verify service connectivity.&lt;br&gt;
&lt;strong&gt;Approach:&lt;/strong&gt; Debug image variants or cdebug.&lt;/p&gt;

&lt;p&gt;Speed and iteration take priority. Build the debug variant and deploy it directly, or use &lt;code&gt;cdebug exec&lt;/code&gt; for automatic filesystem root access.&lt;/p&gt;
&lt;h2&gt;
  
  
  Developer — Staging Cluster
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Debug integration issues and environment-specific behavior.&lt;br&gt;
&lt;strong&gt;Approach:&lt;/strong&gt; kubectl debug with ephemeral containers (&lt;code&gt;--target&lt;/code&gt;), scoped to own namespace.&lt;/p&gt;

&lt;p&gt;Grant developers &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt; in their team's namespaces for self-service debugging without ops involvement.&lt;/p&gt;
&lt;h2&gt;
  
  
  Platform Engineer / SRE — Production
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Diagnose live production incidents while minimizing risk.&lt;br&gt;
&lt;strong&gt;Approach:&lt;/strong&gt; kubectl debug with ephemeral containers.&lt;/p&gt;

&lt;p&gt;Ephemeral containers satisfy production requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They are recorded in API audit logs (who, when, which pod)&lt;/li&gt;
&lt;li&gt;They do not modify the running application container&lt;/li&gt;
&lt;li&gt;They are limited to the pod's network and process namespaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Avoid &lt;code&gt;--copy-to&lt;/code&gt; in production incidents because it creates a pod that may not exhibit the issue and adds load during an incident.&lt;/p&gt;
&lt;h2&gt;
  
  
  Platform Engineer — Production, Node-Level Issue
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Diagnose kernel-level issues, container runtime problems, or multi-pod networking issues.&lt;br&gt;
&lt;strong&gt;Approach:&lt;/strong&gt; Node-level debug pod with &lt;code&gt;nsenter&lt;/code&gt;. Treat as break-glass.&lt;/p&gt;

&lt;p&gt;Create a dedicated RBAC role that grants &lt;code&gt;nodes/proxy&lt;/code&gt; access only on-demand with separate authentication and time-limited bindings. Log all access.&lt;/p&gt;
&lt;h2&gt;
  
  
  Common Errors and Solutions
&lt;/h2&gt;
&lt;h3&gt;
  
  
  "ephemeral containers are disabled for this cluster"
&lt;/h3&gt;

&lt;p&gt;Ephemeral containers require Kubernetes 1.16+ with the feature gate enabled. They are stable and always-on from Kubernetes 1.25.&lt;/p&gt;
&lt;h3&gt;
  
  
  "cannot update ephemeralcontainers" (RBAC)
&lt;/h3&gt;

&lt;p&gt;You have &lt;code&gt;pods/exec&lt;/code&gt; but lack &lt;code&gt;pods/ephemeralcontainers&lt;/code&gt;. These are separate subresources.&lt;/p&gt;
&lt;h3&gt;
  
  
  "container not found" with --target
&lt;/h3&gt;

&lt;p&gt;The container name in &lt;code&gt;--target&lt;/code&gt; must match exactly. Verify with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pod my-pod &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.spec.containers[*].name}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Can see processes but cannot read /proc/1/root
&lt;/h3&gt;

&lt;p&gt;The ephemeral container may lack &lt;code&gt;CAP_SYS_PTRACE&lt;/code&gt; capability. Use the Baseline PodSecurityStandards (PSS) profile for debug namespaces or explicitly add the capability:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;capabilities&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;SYS_PTRACE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  tcpdump shows no traffic
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;tcpdump -i any&lt;/code&gt; to capture on all interfaces including loopback, where inter-container traffic travels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production RBAC Design
&lt;/h2&gt;

&lt;p&gt;Separate three privilege tiers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1: Developer self-service&lt;/strong&gt; (team namespaces)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;distroless-debugger&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-namespace&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/ephemeralcontainers"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;update"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/attach"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tier 2: SRE production incident access&lt;/strong&gt; (all namespaces)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRole&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sre-distroless-debugger&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/ephemeralcontainers"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;update"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/attach"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tier 3: Break-glass node access&lt;/strong&gt; (time-limited binding recommended)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterRole&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node-debugger&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nodes/proxy"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delete"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bind Tier 1 permanently to developers. Bind Tier 2 permanently to SREs with audit alerts on use. Bind Tier 3 only on-demand via a Kubernetes operator creating time-limited RoleBindings—never as a permanent ClusterRoleBinding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Distroless containers reduce attack surface and CVEs, forcing a clean separation between application and tooling. Kubernetes provides ephemeral containers and &lt;code&gt;kubectl debug&lt;/code&gt; as the clean answer: inject a debug container with necessary tools into the running pod, sharing its network and process namespaces, without restarting or modifying the application.&lt;/p&gt;

&lt;p&gt;For scenarios ephemeral containers cannot address—filesystem access, crash debugging, kernel-level investigation—the copy strategy and node-level debug fill remaining gaps. The key to scaling this approach is the access model: developers get self-service ephemeral container access in their namespaces, SREs get cluster-wide ephemeral container access for production incidents, and node-level access is a break-glass procedure with audit trail and time limits.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/debugging-distroless-containers/" rel="noopener noreferrer"&gt;alexandre-vazquez.com/debugging-distroless-containers&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>docker</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>Debugging Distroless Containers: Kubectl Debug, Ephemeral Containers, and When to Use Each</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 14 Apr 2026 10:24:40 +0000</pubDate>
      <link>https://forem.com/alexandrev/debugging-distroless-containers-kubectl-debug-ephemeral-containers-and-when-to-use-each-ook</link>
      <guid>https://forem.com/alexandrev/debugging-distroless-containers-kubectl-debug-ephemeral-containers-and-when-to-use-each-ook</guid>
      <description>&lt;h1&gt;
  
  
  Debugging Distroless Containers: Kubectl Debug, Ephemeral Containers, and When to Use Each
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/debugging-distroless-containers/" rel="noopener noreferrer"&gt;alexandre-vazquez.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Read the full article on my blog: &lt;a href="https://alexandre-vazquez.com/debugging-distroless-containers/" rel="noopener noreferrer"&gt;https://alexandre-vazquez.com/debugging-distroless-containers/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Mon, 13 Apr 2026 08:39:16 +0000</pubDate>
      <link>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-319a</link>
      <guid>https://forem.com/alexandrev/kubernetes-hpa-best-practices-when-cpu-works-why-memory-almost-never-does-319a</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://alexandre-vazquez.com/kubernetes-hpa-best-practices/" rel="noopener noreferrer"&gt;alexandre-vazquez.com/kubernetes-hpa-best-practices/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There is a configuration that appears in virtually every Kubernetes cluster: a HorizontalPodAutoscaler targeting 70% CPU utilization and 70% memory utilization. It looks reasonable. It follows the examples in the official documentation. And in many cases, it silently causes more harm than good.&lt;/p&gt;

&lt;p&gt;The problems surface in predictable ways: workloads that do nothing get scaled up because their memory footprint is naturally high. Latency-sensitive APIs scale too slowly because the CPU spike is already over by the time new pods are ready. Batch jobs oscillate between scaling up and down during normal operation. And teams spend hours debugging autoscaling behavior that should have been straightforward.&lt;/p&gt;

&lt;p&gt;This article is about understanding &lt;strong&gt;why the default HPA configuration fails&lt;/strong&gt; , under which exact conditions memory-based HPA is appropriate (and when it is not), and what alternative metrics — custom metrics, event-driven triggers, and external signals — produce autoscaling behavior that actually matches workload demand.&lt;/p&gt;

&lt;h2&gt;
  
  
  How HPA Actually Decides to Scale
&lt;/h2&gt;

&lt;p&gt;Before diagnosing the problems, it is worth understanding the mechanics precisely. HPA computes a desired replica count using this formula:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For a CPU target of 70%, with 2 replicas currently consuming an average of 140% of their CPU request, HPA computes &lt;code&gt;ceil(2 × (140 / 70)) = 4&lt;/code&gt; replicas. This is conceptually simple but has a critical dependency that most configurations ignore: &lt;strong&gt;the metric value is expressed relative to the resource &lt;em&gt;request&lt;/em&gt; , not the resource limit&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This distinction is fundamental to understanding every failure mode that follows. If a container has a CPU request of 100m and a limit of 2000m, and it is currently consuming 80m, HPA sees 80% utilization — even though the container is using only 4% of its allowed ceiling. Set an HPA threshold of 70% on a container with a CPU request of 100m and any nontrivial workload will trigger scaling immediately.&lt;/p&gt;

&lt;p&gt;The HPA controller polls metrics every 15 seconds by default (&lt;code&gt;--horizontal-pod-autoscaler-sync-period&lt;/code&gt;). Scale-up happens quickly — within one to three polling cycles when the threshold is consistently exceeded. Scale-down is deliberately slow: by default the controller waits 5 minutes (&lt;code&gt;--horizontal-pod-autoscaler-downscale-stabilization&lt;/code&gt;) before reducing replicas, to avoid thrashing. This asymmetry matters when debugging oscillation.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU-Based HPA: When It Works and When It Doesn't
&lt;/h2&gt;

&lt;p&gt;CPU is a &lt;em&gt;compressible&lt;/em&gt; resource. When a container hits its CPU limit, the kernel throttles it — the process slows down but does not crash or get evicted. This property makes CPU a reasonable proxy for load in many, but not all, scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where CPU HPA Works Well
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Stateless request-processing workloads&lt;/strong&gt; are the sweet spot for CPU-based HPA. If your service does CPU-bound work per request — REST APIs performing data transformation, compute-heavy business logic, image processing — then CPU utilization correlates strongly with request volume. More requests means more CPU consumed, which means HPA adds replicas, which distributes the load.&lt;/p&gt;

&lt;p&gt;The key prerequisites for CPU HPA to work correctly are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accurate CPU requests.&lt;/strong&gt; Set requests to the actual sustained consumption of the workload under normal load, not a low placeholder. Use VPA in recommendation mode or historical Prometheus data to right-size requests before enabling HPA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasonable request-to-limit ratio.&lt;/strong&gt; A ratio of 1:4 or less keeps HPA thresholds meaningful. A container with request 100m and limit 4000m makes percentage-based thresholds nearly useless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU consumption that tracks user load linearly.&lt;/strong&gt; If your service does CPU-heavy background work independent of incoming requests, CPU utilization will trigger scaling regardless of actual demand.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where CPU HPA Fails
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Latency-sensitive services with sharp traffic spikes.&lt;/strong&gt; HPA reacts to average CPU utilization measured over the polling window. For a service that handles traffic bursts — a flash sale, a cron-triggered batch of API calls, a notification broadcast — by the time the HPA controller detects the spike, queues new pods, and those pods pass readiness checks, the burst may already be over. The result is replicas added after the damage is done, with the added cost of a scale-down cycle afterward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I/O-bound workloads.&lt;/strong&gt; A service that spends most of its time waiting on database queries, external API calls, or message queue reads will show low CPU utilization even under heavy load. HPA will not add replicas while the service is degraded — it sees idle CPUs while goroutines or threads are blocked waiting on I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workloads with cold-start costs.&lt;/strong&gt; If a new replica takes 30-60 seconds to warm up (loading ML models, establishing connection pools, populating caches), scaling decisions need to happen earlier — before CPU peaks — not in reaction to it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory-Based HPA: Why It Almost Always Breaks
&lt;/h2&gt;

&lt;p&gt;Memory is an &lt;em&gt;incompressible&lt;/em&gt; resource. Unlike CPU — which can be throttled without killing a process — when a container exhausts its memory limit, the OOM killer terminates it. This single property cascades into a set of fundamental problems with using memory as an HPA trigger.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Problem: Memory Doesn't Naturally Correlate With Load
&lt;/h3&gt;

&lt;p&gt;For most well-architected services, memory consumption is relatively stable. A Go service allocates memory at startup for its runtime structures, connection pools, and caches — and then maintains roughly that footprint regardless of traffic. A JVM application allocates a heap at startup and uses garbage collection to manage it. In both cases, memory usage under 10 requests per second and under 10,000 requests per second may be nearly identical.&lt;/p&gt;

&lt;p&gt;This means a memory-based HPA with a 70% threshold will either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Never trigger,&lt;/strong&gt; because the workload's memory is stable and always below the threshold — rendering the HPA useless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always trigger,&lt;/strong&gt; because the workload's baseline memory consumption is naturally above the threshold — causing the workload to scale out permanently and never scale back in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither outcome corresponds to actual scaling need.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Request Misconfiguration Trap
&lt;/h3&gt;

&lt;p&gt;This is the failure mode the user mentioned, and it is the most common cause of "my workload scales up for no reason." Consider a Java service that needs 512Mi of heap to run normally. The team sets memory request to 256Mi — too conservative, either to save cost or because the initial estimate was wrong. The service immediately consumes 200% of its memory request just by being alive. An HPA with a 70% memory target will scale this workload to maximum replicas within minutes of deployment, and it will stay there forever.&lt;/p&gt;

&lt;p&gt;The fix is never "adjust the HPA threshold." The fix is right-sizing the memory request. But this reveals the deeper issue: &lt;strong&gt;memory-based HPA is extremely sensitive to the accuracy of your resource requests&lt;/strong&gt; , and most teams do not have accurate requests — especially for newer workloads or after code changes that alter memory footprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  JVM and Go Runtime Memory Behavior
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;JVM workloads&lt;/strong&gt; are particularly problematic. By default, the JVM allocates heap up to a maximum (-Xmx) and then holds that memory — it does not release heap back to the OS aggressively, even after garbage collection. A JVM service that handles one request per hour will show nearly the same memory footprint as one handling thousands of requests per minute. Furthermore, the JVM's garbage collector introduces memory spikes during collection cycles that are unrelated to load.&lt;/p&gt;

&lt;p&gt;In containerized JVM environments, you also need to account for the container memory limit aware flag (&lt;code&gt;-XX:+UseContainerSupport&lt;/code&gt;, enabled by default since JDK 11) which affects how the JVM calculates its heap ceiling relative to the container limit. Without proper tuning, the JVM may allocate a heap that fills 80-90% of the container's memory limit — immediately triggering any memory-based HPA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Go workloads&lt;/strong&gt; behave differently but also poorly with memory HPA. Go's garbage collector is designed to maintain low latency rather than minimal memory use. The runtime may hold memory above what is strictly needed, and the memory footprint can vary based on GC tuning parameters (&lt;code&gt;GOGC&lt;/code&gt;, &lt;code&gt;GOMEMLIMIT&lt;/code&gt;) in ways that are not correlated with incoming request load.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Memory HPA Is Actually Appropriate
&lt;/h3&gt;

&lt;p&gt;There are narrow cases where memory-based HPA makes sense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Workloads where memory consumption genuinely tracks with load linearly.&lt;/strong&gt; Some data processing pipelines, in-memory caches that grow with request volume, or streaming applications that buffer data proportionally to throughput. If you can demonstrate from metrics that memory and load have a strong linear correlation, memory HPA is defensible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;As a safety valve alongside CPU HPA.&lt;/strong&gt; Using memory as a secondary metric (not primary) to protect against memory leaks or runaway allocations in a service that normally scales on CPU. In this case, set the memory threshold high — 85-90% — so it only triggers in genuine overconsumption scenarios.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching services where eviction is not desirable.&lt;/strong&gt; If a service uses memory as a performance cache and you want to scale out before memory pressure causes cache eviction, memory utilization can be a useful trigger — provided requests are accurately sized.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Outside these specific cases, removing memory from your HPA spec and relying on the signals below will produce better behavior in virtually every scenario.&lt;/p&gt;

&lt;h2&gt;
  
  
  Right-Sizing Requests Before You Add HPA
&lt;/h2&gt;

&lt;p&gt;No HPA strategy works correctly without accurate resource requests. Before adding any autoscaler — CPU, memory, or custom metrics — run your workload under representative load and measure actual consumption. The easiest way to do this is with VPA in recommendation mode:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  updatePolicy:
    updateMode: "Off"   # Recommendation only — don't auto-apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;After 24-48 hours of traffic, check the VPA recommendations:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl describe vpa my-service-vpa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;lowerBound&lt;/code&gt;, &lt;code&gt;target&lt;/code&gt;, and &lt;code&gt;upperBound&lt;/code&gt; values give you a data-driven baseline for setting requests. Set your requests at or near the VPA &lt;code&gt;target&lt;/code&gt; value before configuring HPA. This single step eliminates the most common cause of HPA misbehavior.&lt;/p&gt;

&lt;p&gt;Note that &lt;strong&gt;VPA and HPA cannot both manage the same resource metric simultaneously&lt;/strong&gt;. If VPA is set to auto-update CPU or memory, and HPA is also scaling on those metrics, the two controllers will fight each other. The safe combination is: HPA on CPU/memory + VPA in recommendation-only mode, or HPA on custom metrics + VPA on CPU/memory in auto mode. See the &lt;a href="https://alexandre-vazquez.com/introduction/" rel="noopener noreferrer"&gt;Kubernetes VPA guide&lt;/a&gt; for the full details.&lt;/p&gt;

&lt;h2&gt;
  
  
  Better Signals: What to Scale On Instead
&lt;/h2&gt;

&lt;p&gt;The fundamental shift is moving from &lt;em&gt;resource consumption&lt;/em&gt; metrics (which describe the past) to &lt;em&gt;demand&lt;/em&gt; metrics (which describe what the workload is being asked to do right now or will be asked to do in seconds).&lt;/p&gt;

&lt;h3&gt;
  
  
  Requests Per Second (RPS)
&lt;/h3&gt;

&lt;p&gt;For HTTP services, requests per second per replica is usually the most accurate proxy for load. Unlike CPU, it measures demand directly — not a side-effect of demand. An HPA that maintains 500 RPS per replica will scale predictably as traffic grows, regardless of whether the service is CPU-bound, memory-bound, or I/O-bound.&lt;/p&gt;

&lt;p&gt;RPS is available as a custom metric from your service mesh (Istio exposes it as &lt;code&gt;istio_requests_total&lt;/code&gt;), from your ingress controller (NGINX exposes request rates via Prometheus), or from your application's own Prometheus metrics. Configuring HPA on custom metrics requires the Prometheus Adapter or a compatible custom metrics API implementation.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "500"   # 500 RPS per replica
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Queue Depth and Lag
&lt;/h3&gt;

&lt;p&gt;For consumer workloads — services reading from Kafka, RabbitMQ, SQS, or any message queue — the right scaling signal is &lt;strong&gt;consumer lag&lt;/strong&gt; : how many messages are waiting to be processed. A lag of zero means consumers are keeping up; a growing lag means you need more consumers.&lt;/p&gt;

&lt;p&gt;CPU will not give you this signal reliably. A consumer blocked on a slow database write will show low CPU but growing lag. An idle consumer will show low CPU even if the queue contains millions of unprocessed messages. Scaling on lag directly solves both problems.&lt;/p&gt;

&lt;p&gt;This is precisely the use case that &lt;a href="https://alexandre-vazquez.com/enhanced-autoscaling-options-for-event-driven-applications/" rel="noopener noreferrer"&gt;KEDA was built for&lt;/a&gt;. KEDA's Kafka scaler, for example, reads consumer group lag directly and scales replicas to maintain a configurable lag threshold — no custom metrics pipeline required.&lt;/p&gt;
&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;p&gt;P99 latency per replica is an excellent scaling signal for latency-sensitive services. If your SLO is a 200ms P99 response time and latency starts climbing toward 400ms, that is a direct signal that the service is overloaded — regardless of what CPU or memory shows.&lt;/p&gt;

&lt;p&gt;Latency-based autoscaling requires custom metrics from your service mesh or APM tool, but the added complexity is often justified for user-facing APIs where latency directly impacts experience.&lt;/p&gt;
&lt;h3&gt;
  
  
  Scheduled and Predictive Scaling
&lt;/h3&gt;

&lt;p&gt;For workloads with predictable traffic patterns — business-hours services, weekly batch jobs, end-of-month processing peaks — proactive scaling outperforms reactive scaling by definition. Rather than waiting for CPU to spike and then scrambling to add replicas, you pre-scale before the expected load increase.&lt;/p&gt;

&lt;p&gt;KEDA's Cron scaler enables this pattern declaratively, defining scale rules based on time windows rather than observed metrics.&lt;/p&gt;
&lt;h2&gt;
  
  
  HPA Configuration Best Practices
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Always Set minReplicas ≥ 2 for Production
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;minReplicas: 1&lt;/code&gt; HPA means your service has a single point of failure during scale-in events. When HPA scales down to 1 replica and that pod is evicted for node maintenance, your service has zero available instances for the duration of the new pod's startup time. For any production workload, set &lt;code&gt;minReplicas: 2&lt;/code&gt; as a baseline.&lt;/p&gt;
&lt;h3&gt;
  
  
  Tune Stabilization Windows
&lt;/h3&gt;

&lt;p&gt;The default 5-minute scale-down stabilization window is too aggressive for many workloads. A service that processes jobs in 3-minute batches will show a predictable CPU trough between batches — HPA will attempt to scale down, only to scale back up when the next batch arrives. Increase the stabilization window to match your workload's natural cycle:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600   # 10 minutes
      policies:
      - type: Percent
        value: 25                        # Scale down max 25% of replicas at once
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0     # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;behavior&lt;/code&gt; block (available in HPA v2, GA since Kubernetes 1.23) gives you independent control over scale-up and scale-down behavior. Aggressive scale-up with conservative scale-down is the right default for most production services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use a Lower CPU Threshold Than You Think
&lt;/h3&gt;

&lt;p&gt;A CPU target of 70% sounds like it leaves headroom, but it does not account for the time required to scale. If your service takes 45 seconds to pass readiness checks after a new pod starts, and you scale at 70% CPU, the existing pods will be at 100%+ CPU (throttled) for 45 seconds before relief arrives. Set CPU targets at 50-60% for services where scale-up latency matters. This keeps more headroom available during the scaling reaction window.&lt;/p&gt;

&lt;h3&gt;
  
  
  Combine HPA with PodDisruptionBudgets
&lt;/h3&gt;

&lt;p&gt;HPA scale-down terminates pods. Without a PodDisruptionBudget, HPA can terminate multiple replicas simultaneously during a scale-down event, potentially taking your service below its minimum healthy instance count during cluster maintenance. Always pair an HPA with a PDB:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: policy/v1&lt;br&gt;
kind: PodDisruptionBudget&lt;br&gt;
metadata:&lt;br&gt;
  name: my-service-pdb&lt;br&gt;
spec:&lt;br&gt;
  minAvailable: "50%"&lt;br&gt;
  selector:&lt;br&gt;
    matchLabels:&lt;br&gt;
      app: my-service&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Don't Mix VPA Auto-Update with HPA on the Same Metric&lt;br&gt;
&lt;/h3&gt;

&lt;p&gt;If VPA is set to auto-update CPU or memory requests, and HPA is also scaling on CPU or memory utilization, you create a control loop conflict. VPA changes the request (the denominator of the utilization calculation), which immediately changes the apparent utilization, which triggers HPA to change replica count, which changes the per-pod load, which triggers VPA again. Use VPA in &lt;code&gt;Off&lt;/code&gt; or &lt;code&gt;Initial&lt;/code&gt; mode when HPA is managing the same workload on resource metrics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Framework: Which Autoscaler for Which Workload
&lt;/h2&gt;

&lt;p&gt;Use this as a starting point when configuring autoscaling for a new workload:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload type&lt;/th&gt;
&lt;th&gt;Recommended signal&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stateless HTTP API, CPU-bound&lt;/td&gt;
&lt;td&gt;CPU utilization at 50-60%&lt;/td&gt;
&lt;td&gt;HPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stateless HTTP API, I/O-bound&lt;/td&gt;
&lt;td&gt;RPS per replica or P99 latency&lt;/td&gt;
&lt;td&gt;HPA + custom metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message queue consumer&lt;/td&gt;
&lt;td&gt;Consumer lag / queue depth&lt;/td&gt;
&lt;td&gt;KEDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event-driven / Kafka / SQS&lt;/td&gt;
&lt;td&gt;Event rate or lag&lt;/td&gt;
&lt;td&gt;KEDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Predictable traffic pattern&lt;/td&gt;
&lt;td&gt;Schedule (time-based)&lt;/td&gt;
&lt;td&gt;KEDA Cron scaler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Workload with memory leak risk&lt;/td&gt;
&lt;td&gt;CPU primary + memory at 85% secondary&lt;/td&gt;
&lt;td&gt;HPA (v2 multi-metric)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Right-sizing before HPA&lt;/td&gt;
&lt;td&gt;Historical CPU/memory recommendations&lt;/td&gt;
&lt;td&gt;VPA recommendation mode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Going Beyond HPA: KEDA and Custom Metrics
&lt;/h2&gt;

&lt;p&gt;Once you outgrow what HPA v2 can express — particularly for event-driven architectures, external system triggers, or composite scaling conditions — KEDA provides a Kubernetes-native autoscaling framework that extends the HPA model without replacing it.&lt;/p&gt;

&lt;p&gt;KEDA works by implementing a custom metrics API that HPA can consume, plus its own ScaledObject CRD that abstracts the configuration of over 60 built-in scalers: Kafka, RabbitMQ, Azure Service Bus, AWS SQS, Prometheus queries, Datadog metrics, HTTP request rate, and more. The important architectural point is that &lt;strong&gt;KEDA does not replace HPA — it feeds it&lt;/strong&gt;. Under the hood, KEDA creates and manages an HPA resource targeting the scaled deployment. You get HPA's stabilization windows, replica bounds, and Kubernetes-native behavior, driven by signals that HPA itself cannot access natively.&lt;/p&gt;

&lt;p&gt;For a detailed walkthrough of KEDA scalers and real-world event-driven patterns, see &lt;a href="https://alexandre-vazquez.com/enhanced-autoscaling-options-for-event-driven-applications/" rel="noopener noreferrer"&gt;Event-Driven Autoscaling in Kubernetes with KEDA&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For workloads where the right scaling signal comes from a Prometheus metric — request rates, custom business metrics, queue sizes exposed via exporters — the &lt;a href="https://alexandre-vazquez.com/kubernetes-autoscaling-126/" rel="noopener noreferrer"&gt;Kubernetes Autoscaling 1.26 and HPA v2 article&lt;/a&gt; covers how the custom metrics API pipeline works and how changes in Kubernetes 1.26 affected KEDA behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  ❓ FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can I use both CPU and memory in the same HPA?
&lt;/h3&gt;

&lt;p&gt;Yes. HPA v2 supports multiple metrics simultaneously — it scales to satisfy the most demanding metric. If CPU is at 40% (below threshold) but memory is at 80% (above threshold), HPA will scale up. This multi-metric capability is useful for using memory as a safety valve while CPU drives normal scaling behavior. Set the CPU threshold at 60% and the memory threshold at 85% so memory only triggers in genuine overconsumption scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does my workload scale up immediately after deployment?
&lt;/h3&gt;

&lt;p&gt;Almost always a resource request misconfiguration. Check &lt;code&gt;kubectl top pods&lt;/code&gt; immediately after deployment and compare the actual consumption to the configured request. If the workload is consuming 200% of its request by simply being alive, the request is set too low. Use VPA in recommendation mode for 24 hours and adjust the request to match actual usage before re-enabling HPA.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does HPA scale down too aggressively and cause latency spikes?
&lt;/h3&gt;

&lt;p&gt;Increase the &lt;code&gt;scaleDown.stabilizationWindowSeconds&lt;/code&gt; in the HPA &lt;code&gt;behavior&lt;/code&gt; block. The default 300 seconds is too short for workloads with cyclical load patterns. Also add a &lt;code&gt;Percent&lt;/code&gt; policy to scale down at most 25% of replicas per minute, preventing simultaneous termination of multiple pods during a rapid scale-down event.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I set HPA on every deployment?
&lt;/h3&gt;

&lt;p&gt;No. HPA is appropriate for workloads where replica count meaningfully affects capacity — stateless services, consumers, request handlers. It is not appropriate for stateful workloads (databases, caches) where scaling requires more than just adding replicas, for singleton controllers that should never have more than one replica, or for batch jobs that should run to completion without scaling. Adding HPA to every deployment creates operational noise and potential instability without benefit.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the minimum CPU request I should set to use HPA reliably?
&lt;/h3&gt;

&lt;p&gt;There is no absolute minimum, but requests below 100m make percentage thresholds very coarse-grained. At 50m CPU request and a 70% threshold, HPA triggers when the pod consumes 35m CPU — essentially any non-trivial activity. In practice, if your workload genuinely needs less than 100m CPU under load, it probably should not be using CPU-based HPA at all. Consider RPS or custom metrics instead.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I debug HPA scaling decisions?
&lt;/h3&gt;

&lt;p&gt;Start with &lt;code&gt;kubectl describe hpa &amp;lt;name&amp;gt;&lt;/code&gt; — it shows the current metric values, the computed desired replica count, and the last scaling event reason. For deeper inspection, check HPA events with &lt;code&gt;kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler&lt;/code&gt;. If using custom metrics, verify the metrics server is returning expected values with &lt;code&gt;kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1"&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://alexandre-vazquez.com/enhanced-autoscaling-options-for-event-driven-applications/" rel="noopener noreferrer"&gt;Event-Driven Autoscaling with KEDA&lt;/a&gt; — queue depth, Kafka lag, and external triggers as scaling signals&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://alexandre-vazquez.com/introduction/" rel="noopener noreferrer"&gt;Kubernetes VPA Explained&lt;/a&gt; — right-sizing resource requests before enabling HPA&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://alexandre-vazquez.com/kubernetes-autoscaling-126/" rel="noopener noreferrer"&gt;HPA v2 and Kubernetes Autoscaling 1.26&lt;/a&gt; — custom metrics API, KEDA integration, and behavioral changes&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/" rel="noopener noreferrer"&gt;Kubernetes HPA Official Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://keda.sh/docs/latest/concepts/" rel="noopener noreferrer"&gt;KEDA Concepts — Official Docs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Prometheus Alertmanager vs Grafana Alerting (2026): Architecture, Features, and When to Use Each</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Wed, 08 Apr 2026 12:04:56 +0000</pubDate>
      <link>https://forem.com/alexandrev/prometheus-alertmanager-vs-grafana-alerting-2026-architecture-features-and-when-to-use-each-4gn</link>
      <guid>https://forem.com/alexandrev/prometheus-alertmanager-vs-grafana-alerting-2026-architecture-features-and-when-to-use-each-4gn</guid>
      <description>&lt;p&gt;Most observability stacks that have been running in production for more than a year end up with alerting spread across two systems: Prometheus Alertmanager handling metric-based alerts and Grafana Alerting managing everything else. Engineers add a Slack integration in Grafana because it is convenient, then realize their Alertmanager routing tree already covers the same service. Before long, the on-call team receives duplicated pages, silencing rules live in two places, and nobody is confident which system is authoritative.&lt;/p&gt;

&lt;p&gt;This is the alerting consolidation problem, and it affects teams of every size. The question is straightforward: should you standardize on Prometheus Alertmanager, move everything into Grafana Alerting, or deliberately run both? The answer depends on your datasource mix, your GitOps maturity, and how your organization manages on-call routing. This guide breaks down the architecture, features, and operational trade-offs of each system so you can make a deliberate choice instead of drifting into accidental complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Before comparing features, you need to understand how each system fits into the alerting pipeline. They occupy the same logical space — “receive a condition, route a notification” — but they get there from fundamentally different starting points.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prometheus Alertmanager: The Standalone Receiver
&lt;/h3&gt;

&lt;p&gt;Alertmanager is a dedicated, standalone component in the Prometheus ecosystem. It does not evaluate alert rules itself. Instead, Prometheus (or any compatible sender like Thanos Ruler, Cortex, or Mimir Ruler) evaluates PromQL expressions and pushes firing alerts to the Alertmanager API. Alertmanager then handles deduplication, grouping, inhibition, silencing, and notification delivery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Simplified Prometheus → Alertmanager flow
#
# [Prometheus] --evaluates rules--&amp;gt; [firing alerts]
#        |
#        +--POST /api/v2/alerts--&amp;gt; [Alertmanager]
#                                      |
#                          +-----------+-----------+
#                          |           |           |
#                       [Slack]    [PagerDuty]  [Email]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire configuration lives in a single YAML file (&lt;code&gt;alertmanager.yml&lt;/code&gt;). This includes the routing tree, receiver definitions, inhibition rules, and silence templates. There is no database, no UI-driven state — just a config file and an optional local storage directory for notification state and silences. This makes it trivially reproducible and ideal for GitOps workflows.&lt;/p&gt;

&lt;p&gt;For high availability, you run multiple Alertmanager instances in a gossip-based cluster. They use a mesh protocol to share silence and notification state, ensuring that failover does not result in duplicate or lost notifications. The HA model is well-understood and has been stable for years.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana Alerting: The Integrated Platform
&lt;/h3&gt;

&lt;p&gt;Grafana Alerting (sometimes called “Grafana Unified Alerting,” introduced in Grafana 8 and significantly matured through Grafana 11 and 12) takes a different architectural approach. It embeds the entire alerting lifecycle — rule evaluation, state management, routing, and notification — inside the Grafana server process. Under the hood, it actually uses a fork of Alertmanager for the routing and notification layer, but this is an implementation detail that is invisible to users.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Simplified Grafana Alerting flow
#
# [Grafana Server]
#   ├── Rule Evaluation Engine
#   │     ├── queries Prometheus
#   │     ├── queries Loki
#   │     ├── queries CloudWatch
#   │     └── queries any supported datasource
#   │
#   ├── Alert State Manager (internal)
#   │
#   └── Embedded Alertmanager (routing + notifications)
#           |
#           +-----------+-----------+
#           |           |           |
#        [Slack]    [PagerDuty]  [Email]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical distinction is that Grafana Alerting evaluates alert rules itself, querying any configured datasource — not just Prometheus. It can fire alerts based on Loki log queries, Elasticsearch searches, CloudWatch metrics, PostgreSQL queries, or any of the 100+ datasource plugins available in Grafana. Rule definitions, contact points, notification policies, and mute timings are stored in the Grafana database (or provisioned via YAML files and the Grafana API).&lt;/p&gt;

&lt;p&gt;For high availability in self-hosted environments, Grafana Alerting relies on a shared database and a peer-discovery mechanism between Grafana instances. In Grafana Cloud, HA is fully managed by Grafana Labs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Comparison
&lt;/h2&gt;

&lt;p&gt;The following table provides a side-by-side comparison of the capabilities that matter most in production alerting systems. Both systems are mature, but they prioritize different things.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Prometheus Alertmanager&lt;/th&gt;
&lt;th&gt;Grafana Alerting&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datasources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prometheus-compatible only (Prometheus, Thanos, Mimir, VictoriaMetrics)&lt;/td&gt;
&lt;td&gt;Any Grafana datasource (Prometheus, Loki, Elasticsearch, CloudWatch, SQL databases, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rule evaluation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External (Prometheus/Ruler evaluates rules and pushes alerts)&lt;/td&gt;
&lt;td&gt;Built-in (Grafana evaluates rules directly)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Routing tree&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hierarchical YAML-based routing with match/match_re, continue, group_by&lt;/td&gt;
&lt;td&gt;Notification policies with label matchers, nested policies, mute timings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Grouping&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full support via group_by, group_wait, group_interval&lt;/td&gt;
&lt;td&gt;Full support via notification policies with equivalent controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inhibition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native inhibition rules (suppress alerts when a related alert is firing)&lt;/td&gt;
&lt;td&gt;Supported since Grafana 10.3 but less flexible than Alertmanager&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Silencing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Label-based silences via API or UI, time-limited&lt;/td&gt;
&lt;td&gt;Mute timings (recurring schedules) and silences (ad-hoc, label-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Notification channels&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Email, Slack, PagerDuty, OpsGenie, VictoriaOps, webhook, WeChat, Telegram, SNS, Webex&lt;/td&gt;
&lt;td&gt;All of the above plus Teams, Discord, Google Chat, LINE, Threema, Oncall, and more via contact points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Templating&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Go templates in notification config&lt;/td&gt;
&lt;td&gt;Go templates with access to Grafana template variables and functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-tenancy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not built-in; achieved via separate instances or Mimir Alertmanager&lt;/td&gt;
&lt;td&gt;Native multi-tenancy via Grafana organizations and RBAC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gossip-based cluster (peer mesh, well-proven)&lt;/td&gt;
&lt;td&gt;Database-backed HA with peer discovery between Grafana instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Configuration model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single YAML file, fully declarative&lt;/td&gt;
&lt;td&gt;UI + API + provisioning YAML files, stored in database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitOps compatibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent — config file lives in version control natively&lt;/td&gt;
&lt;td&gt;Possible via provisioning files or Terraform provider, but requires extra tooling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;External alert sources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any system that can POST to the Alertmanager API&lt;/td&gt;
&lt;td&gt;Supported via the Grafana Alerting API (external alerts can be pushed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Managed service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Available via Grafana Cloud (as Mimir Alertmanager), Amazon Managed Prometheus&lt;/td&gt;
&lt;td&gt;Available via Grafana Cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Alertmanager Strengths
&lt;/h2&gt;

&lt;p&gt;Alertmanager has been a production staple since 2015. Over a decade of use across thousands of organizations has made it one of the most battle-tested components in the CNCF ecosystem. Here is where it genuinely excels.&lt;/p&gt;

&lt;h3&gt;
  
  
  Declarative, GitOps-Native Configuration
&lt;/h3&gt;

&lt;p&gt;The entire Alertmanager configuration is a single YAML file. There is no hidden state in a database, no click-driven configuration that someone forgets to document. You check it into Git, review it in a pull request, and deploy it through your CI/CD pipeline like any other infrastructure code. This is a significant operational advantage for teams that have invested in GitOps.&lt;/p&gt;

&lt;h1&gt;
  
  
  alertmanager.yml — everything in one file
&lt;/h1&gt;

&lt;p&gt;global:&lt;br&gt;
  resolve_timeout: 5m&lt;br&gt;
  slack_api_url: "&lt;a href="https://hooks.slack.com/services/T00/B00/XXX" rel="noopener noreferrer"&gt;https://hooks.slack.com/services/T00/B00/XXX&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;route:&lt;br&gt;
  receiver: platform-team&lt;br&gt;
  group_by: [alertname, cluster, namespace]&lt;br&gt;
  group_wait: 30s&lt;br&gt;
  group_interval: 5m&lt;br&gt;
  repeat_interval: 4h&lt;br&gt;
  routes:&lt;br&gt;
    - match:&lt;br&gt;
        severity: critical&lt;br&gt;
      receiver: pagerduty-oncall&lt;br&gt;
      group_wait: 10s&lt;br&gt;
    - match_re:&lt;br&gt;
        team: "^(payments|checkout)$"&lt;br&gt;
      receiver: payments-slack&lt;br&gt;
      continue: true&lt;/p&gt;

&lt;p&gt;receivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;name: platform-team
slack_configs:

&lt;ul&gt;
&lt;li&gt;channel: "#platform-alerts"&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;name: pagerduty-oncall
pagerduty_configs:

&lt;ul&gt;
&lt;li&gt;service_key: ""&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;name: payments-slack
slack_configs:

&lt;ul&gt;
&lt;li&gt;channel: "#payments-oncall"&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;inhibit_rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source_match:
  severity: critical
target_match:
  severity: warning
equal: [alertname, cluster]&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every change is auditable. Rollbacks are a &lt;code&gt;git revert&lt;/code&gt; away. This matters enormously when you are debugging why an alert did not fire at 3 AM.&lt;/p&gt;
&lt;h3&gt;
  
  
  Lightweight and Single-Purpose
&lt;/h3&gt;

&lt;p&gt;Alertmanager does one thing: route and deliver notifications. It has no dashboard, no query engine, no datasource plugins. This single-purpose design makes it operationally simple. Resource consumption is minimal — a small Alertmanager instance handles thousands of active alerts on a few hundred megabytes of memory. It starts in milliseconds and requires almost no maintenance.&lt;/p&gt;
&lt;h3&gt;
  
  
  Mature Inhibition and Routing
&lt;/h3&gt;

&lt;p&gt;Alertmanager’s inhibition rules are first-class citizens. You can suppress downstream warnings when a critical alert is already firing, preventing alert storms from overwhelming your on-call team. The hierarchical routing tree with &lt;code&gt;continue&lt;/code&gt; flags allows for nuanced delivery: send to the team channel AND escalate to PagerDuty simultaneously, with different grouping strategies at each level.&lt;/p&gt;
&lt;h3&gt;
  
  
  Proven High Availability
&lt;/h3&gt;

&lt;p&gt;The gossip-based HA cluster has been stable for years. Running three Alertmanager replicas behind a load balancer (or using Kubernetes service discovery) gives you reliable notification delivery without shared storage. The protocol handles deduplication across instances automatically, which is the hardest part of distributed alerting.&lt;/p&gt;
&lt;h2&gt;
  
  
  Grafana Alerting Strengths
&lt;/h2&gt;

&lt;p&gt;Grafana Alerting has matured considerably since its rocky introduction in Grafana 8. By Grafana 11 and 12, it has become a legitimate production alerting platform with capabilities that Alertmanager cannot match on its own.&lt;/p&gt;
&lt;h3&gt;
  
  
  Multi-Datasource Alert Rules
&lt;/h3&gt;

&lt;p&gt;This is Grafana Alerting’s strongest differentiator. You can write alert rules that query Loki for error log spikes, CloudWatch for AWS resource utilization, Elasticsearch for application errors, or a PostgreSQL database for business metrics — all from the same alerting system. If your observability stack includes more than just Prometheus, this eliminates the need for separate alerting tools per datasource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Grafana alert rule provisioning example — alerting on Loki log errors&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;orgId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;application-errors&lt;/span&gt;
    &lt;span class="na"&gt;folder&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Production&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki-error-spike&lt;/span&gt;
        &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;payment&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;service"&lt;/span&gt;
        &lt;span class="na"&gt;condition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;C&lt;/span&gt;
        &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;refId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A&lt;/span&gt;
            &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loki-prod&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sum(rate({app="payment-service"}&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"ERROR"&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;[5m]))'&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;refId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;B&lt;/span&gt;
            &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__expr__"&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;reduce&lt;/span&gt;
              &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;A&lt;/span&gt;
              &lt;span class="na"&gt;reducer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;last&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;refId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;C&lt;/span&gt;
            &lt;span class="na"&gt;datasourceUid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__expr__"&lt;/span&gt;
            &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;threshold&lt;/span&gt;
              &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;B&lt;/span&gt;
              &lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;evaluator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gt&lt;/span&gt;
                    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;10&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is something Alertmanager simply cannot do. Alertmanager only receives pre-evaluated alerts — it has no concept of datasources or query execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Unified UI for Alert Management
&lt;/h3&gt;

&lt;p&gt;Grafana provides a single pane of glass for alert rule creation, visualization, notification policy management, contact point configuration, and silence management. For teams where not every engineer is comfortable editing YAML routing trees, the visual notification policy editor significantly reduces the barrier to entry. You can see the state of every alert rule, its evaluation history, and the exact notification path it will take — all without leaving the browser.&lt;/p&gt;

&lt;h3&gt;
  
  
  Native Multi-Tenancy and RBAC
&lt;/h3&gt;

&lt;p&gt;Grafana’s organization model and role-based access control extend naturally to alerting. Different teams can manage their own alert rules, contact points, and notification policies within their organization or folder scope, without seeing or interfering with other teams. Achieving this with standalone Alertmanager requires either running separate instances per tenant or using Mimir’s multi-tenant Alertmanager.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mute Timings and Richer Scheduling
&lt;/h3&gt;

&lt;p&gt;While Alertmanager supports silences (ad-hoc, time-limited suppressions), Grafana Alerting adds mute timings — recurring time-based windows where notifications are suppressed. This is useful for scheduled maintenance windows, business-hours-only alerting, or suppressing non-critical alerts on weekends. Alertmanager requires external tooling or manual silence creation for recurring windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana Cloud as a Managed Option
&lt;/h3&gt;

&lt;p&gt;For teams that want to avoid managing alerting infrastructure entirely, Grafana Cloud provides a fully managed Grafana Alerting stack. This includes HA, state persistence, and notification delivery without any self-hosted components. The Grafana Cloud alerting stack also includes a managed Mimir Alertmanager, which means you can use Prometheus-native alerting rules if you prefer that model while still benefiting from the managed infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Prometheus Alertmanager
&lt;/h2&gt;

&lt;p&gt;Alertmanager is the right choice when the following conditions describe your environment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your metrics stack is Prometheus-native.&lt;/strong&gt; If all your alert rules are PromQL expressions evaluated by Prometheus, Thanos Ruler, or Mimir Ruler, Alertmanager is the natural fit. There is no added value in routing those alerts through Grafana.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GitOps is non-negotiable.&lt;/strong&gt; If every infrastructure change must go through a pull request and be fully declarative, Alertmanager’s single-file configuration model is significantly easier to manage than Grafana’s database-backed state. Tools like &lt;code&gt;amtool&lt;/code&gt; provide config validation in CI pipelines.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You need fine-grained routing with inhibition.&lt;/strong&gt; Complex routing trees with multiple levels of grouping, inhibition rules, and &lt;code&gt;continue&lt;/code&gt; flags are more naturally expressed in Alertmanager’s YAML format. The routing logic has been stable and well-documented for years.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You run microservices with per-team routing.&lt;/strong&gt; If each team owns its routing subtree and the routing logic is complex, Alertmanager’s hierarchical model scales better than UI-driven configuration. Teams can own their section of the config file via CODEOWNERS in Git.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You want minimal operational overhead.&lt;/strong&gt; Alertmanager is a single binary with minimal resource requirements. There is no database to back up, no migrations to run, and no UI framework to keep updated.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Use Grafana Alerting
&lt;/h2&gt;

&lt;p&gt;Grafana Alerting is the right choice when these conditions apply:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You alert on more than just Prometheus metrics.&lt;/strong&gt; If you need alert rules based on Loki logs, Elasticsearch queries, CloudWatch metrics, or database queries, Grafana Alerting is the only option that handles all of these natively. The alternative is running separate alerting tools per datasource, which is worse.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your team prefers UI-driven configuration.&lt;/strong&gt; Not every engineer wants to edit YAML routing trees. If your organization values a visual interface for managing alerts, contact points, and notification policies, Grafana’s UI is a major productivity advantage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You are using Grafana Cloud.&lt;/strong&gt; If you are already on Grafana Cloud, using its built-in alerting is the path of least resistance. You get HA, managed notification delivery, and a unified experience without running any additional infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-tenancy is a requirement.&lt;/strong&gt; If multiple teams need isolated alerting configurations with RBAC, Grafana’s native organization and folder-based access model is significantly easier to set up than running per-tenant Alertmanager instances.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You want mute timings for recurring maintenance windows.&lt;/strong&gt; If your team regularly needs to suppress alerts during scheduled windows (deploy windows, batch processing hours, weekend non-critical suppression), Grafana’s mute timings feature is more ergonomic than creating and managing recurring silences in Alertmanager.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running Both Together: The Hybrid Pattern
&lt;/h2&gt;

&lt;p&gt;In practice, many production environments run both Alertmanager and Grafana Alerting. This is not necessarily a mistake — it can be a deliberate architectural choice when done with clear boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Common Hybrid Architecture
&lt;/h3&gt;

&lt;p&gt;The most common pattern looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Prometheus Alertmanager&lt;/strong&gt; handles all metric-based alerts. PromQL rules are evaluated by Prometheus or a long-term storage ruler (Thanos, Mimir). Alertmanager owns routing, grouping, and notification for these alerts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Grafana Alerting&lt;/strong&gt; handles non-Prometheus alerts: log-based alerts from Loki, business metrics from SQL datasources, and cross-datasource correlation rules.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key to making this work without chaos is establishing clear ownership rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Ownership boundaries for hybrid alerting
#
# Prometheus Alertmanager owns:
#   - All PromQL-based alert rules
#   - Infrastructure alerts (node, kubelet, etcd, CoreDNS)
#   - Application SLO/SLI alerts based on metrics
#
# Grafana Alerting owns:
#   - Log-based alert rules (Loki, Elasticsearch)
#   - Business metric alerts (SQL datasources)
#   - Cross-datasource correlation rules
#   - Alerts for teams that prefer UI-driven management
#
# Shared:
#   - Contact points / receivers use the same Slack channels and PagerDuty services
#   - On-call rotations are managed externally (PagerDuty, Grafana OnCall)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both systems can deliver to the same notification channels. The critical discipline is ensuring that silencing and maintenance windows are applied in both systems when needed. This is the primary operational cost of the hybrid approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  Grafana as a Viewer for Alertmanager
&lt;/h3&gt;

&lt;p&gt;Even if you use Alertmanager exclusively for routing and notification, Grafana can serve as a read-only viewer. Grafana natively supports connecting to an external Alertmanager datasource, allowing you to see firing alerts, active silences, and alert groups in the Grafana UI. This gives you the operational visibility of Grafana without moving your alerting logic into it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Grafana datasource provisioning for external Alertmanager&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;datasources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Alertmanager&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alertmanager&lt;/span&gt;
    &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://alertmanager.monitoring.svc:9093&lt;/span&gt;
    &lt;span class="na"&gt;access&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;proxy&lt;/span&gt;
    &lt;span class="na"&gt;jsonData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;implementation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Migration Considerations
&lt;/h2&gt;

&lt;p&gt;If you are moving from one system to the other, here are the practical considerations to plan for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Migrating from Alertmanager to Grafana Alerting
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rule conversion.&lt;/strong&gt; Your PromQL-based recording and alerting rules defined in Prometheus rule files need to be recreated as Grafana alert rules. Grafana provides a migration tool that can import Prometheus-format rules, but complex expressions may need manual adjustment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Routing tree translation.&lt;/strong&gt; Alertmanager’s hierarchical routing tree maps to Grafana’s notification policies, but the semantics are not identical. Test the notification routing thoroughly — the &lt;code&gt;continue&lt;/code&gt; flag behavior and default routes may differ.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Silence and inhibition migration.&lt;/strong&gt; Active silences are ephemeral and do not need migration. Inhibition rules need to be recreated in Grafana’s format. Recurring maintenance windows should be converted to mute timings.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run in parallel first.&lt;/strong&gt; The safest migration strategy is to run both systems in parallel for two to four weeks, sending notifications from both, then cutting over when you have confidence in the Grafana setup. Accept the temporary noise of duplicate alerts — it is far cheaper than missing a critical page during migration.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Migrating from Grafana Alerting to Alertmanager
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Datasource limitation.&lt;/strong&gt; You can only migrate alerts that are based on Prometheus-compatible datasources. Alerts querying Loki, Elasticsearch, or SQL datasources have no equivalent in Alertmanager — you will need an alternative solution for those.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rule export.&lt;/strong&gt; Export Grafana alert rules and convert them to Prometheus-format rule files. The Grafana API (&lt;code&gt;GET /api/v1/provisioning/alert-rules&lt;/code&gt;) provides structured output that can be transformed with a script.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Contact point mapping.&lt;/strong&gt; Map Grafana contact points to Alertmanager receivers. The configuration format is different, but the concepts are equivalent.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;State loss.&lt;/strong&gt; Alertmanager does not carry over Grafana’s alert evaluation history. You start fresh. Plan for a brief period where alerts may re-fire as Prometheus evaluates rules that were previously managed by Grafana.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Decision Framework
&lt;/h2&gt;

&lt;p&gt;If you want a quick decision path, use this framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start here:
│
├── Do you alert on non-Prometheus datasources (Loki, ES, SQL, CloudWatch)?
│   ├── YES → Grafana Alerting (at least for those datasources)
│   └── NO ↓
│
├── Is GitOps/declarative config a hard requirement?
│   ├── YES → Alertmanager
│   └── NO ↓
│
├── Do you need multi-tenancy with RBAC?
│   ├── YES → Grafana Alerting (or Mimir Alertmanager)
│   └── NO ↓
│
├── Are you on Grafana Cloud?
│   ├── YES → Grafana Alerting (path of least resistance)
│   └── NO ↓
│
└── Default → Alertmanager (simpler, lighter, well-proven)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For many teams, the honest answer is “both” — Alertmanager for the Prometheus-native metric pipeline, Grafana Alerting for everything else. That is a valid architecture as long as the ownership boundaries are documented and the on-call team knows where to look.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between Alertmanager and Grafana Alerting?
&lt;/h3&gt;

&lt;p&gt;Prometheus Alertmanager is a standalone notification routing engine that receives pre-evaluated alerts from Prometheus and delivers them to receivers like Slack, PagerDuty, or email. It does not evaluate alert rules itself. Grafana Alerting is an integrated alerting platform embedded in Grafana that both evaluates alert rules (querying any supported datasource) and handles notification routing. Alertmanager is configured entirely via YAML, while Grafana Alerting offers a UI, API, and file-based provisioning. The fundamental difference is scope: Alertmanager handles only the routing and notification phase, while Grafana Alerting handles the full lifecycle from query evaluation to notification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can Grafana Alerting replace Prometheus Alertmanager?
&lt;/h3&gt;

&lt;p&gt;Yes, for many use cases. Grafana Alerting can evaluate PromQL rules directly against your Prometheus datasource, so you do not strictly need a separate Alertmanager instance. However, there are scenarios where Alertmanager remains the better choice: heavily GitOps-driven environments, teams that need Alertmanager’s mature inhibition rules, or architectures where Prometheus rule evaluation happens externally (Thanos Ruler, Mimir Ruler) and a dedicated Alertmanager is already in the pipeline. If your only datasource is Prometheus and you value declarative configuration, Alertmanager is still simpler and lighter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Grafana Alertmanager the same as Prometheus Alertmanager?
&lt;/h3&gt;

&lt;p&gt;Not exactly. Grafana Alerting uses a fork of the Prometheus Alertmanager code internally for its notification routing engine, but it is not the same product. The Grafana “Alertmanager” you see in the UI is a managed, embedded component with a different configuration interface (notification policies, contact points, mute timings) compared to the standalone Prometheus Alertmanager (routing tree, receivers, inhibition rules in YAML). Grafana can also connect to an external Prometheus Alertmanager as a datasource, which adds to the confusion. When people refer to “Grafana Alertmanager,” they usually mean the embedded routing engine inside Grafana Alerting.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the best alternatives to Prometheus Alertmanager?
&lt;/h3&gt;

&lt;p&gt;The most direct alternative is Grafana Alerting, which can receive and route Prometheus alerts while also supporting other datasources. Beyond that, other options include: &lt;strong&gt;Grafana OnCall&lt;/strong&gt; for on-call management and escalation (often used alongside Alertmanager rather than replacing it), &lt;strong&gt;PagerDuty&lt;/strong&gt; or &lt;strong&gt;Opsgenie&lt;/strong&gt; as managed incident response platforms that can receive alerts directly, &lt;strong&gt;Keep&lt;/strong&gt; as an open-source AIOps alert management platform, and &lt;strong&gt;Mimir Alertmanager&lt;/strong&gt; for multi-tenant environments running Grafana Mimir. The choice depends on whether you need an Alertmanager replacement (routing and notification) or a complementary tool for escalation and incident response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use Prometheus alerts or Grafana alerts for Kubernetes monitoring?
&lt;/h3&gt;

&lt;p&gt;For Kubernetes monitoring specifically, the kube-prometheus-stack (which includes Prometheus, Alertmanager, and a comprehensive set of pre-built alerting rules) remains the industry standard. These rules are PromQL-based and are designed to work with Alertmanager. If you are deploying kube-prometheus-stack, using Alertmanager for metric-based alerts is the straightforward choice. Add Grafana Alerting on top if you also need to alert on logs (via Loki) or non-metric datasources. For Kubernetes-specific monitoring, the combination of Prometheus rules with Alertmanager for routing is the most mature and well-supported path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The Alertmanager vs Grafana Alerting debate is not really about which tool is better — it is about which tool fits your operational context. Alertmanager is simpler, lighter, and more GitOps-friendly. Grafana Alerting is more versatile, more accessible to UI-oriented teams, and the only option if you need multi-datasource alerting. Running both is perfectly valid when the boundaries are clear.&lt;/p&gt;

&lt;p&gt;The worst outcome is not picking the “wrong” tool. The worst outcome is running both accidentally, with overlapping coverage, duplicated notifications, and no clear ownership. Whatever you choose, document the decision, define the ownership boundaries, and make sure your on-call team knows exactly where to go when they need to silence an alert at 3 AM.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>285</category>
      <category>49</category>
      <category>67</category>
    </item>
    <item>
      <title>Gateway API Provider Support in 2026: A Critical Evaluation</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 24 Feb 2026 13:00:49 +0000</pubDate>
      <link>https://forem.com/alexandrev/gateway-api-provider-support-in-2026-a-critical-evaluation-pc1</link>
      <guid>https://forem.com/alexandrev/gateway-api-provider-support-in-2026-a-critical-evaluation-pc1</guid>
      <description>&lt;p&gt;The &lt;a href="https://alexandre-vazquez.com/kubernetes-gateway-api-versions-compatibility-guide/" rel="noopener noreferrer"&gt;Kubernetes Gateway API&lt;/a&gt; is no longer a future concept—it’s the present standard for traffic management. With the &lt;a href="https://www.reddit.com/r/kubernetes/comments/1qw534a/understanding%5Fthe%5Fingressnginx%5Fdeprecation%5Fbefore/" rel="noopener noreferrer"&gt;deprecation of Ingress NGINX’s stable APIs&lt;/a&gt; signaling a definitive shift, platform teams and architects are now faced with a critical decision: which Gateway API provider to adopt. The official &lt;a href="https://gateway-api.sigs.k8s.io/implementations/" rel="noopener noreferrer"&gt;implementations page&lt;/a&gt; lists numerous options, but the real-world picture is one of fragmented support, varying stability, and significant gaps that can derail multi-cluster strategies.&lt;/p&gt;

&lt;p&gt;In this evaluation, we move beyond marketing checklists to analyze the practical state of Gateway API support across major cloud providers, ingress controllers, and service meshes. We’ll examine which versions are truly production-ready, where the interoperability pitfalls lie, and what you must account for before standardizing across your infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gateway API Maturity Spectrum: From Experimental to Standard
&lt;/h2&gt;

&lt;p&gt;Not all Gateway API resources are created equal. The API’s &lt;a href="https://alexandre-vazquez.com/kubernetes-gateway-api-versions-compatibility-guide/" rel="noopener noreferrer"&gt;unique versioning model&lt;/a&gt;—with features progressing through Experimental, Standard, and Extended support tracks—means provider support is inherently uneven. An implementation might fully support the stable &lt;code&gt;Gateway&lt;/code&gt; and &lt;code&gt;HTTPRoute&lt;/code&gt; resources while offering only partial or experimental backing for &lt;code&gt;GRPCRoute&lt;/code&gt; or &lt;code&gt;TCPRoute&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This creates a fundamental challenge for architects: designing for the lowest common denominator or accepting provider-specific constraints. The decision hinges on accurately mapping your traffic management requirements (HTTP, TLS termination, gRPC, TCP/UDP load balancing) against what each provider actually delivers in a stable form.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core API Support: The Foundation
&lt;/h3&gt;

&lt;p&gt;Most providers now support the v1 (GA) versions of the foundational resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GatewayClass &amp;amp; Gateway:&lt;/strong&gt; Nearly universal support for v1. These are the control plane resources for provisioning and configuring load balancers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTPRoute:&lt;/strong&gt; Universal support for v1. This is the workhorse for HTTP/HTTPS traffic routing and is considered the most stable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, support for other route types reveals the fragmentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GRPCRoute:&lt;/strong&gt; Often in beta or experimental stages. Critical for modern microservices architectures but not yet universally reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TCPRoute &amp;amp; UDPRoute:&lt;/strong&gt; Patchy support. Some providers implement them as beta, others ignore them entirely, forcing fallbacks to provider-specific annotations or custom resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TLSRoute:&lt;/strong&gt; Frequently tied to specific certificate management integrations (e.g., cert-manager).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Major Provider Deep Dive: Implementation Realities
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AWS Elastic Kubernetes Service (EKS)
&lt;/h3&gt;

&lt;p&gt;AWS offers an &lt;a href="https://www.eksworkshop.com/docs/networking/vpc-lattice/gateway-api-controller" rel="noopener noreferrer"&gt;official Gateway API controller&lt;/a&gt; for EKS. Its support is pragmatic but currently limited:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Supported Resources:&lt;/strong&gt; &lt;code&gt;GatewayClass&lt;/code&gt;, &lt;code&gt;Gateway&lt;/code&gt;, &lt;code&gt;HTTPRoute&lt;/code&gt;, and &lt;code&gt;GRPCRoute&lt;/code&gt; (all v1beta1 as of early 2024). Note the use of v1beta1 for GRPCRoute, indicating it’s not yet at GA stability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Underlying Infrastructure:&lt;/strong&gt; Maps directly to AWS Application Load Balancer (ALB) and Network Load Balancer (NLB). This is a strength (managed AWS services) and a constraint (you inherit ALB/NLB feature limits).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Critical Gap:&lt;/strong&gt; No support for &lt;code&gt;TCPRoute&lt;/code&gt; or &lt;code&gt;UDPRoute&lt;/code&gt;. If your workload requires raw TCP/UDP load balancing, you must use the legacy Kubernetes Service type &lt;code&gt;LoadBalancer&lt;/code&gt; or a different ingress controller alongside the Gateway API controller, creating a disjointed management model.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Google Kubernetes Engine (GKE) &amp;amp; Azure Kubernetes Service (AKS)
&lt;/h3&gt;

&lt;p&gt;Both Google and Azure have integrated Gateway API support directly into their managed Kubernetes offerings, often with a focus on their global load-balancing infrastructures.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GKE:&lt;/strong&gt; Offers the GKE Gateway controller. It supports v1 resources and can provision Google Cloud Global External Load Balancers. Its integration with Google’s certificate management and CDN is a key advantage. However, advanced routing features may require GCP-specific backend configs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AKS:&lt;/strong&gt; Provides the Application Gateway Ingress Controller (AGIC) with Gateway API support, mapping to Azure Application Gateway. Support for newer route types like &lt;code&gt;GRPCRoute&lt;/code&gt; has historically lagged behind other providers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern here is clear: cloud providers implement the Gateway API as a facade over their existing, proprietary load-balancing products. This ensures stability and performance but can limit portability and advanced cross-provider features.&lt;/p&gt;

&lt;h3&gt;
  
  
  NGINX &amp;amp; Kong Ingress Controller
&lt;/h3&gt;

&lt;p&gt;These third-party, cluster-based controllers offer a different value proposition: consistency across any Kubernetes distribution, including on-premises.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NGINX:&lt;/strong&gt; With its stable Ingress APIs deprecated in favor of Gateway API, its Gateway API implementation is now the primary path forward. It generally has excellent support for the full range of experimental and standard resources, as it’s not constrained by a cloud vendor’s underlying service. This makes it a strong choice for hybrid or multi-cloud deployments where feature parity is crucial.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kong Ingress Controller:&lt;/strong&gt; Kong has been an early and comprehensive supporter of the Gateway API, often implementing features quickly. It leverages Kong Gateway’s extensive plugin ecosystem, which can be a major draw but also introduces vendor lock-in.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Critical Gaps for Enterprise Architects
&lt;/h2&gt;

&lt;p&gt;Beyond checking resource support boxes, several deeper gaps can impact production deployments, especially in complex environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Multi-Cluster &amp;amp; Hybrid Environment Support
&lt;/h3&gt;

&lt;p&gt;The Gateway API specification includes concepts like &lt;code&gt;ReferenceGrant&lt;/code&gt; for cross-namespace and future cross-cluster routing. In practice, very few providers have robust, production-ready multi-cluster stories. Most implementations assume a single cluster. If your architecture spans multiple clusters (for isolation, geography, or failure domains), you will likely need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manage separate &lt;code&gt;Gateway&lt;/code&gt; resources per cluster.&lt;/li&gt;
&lt;li&gt;Use an external global load balancer (like a cloud DNS/GSLB) to distribute traffic across cluster-specific gateways.&lt;/li&gt;
&lt;li&gt;This negates some of the API’s promise of a unified, abstracted configuration.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Policy Attachment and Extension Consistency
&lt;/h3&gt;

&lt;p&gt;Gateway API is designed to be extended through policy attachment (e.g., for rate limiting, WAF rules, authentication). There is no standard for &lt;em&gt;how&lt;/em&gt; these policies are implemented. One provider might use a custom &lt;code&gt;RateLimitPolicy&lt;/code&gt; CRD, while another might rely on annotations or a separate policy engine. This creates massive configuration drift and vendor lock-in, breaking the portability goal.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Observability and Debugging Interfaces
&lt;/h3&gt;

&lt;p&gt;While the API defines status fields, the richness of operational data—detailed error logs, granular metrics tied to API resources, distributed tracing integration—varies wildly. Some providers expose deep integration with their monitoring stack; others offer minimal visibility. You must verify that the provider’s observability model meets your SRE team’s needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation Framework: Questions for Your Team
&lt;/h2&gt;

&lt;p&gt;Before selecting a provider, work through this technical checklist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Route Requirements:&lt;/strong&gt; Do we need stable support for HTTP only, or also gRPC, TCP, UDP? Is beta support acceptable for non-HTTP routes?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Model:&lt;/strong&gt; Do we want a cloud-managed load balancer (simpler, less control) or a cluster-based controller (more portable, more operational overhead)?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Cluster Future:&lt;/strong&gt; Is our architecture single-cluster today but likely to expand? Does the provider have a credible roadmap for multi-cluster Gateway API?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Policy Needs:&lt;/strong&gt; What advanced policies (auth, WAF, rate limiting) are required? How does the provider implement them? Can we live with vendor-specific policy CRDs?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observe &amp;amp; Debug:&lt;/strong&gt; What logging, metrics, and tracing are exposed for Gateway API resources? Do they integrate with our existing observability platform?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upgrade Path:&lt;/strong&gt; What is the provider’s track record for supporting new Gateway API releases? How painful are version upgrades?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Strategic Recommendations
&lt;/h2&gt;

&lt;p&gt;Based on the current landscape, here are pragmatic paths forward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;For Single-Cloud Deployments:&lt;/strong&gt; Start with your cloud provider’s native controller (AWS, GKE, AKS). It’s the path of least resistance and best integration with other cloud services (IAM, certificates, monitoring). Just be acutely aware of its specific limitations regarding unsupported route types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For Hybrid/Multi-Cloud or On-Premises:&lt;/strong&gt; Standardize on a portable, cluster-based controller like Ingress-NGINX or Kong. The consistency across environments will save significant operational complexity, even if it means forgoing some cloud-native integrations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For Greenfield Projects:&lt;/strong&gt; Design your applications and configurations against the stable v1 resources (&lt;code&gt;Gateway&lt;/code&gt;, &lt;code&gt;HTTPRoute&lt;/code&gt;) only. Treat any use of beta/experimental resources as a known risk that may require refactoring later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Always Have an Exit Plan:&lt;/strong&gt; Isolate Gateway API configuration YAMLs from provider-specific policies and annotations. This modularity will make migration less painful when the next generation of providers emerges or when you need to switch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Gateway API’s evolution is a net positive for the Kubernetes ecosystem, offering a far more expressive model than the original Ingress. However, in 2026, the provider landscape is still maturing. Support is broad but not deep, and critical gaps in multi-cluster management and policy portability remain. The successful architect will choose a provider not based on a feature checklist, but based on how well its specific constraints and capabilities align with their organization’s immediate traffic patterns and long-term platform strategy. The era of a universal, write-once-run-anywhere Gateway API configuration is not yet here—but with careful, informed provider selection, you can build a robust foundation for it.&lt;/p&gt;

</description>
      <category>uncategorized</category>
    </item>
    <item>
      <title>Kubernetes Housekeeping: How to Clean Up Orphaned ConfigMaps and Secrets</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Tue, 10 Feb 2026 13:00:20 +0000</pubDate>
      <link>https://forem.com/alexandrev/kubernetes-housekeeping-how-to-clean-up-orphaned-configmaps-and-secrets-28hb</link>
      <guid>https://forem.com/alexandrev/kubernetes-housekeeping-how-to-clean-up-orphaned-configmaps-and-secrets-28hb</guid>
      <description>&lt;p&gt;If you’ve been running Kubernetes clusters for any meaningful amount of time, you’ve likely encountered a familiar problem: orphaned ConfigMaps and Secrets piling up in your namespaces. These abandoned resources don’t just clutter your cluster—they introduce security risks, complicate troubleshooting, and can even impact cluster performance as your resource count grows.&lt;/p&gt;

&lt;p&gt;The reality is that Kubernetes doesn’t automatically clean up ConfigMaps and Secrets when the workloads that reference them are deleted. This gap in Kubernetes’ native garbage collection creates a housekeeping problem that every production cluster eventually faces. In this article, we’ll explore why orphaned resources happen, how to detect them, and most importantly, how to implement sustainable cleanup strategies that prevent them from accumulating in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Orphaned Resource Problem
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What Are Orphaned ConfigMaps and Secrets?
&lt;/h3&gt;

&lt;p&gt;Orphaned ConfigMaps and Secrets are configuration resources that no longer have any active references from Pods, Deployments, StatefulSets, or other workload resources in your cluster. They typically become orphaned when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Applications are updated and new ConfigMaps are created while old ones remain&lt;/li&gt;
&lt;li&gt;Deployments are deleted but their associated configuration resources aren’t&lt;/li&gt;
&lt;li&gt;Failed rollouts leave behind unused configuration versions&lt;/li&gt;
&lt;li&gt;Development and testing workflows create temporary resources that never get cleaned up&lt;/li&gt;
&lt;li&gt;CI/CD pipelines generate unique ConfigMap names (often with hash suffixes) on each deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why This Matters for Production Clusters
&lt;/h3&gt;

&lt;p&gt;While a few orphaned ConfigMaps might seem harmless, the problem compounds over time and introduces real operational challenges:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Risks&lt;/strong&gt;: Orphaned Secrets can contain outdated credentials, API keys, or certificates that should no longer be accessible. If these aren’t removed, they remain attack vectors for unauthorized access—especially problematic if RBAC policies grant broad read access to Secrets within a namespace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster Bloat&lt;/strong&gt;: Kubernetes stores these resources in etcd, your cluster’s backing store. As the number of orphaned resources grows, etcd size increases, potentially impacting cluster performance and backup times. In extreme cases, this can contribute to etcd performance degradation or even hit storage quotas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational Complexity&lt;/strong&gt;: When troubleshooting issues or reviewing configurations, sifting through dozens of unused ConfigMaps makes it harder to identify which resources are actually in use. This “configuration noise” slows down incident response and increases cognitive load for your team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost Implications&lt;/strong&gt;: While individual ConfigMaps are small, at scale they contribute to storage costs and can trigger alerts in cost monitoring systems, especially in multi-tenant environments where resource quotas matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting Orphaned ConfigMaps and Secrets
&lt;/h2&gt;

&lt;p&gt;Before you can clean up orphaned resources, you need to identify them. Let’s explore both manual detection methods and automated tooling approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual Detection with kubectl
&lt;/h3&gt;

&lt;p&gt;The simplest approach uses kubectl to cross-reference ConfigMaps and Secrets against active workload resources. Here’s a basic script to identify potentially orphaned ConfigMaps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/bin/bash
# detect-orphaned-configmaps.sh
# Identifies ConfigMaps not referenced by any active Pods

NAMESPACE=${1:-default}

echo "Checking for orphaned ConfigMaps in namespace: $NAMESPACE"
echo "---"

# Get all ConfigMaps in the namespace
CONFIGMAPS=$(kubectl get configmaps -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}')

for cm in $CONFIGMAPS; do
    # Skip kube-root-ca.crt as it's system-managed
    if [[ "$cm" == "kube-root-ca.crt" ]]; then
        continue
    fi

    # Check if any Pod references this ConfigMap
    REFERENCED=$(kubectl get pods -n $NAMESPACE -o json | \
        jq -r --arg cm "$cm" '.items[] |
        select(
            (.spec.volumes[]?.configMap.name == $cm) or
            (.spec.containers[].env[]?.valueFrom.configMapKeyRef.name == $cm) or
            (.spec.containers[].envFrom[]?.configMapRef.name == $cm)
        ) | .metadata.name' | head -1)

    if [[ -z "$REFERENCED" ]]; then
        echo "Orphaned: $cm"
    fi
done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A similar script for Secrets would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/bin/bash
# detect-orphaned-secrets.sh

NAMESPACE=${1:-default}

echo "Checking for orphaned Secrets in namespace: $NAMESPACE"
echo "---"

SECRETS=$(kubectl get secrets -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}')

for secret in $SECRETS; do
    # Skip service account tokens and system secrets
    SECRET_TYPE=$(kubectl get secret $secret -n $NAMESPACE -o jsonpath='{.type}')
    if [[ "$SECRET_TYPE" == "kubernetes.io/service-account-token" ]]; then
        continue
    fi

    # Check if any Pod references this Secret
    REFERENCED=$(kubectl get pods -n $NAMESPACE -o json | \
        jq -r --arg secret "$secret" '.items[] |
        select(
            (.spec.volumes[]?.secret.secretName == $secret) or
            (.spec.containers[].env[]?.valueFrom.secretKeyRef.name == $secret) or
            (.spec.containers[].envFrom[]?.secretRef.name == $secret) or
            (.spec.imagePullSecrets[]?.name == $secret)
        ) | .metadata.name' | head -1)

    if [[ -z "$REFERENCED" ]]; then
        echo "Orphaned: $secret"
    fi
done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important caveat&lt;/strong&gt;: These scripts only check currently running Pods. They won’t catch ConfigMaps or Secrets referenced by Deployments, StatefulSets, or DaemonSets that might currently have zero replicas. For production use, you’ll want to check against all workload resource types.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated Detection with Specialized Tools
&lt;/h3&gt;

&lt;p&gt;Several open-source tools have emerged to solve this problem more comprehensively:&lt;/p&gt;

&lt;h4&gt;
  
  
  Kor: Comprehensive Unused Resource Detection
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://github.com/yonahd/kor" rel="noopener noreferrer"&gt;Kor&lt;/a&gt; is a purpose-built tool for finding unused resources across your Kubernetes cluster. It checks not just ConfigMaps and Secrets, but also PVCs, Services, and other resource types.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install Kor
brew install kor

# Scan for unused ConfigMaps and Secrets
kor all --namespace production --output json

# Check specific resource types
kor configmap --namespace production
kor secret --namespace production --exclude-namespaces kube-system,kube-public

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kor works by analyzing resource relationships and identifying anything without dependent objects. It’s particularly effective because it understands Kubernetes resource hierarchies and checks against Deployments, StatefulSets, and DaemonSets—not just running Pods.&lt;/p&gt;

&lt;h4&gt;
  
  
  Popeye: Cluster Sanitization Reports
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://github.com/derailed/popeye" rel="noopener noreferrer"&gt;Popeye&lt;/a&gt; scans your cluster and generates reports on resource health, including orphaned resources. While broader in scope than just ConfigMap cleanup, it provides valuable context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install Popeye
brew install derailed/popeye/popeye

# Scan cluster
popeye --output json --save

# Focus on specific namespace
popeye --namespace production

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Custom Controllers with Kubernetes APIs
&lt;/h4&gt;

&lt;p&gt;For more sophisticated detection, you can build custom controllers using client-go that continuously monitor for orphaned resources. This approach works well when integrated with your existing observability stack:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Pseudocode example
func detectOrphanedConfigMaps(namespace string) []string {
    configMaps := listConfigMaps(namespace)
    deployments := listDeployments(namespace)
    statefulSets := listStatefulSets(namespace)
    daemonSets := listDaemonSets(namespace)

    referenced := make(map[string]bool)

    // Check all workload types for ConfigMap references
    for _, deploy := range deployments {
        for _, cm := range getReferencedConfigMaps(deploy) {
            referenced[cm] = true
        }
    }
    // ... repeat for other workload types

    orphaned := []string{}
    for _, cm := range configMaps {
        if !referenced[cm.Name] {
            orphaned = append(orphaned, cm.Name)
        }
    }

    return orphaned
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Prevention Strategies: Stop Orphans Before They Start
&lt;/h2&gt;

&lt;p&gt;The best cleanup strategy is prevention. By implementing proper resource management patterns from the beginning, you can minimize orphaned resources in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Owner References for Automatic Cleanup
&lt;/h3&gt;

&lt;p&gt;Kubernetes provides a built-in mechanism for resource lifecycle management through &lt;strong&gt;owner references&lt;/strong&gt;. When properly configured, child resources are automatically deleted when their owner is removed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
  ownerReferences:
    - apiVersion: apps/v1
      kind: Deployment
      name: myapp
      uid: d9607e19-f88f-11e6-a518-42010a800195
      controller: true
      blockOwnerDeletion: true
data:
  app.properties: |
    database.url=postgres://db:5432

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tools like Helm and Kustomize automatically set owner references, which is one reason GitOps workflows tend to have fewer orphaned resources than imperative deployment approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implement Consistent Labeling Standards
&lt;/h3&gt;

&lt;p&gt;Labels make it much easier to identify resource relationships and track ownership:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: ConfigMap
metadata:
  name: api-gateway-config-v2
  labels:
    app: api-gateway
    component: configuration
    version: v2
    managed-by: argocd
    owner: platform-team
data:
  config.yaml: |
    # configuration here

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With consistent labeling, you can easily query for ConfigMaps associated with specific applications:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Find all ConfigMaps for a specific app
kubectl get configmaps -l app=api-gateway

# Clean up old versions
kubectl delete configmaps -l app=api-gateway,version=v1

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Adopt GitOps Practices
&lt;/h3&gt;

&lt;p&gt;GitOps tools like ArgoCD and Flux excel at preventing orphaned resources because they maintain a clear desired state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Declarative management&lt;/strong&gt;: All resources are defined in Git&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic pruning&lt;/strong&gt;: Tools can detect and remove resources not defined in Git&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit trail&lt;/strong&gt;: Git history shows when and why resources were created or deleted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ArgoCD’s sync policies can automatically prune resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
spec:
  syncPolicy:
    automated:
      prune: true  # Remove resources not in Git
      selfHeal: true

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Use Kustomize ConfigMap Generators with Hashes
&lt;/h3&gt;

&lt;p&gt;Kustomize’s ConfigMap generator feature appends content hashes to ConfigMap names, ensuring that configuration changes trigger new ConfigMaps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# kustomization.yaml
configMapGenerator:
  - name: app-config
    files:
      - config.properties
generatorOptions:
  disableNameSuffixHash: false  # Include hash in name

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates ConfigMaps like &lt;code&gt;app-config-dk9g72hk5f&lt;/code&gt;. When you update the configuration, Kustomize creates a new ConfigMap with a different hash. Combined with Kustomize’s &lt;code&gt;--prune&lt;/code&gt; flag, old ConfigMaps are automatically removed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubectl apply --prune -k ./overlays/production \
  -l app=myapp

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Set Resource Quotas
&lt;/h3&gt;

&lt;p&gt;While quotas don’t prevent orphans, they create backpressure that forces teams to clean up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: ResourceQuota
metadata:
  name: config-quota
  namespace: production
spec:
  hard:
    configmaps: "50"
    secrets: "50"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When teams hit quota limits, they’re incentivized to audit and remove unused resources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cleanup Strategies for Existing Orphaned Resources
&lt;/h2&gt;

&lt;p&gt;For clusters that already have accumulated orphaned ConfigMaps and Secrets, here are practical cleanup approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  One-Time Manual Cleanup
&lt;/h3&gt;

&lt;p&gt;For immediate cleanup, combine detection scripts with kubectl delete:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Dry run first - review what would be deleted
./detect-orphaned-configmaps.sh production &amp;gt; orphaned-cms.txt
cat orphaned-cms.txt

# Manual review and cleanup
for cm in $(cat orphaned-cms.txt | grep "Orphaned:" | awk '{print $2}'); do
    kubectl delete configmap $cm -n production
done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical warning&lt;/strong&gt;: Always do a dry run and manual review first. Some ConfigMaps might be referenced by workloads that aren’t currently running but will scale up later (HPA scaled to zero, CronJobs, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled Cleanup with CronJobs
&lt;/h3&gt;

&lt;p&gt;For ongoing maintenance, deploy a Kubernetes CronJob that runs cleanup scripts periodically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: batch/v1
kind: CronJob
metadata:
  name: configmap-cleanup
  namespace: kube-system
spec:
  schedule: "0 2 * * 0"  # Weekly at 2 AM Sunday
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cleanup-sa
          containers:
          - name: cleanup
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              # Cleanup script here
              echo "Starting ConfigMap cleanup..."

              for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
                echo "Checking namespace: $ns"

                # Get all workload-referenced ConfigMaps
                REFERENCED_CMS=$(kubectl get deploy,sts,ds -n $ns -o json | \
                  jq -r '.items[].spec.template.spec |
                  [.volumes[]?.configMap.name,
                   .containers[].env[]?.valueFrom.configMapKeyRef.name,
                   .containers[].envFrom[]?.configMapRef.name] |
                  .[] | select(. != null)' | sort -u)

                ALL_CMS=$(kubectl get cm -n $ns -o jsonpath='{.items[*].metadata.name}')

                for cm in $ALL_CMS; do
                  if [[ "$cm" == "kube-root-ca.crt" ]]; then
                    continue
                  fi

                  if ! echo "$REFERENCED_CMS" | grep -q "^$cm$"; then
                    echo "Deleting orphaned ConfigMap: $cm in namespace: $ns"
                    kubectl delete cm $cm -n $ns
                  fi
                done
              done
          restartPolicy: OnFailure
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cleanup-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cleanup-role
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets", "namespaces"]
  verbs: ["get", "list", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets", "daemonsets"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cleanup-binding
subjects:
- kind: ServiceAccount
  name: cleanup-sa
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cleanup-role
  apiGroup: rbac.authorization.k8s.io

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Security consideration&lt;/strong&gt;: This CronJob needs cluster-wide permissions to read workloads and delete ConfigMaps. Review and adjust the RBAC permissions based on your security requirements. Consider limiting to specific namespaces if you don’t need cluster-wide cleanup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Integration with CI/CD Pipelines
&lt;/h3&gt;

&lt;p&gt;Build cleanup into your deployment workflows. Here’s an example GitLab CI job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cleanup_old_configs:
  stage: post-deploy
  image: bitnami/kubectl:latest
  script:
    - |
      # Delete ConfigMaps with old version labels after successful deployment
      kubectl delete configmap -n production \
        -l app=myapp,version!=v${CI_COMMIT_TAG}

    - |
      # Keep only the last 3 ConfigMap versions by timestamp
      kubectl get configmap -n production \
        -l app=myapp \
        --sort-by=.metadata.creationTimestamp \
        -o name | head -n -3 | xargs -r kubectl delete -n production
  only:
    - tags
  when: on_success

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Safe Deletion Practices
&lt;/h3&gt;

&lt;p&gt;When cleaning up ConfigMaps and Secrets, follow these safety guidelines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dry run first&lt;/strong&gt;: Always review what will be deleted before executing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup before deletion&lt;/strong&gt;: Export resources to YAML files before removing them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check age&lt;/strong&gt;: Only delete resources older than a certain threshold (e.g., 30 days)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exclude system resources&lt;/strong&gt;: Skip kube-system, kube-public, and other system namespaces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor for impact&lt;/strong&gt;: Watch application metrics after cleanup to ensure nothing broke&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example backup and conditional deletion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Backup before deletion
kubectl get configmap -n production -o yaml &amp;gt; cm-backup-$(date +%Y%m%d).yaml

# Only delete ConfigMaps older than 30 days
kubectl get configmap -n production -o json | \
  jq -r --arg date "$(date -d '30 days ago' -u +%Y-%m-%dT%H:%M:%SZ)" \
  '.items[] | select(.metadata.creationTimestamp &amp;lt; $date) | .metadata.name' | \
  while read cm; do
    echo "Would delete: $cm (created: $(kubectl get cm $cm -n production -o jsonpath='{.metadata.creationTimestamp}'))"
    # Uncomment to actually delete:
    # kubectl delete configmap $cm -n production
  done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Advanced Patterns for Large-Scale Clusters
&lt;/h2&gt;

&lt;p&gt;For organizations running multiple clusters or large multi-tenant platforms, housekeeping requires more sophisticated approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Policy-Based Cleanup with OPA Gatekeeper
&lt;/h3&gt;

&lt;p&gt;Use OPA Gatekeeper to enforce ConfigMap lifecycle policies at admission time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: configmaprequiredlabels
spec:
  crd:
    spec:
      names:
        kind: ConfigMapRequiredLabels
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package configmaprequiredlabels

        violation[{"msg": msg}] {
          input.review.kind.kind == "ConfigMap"
          not input.review.object.metadata.labels["app"]
          msg := "ConfigMaps must have an 'app' label for lifecycle tracking"
        }

        violation[{"msg": msg}] {
          input.review.kind.kind == "ConfigMap"
          not input.review.object.metadata.labels["owner"]
          msg := "ConfigMaps must have an 'owner' label for lifecycle tracking"
        }

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This policy prevents ConfigMaps without proper labels from being created, making future tracking and cleanup much easier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Centralized Monitoring with Prometheus
&lt;/h3&gt;

&lt;p&gt;Monitor orphaned resource metrics across your clusters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: v1
kind: ConfigMap
metadata:
  name: orphan-detection-exporter
data:
  script.sh: |
    #!/bin/bash
    # Expose metrics for Prometheus scraping
    while true; do
      echo "# HELP k8s_orphaned_configmaps Number of orphaned ConfigMaps"
      echo "# TYPE k8s_orphaned_configmaps gauge"

      for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
        count=$(./detect-orphaned-configmaps.sh $ns | grep -c "Orphaned:")
        echo "k8s_orphaned_configmaps{namespace=\"$ns\"} $count"
      done

      sleep 300  # Update every 5 minutes
    done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create alerts when orphaned resource counts exceed thresholds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;groups:
- name: kubernetes-housekeeping
  rules:
  - alert: HighOrphanedConfigMapCount
    expr: k8s_orphaned_configmaps &amp;gt; 20
    for: 24h
    labels:
      severity: warning
    annotations:
      summary: "High number of orphaned ConfigMaps in {{ $labels.namespace }}"
      description: "Namespace {{ $labels.namespace }} has {{ $value }} orphaned ConfigMaps"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Multi-Cluster Cleanup with Crossplane or Cluster API
&lt;/h3&gt;

&lt;p&gt;For platform teams managing dozens or hundreds of clusters, extend cleanup automation across your entire fleet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Crossplane Composition for cluster-wide cleanup
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: cluster-cleanup-policy
spec:
  compositeTypeRef:
    apiVersion: platform.example.com/v1
    kind: ClusterCleanupPolicy
  resources:
    - name: cleanup-cronjob
      base:
        apiVersion: kubernetes.crossplane.io/v1alpha1
        kind: Object
        spec:
          forProvider:
            manifest:
              apiVersion: batch/v1
              kind: CronJob
              # ... CronJob spec from earlier

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Housekeeping Checklist for Production Clusters
&lt;/h2&gt;

&lt;p&gt;Here’s a practical checklist to implement sustainable ConfigMap and Secret housekeeping:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Immediate Actions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Run detection scripts to audit current orphaned resource count&lt;/li&gt;
&lt;li&gt;[ ] Backup all ConfigMaps and Secrets before any cleanup&lt;/li&gt;
&lt;li&gt;[ ] Manually review and delete obvious orphans (with team approval)&lt;/li&gt;
&lt;li&gt;[ ] Document which ConfigMaps/Secrets are intentionally unused but needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Short-term (1-4 weeks):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Implement consistent labeling standards across teams&lt;/li&gt;
&lt;li&gt;[ ] Add owner references to all ConfigMaps and Secrets&lt;/li&gt;
&lt;li&gt;[ ] Deploy scheduled CronJob for automated detection and reporting&lt;/li&gt;
&lt;li&gt;[ ] Integrate cleanup steps into CI/CD pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term (1-3 months):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Adopt GitOps tooling (ArgoCD, Flux) with automated pruning&lt;/li&gt;
&lt;li&gt;[ ] Implement OPA Gatekeeper policies for required labels&lt;/li&gt;
&lt;li&gt;[ ] Set up Prometheus monitoring for orphaned resource metrics&lt;/li&gt;
&lt;li&gt;[ ] Create runbooks for incident responders&lt;/li&gt;
&lt;li&gt;[ ] Establish resource quotas per namespace&lt;/li&gt;
&lt;li&gt;[ ] Conduct quarterly cluster hygiene reviews&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ongoing Practices:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Review orphaned resource reports weekly&lt;/li&gt;
&lt;li&gt;[ ] Include cleanup tasks in sprint planning&lt;/li&gt;
&lt;li&gt;[ ] Train new team members on resource lifecycle best practices&lt;/li&gt;
&lt;li&gt;[ ] Update cleanup automation as cluster architecture evolves&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Kubernetes doesn’t automatically clean up orphaned ConfigMaps and Secrets, but with the right strategies, you can prevent them from becoming a problem. The key is implementing a layered approach: use owner references and GitOps for prevention, deploy automated detection for ongoing monitoring, and run scheduled cleanup jobs for maintenance.&lt;/p&gt;

&lt;p&gt;Start with detection to understand your current situation, then focus on prevention strategies like owner references and consistent labeling. For existing clusters with accumulated orphaned resources, implement gradual cleanup with proper safety checks rather than aggressive bulk deletion.&lt;/p&gt;

&lt;p&gt;Remember that housekeeping isn’t a one-time task—it’s an ongoing operational practice. By building cleanup into your CI/CD pipelines and establishing clear resource ownership, you’ll maintain a clean, secure, and performant Kubernetes environment over time.&lt;/p&gt;

&lt;p&gt;The tools and patterns we’ve covered here—from simple bash scripts to sophisticated policy engines—can be adapted to your organization’s scale and maturity level. Whether you’re managing a single cluster or a multi-cluster platform, investing in proper resource lifecycle management pays dividends in operational efficiency, security posture, and team productivity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions (FAQ)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can Kubernetes automatically delete unused ConfigMaps and Secrets?
&lt;/h3&gt;

&lt;p&gt;No. Kubernetes does &lt;strong&gt;not&lt;/strong&gt; garbage-collect ConfigMaps or Secrets by default when workloads are deleted. Unless they have &lt;strong&gt;ownerReferences&lt;/strong&gt; set, these resources remain in the cluster indefinitely and must be cleaned up manually or via automation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is it safe to delete ConfigMaps or Secrets that are not referenced by running Pods?
&lt;/h3&gt;

&lt;p&gt;Not always. Some resources may be referenced by workloads scaled to zero, CronJobs, or future rollouts. Always &lt;strong&gt;perform a dry run&lt;/strong&gt;, check workload definitions (Deployments, StatefulSets, DaemonSets), and review resource age before deletion.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the safest way to prevent orphaned ConfigMaps and Secrets?
&lt;/h3&gt;

&lt;p&gt;The most effective prevention strategies are:&lt;br&gt;&lt;br&gt;
Using &lt;strong&gt;ownerReferences&lt;/strong&gt; (via Helm or Kustomize)&lt;br&gt;&lt;br&gt;
Adopting &lt;strong&gt;GitOps&lt;/strong&gt; with pruning enabled (ArgoCD / Flux)&lt;br&gt;&lt;br&gt;
Applying &lt;strong&gt;consistent labeling&lt;/strong&gt; (&lt;code&gt;app&lt;/code&gt;, &lt;code&gt;owner&lt;/code&gt;, &lt;code&gt;version&lt;/code&gt;)&lt;br&gt;&lt;br&gt;
These ensure unused resources are automatically detected and removed&lt;/p&gt;

&lt;h3&gt;
  
  
  Which tools are best for detecting orphaned resources?
&lt;/h3&gt;

&lt;p&gt;Popular and reliable tools include:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Kor&lt;/strong&gt; – purpose-built for detecting unused Kubernetes resources&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Popeye&lt;/strong&gt; – broader cluster hygiene and sanitization reports&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Custom scripts/controllers&lt;/strong&gt; – useful for tailored environments or integrations&lt;br&gt;&lt;br&gt;
For production clusters, Kor provides the best signal-to-noise ratio.&lt;/p&gt;

&lt;h3&gt;
  
  
  How often should ConfigMap and Secret cleanup run in production?
&lt;/h3&gt;

&lt;p&gt;A common best practice is:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Weekly detection&lt;/strong&gt; (reporting only)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Monthly cleanup&lt;/strong&gt; for resources older than a defined threshold (e.g. 30–60 days)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Immediate cleanup&lt;/strong&gt; integrated into CI/CD after successful deployments&lt;br&gt;&lt;br&gt;
This balances safety with long-term cluster hygiene.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://alexandre-vazquez.com/tag/kubernetes/" rel="noopener noreferrer"&gt;Kubernetes Housekeeping Best Practices – Alexandre Vazquez Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sedai.io/blog/detecting-unused-and-orphance-resources-in-kubernetes-cluster" rel="noopener noreferrer"&gt;Sedai: Detecting Unused &amp;amp; Orphaned Kubernetes Resources&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.blinkops.com/blog/finding-and-deleting-orphaned-configmaps" rel="noopener noreferrer"&gt;Blink Ops: How to Clean Up Orphaned ConfigMaps in Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.stackstate.com/blog/orphaned-resources-in-kubernetes-detection-impact-and-prevention-tips/" rel="noopener noreferrer"&gt;StackState: Orphaned Resources in Kubernetes Detection and Prevention&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://martinheinz.dev/blog/60" rel="noopener noreferrer"&gt;Martin Heinz: Keeping Kubernetes Clusters Clean and Tidy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://infinitejs.com/posts/eliminate-orphaned-configmaps-guide/" rel="noopener noreferrer"&gt;InfiniteJS: Eliminate Orphaned ConfigMaps Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/mr-karan/k8s-pruner" rel="noopener noreferrer"&gt;GitHub: k8s-pruner – Cleanup unused configmaps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/farshad%5Fnick/cleaning-up-kubernetes-a-guide-to-finding-unused-resources-with-kor-3p82"&gt;DEV Community: Cleaning Up Kubernetes with Kor&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>uncategorized</category>
    </item>
    <item>
      <title>Kubernetes Gateway API Versions: Complete Compatibility and Upgrade Guide</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Wed, 04 Feb 2026 13:28:54 +0000</pubDate>
      <link>https://forem.com/alexandrev/kubernetes-gateway-api-versions-complete-compatibility-and-upgrade-guide-36do</link>
      <guid>https://forem.com/alexandrev/kubernetes-gateway-api-versions-complete-compatibility-and-upgrade-guide-36do</guid>
      <description>&lt;p&gt;The Kubernetes Gateway API has rapidly evolved from its experimental roots to become the standard for ingress and service mesh traffic management. But with multiple versions released and various maturity levels, understanding which version to use, how it relates to your Kubernetes cluster, and when to upgrade can be challenging.&lt;/p&gt;

&lt;p&gt;In this comprehensive guide, we’ll explore the different Gateway API versions, their relationship to Kubernetes releases, provider support levels, and the upgrade philosophy that will help you make informed decisions for your infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Gateway API Versioning
&lt;/h2&gt;

&lt;p&gt;The Gateway API follows a unique versioning model that differs from standard Kubernetes APIs. Unlike built-in Kubernetes resources that are tied to specific cluster versions, Gateway API CRDs can be installed independently as long as your cluster meets the minimum requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Minimum Kubernetes Version Requirements
&lt;/h3&gt;

&lt;p&gt;As of Gateway API v1.1 and later versions, you need &lt;strong&gt;Kubernetes 1.26 or later&lt;/strong&gt; to run the latest Gateway API releases. The API commits to supporting a minimum of the most recent 5 Kubernetes minor versions, providing a reasonable window for cluster upgrades.&lt;/p&gt;

&lt;p&gt;This rolling support window means that if you’re running Kubernetes 1.26, 1.27, 1.28, 1.29, or 1.30, you can safely install and use the latest Gateway API without concerns about compatibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release Channels: Standard vs Experimental
&lt;/h2&gt;

&lt;p&gt;Gateway API uses two distinct release channels to balance stability with innovation. Understanding these channels is critical for choosing the right version for your use case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Standard Channel
&lt;/h3&gt;

&lt;p&gt;The Standard channel contains only GA (Generally Available, v1) and Beta (v1beta1) level resources and fields. When you install from the Standard channel, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stability guarantees&lt;/strong&gt;: No breaking changes once a resource reaches Beta or GA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backwards compatibility&lt;/strong&gt;: Safe to upgrade between minor versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production readiness&lt;/strong&gt;: Extensively tested features with multiple implementations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conformance coverage&lt;/strong&gt;: Full test coverage ensuring portability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Resources in the Standard channel include GatewayClass, Gateway, HTTPRoute, and ReferenceGrant at the v1 level, plus stable features like GRPCRoute.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experimental Channel
&lt;/h3&gt;

&lt;p&gt;The Experimental channel includes everything from the Standard channel plus Alpha-level resources and experimental fields. This channel is for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Early feature testing&lt;/strong&gt;: Try new capabilities before they stabilize&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cutting-edge functionality&lt;/strong&gt;: Access the latest Gateway API innovations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No stability guarantees&lt;/strong&gt;: Breaking changes can occur between releases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature feedback&lt;/strong&gt;: Help shape the API by testing experimental features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Features may graduate from Experimental to Standard or be dropped entirely based on implementation experience and community feedback.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gateway API Version History and Features
&lt;/h2&gt;

&lt;p&gt;Let’s explore the major Gateway API releases and what each introduced.&lt;/p&gt;

&lt;h3&gt;
  
  
  v1.0 (October 2023)
&lt;/h3&gt;

&lt;p&gt;The v1.0 release marked a significant milestone, graduating core resources to GA status. This release included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gateway, GatewayClass, and HTTPRoute at v1 (stable)&lt;/li&gt;
&lt;li&gt;Full backwards compatibility guarantees for v1 resources&lt;/li&gt;
&lt;li&gt;Production-ready status for ingress traffic management&lt;/li&gt;
&lt;li&gt;Multiple conformant implementations across vendors&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  v1.1 (May 2024)
&lt;/h3&gt;

&lt;p&gt;Version 1.1 expanded the API significantly with service mesh support:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GRPCRoute&lt;/strong&gt;: Native support for gRPC traffic routing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service mesh capabilities&lt;/strong&gt;: East-west traffic management alongside north-south&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multiple implementations&lt;/strong&gt;: Both Istio and other service meshes achieved conformance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enhanced features&lt;/strong&gt;: Additional matching criteria and routing capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This version bridged the gap between traditional ingress controllers and full service mesh implementations.&lt;/p&gt;

&lt;h3&gt;
  
  
  v1.2 and v1.3
&lt;/h3&gt;

&lt;p&gt;These intermediate releases introduced structured release cycles and additional features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Refined conformance testing&lt;/li&gt;
&lt;li&gt;BackendTLSPolicy (experimental in v1.3)&lt;/li&gt;
&lt;li&gt;Enhanced observability and debugging capabilities&lt;/li&gt;
&lt;li&gt;Improved cross-namespace routing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  v1.4 (October 2025)
&lt;/h3&gt;

&lt;p&gt;The latest GA release as of this writing, v1.4.0 brought:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continued API refinement&lt;/li&gt;
&lt;li&gt;Additional experimental features for community testing&lt;/li&gt;
&lt;li&gt;Enhanced conformance profiles&lt;/li&gt;
&lt;li&gt;Improved documentation and migration guides&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Kubernetes Version Compatibility Matrix
&lt;/h2&gt;

&lt;p&gt;Here’s how Gateway API versions relate to Kubernetes releases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gateway API Version&lt;/th&gt;
&lt;th&gt;Minimum Kubernetes&lt;/th&gt;
&lt;th&gt;Recommended Kubernetes&lt;/th&gt;
&lt;th&gt;Release Date&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1.0.x&lt;/td&gt;
&lt;td&gt;1.25&lt;/td&gt;
&lt;td&gt;1.26+&lt;/td&gt;
&lt;td&gt;October 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.1.x&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.27+&lt;/td&gt;
&lt;td&gt;May 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.2.x&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.28+&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.3.x&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.29+&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v1.4.x&lt;/td&gt;
&lt;td&gt;1.26&lt;/td&gt;
&lt;td&gt;1.30+&lt;/td&gt;
&lt;td&gt;October 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key takeaway: Gateway API v1.1 and later all support Kubernetes 1.26+, meaning you can run the latest Gateway API on any reasonably modern cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gateway Provider Support Levels
&lt;/h2&gt;

&lt;p&gt;Different Gateway API implementations support various versions and feature sets. Understanding provider support helps you choose the right implementation for your needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conformance Levels
&lt;/h3&gt;

&lt;p&gt;Gateway API defines three conformance levels for features:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Core&lt;/strong&gt;: Features that must be supported for an implementation to claim conformance. These are portable across all implementations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extended&lt;/strong&gt;: Standardized optional features. Implementations indicate Extended support separately from Core.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implementation-specific&lt;/strong&gt;: Vendor-specific features without conformance requirements.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Major Provider Support
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Istio
&lt;/h4&gt;

&lt;p&gt;Istio reached Gateway API GA support in version 1.22 (May 2024). Istio provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full Standard channel support (v1 resources)&lt;/li&gt;
&lt;li&gt;Service mesh (east-west) traffic management via GAMMA&lt;/li&gt;
&lt;li&gt;Ingress (north-south) traffic control&lt;/li&gt;
&lt;li&gt;Experimental support for BackendTLSPolicy (Istio 1.26+)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Istio is particularly strong for organizations needing both ingress and service mesh capabilities in a single solution.&lt;/p&gt;

&lt;h4&gt;
  
  
  Envoy Gateway
&lt;/h4&gt;

&lt;p&gt;Envoy Gateway tracks Gateway API releases closely. Version 1.4.0 includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gateway API v1.3.0 support&lt;/li&gt;
&lt;li&gt;Compatibility matrix for Envoy Proxy versions&lt;/li&gt;
&lt;li&gt;Focus on ingress use cases&lt;/li&gt;
&lt;li&gt;Strong experimental feature adoption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check the Envoy Gateway compatibility matrix to ensure your Envoy Proxy version aligns with your Gateway API and Kubernetes versions.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cilium
&lt;/h4&gt;

&lt;p&gt;Cilium integrates Gateway API deeply with its CNI implementation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-node Envoy proxy architecture&lt;/li&gt;
&lt;li&gt;Network policy enforcement for Gateway traffic&lt;/li&gt;
&lt;li&gt;Both ingress and service mesh support&lt;/li&gt;
&lt;li&gt;eBPF-based packet processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cilium’s unique architecture makes it a strong choice for organizations already using Cilium for networking.&lt;/p&gt;

&lt;h4&gt;
  
  
  Contour
&lt;/h4&gt;

&lt;p&gt;Contour v1.31.0 implements Gateway API v1.2.1, supporting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All Standard channel v1 resources&lt;/li&gt;
&lt;li&gt;Most v1alpha2 resources (TLSRoute, TCPRoute, GRPCRoute)&lt;/li&gt;
&lt;li&gt;BackendTLSPolicy support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Checking Provider Conformance
&lt;/h3&gt;

&lt;p&gt;To verify which Gateway API version and features your provider supports:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Visit the official implementations page&lt;/strong&gt;: The Gateway API project maintains a comprehensive list of implementations with their conformance levels.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check provider documentation&lt;/strong&gt;: Most providers publish compatibility matrices showing Gateway API, Kubernetes, and proxy version relationships.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review conformance reports&lt;/strong&gt;: Providers submit conformance test results that detail exactly which Core and Extended features they support.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test in non-production&lt;/strong&gt;: Before upgrading production, validate your specific use cases in a staging environment.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Upgrade Philosophy: When and How to Upgrade
&lt;/h2&gt;

&lt;p&gt;One of the most common questions about Gateway API is: “Do I need to run the latest version?” The answer depends on your specific needs and risk tolerance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Staying on Older Versions
&lt;/h3&gt;

&lt;p&gt;You &lt;strong&gt;don’t need to always run the latest Gateway API version&lt;/strong&gt;. It’s perfectly acceptable to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stay on an older stable release if it meets your needs&lt;/li&gt;
&lt;li&gt;Upgrade only when you need specific new features&lt;/li&gt;
&lt;li&gt;Wait for your Gateway provider to officially support newer versions&lt;/li&gt;
&lt;li&gt;Maintain stability over having the latest features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Standard channel’s backwards compatibility guarantees mean that when you do upgrade, your existing configurations will continue to work.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to Consider Upgrading
&lt;/h3&gt;

&lt;p&gt;Consider upgrading when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You need a specific feature&lt;/strong&gt;: A new HTTPRoute matcher, GRPCRoute support, or other functionality only available in newer versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Your provider recommends it&lt;/strong&gt;: Gateway providers often optimize for specific Gateway API versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security considerations&lt;/strong&gt;: While rare, security issues could prompt upgrades&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes cluster upgrades&lt;/strong&gt;: When upgrading Kubernetes, verify your Gateway API version is compatible with the new cluster version&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Safe Upgrade Practices
&lt;/h3&gt;

&lt;p&gt;Follow these best practices for Gateway API upgrades:&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Stick with Standard Channel
&lt;/h4&gt;

&lt;p&gt;Using Standard channel CRDs makes upgrades simpler and safer. Experimental features can introduce breaking changes, while Standard features maintain compatibility.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Upgrade One Minor Version at a Time
&lt;/h4&gt;

&lt;p&gt;While it’s usually safe to skip versions, the most tested upgrade path is incremental. Going from v1.2 to v1.3 to v1.4 is safer than jumping directly from v1.2 to v1.4.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Test Before Upgrading
&lt;/h4&gt;

&lt;p&gt;Always test upgrades in non-production environments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Install specific Gateway API version in test cluster
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4. Review Release Notes
&lt;/h4&gt;

&lt;p&gt;Each Gateway API release publishes comprehensive release notes detailing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New features and capabilities&lt;/li&gt;
&lt;li&gt;Graduation of experimental features to standard&lt;/li&gt;
&lt;li&gt;Deprecation notices&lt;/li&gt;
&lt;li&gt;Upgrade considerations&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  5. Check Provider Compatibility
&lt;/h4&gt;

&lt;p&gt;Before upgrading Gateway API CRDs, verify your Gateway provider supports the target version. Installing Gateway API v1.4 won’t help if your controller only supports v1.2.&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Never Overwrite Different Channels
&lt;/h4&gt;

&lt;p&gt;Implementations should never overwrite Gateway API CRDs that use a different release channel. Keep track of whether you’re using Standard or Experimental channel installations.&lt;/p&gt;

&lt;h3&gt;
  
  
  CRD Management Best Practices
&lt;/h3&gt;

&lt;p&gt;Gateway API CRD management requires attention to detail:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Check currently installed Gateway API version
kubectl get crd gateways.gateway.networking.k8s.io -o yaml | grep 'gateway.networking.k8s.io/bundle-version'

# Verify which channel is installed
kubectl get crd gateways.gateway.networking.k8s.io -o yaml | grep 'gateway.networking.k8s.io/channel'

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Staying Informed About New Releases
&lt;/h2&gt;

&lt;p&gt;Gateway API releases follow a structured release cycle with clear communication channels.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Know When New Versions Are Released
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Releases Page&lt;/strong&gt;: Watch the kubernetes-sigs/gateway-api repository for release announcements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Blog&lt;/strong&gt;: Major Gateway API releases are announced on the official Kubernetes blog&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mailing Lists and Slack&lt;/strong&gt;: Join the Gateway API community channels for discussions and announcements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider Announcements&lt;/strong&gt;: Gateway providers announce support for new Gateway API versions through their own channels&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Release Cadence
&lt;/h3&gt;

&lt;p&gt;Gateway API follows a quarterly release schedule for minor versions, with patch releases as needed for bug fixes and security issues. This predictable cadence helps teams plan upgrades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Decision Framework
&lt;/h2&gt;

&lt;p&gt;Here’s a framework to help you decide which Gateway API version to run:&lt;/p&gt;

&lt;h3&gt;
  
  
  For New Deployments
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Production workloads&lt;/strong&gt;: Use the latest GA version supported by your provider&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Innovation-focused&lt;/strong&gt;: Consider Experimental channel if you need cutting-edge features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conservative approach&lt;/strong&gt;: Use v1.1 or later with Standard channel&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For Existing Deployments
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If things are working&lt;/strong&gt;: Stay on your current version until you need new features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If provider recommends upgrade&lt;/strong&gt;: Follow provider guidance, especially for security&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If Kubernetes upgrade planned&lt;/strong&gt;: Verify compatibility, may need to upgrade Gateway API first or simultaneously&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Feature-Driven Upgrades
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Need service mesh support&lt;/strong&gt;: Upgrade to v1.1 minimum&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need GRPCRoute&lt;/strong&gt;: Upgrade to v1.1 minimum&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Need BackendTLSPolicy&lt;/strong&gt;: Requires v1.3+ and provider support for experimental features&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Kubernetes Gateway API represents the future of traffic management in Kubernetes, offering a standardized, extensible, and role-oriented API for both ingress and service mesh use cases. Understanding the versioning model, compatibility requirements, and upgrade philosophy empowers you to make informed decisions that balance innovation with stability.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gateway API versions install independently from Kubernetes, requiring only version 1.26 or later for recent releases&lt;/li&gt;
&lt;li&gt;Standard channel provides stability, Experimental channel provides early access to new features&lt;/li&gt;
&lt;li&gt;You don’t need to always run the latest version—upgrade when you need specific features&lt;/li&gt;
&lt;li&gt;Verify provider support before upgrading Gateway API CRDs&lt;/li&gt;
&lt;li&gt;Follow safe upgrade practices: test first, upgrade incrementally, review release notes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By following these guidelines, you can confidently deploy and maintain Gateway API in your Kubernetes infrastructure while making upgrade decisions that align with your organization’s needs and risk tolerance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the difference between Kubernetes Ingress and the Gateway API?
&lt;/h3&gt;

&lt;p&gt;Kubernetes Ingress is a legacy API focused mainly on HTTP(S) traffic with limited extensibility. The Gateway API is its successor, offering a more expressive, role-oriented model that supports multiple protocols, advanced routing, better separation of concerns, and consistent behavior across implementations&lt;/p&gt;

&lt;h3&gt;
  
  
  Which Gateway API version should I use in production today?
&lt;/h3&gt;

&lt;p&gt;For most production environments, you should use the &lt;strong&gt;latest GA (v1.x) release supported by your Gateway provider&lt;/strong&gt;, installed from the &lt;strong&gt;Standard channel&lt;/strong&gt;. This ensures stability, backwards compatibility, and conformance guarantees while still benefiting from ongoing improvements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I upgrade the Gateway API without upgrading my Kubernetes cluster?
&lt;/h3&gt;

&lt;p&gt;Yes. Gateway API CRDs are installed independently of Kubernetes itself. As long as your cluster meets the &lt;strong&gt;minimum supported Kubernetes version&lt;/strong&gt; (1.26+ for recent releases), you can upgrade the Gateway API without upgrading the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens if my Gateway provider does not support the latest Gateway API version?
&lt;/h3&gt;

&lt;p&gt;If your provider lags behind, you should stay on the &lt;strong&gt;latest version officially supported by that provider&lt;/strong&gt;. Installing newer Gateway API CRDs than your controller supports can lead to missing features or undefined behavior. Provider compatibility should always take precedence over running the newest API version.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is it safe to upgrade Gateway API CRDs without downtime?
&lt;/h3&gt;

&lt;p&gt;In most cases, yes—&lt;strong&gt;when using the Standard channel&lt;/strong&gt;. The Gateway API provides strong backwards compatibility guarantees for GA and Beta resources. However, you should always test upgrades in a non-production environment and verify that your Gateway provider supports the target version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/concepts/versioning/" rel="noopener noreferrer"&gt;Kubernetes Gateway API Versioning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/blog/2023/10/31/gateway-api-ga/" rel="noopener noreferrer"&gt;Gateway API v1.0 GA Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/blog/2024/05/09/gateway-api-v1-1/" rel="noopener noreferrer"&gt;Gateway API v1.1 Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/blog/2025/11/06/gateway-api-v1-4/" rel="noopener noreferrer"&gt;Gateway API v1.4 Release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/implementations/" rel="noopener noreferrer"&gt;Gateway API Implementations List&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/concepts/conformance/" rel="noopener noreferrer"&gt;Gateway API Conformance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway.envoyproxy.io/news/releases/matrix/" rel="noopener noreferrer"&gt;Envoy Gateway Compatibility Matrix&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://istio.io/latest/docs/tasks/traffic-management/ingress/gateway-api/" rel="noopener noreferrer"&gt;Istio Gateway API Support&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cilium.io/en/latest/network/servicemesh/gateway-api/gateway-api/" rel="noopener noreferrer"&gt;Cilium Gateway API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gateway-api.sigs.k8s.io/guides/crd-management/" rel="noopener noreferrer"&gt;Gateway API CRD Management&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>uncategorized</category>
      <category>gatewayapiversions</category>
      <category>kubernetesgatewayapi</category>
    </item>
    <item>
      <title>Kubernetes Dashboard Alternatives in 2026: Best Web UI Options After Official Retirement</title>
      <dc:creator>Alexandre Vazquez</dc:creator>
      <pubDate>Mon, 26 Jan 2026 10:44:32 +0000</pubDate>
      <link>https://forem.com/alexandrev/kubernetes-dashboard-alternatives-in-2026-best-web-ui-options-after-official-retirement-4e02</link>
      <guid>https://forem.com/alexandrev/kubernetes-dashboard-alternatives-in-2026-best-web-ui-options-after-official-retirement-4e02</guid>
      <description>&lt;p&gt;The Kubernetes Dashboard, once a staple tool for cluster visualization and management, has been officially archived and is no longer maintained. For many teams who relied on its straightforward web interface to monitor pods, deployments, and services, this retirement marks the end of an era. But it also signals something important: the Kubernetes ecosystem has evolved far beyond what the original dashboard was designed to handle.&lt;/p&gt;

&lt;p&gt;Today’s Kubernetes environments are multi-cluster by default, driven by GitOps principles, guarded by strict RBAC policies, and operated by platform teams serving dozens or hundreds of developers. The operating model has simply outgrown the traditional dashboard’s capabilities.&lt;/p&gt;

&lt;p&gt;So what comes next? If you’ve been using Kubernetes Dashboard and need to migrate to something more capable, or if you’re simply curious about modern alternatives, this guide will walk you through the best options available in 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Kubernetes Dashboard Was Retired
&lt;/h2&gt;

&lt;p&gt;The Kubernetes Dashboard served its purpose well in the early days of Kubernetes adoption. It provided a simple, browser-based interface for viewing cluster resources without needing to master &lt;code&gt;kubectl&lt;/code&gt; commands. But as Kubernetes matured, several limitations became apparent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single-cluster focus&lt;/strong&gt;: Most organizations now manage multiple clusters across different environments, but the dashboard was designed for viewing one cluster at a time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limited RBAC capabilities&lt;/strong&gt;: Modern platform teams need fine-grained access controls at the cluster, namespace, and workload levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No GitOps integration&lt;/strong&gt;: Contemporary workflows rely on declarative configuration and continuous deployment pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal observability&lt;/strong&gt;: Beyond basic resource listing, the dashboard lacked advanced monitoring, alerting, and troubleshooting features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security concerns&lt;/strong&gt;: The dashboard’s architecture required careful configuration to avoid exposing cluster access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The community recognized these constraints, and the official recommendation now points toward &lt;strong&gt;Headlamp&lt;/strong&gt; as the successor. But Headlamp isn’t the only option worth considering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top Kubernetes Dashboard Alternatives for 2026
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Headlamp: The Official Successor
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://headlamp.dev/" rel="noopener noreferrer"&gt;Headlamp&lt;/a&gt; is now the official recommendation from the Kubernetes SIG UI group. It’s a CNCF Sandbox project developed by Kinvolk (now part of Microsoft) that brings a modern approach to cluster visualization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean, intuitive interface built with modern web technologies&lt;/li&gt;
&lt;li&gt;Extensive plugin system for customization&lt;/li&gt;
&lt;li&gt;Works both as an in-cluster deployment and desktop application&lt;/li&gt;
&lt;li&gt;Uses your existing kubeconfig file for authentication&lt;/li&gt;
&lt;li&gt;OpenID Connect (OIDC) support for enterprise SSO&lt;/li&gt;
&lt;li&gt;Read and write operations based on RBAC permissions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Installation Options:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Using Helm
helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/
helm install my-headlamp headlamp/headlamp --namespace kube-system

# As Minikube addon
minikube addons enable headlamp
minikube service headlamp -n headlamp

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Headlamp excels at providing a familiar dashboard experience while being extensible enough to grow with your needs. The plugin architecture means you can customize it for your specific workflows without waiting for upstream changes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams transitioning from Kubernetes Dashboard who want a similar experience with modern features and official backing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Portainer: Enterprise Multi-Cluster Management
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.portainer.io/" rel="noopener noreferrer"&gt;Portainer&lt;/a&gt; has evolved from a Docker management tool into a comprehensive Kubernetes platform. It’s particularly strong when you need to manage multiple clusters from a single interface. We already covered in detail &lt;a href="https://alexandre-vazquez.com/portainer-software/" rel="noopener noreferrer"&gt;Portainer&lt;/a&gt; so you can also take a look&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-cluster management dashboard&lt;/li&gt;
&lt;li&gt;Enterprise-grade RBAC with fine-grained access controls&lt;/li&gt;
&lt;li&gt;Visual workload deployment and scaling&lt;/li&gt;
&lt;li&gt;GitOps integration support&lt;/li&gt;
&lt;li&gt;Comprehensive audit logging&lt;/li&gt;
&lt;li&gt;Support for both Kubernetes and Docker environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Organizations managing multiple clusters across different environments who need enterprise RBAC and centralized control.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Skooner (formerly K8Dash): Lightweight and Fast
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/skooner-k8s/skooner" rel="noopener noreferrer"&gt;Skooner&lt;/a&gt; keeps things simple. If you appreciated the straightforward nature of the original Kubernetes Dashboard, Skooner delivers a similar philosophy with a cleaner, faster interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast, real-time updates&lt;/li&gt;
&lt;li&gt;Clean and minimal interface&lt;/li&gt;
&lt;li&gt;Easy installation with minimal configuration&lt;/li&gt;
&lt;li&gt;Real-time metrics visualization&lt;/li&gt;
&lt;li&gt;Built-in OIDC authentication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want a simple, no-frills dashboard without complex features or steep learning curves.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Devtron: Complete DevOps Platform
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://devtron.ai/" rel="noopener noreferrer"&gt;Devtron&lt;/a&gt; goes beyond simple cluster visualization to provide an entire application delivery platform built on Kubernetes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-cluster application deployment&lt;/li&gt;
&lt;li&gt;Built-in CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Advanced security scanning and compliance&lt;/li&gt;
&lt;li&gt;Application-centric view rather than resource-centric&lt;/li&gt;
&lt;li&gt;Support for seven different SSO providers&lt;/li&gt;
&lt;li&gt;Chart store for Helm deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Platform teams building internal developer platforms who need comprehensive deployment pipelines alongside cluster management.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. KubeSphere: Full-Stack Container Platform
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://kubesphere.io/" rel="noopener noreferrer"&gt;KubeSphere&lt;/a&gt; positions itself as a distributed operating system for cloud-native applications, using Kubernetes as its kernel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-tenant architecture&lt;/li&gt;
&lt;li&gt;Integrated DevOps workflows&lt;/li&gt;
&lt;li&gt;Service mesh integration (Istio)&lt;/li&gt;
&lt;li&gt;Multi-cluster federation&lt;/li&gt;
&lt;li&gt;Observability and monitoring built-in&lt;/li&gt;
&lt;li&gt;Plug-and-play architecture for third-party integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Organizations building comprehensive container platforms who want an opinionated, batteries-included experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Rancher: Battle-Tested Enterprise Platform
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.rancher.com/" rel="noopener noreferrer"&gt;Rancher&lt;/a&gt; from SUSE has been in the Kubernetes management space for years and offers one of the most mature platforms available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manage any Kubernetes cluster (EKS, GKE, AKS, on-premises)&lt;/li&gt;
&lt;li&gt;Centralized authentication and RBAC&lt;/li&gt;
&lt;li&gt;Built-in monitoring with Prometheus and Grafana&lt;/li&gt;
&lt;li&gt;Application catalog with Helm charts&lt;/li&gt;
&lt;li&gt;Policy management and security scanning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise organizations managing heterogeneous Kubernetes environments across multiple cloud providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Octant: Developer-Focused Cluster Exploration
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://octant.dev/" rel="noopener noreferrer"&gt;Octant&lt;/a&gt; (originally developed by VMware) takes a developer-centric approach to cluster visualization with a focus on understanding application architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Plugin-based extensibility&lt;/li&gt;
&lt;li&gt;Resource relationship visualization&lt;/li&gt;
&lt;li&gt;Port forwarding directly from the UI&lt;/li&gt;
&lt;li&gt;Log streaming&lt;/li&gt;
&lt;li&gt;Context-aware resource inspection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Application developers who need to understand how their applications run on Kubernetes without being cluster administrators.&lt;/p&gt;

&lt;h2&gt;
  
  
  Desktop and CLI Alternatives Worth Considering
&lt;/h2&gt;

&lt;p&gt;While this article focuses on web-based dashboards, it’s worth noting that not everyone needs a browser interface. Some of the most powerful Kubernetes management tools work as desktop applications or terminal UIs.&lt;/p&gt;

&lt;p&gt;If you’re considering client-side tools, you might find these articles on my blog helpful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://alexandre-vazquez.com/freelens-vs-openlens-vs-lens-kubernetes-ide/" rel="noopener noreferrer"&gt;Choosing The Right Kubernetes IDE: FreeLens vs OpenLens vs Lens&lt;/a&gt;&lt;/strong&gt; – A comprehensive comparison of the Lens ecosystem and which variant makes sense in 2026&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://alexandre-vazquez.com/discover-your-perfect-tool-for-managing-kubernetes/" rel="noopener noreferrer"&gt;Discover Your Perfect Tool For Managing Kubernetes&lt;/a&gt;&lt;/strong&gt; – An overview of different management approaches including K9s, a powerful terminal UI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These client tools offer advantages that web dashboards can’t match: offline access, better performance, and tighter integration with your local development workflow. FreeLens, in particular, has emerged as the lowest-risk choice for most organizations looking for a desktop Kubernetes IDE.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Alternative for Your Team
&lt;/h2&gt;

&lt;p&gt;With so many options available, how do you choose? Here’s a decision framework:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Headlamp if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the officially recommended path forward&lt;/li&gt;
&lt;li&gt;You need a lightweight dashboard similar to what you had before&lt;/li&gt;
&lt;li&gt;Plugin extensibility is important for future customization&lt;/li&gt;
&lt;li&gt;You prefer CNCF-backed open source projects&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Portainer if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You manage multiple Kubernetes clusters&lt;/li&gt;
&lt;li&gt;Enterprise RBAC is a critical requirement&lt;/li&gt;
&lt;li&gt;You also work with Docker environments&lt;/li&gt;
&lt;li&gt;Visual deployment tools would benefit your team&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Skooner if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the simplest possible alternative&lt;/li&gt;
&lt;li&gt;Your needs are straightforward: view and manage resources&lt;/li&gt;
&lt;li&gt;You don’t need advanced features or multi-cluster support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Devtron or KubeSphere if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re building an internal developer platform&lt;/li&gt;
&lt;li&gt;You need integrated CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Application-centric workflows matter more than resource-centric views&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Rancher if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re managing enterprise-scale, multi-cloud Kubernetes&lt;/li&gt;
&lt;li&gt;You need battle-tested stability and vendor support&lt;/li&gt;
&lt;li&gt;Policy management and compliance are critical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Consider desktop tools like FreeLens if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You work primarily from a local development environment&lt;/li&gt;
&lt;li&gt;You need offline access to cluster information&lt;/li&gt;
&lt;li&gt;You prefer richer desktop application experiences&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Migration Considerations
&lt;/h2&gt;

&lt;p&gt;If you’re actively using Kubernetes Dashboard today, here’s what to think about when migrating:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Authentication method&lt;/strong&gt;: Most modern alternatives support OIDC/SSO, but verify your specific identity provider is supported&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RBAC policies&lt;/strong&gt;: Review your existing ClusterRole and RoleBinding configurations to ensure they translate properly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom workflows&lt;/strong&gt;: If you’ve built automation around Dashboard URLs or specific features, you’ll need to adapt these&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User training&lt;/strong&gt;: Even similar-looking alternatives have different UIs and workflows; budget time for team training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ingress configuration&lt;/strong&gt;: If you expose your dashboard externally, you’ll need to reconfigure ingress rules&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Future of Kubernetes UI Management
&lt;/h2&gt;

&lt;p&gt;The retirement of Kubernetes Dashboard isn’t a step backward—it’s recognition that the ecosystem has matured. Modern platforms need to handle multi-cluster management, GitOps workflows, comprehensive observability, and sophisticated RBAC out of the box.&lt;/p&gt;

&lt;p&gt;The alternatives listed here represent different philosophies about what a Kubernetes interface should be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimalist dashboards&lt;/strong&gt; (Headlamp, Skooner) that stay close to the original vision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise platforms&lt;/strong&gt; (Portainer, Rancher) that centralize multi-cluster management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developer platforms&lt;/strong&gt; (Devtron, KubeSphere) that integrate the entire application lifecycle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Desktop experiences&lt;/strong&gt; (FreeLens, OpenLens) that bring IDE-like capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The right choice depends on your team’s size, your infrastructure complexity, and whether you’re managing platforms or building applications. For most teams migrating from Kubernetes Dashboard, starting with Headlamp makes sense—it’s officially recommended, actively maintained, and provides a familiar experience. From there, you can evaluate whether you need to scale up to more comprehensive platforms.&lt;/p&gt;

&lt;p&gt;Whatever you choose, the good news is that the Kubernetes ecosystem in 2026 offers more sophisticated, capable, and secure dashboard alternatives than ever before.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently Asked Questions (FAQ)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Kubernetes Dashboard officially deprecated or just unmaintained?
&lt;/h3&gt;

&lt;p&gt;The Kubernetes Dashboard has been officially &lt;strong&gt;archived&lt;/strong&gt; by the Kubernetes project and is no longer actively maintained. While it may still run in existing clusters, it no longer receives security updates, bug fixes, or new features, making it unsuitable for production use in modern environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the official replacement for Kubernetes Dashboard?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Headlamp&lt;/strong&gt; is the officially recommended successor by the Kubernetes SIG UI group. It provides a modern web interface, supports plugins, integrates with existing kubeconfig files, and aligns with current Kubernetes security and RBAC best practices.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Headlamp production-ready for enterprise environments?
&lt;/h3&gt;

&lt;p&gt;Yes. Headlamp supports &lt;strong&gt;OIDC authentication&lt;/strong&gt;, fine-grained &lt;strong&gt;RBAC&lt;/strong&gt;, and can run either in-cluster or as a desktop application. While still evolving, it is actively maintained and suitable for many production use cases, especially when combined with proper access controls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Are there lightweight alternatives similar to the old Kubernetes Dashboard?
&lt;/h3&gt;

&lt;p&gt;Yes. &lt;strong&gt;Skooner&lt;/strong&gt; is a lightweight, fast alternative that closely mirrors the simplicity of the original Kubernetes Dashboard while offering a cleaner UI and modern authentication options like OIDC.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I still need a web-based dashboard to manage Kubernetes?
&lt;/h3&gt;

&lt;p&gt;Not necessarily. Many teams prefer &lt;strong&gt;desktop or CLI-based tools&lt;/strong&gt; such as FreeLens, OpenLens, or K9s. These tools often provide better performance, offline access, and deeper integration with developer workflows compared to browser-based dashboards.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is it safe to expose Kubernetes dashboards over the internet?
&lt;/h3&gt;

&lt;p&gt;Exposing any Kubernetes dashboard publicly requires &lt;strong&gt;extreme caution&lt;/strong&gt;. If external access is necessary, always use:&lt;br&gt;&lt;br&gt;
Strong authentication (OIDC / SSO)&lt;br&gt;&lt;br&gt;
Strict RBAC policies&lt;br&gt;&lt;br&gt;
Network restrictions (VPN, IP allowlists)&lt;br&gt;&lt;br&gt;
TLS termination and hardened ingress rules&lt;br&gt;&lt;br&gt;
In many cases, dashboards should only be accessible from internal networks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can these dashboards replace kubectl?
&lt;/h3&gt;

&lt;p&gt;No. Dashboards are &lt;strong&gt;complementary tools&lt;/strong&gt;, not replacements for &lt;code&gt;kubectl&lt;/code&gt;. While they simplify visualization and some management tasks, advanced operations, automation, and troubleshooting still rely heavily on CLI tools and GitOps workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  What should I consider before migrating away from Kubernetes Dashboard?
&lt;/h3&gt;

&lt;p&gt;Before migrating, review:&lt;br&gt;&lt;br&gt;
Authentication and identity provider compatibility&lt;br&gt;&lt;br&gt;
Existing RBAC roles and permissions&lt;br&gt;&lt;br&gt;
Multi-cluster requirements&lt;br&gt;&lt;br&gt;
GitOps and CI/CD integrations&lt;br&gt;&lt;br&gt;
Training needs for platform teams and developers&lt;br&gt;&lt;br&gt;
Starting with Headlamp is often the lowest-risk migration path&lt;/p&gt;

&lt;h3&gt;
  
  
  Which Kubernetes dashboard is best for developers rather than platform teams?
&lt;/h3&gt;

&lt;p&gt;Tools like &lt;strong&gt;Octant&lt;/strong&gt; and &lt;strong&gt;Devtron&lt;/strong&gt; are more developer-focused. They emphasize application-centric views, resource relationships, and deployment workflows, making them ideal for developers who want insight without managing cluster infrastructure directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which Kubernetes dashboard is best for multi-cluster management?
&lt;/h3&gt;

&lt;p&gt;For multi-cluster environments, &lt;strong&gt;Portainer&lt;/strong&gt;, &lt;strong&gt;Rancher&lt;/strong&gt;, and &lt;strong&gt;KubeSphere&lt;/strong&gt; are strong options. These platforms are designed to manage multiple clusters from a single control plane and offer enterprise-grade RBAC, auditing, and centralized authentication.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>tooling</category>
      <category>ui</category>
    </item>
  </channel>
</rss>
